[Skip to the Main Content]

Character Sets and Character Coding Mismatches

This article was supposed to be the first article of CSS Reboot Spring 2006. However, HTML and CSS validation results squelched that.

The Document Type Definition (DTD) should be the first thing considered when construction begins. Why XHTML™? explains it. The Character Set Encoding (charset) is the second thing which should be considered. Which Character Set Encoding should be Used? discussed charsets but Content-Type by Lachlan Hunt [January 11, 2006] is an excellent technical but readable introduction.

Of the 738 CSS Reboot Spring 2006 sites analyzed, the following charsets were found written (or, not) in the <meta> source code.

<meta http-equiv="Content-Type" content="NONE DECLARED">
  • utf-8 = 40*
  • iso-8859-1 = 12*
  • Unknown = 4*

*The charsets herein were found in HTTP-Header results. “Unknown” is accurate: HTTP-Header Results did not return a character set. No character set in the code; no character set on the server.

<meta http-equiv="Content-Type" content="charset=utf-8">
432
<meta http-equiv="Content-Type" content="charset=iso-8859-1">
235
<meta http-equiv="Content-Type" content="charset=iso-8859-2">
6
<meta http-equiv="Content-Type" content="charset=iso-8859-15">
1
<meta http-equiv="Content-Type" content="charset=euc-kr">
1
<meta http-equiv="Content-Type" content="charset=Shift_JIS">
1
<meta http-equiv="Content-Type" content="charset=windows-1250">
1
<meta http-equiv="Content-Type" content="charset=windows-1252">
5

[Note: All application/xhtml+xml MIME types used utf-8.]

That’s simple.

An interesting warning occurred during HTML Markup Validation. What was written in the source code wasn’t necessarily written on the server.

W3C Character Encoding Mismatch
The character encoding specified in the HTTP header (iso-8859-1) is different from the value in the <meta> element (utf-8). I will use the value from the HTTP header (iso-8859-1) for this validation.
W3C Character Encoding Mismatch
The character encoding specified in the HTTP header (utf-8) is different from the value in the <meta> element (iso-8859-1). I will use the value from the HTTP header (utf-8) for this validation.

31 sites experienced this.

What does this mean? for the site with a Character Coding mismatch?

It translates as whatever the server says (when sent to a browser in the HTTP Content-Type) will be be used. The following explanation may be found in various W3C documents but I’ve excerpted it from W3C Internationalization (I18n) Activity, Tutorial: Character sets & encodings in XHTML, HTML and CSS,

Precedence rules
In the case of conflict between multiple encoding declarations, precedence rules apply to determine which declaration wins out. For XHTML and HTML, the precedence is as follows, with 1 being the highest:
  1. HTTP Content-Type
  2. XML declaration
  3. meta charset declaration
  4. link charset attribute
The fourth item here is a method of declaring the encoding of a file that we have not yet mentioned. A charset attribute can be added to an a element to indicate the encoding of the file being linked to. In general, this approach is not recommended, since it is likely to provide incorrect information if the encoding of the target file is changed.
The high precedence of the HTTP header is useful, as mentioned earlier, in situations where the encoding of the document is changed by an intermediary server, since that transcoding is unlikely to change the in-document declarations. The transcoding server should declare the new encoding in the HTTP header.

[Note: The above except applies to HTML/XHTML with content-type="text/html" only; XML has different requirements that may be seen in the W3C tutorial.]

So, if your server, i.e., HTTP Content-Type, is set for windows-1252, then your page will be parsed by browsers as windows-1252 regardless if you have stated UTF-8 in the meta section, i.e., meta charset declaration. Precedence. What practical effect does this have? If you use UTF-8 encoding in your code, it will not be recognized by the browser and it may not be rendered since the server is saying to use windows-1252 encoding; or, worse, your site could break. And, if you don’t declare a charset, browsers have been suggested to use iso-8859-1.

If you don’t know what your server is doing, you can see for yourself.

On-Line services include,

  • W3C HTTP Head
  • I prefer, Mozilla Web-sniffer: View HTTP Request and Response Header

Or, Firefox has got,

  • Web Developer Extension
  • LiveHTTPHeaders

And, what can be done if the server administrators haven’t a clue about altering Content-Type/Character Set—Especially—on free hosting companies Windows Servers? You can take matters into your own hands by using your .htaccess file. See W3C Internationalization (I18n) Activity Setting charset information in .htaccess for information on how it’s done.

[Why do You use HTML 4? is the final article in this series.]


Sean Fraser posted this on August 5, 2006 09:41 AM.

  • Add to Technorati Favorites
  • de.licio.us: http://www.elementary-group-standards.com/web-standards/character-sets-and-character-coding-mismatches.html
  • furl: http://www.elementary-group-standards.com/web-standards/character-sets-and-character-coding-mismatches.html
  • reddit: http://www.elementary-group-standards.com/web-standards/character-sets-and-character-coding-mismatches.html

Comments

Comments are closed.

The Elementary Standards: A Compendium of Web Standards, CSS, Linguistics and Search Engine Optimization methodology Copyright ©2005-2007 Sean Fraser. All work is published under a Creative Commons License. All Rights Reserved.

Palm trees on a grassy field in Hawai’i

Main Content Returns thus