, , , ,

This post explains the fundamental differences between character sets used on your website. Here is the most popular charset definition that should be placed in the within the HEAD tag:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> 

For ASCII characters the following code can be used:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-8859-1">

In XHTML the following charset definition should be used:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<meta http-equiv="content-type" content="application/xhtml+xml; charset=UTF-8" />

“ISO-8859” means nothing. “ISO-8859-1” is the Latin-1 (western European languages) encoding. There are maybe a dozen different encodings in the ISO-8859-x family, so you have to be careful about which one you’re talking about. What they have in common is that they are single byte, the first 128 characters are the standard ASCII set, and they have a different set of accented characters in the upper 128 slots. There are sets for eastern European languages, Baltic/Scandinavian languages, and other Latin-based alphabets, as well as Greek, Hebrew, Arabic, Cyrillic, and maybe some other non-Latin-based alphabets.

Unicode is a double byte (16 bit, = UTF-16) encoding encompassing every alphabet on Earth. The first 128 slots (0000 – 007F) are the same as ASCII, and the next 128 (0080 – 00FF) happen to be the same as Latin-1. Various other alphabet standards got dropped into Unicode, sometimes unaltered, sometimes rearranged a bit. UTF-8 is a compression method whereby the most common characters (the ASCII set 0000 – 007F) come through as single bytes 00 – 7F, a big chunk of (mostly) Western alphabets (0080 – 07FF) get two bytes, and the rest get 3 or 4 bytes.

If the alphabet(s) you want to use for all your languages together cover more than 96 accented-Latin/non-Latin-based/non-ASCII characters, you have no choice but to go to UTF-8. Otherwise, you will consume a bit more space than with an ISO-8859-x encoding, but you’ll have future flexibility if a new member wants to type in some, say, Japanese text.

There are three places where you are concerned about character encoding.

1) Database: The database really doesn’t care what encoding the character data is, except when it comes to sorting (collating) text. If you feed in UTF-8 data to a database table defined as Latin-1, it may sort a bit differently than you expected. But, no data is lost.

2) Language support text files: Most non-English files (text for headings, titles, prompts, button labels, etc.) are in UTF-8. English is in ASCII, and so is compatible with Latin-1 and UTF-8 pages.

3) Browser: The browser is told what encoding text is being sent in (and what encoding to return input data in). The default is Latin-1 (ISO-8859-1), but the other usual choice is UTF-8.

Needless to say, items (2) and (3) really need to match up if you don’t want gibberish on your page. It’s not uncommon to have UTF-8 text (double byte accented characters) coming out of a database or language support file, and being displayed on a page declared to be Latin-1. This produces two odd-looking accented characters instead of the desired one. It’s nice to have item (1) match up (database as UTF-8, rather than Latin-1), but it’s not critical.

It’s not hard to get all three items consistent (all UTF-8), but you do have to take care that the data currently in your database is actually Latin-1 and not already UTF-8 (caused by manually setting the page display to be UTF-8, while leaving the database at Latin-1). When you use the tool to convert the database to UTF-8, it should translate accented characters (if any) in your tables to UTF-8, resulting in a small increase in size. You will now be using your -UTF8 file for language support, and your page heading should include a Content-Type that tells the browser to render as UTF-8.

Source: SimpleMachines.org