UTF-8 in Joomla

Have you ever browsed to a website only to find that half the content is unreadable? Or that certain characters are being displayed in strange and mysterious ways? Or perhaps you wanted to enter a foreign or unusual character but found that the result was a garbled mess.

The chances are you have been subject to poorly managed character encodings. Joomla! extensions are no exception to these occurrences, but with a little bit of effort and some help from the Joomla! framework, we can avoid these problems with relative ease.

The Joomla! way using the Unicode Repertoire and UTF-8

In Joomla! we use Unicode. Amongst other things, Unicode describes a super-uber-massive-enormous repertoire of characters (ACR). Unicode 5.2 (the current version of Unicode at the time of writing) contains over 107,000 characters and a code space for up to 1,114,111 code points (CCS)! A code space is an allocated range in which code points can fall, and a code point is a unique number given to every character in a character set. In Unicode, the upper case letter J has a code point value of 74, the Greek capital letter Ω (omega) has a code point value of 937, and the Tamil letter ஹ (ha) has a code point value of 3001.

To store and present this data in Joomla! we use UTF-8 (both a CEF and CES, we won't be discussing CEF and CES further in this article). UTF-8 is a multibyte character encoding that enables us to store Unicode in a relatively small amount of space. Being a multibyte character encoding, individual UTF-8 characters are stored in memory using a variable number of bytes. We will explain how UTF-8 achieves this later, for now all we need to know is that the number of bytes used to store a given character can differ from one to four bytes.

When dealing with multibyte character encodings, PHP is unfortunately not the clever little monkey with which we are all familiar... PHP makes the assumption that all strings contain characters that are stored in single bytes. A good example of when this is problematic is counting the number of characters in a string using the PHP strlen() function. If the string contains UTF-8 data and one or more of the characters are represented in memory using multiple bytes, the value that is returned will be larger than expected.

Using JString

Luckily for us, when it comes to dealing with UTF-8 strings in our PHP code, Joomla! gets all dressed up in its fancy knight-in-shinning-armour costume and rushes to the rescue! The Joomla! JString class contains a bunch of static methods that are UTF-8 aware. There is an equivalent JString method for each PHP string function that does not behave as expected when using UTF-8 strings.

The table below describes the PHP string functions for which there is a comparative JString method. For a complete list of PHP string functions refer to the Strings section in the PHP manual.

PHP String Functions and JString Equivalents

(Note: Horizontal scrollbar at the bottom.)

PHP Function
JString Method
Return Type
Parameters
Description
strlen
JString::strlen
int string $str Determines the length of $str.
trim
JString::trim
string
string $str,
[string $charlist]
Remove leading and trailing whitespace or characters defined in$charlist.
ltrim
JString::ltrim
string
string $str,
[string $charlist]
Removes leading whitespace or characters defined in$charlist.
rtrim
JString::rtrim
string
string $str,
[string $charlist]
Removes trailing whitespace or characters defined in$charlist.
strpos
JString::strpos
int or false
string $haystack,
string $needle,
[int $offset = 0]
Finds position of the first occurrence of$needle in$haystack.
strrpos
JString::strrpos
int or false
string $haystack,
string $needle,
[int $offset = 0]
Finds position of the last occurrence of$needle in $haystack(PHP 4 behaves slightly differently).

WARNING: JStringdoes not support$offset.
substr
JString::substr
string
string $string,
int $start,
[int $length]
Gets a portion of$string based on the character position$start and maximum length $length.
substr_replace JString::substr_replace
string
string $string,
string $replacement,
int $start,
[int $length]
Replaces the chunk of$string described by$start and $length with $replacement.

WARNING: JStringdoes not handle arrays of strings.
str_ireplace
JString::str_ireplace
string or array ofstrings
mixed $search,
mixed $replace,
mixed $subject,
[int &$count]
Replaces occurances of $replace in$subject with$replace. In its most basic form, all of these parameters are strings. It is possible to provide arrays, see str_ireplace for more information.

WARNING: The $countparameter differs between the PHP function and the JString method. It is best to avoid $countentirely.
strtolower
JString::strtolower
string
string $string
* Makes every character lowercase.
strtoupper
JString::strtoupper
string
string $string
* Makes every character uppercase.
ucfirst
JString::ucfirst
string
string $string
* Make the first character uppercase.
ucwords
JString::ucwords
string
string $string
* Makes every character at the start of a new word uppercase.
strcspn
JString::strcspn
int
string $str,
string $mask,
[int $start],
[int $length]
Determines the number of chracters in$str before $mask is encountered.
strspn
JString::strspn
int
string $subject,
string $mask,
[int $start],
[int $length]
Determines the number of characters at the start of$subject that are also in $mask.
stristr
JString::stristr
string
string $str,
string $search
[bool $before_needle = false]
Returns the part of$str starting from the first occurance of$search (case insensitive).

WARNING:JString::stristrdoes not support$before_needle.
strcasecmp
JString::strcasecmp
int
string $str1,
string $str2
Case insensitive comparison of $str1and $str2.
strrev
JString::strrev
string
string $str Reverses $str.
str_split
JString::str_split
array
string $string,
[int $split_length = 1]
Splits a string into chunks of the specified number of characters.
iconv
JString::transcode
string
string $source,
string $from_encoding,
string $to_encoding
† Converts $sourcefrom one character encoding to another.Depending on the encodings, this can result in data loss.

* The concept of uppercase and lowercase differs between alphabets, and does not exist in all alphabets. The results of these methods relies on mappings beween lowercase and uppercase characters, mappings which won't exist for the vast majority of characters.

† The PHP iconv() function and JString::transcode() method are not technically equilent to one another. The JString::transcode() method is intended more as a helper method for using the PHP iconv() function. When using the JString::transcode() method transliteration is always enabled.

Although the JString methods essentially make their PHP counterparts redundant, some of the PHP functions still have their place in Joomla! applications. For example, the PHP strlen() method can be used as a bytelen() type function. And of course, if we are not dealing with UTF-8, the PHP string functions may be the preffered option.

Regular Expressions in PHP with UTF-8

You may have noticed that there are no regular expression methods in JString. This is because the PCRE (Perl Compaible Regular Expression) PHP functions are UTF-8 aware. When we define a pattern in which we want to include UTF-8 characters, we don't use raw UTF-8, instead we use the syntax \x{HEX}, where HEX is the hexadecimal Unicode code point of the character in question. For this to work, we must also use the u pattern modifier.

An example of this in use in shown in the example below. This example searches for the Greek word γάτα (cat). A description of all these characters can be found at unicodemap.org.

$pattern  = "~\x{3B3}\x{3AC}\x{3C4}\x{3B1}~u";

$haystack = "I like that Greek γάτα";



$numberOfGreekCats = preg_match($pattern, $haystack);

UTF-8 in the Database

It isn't just our PHP that requires special attention. We must make sure that our database is ready and willing to store our UTF-8 data. Prior to MySQL 4.1.2, MySQL counts strings in bytes. And since Joomla! 1.5 supports MySQL 3.23 and above, we need to pay special attention when creating database tables in instances where the version of MySQL is less than 4.1.2.

Joomla! component installers (the only Joomla! 1.5 installers in which we programatically create database tables) allow us to specify SQL files specifically for a MySQL databases that count strings in bytes or characters. The best way to use this functionality is to to define two SQL files, one for each scenario.

<file driver="mysql" charset="utf8">install.sql</file>
<file driver="mysql">install.noutf8.sql</file>

The fundamental difference between these files is the way in which we define the string fields CHAR and VARCHAR. When describing fields of these types we use the syntax CHAR(n) and VARCHAR(n). As of MySQL 4.1.2, n represents the number of characters that can be stored in these fields. Prior to MySQL 4.1.2, n represents the number of bytes that can be stored in these fields.

To deal with this, in the install.noutf8.sql file we multiply the value of n by three. There is a trade off here. We mentioned earlier that UFT-8 characters are represented using between one and four bytes - this suggests we should be quadrupling the value of n. The CHAR and VARCHAR types allow a maximum n value of 255, making it hard to quadruple the value of n without exceeding this. In truth the four byte chracters are used rarely, thus tripling the value is generally considered a safe compromise.

In instances where tripling the value of n still results in a value greater than 255, we change the MySQL type to TEXT. Below is an example of two SQL files one for each sceanrio.

install.sql

CREATE TABLE `#__sometable`
{`key` CHAR(10) NOT NULL,
`value` VARCHAR(255),
PRIMARY KEY (`key`)
}

install.noutf8.sql

CREATE TABLE `#__sometable`
{
`key` CHAR(30) NOT NULL,
`value` TEXT,
PRIMARY KEY (`key`)
}

What's under the UTF-8 covers?

When we jump in bed with UTF-8, we really should cosy up to Unicode and work out what is going on. Metaphors really are underrated... anyway... returning to the point at hand:

Earlier we mentioned that, Unicode describes a repertoire of characters, each of which is assigend a unique code point (integer value). For example, the upper case letter J from the Latin alphabet has a code point value of 74, the Greek capital letter Ω (omega) has a code point value of 937, and the Tamil letter ஹ (ha) has a code point value of 3001.

NOTE: Unicode Code points are often described in decimal, and in hexadecimal. For example the upper case letter J from the Latin alphabet is code point 74 decimal, and 4A hexadecimal. When Unicode code points are represented in hexadecimal they are commonly written in the form U+004A.

So, if Unicode defines numeric values for each and every character why do we need UTF-8? Code points are references to characters, they are not computer memory representations. UTF-8 is a UTF (Unicode Transformation Format). A UTF maps a Unicode code point to a unique byte sequence.

UTF-8 represents characters in a sequence of one to four 8-bit chunks (a standard modern byte). For example we stated that the Latin character J has a code point value of 74, this will easily fit into one byte. We also stated that the Tamil letter ஹ (ha) is code point 3001, this does not fit into one byte by a long shot, instead it is spread across three bytes.

In comparison, UTF-8's two big brothers, UTF-16 (also a multibyte encoding), and UTF-32, map to sequences of one to two 16-bit chunks (a standard modern word), and one 32-bit chunk (a modern standard modern double word). Since UTF-8 stores chracters in chunks of 8 bits, and it is a multibyte chracter encoding (only uses as few bytes as is necessary), UTF-8 is generally the most bandwidth efficient UTF. This is why Joomla!, and many other modern web based systems, opt for UTF-8.

How UTF-8 works

Since UTF-8 is a multibyte encoding in which every character in a sequence of characters can be represented by a different number of bytes, UTF-8 needs to tell the parser on a per character basis how many bytes to parse. This is achieved using the msbs (most significant bits) in each byte.

If a byte starts with an msb of 0, it means that this byte represents a single character. If a byte starts with the first two msbs 10, it means it is part of a sequence representing a single character, but that it is not the first byte in that sequence. Bytes starting with the msbs 110, 1110, and 11110 are always the first bytes in a sequence of two, three and four bytes respectivley.

Let's try some examples to demonstrate how this works in practice. We'll start with a simple example, the upper case letter J from the Latin alphabet has a code point value of 74 and requires a single byte in UTF-8.

code point decimal: 74
code point hexadecimal: U+004A
code point binary: 1001010
UTF-8 representation: 01001010

Notice that the msb (underlined) is 0 indicating that this one byte represents one character. Now let's try a more complex example. The Greek capital letter Ω (omega) has a code point value of 937.

code point decimal: 937
code point hexadecimal: U+038F
code point binary: 1110101001
UTF-8 representation: 11001110 10101001

This time the msbs of the first byte are 110, which tells us that this byte is the first byte in a sequence of two bytes representing one character. The msbs of the second byte are 10, which tells us that this byte is in a sequence of more than one byte. Now let's ramp it up another gear. We have already established that the Tamil letter ஹ (ha) has a code point value of 3001 and requires three bytes.

code point decimal: 3001
code point hexadecimal: U+0BB9
code point binary: 101110111001
UTF-8 representation: 11100000 10101110 10111001

This time the msbs in the first byte, 1110, are telling us that this is the first byte in a sequence of three bytes. Whilst the msbs in the second and third bytes, 10, are telling us that these are bytes in a sequence of more than one byte.

UTF-8 and ASCII

If we take a look at Unicode, we will discover something interesting, the first 128 code points (0 to 127) are the same as in ASCII. The great thing about UTF-8 is that those code points are also mapped to the same byte sequences as they are in ASCII (this is not true of UTF-16 or UTF-32).

A side effect of this, or rather the intended side effect of this (and a number of other character encodings that are supersets of ASCII) is that the characters that are in the ASCII character set often appear correctly when being viewed having decoded the data using the wrong character encoding. This is precisely the issue we described in the introduction to this article.

Those of us who use the Latin alphabet tend to be less aware of this problem because the vast majority of characters that we use land in that beautiful intersection such that we only find a handful of characters appear incorrectly.