PHP, XML and encoding

Not until I tried to parse an RSS feed encoded in GB2312 did I realize how painful it is to deal with encoding in PHP. It began with the xml_parser_create function, which is as awkward as it is clever at encoding handling.  Steve has posted about his sadness, rage and loss on this. A Chinese translation of his tale is available as PHPXML、以及字元編碼:一則關於悲情、憤怒以及傷逝(資料)的故事. Steve’s code was later shipped as part of MagpieRSS 0.7.  However, Steve overrated PHP 5 and so did the newly shipped MagpieRSS. MagpieRSS 0.72 detects PHP version before calling xml_parser_create. If it’s PHP4, MagpieRSS handles encoding in Steve’s approach. If it’s PHP 5, MagpieRSS simply forward the call to xml_parser_create, wishing that "by default php5 does a fine job of detecting input encodings".

 

PHP 5’s input encoding detection largely depends on the xml prologue. The rule seems quite simple: if an encoding attribute is present, the value is used, otherwise a default value (ISO-8859-1 in PHP 5.0.0 and 5.0.1, UTF-8 in PHP 5.0.2 and upper) is adopted. In PHP 5, the encoding parameter of xm_parser_create only specifies the output encoding (target encoding). It means that unlike in PHP 4, now there’s no way to specify input encoding. Input encoding is always detected and only ISO-8859-1, UTF-8 and US-ASCII are supported.

So I decided to convert the source to UTF-8 encoding first. But unfortunately, even if the content is actually in UTF-8 now the xml prologue could still fool xml_parser_create. I have to do some really nasty ticks in order to make PHP 5 happy – replace <?xml version="1.0" encoding="xxx"?> with <?xml version="1.0" encoding="utf-8"?> .

How to determine input encoding? I suggest that we first look for encoding attribute, then the Content-Type http header. If possible, we can guess the encoding by sampling characters and analyzing the frequency of occurrence. Finally, if all methods have failed, we adopt a default encoding which is the most popular one in the user group our service oriented.

The behavior change in xml_parsr_create eliminates one limitation by introducing a new limitation – it is now impossible to specify the input encoding explicitly. PHP has been suffering from the lack of encoding support. It’s Unicode support was added forced by the increasing need. The widely used mbstring is not an default extension. PHP doesn’t have Unicode support at the core level yet and that’s one of the most important thing PHP 6 targets.

Joel wrote about the lack of Unicode support in PHP in The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

"… I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications…"

<?php

//…

/*var $resp Snoopy*/

$resp = _fetch_remote_file( $url );

//now determines encoding

//a.<xml version="1.0" encoding="xxx">

//b.http header

//c.default

$encoding = NULL;

if (preg_match(‘/<?xml.*encoding=[\’"](.*?)[\’"].*?>/m’, $resp->results, $m)) {

$encoding = strtoupper($m[1]);

} else {

foreach ($resp->headers as $h) {

if (strpos($h, "charset=")) {

list(, $encoding) = explode("charset=", $h, 2);

break;

}

}

}

if (!$encoding) {

$encoding = ‘GB2312’; // the most possible encoding

}

$feedStr = $resp->results;

//encoding convertion

If (strtoupper($encoding) != ‘UTF-8’))

$feedStr = mb_convert_encoding($feedStr, ‘UTF-8’, $encoding);

//prologue regulation

$pattern = ‘/<\?xml.*?encoding="(.*?)".*?>/’;

preg_match($pattern, $feedStr, $matches, PREG_OFFSET_CAPTURE);

if (count($matches) == 2) {

$feedStr = substr_replace($feedStr, ‘UTF-8’, $matches[1][1], strlen($matches[1][0]));

}

$parser = xml_parser_create(‘UTF-8’);

//…

?>

  Alternatively, if you also love Python as I do, you can deliver the parser’s job to Mark’s Universal Feed Parser, which is introduced as the best feed application ever written in Ben Hammersley‘s Developing Feed with RSS and Atom.

 

Advertisements
This entry was posted in PHP Rocks. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s