Convert escaped Unicode to HTML entities
January 7th, 2012
I’m posting this here just because I spent far too long looking for the answer to this problem, before eventually finding a solution at http://ho.runcode.us and then realising the information had been in Stackoverflow all along – the search terms I was using must have just been missing it.
So all of the following comes from:
http://stackoverflow.com/questions/3480074/how-do-i-convert-unicode-codepoints-to-hexadecimal-html-entities and is thanks to Stackoverflow users Joey and Artefacto.
The problem
In short:
Data that I want to display on a Web page contains escaped Unicode code units in it (like \u0096) that I want to convert into HTML entities (like –) so that they will show up correctly on the Web page.
In a little more detail:
I have a Web page with a table control that uses JSON. I have a PHP script that takes some user input (search query terms), goes and searches a database, forms the search results into an array (called $data) with the correct format to pass to json_encode():
$str = json_encode($data);
However, the value you pass to json_encode() must be UTF-8 encoded data, and mine isn’t, so I need to pass it through utf8_encode() first, before I pass it to json_encode(). For example, for the contents of the name field:
utf8_encode($matchedRecord['name'])
The result of this is that any slightly unusual characters, like the en dash (–), come back transformed into one or more escaped Unicode code units (e.g. the en dash becomes \u0096). When this is included in an HTML page and sent to a browser it’s displayed as a strange little symbol by Firefox (e.g.
), or just not displayed at all by Internet Explorer.
So I need to take the string from json_encode() and pass it through something that will convert the escaped Unicode code units (like \u0096) into HTML entities (like –) so that they will show up correctly in the browser.
The solution
$str = preg_replace('/\\\\U0*([0-9a-fA-F]{1,5})/i', '&#x\1;', $str);
This did the trick for me. However, Artefacto supplies a function that handles UTF-16 code points (i.e. characters made up of one or two 16-bit code units):
$str = unenc_utf16_code_units($str);
function unenc_utf16_code_units($string) {
/* go for possible surrogate pairs first */
$string = preg_replace_callback(
'/\\\\U(D[89ab][0-9a-f]{2})\\\\U(D[c-f][0-9a-f]{2})/i',
function ($matches) {
$hi_surr = hexdec($matches[1]);
$lo_surr = hexdec($matches[2]);
$scalar = (0x10000 + (($hi_surr & 0x3FF) << 10) |
($lo_surr & 0x3FF));
return "&#x" . dechex($scalar) . ";";
}, $string);
/* now the rest */
$string = preg_replace_callback('/\\\\U([0-9a-f]{4})/i',
function ($matches) {
//just to remove leading zeros
return "&#x" . dechex(hexdec($matches[1])) . ";";
}, $string);
return $string;
}
Potentially similar posts
- Viewing dynamically generated HTML in the HTML Help viewer – November 2010
- Fix “No topics found” issue in CHM output – November 2010
- Online JavaScript scratchpads – November 2010
- Use the existence of a file on the server to determine Javascript behaviour in the browser – November 2010
- Help is just a search and a click away – August 2010

