Using XSLT with bad HTML

Tags
Tagsphp, xslt, programming
Posted
Tue 19 Oct, 2004
Comments
0

We have a PHP CMS with a lot of poorly written HTML in the client-contributed content. This kept causing my XSL template system to output XML errors. I got around this problem by:

  1. wrapping content in CDATA tags
  2. Checking if the content is valid XML with xml_parse() in PHP, if not I add a CDATA tag and try again.
  3. Strip out bad characters that may have crept in from Word
  4. Process the XSL and XML using xsl:value-of tags with disable-output-escaping="yes"

Using CDATA tags around unpredictable HTML helps prevent problems with the XML parser. Without the final step, the resulting HTML contains the original HTML with HTML entities.

In PHP, mb_convert_encoding($string, 'ASCII') has proven very useful for handling text users paste from applications like Word. PHP has to be compiled with—enable-mbstring for this function to work. It prevents strings with different encodings encodings to your XSL from confusing the XML parser (where the encoding is defined).


Security Code