I'm attempting to parse HTML code with DOMDocument, do stuff like changes to it, then assemble it back to a string which I send to the output.
But there a few issues regarding parsing, meaning that what I send to DOMDocument does not always come back in the same form :)
Here's a list:
using ->loadHTML:
- formats my document regardless of the
preserveWhitespace
andformatOutput
settings (loosing whitespaces on preformatted text) - gives me errors when I have html5 tags like
<header>
,<footer>
etc. But they can be supressed, so I can live with this. - produces inconsistent markup - for example if I add a
<link ... />
element (with a self-closing tag), after parsing/saveHTML the output will be<link .. >
- formats my document regardless of the
using ->loadXML:
- encodes entities like
>
from<style>
or<script>
tags:body > div
becomesbody > div
- all tags are closed the same way, for example
<meta ... />
becomes<meta...></meta>
; but this can be fixed with an regex.
- encodes entities like
I didn't try HTML5lib but I'd prefer DOMDocument instead of a custom parser for performance reasons
Update:
So like the Honeymonster mentioned using CDATA fixes the main problem with loadXML.
Is there any way I could prevent self closing of all empty HTML tags besides a certain set, without using regex?
Right now I have:
$html = $dom->saveXML($node);
$html = preg_replace_callback('#<(\w+)([^>]*)\s*/>#s', function($matches){
// ignore only these tags
$xhtml_tags = array('br', 'hr', 'input', 'frame', 'img', 'area', 'link', 'col', 'base', 'basefont', 'param' ,'meta');
// if a element that is not in the above list is empty,
// it should close like `<element></element>` (for eg. empty `<title>`)
return in_array($matches[1], $xhtml_tags) ? "<{$matches[1]}{$matches[2]} />" : "<{$matches[1]}{$matches[2]}></{$matches[1]}>";
}, $html);
which works but it will also do the replacements in the CDATA content, which I don't want...