PHP parse xml with html content

0

votes

Is it possible in php with the default xml classes to parse an xml file in such a way that only elements from one namespace are considered to be xml? I want to parse xml files in which some elements contain html code, and preferably I don't want to encapsulate every element with cdata tags, or escape all special characters. Since html has a syntax quite similar to xml, most parsers won't be able to parse this correctly.

Example:

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <html>
        <head>
        <title>Sometitle</title>
        </head>
        <body>
        --a lot of stuff here
        </body>
        </html>
    </ns:content>
</ns:root>

In this example I want all the html content inside to be the content of that element, and it shouldn't be parsed itself. Is this possible with the default parsers like simplexml etc, or should I write my own parser?

Edit: Let me explain my situation a little bit better: I want to create a little personal php framework in which code is separated from the HTML (similar to MVC, but not quite the same). However, many HTML code will be the same in multiple pages, but not everything, and some data from e.g. a database should be inserted in some pages, nothing different from normal websites. So I came up with the idea to use separate html component files, which can be parsed by an html script. This would look something like this:

main.fw:

<html>
<head>
    <title>
        <fw:placeholder name="title" />
    </title>
</head>
<body>
    <div id="menubar">
        <ul>
            <li>page1</li>
            <li>page2</li>
        </ul>
    </div>
    <div id="content>
        <fw:placeholder name="maincontent" />
    </div>
</body>
</html>

page1.fw

<fw:component file="main.fw">
    <fw:content name="title">
        page1
    </fw:content>
    <fw:content name="maincontent" />
        some content with html
    </fw:content>
</fw:component>

Result after parsing: page1

page1
page2

some content with html

This question is mainly about that second type of file, in which html is nested inside xml elements.

phpxmlparsing

This has been done million times before. Take a look how other PHP CMS systems do it, I guess the have found a way that proved to by good. – CodeZombie

I already thought many had done that before, and that's why I thought it should be possible. Do you happen to know a CMS which uses something similar? – Tiddo

1

votes

You can use textContent when using DOMDocument: http://www.php.net/manual/en/class.domnode.php

1

votes

You want the HTML code to be considered as non XML code and thats exactly what character data (CDATA) is designed.

<ns:root>
    <ns:date>
        06-12-2011
    </ns:date>
    <ns:content>
        <![CDATA[
            <html>
            <head>
            <title>Sometitle</title>
            </head>
            <body>
            --a lot of stuff here
            </body>
            </html>
        ]]>
    </ns:content>
</ns:root>

Better rely on this than write your own parser. Use the XMLWriter::writeCData() method to write the CDATA section.

Important: HTML tags inside the CDATA section do not need to be encoded!

Quote from Wikipedia CDATA:

However, if written like this:

<![CDATA[<sender>John Smith</sender>]]>

then the code is interpreted the same as if it had been written like this:

&lt;sender&gt;John Smith&lt;/sender&gt;

1

votes

An XML file with some parts that are not XML is not an XML file. Thus you can't expect that an XML parser will be able to parse it. For a document to be XML the whole thing must be XML.

What you are asking for is essentially "is there a parser that will parse my made-up angle-bracket language." Maybe DOMDocument->loadHTML() or html5lib will interpret it according to your expectations, but no guarantees.

Is it really a terrible burden for your included "html" bits to be valid XML? This is good HTML hygiene anyway, and if you are willing to do that, you can implement your entire view system with XSL templates very easily. Most of the benefit of a node-aware template system is precisely that you can manipulate nodes directly and have pretty good assurances that the final document will be valid. Why have the burden of node-awareness with none of the benefit? You might as well use a string-based system like every other template system out there. At least it will be faster.

Note that once you have constructed your final DOM, you can output it as something else, like HTML, so just because all your input templates are XML doesn't mean your output has to be.

1

votes

I decided to create a simple parser to see what the results would be. Since I don't parse valid XML, I will call it XMLIsh from now on.

The parser works quite well actually, and the peformance isn't that bad either: I did some testing and I found out that it's only ~10 times slower than SimpleXMLElement on valid xml documents, while SimpleXMLElement is build in php functionality and my function is php only. And this parser also works on 'XMLIsh' documents, as described a few times before. So as long as super fast speed is not required, this might be a valid solution.

In my situation these documents are only parsed once in a while, since the output is cached, so I think this will work for me.

Anyway, this is my code:

/**
 * This function parses a string as an XMLIsh document. An XMLIsh document is very similar to xml, but only one namespace should be parsed. 
 * 
 * parseXMLish walks through the document and creates a tree while doing so. 
 * Each element will be represented as an array, with the following content:
 * -index = 0: An array with as first element (index = 0) the type of the element. All following elements are its arguments with index=name and value=value.
 * -index = 1: Optional:an array with the content of this element. If the content is a string, this array will only have one element, namely the content of the string.
 * 
 * @param &$string The XMLIsh string to be parsed
 * @param $namespace The namespace which should be parsed.
 * @param &$offset The starting point of parsing. Default = 0
 * @param $previousTag The current opening tag. This argument shouldn't be set manually, this argument is needed for this function to check if a closing tag is valid.
 */
function parseXMLish(&$string,$namespace,&$offset=0,$openingTag = ""){
    //Whitespace doesn't matter, so trim it:)
    $string = trim($string);
    $result = array();
    //We need to find our mvc elements. These elements use xml syntax and should have the namespace mvc. 
    //Opening, closing and self closing tags are found.
    while(preg_match("/<(\/)?{$namespace}:(\w*)(.*?)(\/)?>/",$string,$matches,PREG_OFFSET_CAPTURE,$offset)){
        //Before our first mvc element, other text might have been found (e.g. html code). 
        //This should be added to our result array first. Again, strip the whitespace.
        $preText = substr($string,$offset,$matches[0][1]-$offset);
        $trimmedPreText = trim($preText);
        if (!empty($trimmedPreText))
            $result[] = $trimmedPreText;
        //We could have find 2 types of tags: closing and opening (including self closing) tags.
        //We need to distinguish between those two.
        if ($matches[1][0] == ''){
            //This tag was an opening tag. This means we should add this to the result array.
            //We add the name of this tag to the element first.
            $result[][0][0] = $matches[2][0];
            //Tags can also have arguments. We will find them here, and store them in the result array.
            preg_match_all("/\s*(\w+=[\"']?\S+[\"'])/",$matches[0][0],$arguments);
            foreach($arguments[1] as $argument){
                list($name,$value)=explode("=",$argument);
                $value = str_replace("\"","",$value);
                $value = str_replace("'","",$value);
                $result[count($result)-1][0][$name]=$value;
            }
            //We need to recalculate our offset. So lets do that. 
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            //Now we will have to fill our element with content. 
            //This is only necessary if this is a regular opening tag, and not a self-closing tag.
            if (!(isset($matches[4]) && $matches[4][0] == "/")){
                $content = parseXMLish($string, $namespace, $offset,$matches[2][0]);                
            }
            //Only add content when there is any. 
            if (!empty($content))
                $result[count($result)-1][] = $content;
        }else{
            //This tag is a closing tag. It means that we only have to update the offset, and that we can go one level up
            //That is: return what we have so far back to the previous level. 
            //Note: the closing tag is the closing tag of the previous level, not of the current level. 
            if ($matches[2][0] != $openingTag)
                throw new Exception("Closing tag doesn't match the opening tag. Opening tag: $previousTag. Closing tag: {$matches[2][0]}");
            $offset +=  strlen($preText) + strlen($matches[0][0]);
            return $result;
        }
    }
    //If we have any text left after our last element, we should add that to the array too.
    $postText = substr($string,$offset);
    if (!empty($postText))
        $result[] = $postText;

    //We're done!
    return $result;     
}

PHP parse xml with html content

4 Answers