I decided to create a simple parser to see what the results would be. Since I don't parse valid XML, I will call it XMLIsh from now on.
The parser works quite well actually, and the peformance isn't that bad either: I did some testing and I found out that it's only ~10 times slower than SimpleXMLElement on valid xml documents, while SimpleXMLElement is build in php functionality and my function is php only. And this parser also works on 'XMLIsh' documents, as described a few times before. So as long as super fast speed is not required, this might be a valid solution.
In my situation these documents are only parsed once in a while, since the output is cached, so I think this will work for me.
Anyway, this is my code:
/**
* This function parses a string as an XMLIsh document. An XMLIsh document is very similar to xml, but only one namespace should be parsed.
*
* parseXMLish walks through the document and creates a tree while doing so.
* Each element will be represented as an array, with the following content:
* -index = 0: An array with as first element (index = 0) the type of the element. All following elements are its arguments with index=name and value=value.
* -index = 1: Optional:an array with the content of this element. If the content is a string, this array will only have one element, namely the content of the string.
*
* @param &$string The XMLIsh string to be parsed
* @param $namespace The namespace which should be parsed.
* @param &$offset The starting point of parsing. Default = 0
* @param $previousTag The current opening tag. This argument shouldn't be set manually, this argument is needed for this function to check if a closing tag is valid.
*/
function parseXMLish(&$string,$namespace,&$offset=0,$openingTag = ""){
//Whitespace doesn't matter, so trim it:)
$string = trim($string);
$result = array();
//We need to find our mvc elements. These elements use xml syntax and should have the namespace mvc.
//Opening, closing and self closing tags are found.
while(preg_match("/<(\/)?{$namespace}:(\w*)(.*?)(\/)?>/",$string,$matches,PREG_OFFSET_CAPTURE,$offset)){
//Before our first mvc element, other text might have been found (e.g. html code).
//This should be added to our result array first. Again, strip the whitespace.
$preText = substr($string,$offset,$matches[0][1]-$offset);
$trimmedPreText = trim($preText);
if (!empty($trimmedPreText))
$result[] = $trimmedPreText;
//We could have find 2 types of tags: closing and opening (including self closing) tags.
//We need to distinguish between those two.
if ($matches[1][0] == ''){
//This tag was an opening tag. This means we should add this to the result array.
//We add the name of this tag to the element first.
$result[][0][0] = $matches[2][0];
//Tags can also have arguments. We will find them here, and store them in the result array.
preg_match_all("/\s*(\w+=[\"']?\S+[\"'])/",$matches[0][0],$arguments);
foreach($arguments[1] as $argument){
list($name,$value)=explode("=",$argument);
$value = str_replace("\"","",$value);
$value = str_replace("'","",$value);
$result[count($result)-1][0][$name]=$value;
}
//We need to recalculate our offset. So lets do that.
$offset += strlen($preText) + strlen($matches[0][0]);
//Now we will have to fill our element with content.
//This is only necessary if this is a regular opening tag, and not a self-closing tag.
if (!(isset($matches[4]) && $matches[4][0] == "/")){
$content = parseXMLish($string, $namespace, $offset,$matches[2][0]);
}
//Only add content when there is any.
if (!empty($content))
$result[count($result)-1][] = $content;
}else{
//This tag is a closing tag. It means that we only have to update the offset, and that we can go one level up
//That is: return what we have so far back to the previous level.
//Note: the closing tag is the closing tag of the previous level, not of the current level.
if ($matches[2][0] != $openingTag)
throw new Exception("Closing tag doesn't match the opening tag. Opening tag: $previousTag. Closing tag: {$matches[2][0]}");
$offset += strlen($preText) + strlen($matches[0][0]);
return $result;
}
}
//If we have any text left after our last element, we should add that to the array too.
$postText = substr($string,$offset);
if (!empty($postText))
$result[] = $postText;
//We're done!
return $result;
}