0
votes

The use case is quite simple. I would like to find node via an xpath statement in a string(!) that basically contains an HTML document and delete them.

I know how to find the nodes with PHP. It is basically like this: create new DOMDocument LoadHTML (or LoadXML) Create new DOMXpath and then method "query" or "evaluate". Done.

However deleting is the tricky part. One would think that you just delete the nodes with a few statements (and at the end parentNode->removechild) and just save the result back into the string with saveHTML. Unfortunately this operation transforms almost every time "too many things" in the original HTML string.

So my question now is. How could I delete the nodes return by xpath->query ($query) without using saveHTML or saveXML? And without writing my own parser.

Hope it was clear enough :-)

Thanks for looking at this!

2

2 Answers

0
votes

First of all, make sure you remove the found nodes from the bottom and up. This is to make sure you remove child nodes before parent nodes.

Second, what do you mean by "transforms to many things"? PHP's DOM XML will parse the document into a DOM node tree. Then you work on the tree, and when you aree done it will convert the DOM tree back into XML/HTML. You may very well lose indentation, arguments may change places and so on. The important thing is that the document means exactly the same thing, i.e. is an exact XML/HTML representation of the DOM tree.

0
votes

Emil thanks for your quick answer

Yes, you are right. This is how I removed the nodes and it worked:

Convert html STRING to DOM with loadHTML/loadXML -> identify nodes with xpath query -> remove nodes from DOM (like you described) -> convert DOM to html STRING with saveHTML/XML

That works - however the problem is that the output after saveHTML is usually significantly different (besides the deleted nodes). I don't care about arguments positioning or white space. But sometimes sites don't even render correctly in a browser after saveHTML. I suspect that browsers deal better with less than perfect HTML code ...

Is there another way I could try - besides saveHTML?

May be it is not possible (or at least not without significant effort)? What do you think?