1
votes

I have a markup HTML as below:

<body>
    <div>......</div>
    ............
    <div class="entry-content">
        <div class="code1 code2">(ads.....);</div>
        <p><img src="https://www..."></img></p>
        <h2> title </h2>
        <div class="code1-block code2">(ads.....);</div>
        <div class="data1 dta-ta1">
              <ul><li><p> text</p></li>
                  <li><span> text2 </span></li>
                  <li><span> text3 </span></li>
                  <div class="codex1 code-block"><span>(ads ....); </span></div>
                  <li><span> text4 </span></li>
                  <div class="codex1 code-block"><span>(ads ....); </span></div>
              </ul>
        </div> 
        <div class="codex2-block code2">(ads.....);</div>
        <div class="data2-entry dta-ta2">
              <p>
                <span> text5</span>
              </p>
              <p> text6 </p>
              <p> text7 </p
              <div class="codex1 code-block"><span>(ads ....); </span></div>
              <li><span> text8 </span></li>
              <div class="codex1 code-block"><span>(ads ....); </span></div>
        </div>
  </div>
</body>

I've tried to "go into div with class="entry-content" get all texts from its child nodes excluding child nodes with class= "code1", "code2", "codex1", "codex2"

My code as below just goes to the div and gets all texts from child nodes. However, I can not remove text from the child nodes with code1 & code2. I appreciate for your supports. Thanks.

 $classname='entry-content';
 $a = new DOMXPath($dom);
 $query = "//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]";

 $list = $a->query($query);

 if ($list->length > 0) {
    foreach ($list as $element) {
        $nodes = $element->childNodes;

          foreach ($element as $node) {
             $bodytext = trim(preg_replace('/[\r\n]+/', ' ', $node->nodeValue));
             $bodyContent .= '<p>' . $bodytext . '</p>';
          }
    }
 }

My expected output:

https://www...

title

text2

text3

text4

text5

text6

text7

text8

1
What is your expected output?Nick
I want to get all texts from child nodes insde div with class="content" and exclude child nodes with class contain "code" EX:code1 code2, codex1, codex2.... it's a bit challenge because of the complex HTML markup. I think we need a complex queries xpath.An Nguyen
Please edit your question with the expected output for the sample input you have providedNick
ok ! I've just edited.An Nguyen

1 Answers

1
votes

Your input document is not well-formed, a > is missing for </p, and one div is not closed properly. With the input document fixed, a working path expression is

XPath expression

//div[@class='content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()]

It selects all text nodes, but only if they do not have an ancestor div element that has a class attribute whose value contains "code", and also, the text nodes selected cannot be whitespace-only.

Output

Individual results are separated by ------:

 title 
-----------------------
 text
-----------------------
 text2 
-----------------------
 text3 
-----------------------
 text4 
-----------------------
 text5
-----------------------
 text6 
-----------------------
 text7 
-----------------------
 text8 

Update

I tried with your answer. It works however I still need a source from img tag. How can I get it?

It's possible to also select the source attribute of an img element, but this would make the Xpath expression even more complicated. You should just add another line of PHP to evaluate a separate path expression, such as:

//div[@class='entry-content']/p/img/@source

Update 2

While I absolutely do not recommend to use this expression (because it obfuscates your code), here is how to combine both expressions into a single one with a union operator:

//div[@class='entry-content']//text()[not(ancestor::div/@class[contains(., 'code')])][normalize-space()] | //div[@class='entry-content']//p/img/@src