3
votes

I am looking for html content extractor using xpath, I have seen various nodejs module for this like

jsdom, htmlparser2, xpath, cheerio

I found cheerio better for getting data using class, id, tags etc but I am not able to get data by specifying xpath , and by using xpath nodejs module I am able to get data using xpath for smaller html, for longer html it gives different type of error like

entity not found:  @#[line:120,col:9], unclosed xml attribute @#[line:1,col:877]

Note: I have no permission to change html in any way

e.g. if my html is

<html>
<body>

<div>

    <ul id="fruits">
        <li class="apple">Apple</li>
        <li class="orange">Orange</li>
        <li class="pear">Pear</li>
    </ul>

</div>

</body>


</html>

if I am using this and giving this xpath //*[@id="fruits"]/li[2] to find element using xpath nodejs module, I am not getting any error and got the result as Orange using xpath nodejs module, but if I am using html of this page http://www.infotaxi.org/india_taxi/ahmedabad_taxi.htm

(which is quite longer), and accessing the part of text using xpath

//*[@id="navlistmeniu"]/li[3]/a/b, 

I am getting error

entity not found:  @#[line:120,col:9]

Using Cheerio I am able to extract data using class, id, tags etc. and not with xpath

Please help????

1
Is there a reason you need to use XPath? Isn't the point of cheerio to use normal selectors? $(#navlistmeniu > li).eq(3).find('a > b'); - Duane
Hi, This is also the great way, but I have only xpath available and I have require to convert my xpath into this way, is there any way to formulate this. Actually I have xpath of any child like xpath of this <li class="orange">Orange</li>, and I have required to get content of all the three i.e. my output should be Apple, Orange, Pear, i.e. my output should be construct from the parent of the given child, I hope you can understands, what I am saying - Rajit Garg

1 Answers

1
votes

I think this is your answer xpath-html, test it yourself:

const xpath = require("xpath-html");
const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");