2
votes

First of all, is it possible to do such thing?

I have been trying out to generate Xpath expression by using "sub-element text values" present in webpage. Trying to do this using lxml (etree, html, getpath), ElementTree modules in Python. But I don't know how to generate Xpath expression for a value present in the webpage. I totally know about Scrapy framework in python, but this is different.

Below is the my incomplete code..

import urllib2, re
from lxml import etree

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''
    return outtxt


newUrl = 'http://www.iupui.edu/~webtrain/tutorials/tables.html' # homepage

dt = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree   = etree.fromstring(dt, parser)

As per the lxml documentation they are creating element tree manually, but how can I use my read and parsed html data (in my example variable tree or data) to access the sub-element. Or more importantly, if possible the sub-element text value.

Let's say in the above example webpage, I want to search table "Supplies and Expenses" and generate Xpath expression dynamically by that value - Supplies and Expenses

Is there any option to do so !!! Ultimate goal, I would like to achieve is read webpage and generate Xpath for the sub-element text value present in webpage.

1
you want to generate xpath for elements with specific text? or you want a "general" xpath like /html/body/div[6]/table/tbody/tr[1]/td/div/b of given elements (maybe have some different text in other page) - Binux
I want to generate xpath for elements with specific text. If I've understood correctly, no need to worry to other page. - techDiscussion
the answer below is exactly you want, a xpath point to the element with specific text. if you want a xpath like /html/..., first locate the element (maybe with xpath in the answer below or traverse the tree. second trace back to root by yourself. - Binux

1 Answers

3
votes

To find all elements based on a part of their text value:

"//*[contains(text(), 'some_value')]"

For example, if you have this:

<div id="somediv">
    <span>Something is here</span>
    <a href="#">Click here</a>
</div>

You can find all sub-elements containing the word "here" like this:

"//div[@id='somediv']//*[contains(text(), 'here')]"

Or you can for example find all sub-div span elements containing the word "Something":

"//div[@id='somediv']//span[contains(text(), 'Something')]"

As for parsing this in lxml:

from lxml import etree
outtxt = response.read()
root = etree.fromstring(outtxt)
root.xpath("my_xpath_expression")

Update:

To get the full XPath expression for an element, use the ElementTree.getPath() method, like so:

tree = etree.ElementTree(root)
# this will print XPath of all
# elements in 'root'
for e in root.iter():
    print tree.getpath(e)