2
votes

I am trying to extract 'Total Marks' from Result using Python 3. The web page is shown in the image, from here, I'm trying to extract the data '515'. The XPath of the content (from Firebug) is shown as:

/html/body/div/div/div/div[3]/div[1]/div/div[2]/div[2]/table/tbody/tr[1]/td[2]/b

The code snippet used is:

summary_data_xpath = '//tbody/tr[1]/td[2]/b/text()'
data = html_tree.xpath(summary_data_xpath)
print(data)

But I get the output: []

I tried using absolute path (XPath given by Firebug). I also tried to start reference from the '//table', but I got the same result.

The two tables are structured as:

...
<div>
    <div>
        Upper Table with subject marks
    </div>
    Lower Table with subject marks and division
</div>
...

How can I extract the total marks '515' from the table? Thanks in advance for any assistance!

2

2 Answers

2
votes

I would use the related preceding "Total Marks" label via following-sibling axis:

import requests
from lxml.html import fromstring


url = "http://results.vtu.ac.in/results/result_page.php?usn=3ae13cs089"

response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36'})

root = fromstring(response.content)
summary_data_xpath = './/td[b = "Total Marks"]/following-sibling::td/b'
data = root.xpath(summary_data_xpath)[0].text.strip(": ")
print(data)

Prints 515.

1
votes

As there's no really good id usage here, I'd use the following:

//tr[./td/b/text()="Total Marks"]/td[2]/b