1
votes

I am new to jython and scrapy, but I am impressed by the capabilities that is has. My question is, what is the best way to extract data when the XPaths are the same?

For example:

<tr>
  <td>
    <a href="/user/Bob">Bob Job</a>
  </td>
  <td>hi</td>
  <td>280.0</td>
</tr>

I need to scrape the information from all 3 td fields. I use firebug to extract the XPath which displays my XPath as

/html/body/table[2]/tbody/tr/td[2]/div/table/tbody/tr[2]/td[3]

what is the best way to extract data when the XPaths are the same? I may only need data from td[1] and td[3].

2

2 Answers

1
votes

You will have to identify a criteria to extract the values and put them in respective item fields. e.g.

link     = hxs.select('//td/a/href').extract()[0]
linktext = hxs.select('//td/a/text()').extract()[0]
number   = hxs.select('//td').re('\d+\.\d+')
0
votes

Firebugs copy xpath isn't always optimal.

When scraping tables, first find a way to iterate the <TR> fields like //table[@id='results']/tr, then do another query to grab the td fields you need for each row. //td Simpler that way.