I encountered complicated html structure on website from which i want to extract text information.
Website has following structure:
<ul class = "listing_pages">
<li id = "list_1" style = ""></li>
<li id = "list_2" style = ""></li>
<li id = "list_3" style = ""></li>
<li id = "list_4" style = ""></li>
<li id = "list_5" style = ""></li>
<li id = "list_6" style = ""></li>
<li id = "list_7" style = ""></li>
<li id = "list_8" style = ""></li>
<li id = "list_9" style = ""></li>
</ul>
Each id="list_*" unfolds into
<li id="list_1">
<div class="description_block">
<table valign="top">
<tbody>
<tr valign="top">
<td width="400px">
<table>
<tbody>
<tr>
<td style="width:350px">
<div></div>
<table></table>
<table cellspacing="0">
<tbody>
<tr>
<td height="15px">
<h2>
<a class="product_title" title="PRODUCT_NAME" href="http://example.com">PRODUCT_NAME</a>
Its nightmarish structure! And its repeated for eatch list_*
Relative Xpath for following is
/div[9]/div[2]/div[3]/div[2]/form/div/div[2]/ul/li[1]/div[2]/table/tbody/tr/td[1]/table/tbody/tr/td/table[2]/tbody/tr/td/h2/a
Which fails.
Few things i tried, with limited success are,
response.xpath('//*[@id="one"]//table//tr//h2//a[position()]//text()').extract()
This extracts all /h2/a from page, not from single list_*
response.xpath('//*[@id="list_1"]//table//tr//h2//a//text()').extract()
This extracts text correctly but only from first list_1 div. I can increment it with extract()[++i], but that is not optimal solution and i think there are definitely better ways to do it.
What i want to accomplish is:
Extract text (PRODUCT_NAME) from each list_* in order.