0
votes

I am attempting to scrape items from a page containing various HTML elements and a series of nested tables.

I have some code working that is successfully scraping from table X where class="ClassA" and outputting table elements into a series of items, such as company address, phone number, website address, etc.

I would like to add some extra items into this list that i am outputting, however the other items to be scraped aren't located within the same table, and some aren't even located in a table at all, eg < H1 > tag in another part of the page.

How is it possible to add some other items into my output, using xpath filter and have them appear in the same array / output structure ? I noticed if I scrape extra table items from another table (even when the table has the exact same CLASS Name and ID) the CSV output for those other items are outputted on different lines in the CSV, not keeping the CSV structure intact :(

Im sure there must be a way for items to remain unified in a csv output, even if they are scraped from slightly different areas on a page ? Hopefully its just a simple fix...

----- HTML EXAMPLE PAGE BEING SCRAPED -----

<html>
<head></head>
<body>

< // huge amount of other HTML and tables NOT to be scraped >

<h2>HEADING TO BE SCRAPED - Company Name</h2>
<p>Company Description</p>

< table cellspacing="0" class="contenttable company-details">
<tr>
  <th>Item Code</th>
  <td>IT123</td>
</tr>
  <th>Listing Date</th>
  <td>12 September, 2011</td>
</tr>
<tr>
  <th>Internet Address</th>
  <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
</tr>
<tr>
  <th>Office Address</th>
  <td>123 Example Street</td>
</tr>    
<tr>
  <th>Office Telephone</th>
  <td>(01) 1234 5678</td>
</tr>       
</table>

<table cellspacing="0" class="contenttable" id="staff">
<tr><th>Management Names</th></tr>
<tr>
    <td>
    Mr John Citizen (CEO)<br/>Mrs Mary Doe (Director)<br/>Dr J. Watson (Manager)<br/>
    </td>
</tr>
</table>

<table cellspacing="0" class="contenttable company-details">    
<tr>
    <th>Contact Person</th>
    <td>        
    Mr John Citizen<br/>        
    </td>
</tr>   
<tr>
    <th class=principal>Company Mission</th>
    <td>ACME Corp is a retail sales company.</td>
</tr>   
</table>

</body>
</html>

---- SCRAPY CODE EXAMPLE ----

from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import AsxItem

class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/ABC" ]

def parse(self, response):
   sel = Selector(response)
   sites = sel.xpath('//table[@class="contenttable company-details"]')
   items = []

   for site in sites:
      item = MyItem()
      item['Company_name'] = site.xpath('.//h1//text()').extract()
      item['Item_Code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
      item['Listing_Date'] = site.xpath('.//th[text()="Listing Date"]/following-sibling::td//text()').extract()
      item['Website_URL'] = site.xpath('.//th[text()="Internet Address"]/following-sibling::td//text()').extract()
      item['Office_Address'] = site.xpath('.//th[text()="Office Address"]/following-sibling::td//text()').extract()
      item['Office_Phone'] = site.xpath('.//th[text()="Office Telephone"]/following-sibling::td//text()').extract()
      item['Company_Mission'] = site.xpath('//th[text()="Company Mission"]/following-sibling::td//text()').extract()
      yield item

Outputting to CSV

scrapy crawl my -o items.csv -t csv

With the example code above, the [company mission] item appears on a different line in the CSV to the other items (guessing because its in a different table) even though it has the same CLASS name and ID, and additionally im unsure how to scrape the < H1 > field since it falls outside the table structure for my current XPATH sites filter ?

I could expand the sites XPATH filter to include more content, but won't that be less effecient and defeat the point of filtering all together ?

Here's an example of the debug log, where you can see the Company Mission is being processed twice for some reason, and the first loop is empty, which must be why it is outputting onto a new line in the CSV, but why ??

{'Item_Code': [u'ABC'],
 'Listing_Date': [u'1 January, 2000'],
 'Office_Address': [u'Level 1, Some Street, SYDNEY, NSW, AUSTRALIA, 2000'],
 'Office_Fax': [u'(02) 1234 5678'],
 'Office_Phone': [u'(02) 1234 5678'],
 'Company_Mission': [],
 'Website_URL': [u'http://www.company.com']}
2014-02-06 16:32:13+1000 [my] DEBUG: Scraped from <200 http://www.website.com/Code=ABC>
{'Item_Code': [],
 'Listing_Date': [],
 'Office_Address': [],
 'Office_Fax': [],
 'Office_Phone': [],
 'Company_Mission': [u'The comapany is involved in retail, food and beverage, wholesale services.'],
 'Website_URL': []}

The other thing I am completely baffled about is why the items are spat out in the CSV in a completely different order to the items on the HTML page and the order I have defined in the spiders config file. Does scrapy run completely asynchronously returning items in whatever order it pleases ?

2

2 Answers

0
votes

I understand you want to scrape 1 item for this page but //table[@class="contenttable company-details"] matches 2 tables elements in your HTML content, so the for site in sites: will run twice, creating 2 items.

And for each table, XPath expressions will be applied within the current table if they are relative -- .//th[text()="Item Code"]. Absolute XPath expressions, such as //th[text()="Company Mission"], will look for elements from the root element of your HTML document.

Your sample output shows the "Company_Mission" only once while you say it appears twice. And because you're using an absolute XPath expression for it, it should have indeed appeared twice. Not sure if the ouput matches your current spider code in the question.

So, first iteration of the loop,

    <table cellspacing="0" class="contenttable company-details">
    <tr>
      <th>Item Code</th>
      <td>IT123</td>
    </tr>
      <th>Listing Date</th>
      <td>12 September, 2011</td>
    </tr>
    <tr>
      <th>Internet Address</th>
      <td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
    </tr>
    <tr>
      <th>Office Address</th>
      <td>123 Example Street</td>
    </tr>    
    <tr>
      <th>Office Telephone</th>
      <td>(01) 1234 5678</td>
    </tr>       
    </table>

in which you can scrape:

  • Item Code
  • Listing Date
  • Internet Address --> Website URL
  • Office Address
  • Office Telephone

and because you're using an absolute XPath expression, //th[text()="Company Mission"]/following-sibling::td//text() will look anywhere in the document, not only in this first <table cellspacing="0" class="contenttable company-details">

These extracted field go into an item of their own.

Then comes the 2nd table matching your XPath for sites:

    <table cellspacing="0" class="contenttable company-details">    
    <tr>
        <th>Contact Person</th>
        <td>        
        Mr John Citizen<br/>        
        </td>
    </tr>   
    <tr>
        <th class=principal>Company Mission</th>
        <td>ACME Corp is a retail sales company.</td>
    </tr>   
    </table>

for which a new MyItem() is instantiated, and here, no XPath expression match except the absolute XPath for "Company Mission", so at the end of the loop iteration, you've got an item with only "Company Mission".

If you're sure you only expect 1 and only 1 item from this page, you can use longer XPaths like //table[@class="contenttable company-details"]//th[text()="Item Code"]/following-sibling::td//text() for each field you want, so that it will match the 1st or 2nd table,

and use only 1 MyItem() instance.

Also, you can try CSS selectors that would be shorter to read and write and easier to maintain:

  • "Company_name" <-- sel.css('h2::text')
  • "Item_Code" <-- sel.css('table.company-details th:contains("Item Code") + td::text')
  • "Listing_Date" <-- sel.css('table.company-details th:contains("Listing Date") + td::text')
  • etc.

Note that :contains() is available in Scrapy via cssselect underneath, but it's not standard (was remove from the CSS specs, but is handy) and ::text pseudo-element selector is also non-standard but a Scrapy extension, and is also handy.

0
votes

guessing because its in a different table - wrong guess, there is no correlation between tables and items, in fact, it does not matter where is the data from, as long as you set it of the item fields.

meaning you can take Company_name and Company_Mission from wherever you want.

having said that, check what is returned from //th[text()="Company Mission"] and how many times it appears on the page, while other items xpath are relative (start with a .) this one is absolute (start with //), it may scrape a list of items and not just one