I am attempting to scrape items from a page containing various HTML elements and a series of nested tables.
I have some code working that is successfully scraping from table X where class="ClassA" and outputting table elements into a series of items, such as company address, phone number, website address, etc.
I would like to add some extra items into this list that i am outputting, however the other items to be scraped aren't located within the same table, and some aren't even located in a table at all, eg < H1 > tag in another part of the page.
How is it possible to add some other items into my output, using xpath filter and have them appear in the same array / output structure ? I noticed if I scrape extra table items from another table (even when the table has the exact same CLASS Name and ID) the CSV output for those other items are outputted on different lines in the CSV, not keeping the CSV structure intact :(
Im sure there must be a way for items to remain unified in a csv output, even if they are scraped from slightly different areas on a page ? Hopefully its just a simple fix...
----- HTML EXAMPLE PAGE BEING SCRAPED -----
<html>
<head></head>
<body>
< // huge amount of other HTML and tables NOT to be scraped >
<h2>HEADING TO BE SCRAPED - Company Name</h2>
<p>Company Description</p>
< table cellspacing="0" class="contenttable company-details">
<tr>
<th>Item Code</th>
<td>IT123</td>
</tr>
<th>Listing Date</th>
<td>12 September, 2011</td>
</tr>
<tr>
<th>Internet Address</th>
<td class="altrow"><a href="http://www.website.com/" target="_top">http://www.website.com/</a></td>
</tr>
<tr>
<th>Office Address</th>
<td>123 Example Street</td>
</tr>
<tr>
<th>Office Telephone</th>
<td>(01) 1234 5678</td>
</tr>
</table>
<table cellspacing="0" class="contenttable" id="staff">
<tr><th>Management Names</th></tr>
<tr>
<td>
Mr John Citizen (CEO)<br/>Mrs Mary Doe (Director)<br/>Dr J. Watson (Manager)<br/>
</td>
</tr>
</table>
<table cellspacing="0" class="contenttable company-details">
<tr>
<th>Contact Person</th>
<td>
Mr John Citizen<br/>
</td>
</tr>
<tr>
<th class=principal>Company Mission</th>
<td>ACME Corp is a retail sales company.</td>
</tr>
</table>
</body>
</html>
---- SCRAPY CODE EXAMPLE ----
from scrapy.spider import Spider
from scrapy.selector import Selector
from my.items import AsxItem
class MySpider(Spider):
name = "my"
allowed_domains = ["website.com"]
start_urls = ["http://www.website.com/ABC" ]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//table[@class="contenttable company-details"]')
items = []
for site in sites:
item = MyItem()
item['Company_name'] = site.xpath('.//h1//text()').extract()
item['Item_Code'] = site.xpath('.//th[text()="Item Code"]/following-sibling::td//text()').extract()
item['Listing_Date'] = site.xpath('.//th[text()="Listing Date"]/following-sibling::td//text()').extract()
item['Website_URL'] = site.xpath('.//th[text()="Internet Address"]/following-sibling::td//text()').extract()
item['Office_Address'] = site.xpath('.//th[text()="Office Address"]/following-sibling::td//text()').extract()
item['Office_Phone'] = site.xpath('.//th[text()="Office Telephone"]/following-sibling::td//text()').extract()
item['Company_Mission'] = site.xpath('//th[text()="Company Mission"]/following-sibling::td//text()').extract()
yield item
Outputting to CSV
scrapy crawl my -o items.csv -t csv
With the example code above, the [company mission] item appears on a different line in the CSV to the other items (guessing because its in a different table) even though it has the same CLASS name and ID, and additionally im unsure how to scrape the < H1 > field since it falls outside the table structure for my current XPATH sites filter ?
I could expand the sites XPATH filter to include more content, but won't that be less effecient and defeat the point of filtering all together ?
Here's an example of the debug log, where you can see the Company Mission is being processed twice for some reason, and the first loop is empty, which must be why it is outputting onto a new line in the CSV, but why ??
{'Item_Code': [u'ABC'],
'Listing_Date': [u'1 January, 2000'],
'Office_Address': [u'Level 1, Some Street, SYDNEY, NSW, AUSTRALIA, 2000'],
'Office_Fax': [u'(02) 1234 5678'],
'Office_Phone': [u'(02) 1234 5678'],
'Company_Mission': [],
'Website_URL': [u'http://www.company.com']}
2014-02-06 16:32:13+1000 [my] DEBUG: Scraped from <200 http://www.website.com/Code=ABC>
{'Item_Code': [],
'Listing_Date': [],
'Office_Address': [],
'Office_Fax': [],
'Office_Phone': [],
'Company_Mission': [u'The comapany is involved in retail, food and beverage, wholesale services.'],
'Website_URL': []}
The other thing I am completely baffled about is why the items are spat out in the CSV in a completely different order to the items on the HTML page and the order I have defined in the spiders config file. Does scrapy run completely asynchronously returning items in whatever order it pleases ?