I am attempting to perform some scraping of nutritional data from a website, and everything seems to be going swimmingly so far, until I run into pages that are formatted slightly different.
Using selenium and a line like this, returns an empty list:
values = browser.find_elements_by_class_name('size-12-fl-oz' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value')
print would return this:
[]
[]
[]
[]
[]
But if I define out the element position, then it works fine:
kcal = data.find_elements_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=1]").text
The issue I have ran into, is when the elements are not the same from page to page as I iterate. So if a div does not exist in position 9, then an error is thrown.
Now when I go back and try to edit my code to do a try/catch, I am getting:
AttributeError: 'list' object has no attribute 'find_element_by_xpath'
or
AttributeError: 'list' object has no attribute 'find_elements_by_xpath'
Here is the code, with my commented out areas from my testing back and forth.
import requests, bs4, urllib2, csv
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import NoSuchElementException
browser = webdriver.Firefox()
...
#Loop on URLs to get Nutritional Information from each one.
with open('products.txt') as f:
for line in f:
url = line
# url = 'http://www.tapintoyourbeer.com/index.cfm?id=3'
browser.get(url)
with open("output.csv", "a") as o:
writeFile = csv.writer(o)
browser.implicitly_wait(3)
product_name = browser.find_element_by_tag_name('h1').text.title() #Get product name
size = browser.find_element_by_xpath("(//div[@class='dotted-tab'])").text #Get product size
data = browser.find_elements_by_xpath("//table[@class='beer-data-table']")
# values=[]
# values = browser.find_elements_by_class_name('size-12-fl-oz' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value')
try:
# values = data.find_elements_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])")
kcal = data.find_element_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=1]").text
kj = data.find_element_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=3]").text
fat = data.find_element_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=5]").text
carbs = data.find_element_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=7]").text
protein = data.find_element_by_xpath("(.//div[@class='size-12-fl-oz nutrition-value' or 'size-330-ml hide nutrition-value' or 'size-8-fl-oz nutrition-value'])[position()=9]").text
values = [kcal, kj, fat, carbs, protein]
print values
writeFile.writerow([product_name] + [size] + values)
except NoSuchElementException:
print("No Protein listed")
browser.quit()
I had it working earlier to produce a list, and output to a CSV, but on occasion, the position count would come out wrong.
[u'Budweiser', u'12 FL OZ', u'145.00', u'', u'', u'', u'']
[u"Beck'S", u'12 FL OZ', u'146.00', u'610.86', u'0.00', u'10.40', u'1.80']
[u'Bud Light', u'12 FL OZ', u'110.00', u'460.24', u'0.00', u'6.60', u'0.90']
[u'Michelob Ultra', u'12 FL OZ', u'95.00', u'397.48', u'0.00', u'2.60', u'0.60']
[u'Stella Artois', u'100 ML', u'43.30', u'KCAL/100 ML', u'181.17', u'KJ/100 ML', u'0.00']
The problems started when position 9 didn't exist on a particular page.
Are there any suggestions on how to fix this headache? Do I need to have cases set up for the different pages & sizes?
I appreciate the help.