1
votes

there. I'm building a simple scraping tool. Here's the code that I have for it.

from bs4 import BeautifulSoup
import requests
from lxml import html
import gspread
from oauth2client.service_account import ServiceAccountCredentials
import datetime

scope = ['https://spreadsheets.google.com/feeds']

credentials = ServiceAccountCredentials.from_json_keyfile_name('Programming 
4 Marketers-File-goes-here.json', scope)

site = 'http://nathanbarry.com/authority/'
hdr = {'User-Agent':'Mozilla/5.0'}
req = requests.get(site, headers=hdr)

soup = BeautifulSoup(req.content)

def getFullPrice(soup):
    divs = soup.find_all('div', id='complete-package')
    price = ""
    for i in divs:
        price = i.a
    completePrice = (str(price).split('$',1)[1]).split('<', 1)[0]
    return completePrice


def getVideoPrice(soup):
    divs = soup.find_all('div', id='video-package')
    price = ""
    for i in divs:
        price = i.a
    videoPrice = (str(price).split('$',1)[1]).split('<', 1)[0]
    return videoPrice

fullPrice = getFullPrice(soup)
videoPrice = getVideoPrice(soup)
date = datetime.date.today()

gc = gspread.authorize(credentials)
wks = gc.open("Authority Tracking").sheet1

row = len(wks.col_values(1))+1

wks.update_cell(row, 1, date)
wks.update_cell(row, 2, fullPrice)
wks.update_cell(row, 3, videoPrice)

This script runs on my local machine. But, when I deploy it as a part of an app to Heroku and try to run it, I get the following error:

Traceback (most recent call last): File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 219, in put_feed r = self.session.put(url, data, headers=headers) File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 82, in put return self.request('PUT', url, params=params, data=data, **kwargs) File "/app/.heroku/python/lib/python3.6/site-packages/gspread/httpsession.py", line 69, in request response.status_code, response.content)) gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "AuthorityScraper.py", line 44, in wks.update_cell(row, 1, date) File "/app/.heroku/python/lib/python3.6/site-packages/gspread/models.py", line 517, in update_cell self.client.put_feed(uri, ElementTree.tostring(feed)) File "/app/.heroku/python/lib/python3.6/site-packages/gspread/client.py", line 221, in put_feed if ex[0] == 403: TypeError: 'RequestError' object does not support indexing

What do you think might be causing this error? Do you have any suggestions for how I can fix it?

2

2 Answers

2
votes

There are a couple of things going on:

1) The Google Sheets API returned an error: "Invalid query parameter value for cell_id":

gspread.exceptions.RequestError: (400, "400: b'Invalid query parameter value for cell_id.'")

2) A bug in gspread caused an exception upon receipt of the error:

TypeError: 'RequestError' object does not support indexing

Python 3 removed __getitem__ from BaseException, which this gspread error handling relies on. This doesn't matter too much because it would have raised an UpdateCellError exception anyways.

My guess is that you are passing an invalid row number to update_cell. It would be helpful to add some debug logging to your script to show, for example, which row it is trying to update.

It may be better to start with a worksheet with zero rows and use append_row instead. However there does seem to be an outstanding issue in gspread with append_row, and it may actually be the same issue you are running into.

0
votes

I encountered the same problem. BS4 works fine at a local machine. However, for some reason, it is way too slow in the Heroku server resulting into giving error.

I switched to lxml and it is working fine now.

Install it by command:

pip install lxml

A sample code snippet is given below:

from lxml import html
import requests

getpage = requests.get("https://url_here")
gethtmlcontent = html.fromstring(getpage.content)
data = gethtmlcontent.xpath('//div[@class = "class-name"]/text()') 
#this is a sample for fetching data from the dummy div
data = data[0:n] # as per your requirement

#now inject the data into django tmeplate.