2
votes

I'm using beautifulsoup to scrape the href of each product in this webpage: http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=digital+camera. These hrefs end up with "keywords=digital+camera" Here's my code:

from bs4 import BeautifulSoup
import requests

url = "http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=digital+camera"
keyword = "keywords=digital+camera"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    href = link.get('href')
    if href is None:
        continue
    elif keyword in href:  
        print href

I got nothing back from above script, is there anything I can do to fix it? Thanks

1
have you tried printing all of the hrefs and "ctrl+f"ing to see that you are actually getting what you think you are getting? I just printed [x.get("href") for x in soup.find_all('a')] and didnt get anything that had the string "keywords=digital+camera" in itTehTris
@TehTris yes, I tried. I didn't get anything with keyword in.c20ad4d76fe97759aa27a0c99bff67

1 Answers

2
votes

Amazon is detecting the user-agent ("the name of your browser") and changing the content based on that value. If you add a user-agent to the request, you'll get the strings with "keyword=digital+camera" added to them. Otherwise, you don't.

url ="http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=digital+camera"
import urllib2
from bs4 import BeautifulSoup
req = urllib2.Request(url, headers={ 'User-Agent': 'Mozilla/5.0' })
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
for l in links:
    href = l.get('href')
    title = l.get('title', '')
    if 'Sony W800/B 20.1 MP Digital' in title:
        print(href)  # prints: http://www.amazon.com/Sony-W800-Digital-Camera-Black/dp/B00I8BIBCW/ref=sr_1_2/183-0842534-8993425?s=photo&ie=UTF8&qid=1421400650&sr=1-2&keywords=digital+camera