3
votes

This is my first attempt at web scraping. I am trying to use Beautiful Soup to scrape phone numbers from Raymond James' website. An example would be http://www.raymondjames.com/office_locator_display.asp?addressline=90210

Whenever I use BeautifulSoup, I am unable to find the appropriate information in the HTML.

import urllib2
from bs4 import BeautifulSoup

url='http://www.raymondjames.com/office_locator_display.asp?addressline=90210'

opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3)        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36')]
page_to_scrape=opener.open(url).read()
soup=BeautifulSoup(page_to_scrape.decode('utf-8','ignore'))

The output produced does not contain the information I need. It seems the URL I provide does not point to the location frame.

I don't use Python for a whole lot of work with web data so I am ignorant on how to direct Beautiful Soup into the 'frame' in order to get contact information.

1
I believe additional javascript code loads the address list after the browser loaded the page. You'll have to analyze the page using your browser developer tools. Look for extra network requests that may contain the addresses, and emulate those. - Martijn Pieters

1 Answers

3
votes

As Martijn said, dig in the network requests, and the source data is there. In this case it's an xml response to a GET request made in the iframe. Armed with that url, the solution is pretty simple:

import urllib2
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://hosted.where2getit.com/raymondjames/ajax?&xml_request=%3Crequest%3E%3Cappkey%3E7BD67064-FC36-11E0-B80D-3AEEDDB2B31E%3C%2Fappkey%3E%3Cformdata+id%3D%22locatorsearch%22%3E%3Cdataview%3Estore_default%3C%2Fdataview%3E%3Climit%3E30%3C%2Flimit%3E%3Cgeolocs%3E%3Cgeoloc%3E%3Caddressline%3E90210%3C%2Faddressline%3E%3Clongitude%3E%3C%2Flongitude%3E%3Clatitude%3E%3C%2Flatitude%3E%3Ccountry%3E%3C%2Fcountry%3E%3C%2Fgeoloc%3E%3C%2Fgeolocs%3E%3Csearchradius%3E25%7C50%7C100%3C%2Fsearchradius%3E%3C%2Fformdata%3E%3C%2Frequest%3E'), 'lxml')
# parse the points of interest into a list
pois = soup.find_all('poi')
# now have your way with them!