0
votes

I'm trying to use Python and Beautiful soup to open a link and extract data that is embedded within a tag. I've tried to do this but exhausted my knowledge.

Here are the portions of my code and what the text looks like that I am trying to grab the data from

sauce = urllib.request.urlopen(link).read() #link is the url
soup = BeautifulSoup(sauce,'lxml')

yy = soup.select('span[id^=ctl00_ContentPlaceHolder1_Label1]')
y = yy[0]
print(y)

print(y) results in the following data:

        '<span id="ctl00_ContentPlaceHolder1_Label1"><div style="width:100%;clear:both;overflow:hidden;">\
        <div style="width:17%;float:left;margin-right:10px;"><span style="font-size:16px;font-weight:bold;"> \
        Licensee:</span></div><div style="float:left;"><span style="font-size:14px;font-weight:bold;">Company, INC.</span></div></div><div \
        style="width:100%;clear:both;overflow: hidden;"><div style="width:17%;float:left;margin-right:10px;"> \
        <span style="font-size:16px;font-weight:bold;">Facility:</span></div><div style="float:left;"> \
        <span style="font-size:14px;font-weight:bold;">Joes Shop</span></div></div><br/><b>Status:</b> \
        Licensed<br/><b>JOE SMITH - Director</b><br/><b>Phone:</b> (555)555-5555<br/> <span style="font-size:8pt"><table \
        border="1" style="padding:1px 1px 5px 1px;border:1px solid #999999;width:497px;border-collapse:collapse;"><tr><td \
        width="50%"><b>Daytime Hours:</b>  07:30 AM - 03:30 PM</td><td width="50%"><b>Nighttime Hours:</b>   \
        N/A - N/A</td></tr><tr><td width="50%"><b>Daytime Ages:</b>  4 YRS Through 5 YRS</td><td width="50%"><b> \
        Nighttime Ages:</b>  N/A</td></tr></table></span><br/><span style="font-size:12px;font-weight:bold;"> \
        Mailing Address:</span><br/><span style="font-size:12px;">1909 CENTRAL PARK</span><br/> \
        <span style="font-size:12px;">NEW YORK</span>, <span style="font-size:12px;">NY</span> \
        <span style="font-size:12px;">58756</span><br/><br/><span style="font-size:12px;font-weight:bold;"> \
        Street Address:</span><br/><span style="font-size:12px;">3996 Rhode Ave</span><br/> \
        <span style="font-size:12px;">Cleveland</span>, <span style="font-size:12px;">OH</span> <span style="font-size:12px;">58475</span></span>'

I've tried:

ystring = y.getText(separator=u' ')

But this only left me with all the text and titles and all I want is the actual name, phone number, address, etc.

Specifically, I'm trying to extract from this the following: Licensee (Company, Inc), Facility (Joes Shop), Status (Licensed), Director (Joe Smith), Phone ((555) 555-5555), Daytime Hours (07:30 AM - 03:30 PM), Nighttime Hours (N/A - N/A), Daytime Ages (4 YRS Through 5 YRS), Nighttime Ages (N/A), Mailing Address (1909 Central Park, New York, NY, 58756 (separate Street, City, State, zip by commas, and Street Address (3996 Rhode Ave, Cleveland, OH 58475))

Any thoughts or suggestions are greatly appreciated.

2

2 Answers

0
votes

.descendants gives you all children of a tag, including the children's children. You could use that to search for all NavigableString types (and remove the empty ones). The snippet below will just do that.

From there it depends on what you want to do: maybe use regular expressions to search the list and format the parts according to your specifications, implement some static extraction if the pages you parse look all the same and the lists's indices will be identical or try some machine learning to parse the content.

sauce = urllib.request.urlopen(link).read() #link is the url
soup = BeautifulSoup(sauce,'lxml')
span = soup('span', attrs={'id': 'ctl00_ContentPlaceHolder1_Label1'})

[c.strip() for c in soup.span.descendants if type(c) == NavigableString and len(c.strip()) > 0]
0
votes

I think you can extract the data from y, and regroup them.

import re
html = "..."
print([ele.strip() for ele in re.findall("(?<=>).*?(?=<)",html) if ele.strip() not in ["",","]])

Output

['Licensee:', 'Company, INC.', 'Facility:', 'Joes Shop', 'Status:', 'Licensed',
 'JOE SMITH - Director', 'Phone:', '(555)555-5555',
 'Daytime Hours:', '07:30 AM - 03:30 PM',
 'Nighttime Hours:', 'N/A - N/A', 'Daytime Ages:', '4 YRS Through 5 YRS',
 'Nighttime Ages:', 'N/A', 'Mailing Address:', '1909 CENTRAL PARK',
 'NEW YORK', 'NY', '58756', 'Street Address:', '3996 Rhode Ave', 'Cleveland', 'OH', '58475']