0
votes

I'm having trouble using Beautiful Soup 4 to extract contents from a number of html files which are stored in span tags

I've used the soupstrainer and find("dl") to reduce the html to the repeated items with the "dl" tag and then find all the spans.

My problem is how to extract the correct value from each span and store in a variable and also to handle the ordering of the

<span class="iconYes">Public</span>
<span class="iconNo">Private</span> 

so I know the services they offer

My Python 3 code

WebText=BeautifulSoup(open(fileToProcess),"html.parser",parse_only=DentistStrainer)
datalist = WebText.find("dl")
        for listitems in datalist:
            spans = datalist.find_all('span')
            for span in spans:
                print(span)

Sample Output

<span id="Content_Result_lblDentistName">Dr First Surname</span> 
<span class="lblAddress" id="Content_Result_lblAddress"><strong>Address</strong>: Dental Centre, Street, Town</span> 
<span class="lblAddress" id="Content_Result_lblPhone"><strong>Phone</strong>: 123-1234567</span> 
<span class="lblAddress" id="Content_Result_lblFax"><strong>Fax</strong>: 123-3456789</span> 
<span class="lblAddress" id="Content_Result_lblEmail">[email protected]</span> 
<span class="lblAddress" id="Content_Result_lblWebsiteUrl">www.somewhere.tld</span> 
<span><strong>Services</strong>: </span> 
<span class="iconYes">Private</span> 
<span class="iconYes">Public</span>
<span class="iconNo">Credit Card</span>

I unsuccessfully tried to extract the values using

if span.contains("lblDentistName"):
   DentistName = span.text()
   print("Dentist ",DentistName)`

Can any Beautifulsoup users help me ?

1

1 Answers

2
votes

Use CSS selectors:

dentist_names = soup.select('dl span[id$="lblDentistName"]')
for span in dentist_names:
    print(span.text())

$= selects on attributes that end with the specified text.

CSS selectors can also be used to find all class="icon.." span elements; these are matched in the same order they appear in the tree:

soup.select('dl span[class^="icon"]')

^= matches the class value at the start (note that this may not work if the span has multiple classes).