1
votes

I'm using Python3 and BeautifulSoup 4.4.0 to extract data from a website. I'm interested in the tables in the div tag but to tell what data is inside a table I have to get the text of the h4 tag then get the sibling which is the table. The problem is that one of the h4 tags has a span and BeautifulSoup returns None for the string value when there is another tag inside.

def get_table_items(self, soup, header_title):
        header = soup.find('h4', string=re.compile(r'\b{}\b'.format(header_title), re.I))
        header_table = header.find_next_sibling('table')
        items = header_table.find_all('td')
        return items

The code above works on all h4 except <h4>Unique Title 2<span>(<a href="...">Something</a>)</span></h4>

....
<div id="some_id">
    <h4>Unique Title 1</h4>
    <table>
     ...
    </table>
    <h4>Unique Title 2<span>(<a href="...">Something</a>)</span></h4>
    <table>
    ...
    </table>
    <h4>Unique Title 3</h4>
    <table>
    ...
    </table>
</div>
1

1 Answers

2
votes

You might need to do the search manually rather than relying on the regular expression:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
header_title = "Unique Title 2"

for h4 in soup.find_all('h4'):
    if header_title in h4.text:
        ...