I'm using Python3 and BeautifulSoup 4.4.0 to extract data from a website. I'm interested in the tables in the div tag but to tell what data is inside a table I have to get the text of the h4 tag then get the sibling which is the table. The problem is that one of the h4 tags has a span and BeautifulSoup returns None for the string value when there is another tag inside.
def get_table_items(self, soup, header_title):
header = soup.find('h4', string=re.compile(r'\b{}\b'.format(header_title), re.I))
header_table = header.find_next_sibling('table')
items = header_table.find_all('td')
return items
The code above works on all h4 except <h4>Unique Title 2<span>(<a href="...">Something</a>)</span></h4>
....
<div id="some_id">
<h4>Unique Title 1</h4>
<table>
...
</table>
<h4>Unique Title 2<span>(<a href="...">Something</a>)</span></h4>
<table>
...
</table>
<h4>Unique Title 3</h4>
<table>
...
</table>
</div>