BeautifulSoup get nearest tag with class, not a sibling and nested in unknown sibling

Question

<h3>
    <span></span>
    <span class='headline'>Headline #1</span>
</h3>
<table class='striped'></table>
<h4>
    <span class='headline'>Headline #2</span>
</h4>
<table class='striped'></table>
<p>
    <span class='headline'>Headline #3</span>
</p>
<ul></ul>
<center>
    <table class='striped'></table>
</center>

This is my structure. I am enumerating over the table tags and want to retrieve the text value of the span tags with a class of 'headline' which is nearest to my table. By "nearest" I mean that if you were to flatten out the html, I want to target the span with a class 'headline' that you would come across first if you started up from the point of the table

Sometimes those spans are nested inside an h3, sometimes an h4, sometimes a p tag. Sometimes the table tag is on the same level as the h3/h4/p and sometimes it is itself nested inside a center tag. And sometimes the h3/h4/p tag is an immediate sibling of the table and sometimes it isn't.

How can I use BeautifulSoup to find nearest span.headline regardless of nesting level and whether it is nested inside a parent or sibling?

So far I've got this code

tables = soup.findAll("table", {"class": ["striped"]})

for index, table in enumerate(tables):
    headline = table.find_previous('h3').("span", {"class" : ["headline"]}).text

By "nearest" I mean that if you were to flatten out the html, I want to target the span with a class 'headline' that you would come across first if you started up from the point of the table I have modified my question slightly to better represent the actual structure I am dealing with — vesperknight

Jonathan Jonathan · Accepted Answer · 2019-01-25T01:27:36

I was able to use the find_previous method on each table to find the previous headline for the sample html you provided. I added an additional idx attribute to each table to use when checking if the headline belongs to that table. I also added two tables to the beginning and end of the html that have no previous headline.

html = '''
<table class='striped'></table>
<h3>
    <span></span>
    <span class='headline'>Headline #1</span>
</h3>
<table class='striped'></table>
<h4>
    <span class='headline'>Headline #2</span>
</h4>
<table class='striped'></table>
<p>
    <span class='headline'>Headline #3</span>
</p>
<ul></ul>
<center>
    <table class='striped'></table>
</center>
<table class='striped'></table>
</div>
'''.replace('\n', '')

soup = BeautifulSoup(html, 'lxml')
table_query = ('table', {'class': 'striped'})
headline_query = ('span', {'class': 'headline'})

for idx, table in enumerate(soup.find_all(*table_query)):
    table.attrs['idx'] = idx
    previous_headline = table.find_previous(*headline_query)
    if (previous_headline and 
        previous_headline.find_next(*table_query).attrs['idx'] == idx):
        print(previous_headline.text)
    else:
        print('No headline found.')

Output:

No headline found.
Headline #1
Headline #2
Headline #3
No headline found.

BeautifulSoup get nearest tag with class, not a sibling and nested in unknown sibling

1 Answers