One way to do this is all in a single filter.
See the section on Kinds of filters under searching in the docs.
If you can write a function that checks whether a tag is a p
tag with subclass
in the text, you can use it to find the p
tags. For example:
>>> soup.find_all(lambda tag: tag.name=='p' and tag.text=='subclass')
[<p><b>subclass</b></p>]
Of course you don't need that function. (It's the same thing as soup.find_all('p', text='subclass')
.) But it illustrates the idea.
So now, you want to find table
tags that follow such p
tags. This is going to get a bit more complicated, so let's write it out-of-line.
First, a quick&dirty solution:
def is_table_after_subclass(tag):
return (tag.name == 'table' and
tag.find_previous_sibling('p').text == 'subclass')
But this isn't very robust. You don't want to scan through all the previous siblings, just check the immediate one. Also, if no p
tag is found, you'll get an exception instead of false. So:
# This is necessary because the table's previous_sibling is the
# '\n' string between the `p` and the `table`, not the `p`.
def previous_tag(tag):
tag = tag.previous_sibling
while not isinstance(tag, bs4.Tag):
tag = tag.previous_sibling
return tag
def is_table_after_subclass(tag):
if tag.name != 'table': return False
prev = previous_tag(tag)
return prev.name == 'p' and prev.text == 'subclass'
Now, you can do this:
soup.find_all(is_table_after_subclass)
Another way to do it is to first iterate all the tables, then skip the ones with the wrong previous sibling. Or to first iterate all the subclass paragraphs, then skip the ones with the wrong next sibling. For example:
def next_tag(tag):
tag = tag.next_sibling
while not isinstance(tag, bs4.Tag):
tag = tag.next_sibling
return tag
for subclass in soup.find_all('p', text='subclass'):
tag = next_tag(subclass)
if tag.name == 'table':
do_stuff(tag)
<table>...</table>
, rather than a giant table that never closes with three tables nested underneath each other? – abarnert//p[descendant-or-self::text()='subclass']/following::table
. Which I'm pretty surelxml
can handle. Of course if that looks like incomprehensible magic to you, ignore this and just learn how to do it imperatively with BS4 first; it's a lot simpler. – abarnert