0
votes

I'm new to beautifulsoup and I'm trying to find some table under a certain p tag with text "subclass" inside

Here is the HTML example

<p><b>subclass</b></p>
<table>...<table>
<p><b>frnekr</b></p>
<table>...<table>

I only want to grab the table under the p tag with text -> subclass. Those p tags doesn't have classes unfortunately.

2
I hope that's supposed to be <table>...</table>, rather than a giant table that never closes with three tables nested underneath each other?abarnert
By the way, unlike BS4's find syntax, XPath can do this all in one go, like //p[descendant-or-self::text()='subclass']/following::table. Which I'm pretty sure lxml can handle. Of course if that looks like incomprehensible magic to you, ignore this and just learn how to do it imperatively with BS4 first; it's a lot simpler.abarnert

2 Answers

1
votes

One way to do this is all in a single filter.

See the section on Kinds of filters under searching in the docs.

If you can write a function that checks whether a tag is a p tag with subclass in the text, you can use it to find the p tags. For example:

>>> soup.find_all(lambda tag: tag.name=='p' and tag.text=='subclass')
[<p><b>subclass</b></p>]

Of course you don't need that function. (It's the same thing as soup.find_all('p', text='subclass').) But it illustrates the idea.

So now, you want to find table tags that follow such p tags. This is going to get a bit more complicated, so let's write it out-of-line.

First, a quick&dirty solution:

def is_table_after_subclass(tag):
    return (tag.name == 'table' and 
            tag.find_previous_sibling('p').text == 'subclass')

But this isn't very robust. You don't want to scan through all the previous siblings, just check the immediate one. Also, if no p tag is found, you'll get an exception instead of false. So:

# This is necessary because the table's previous_sibling is the
# '\n' string between the `p` and the `table`, not the `p`.
def previous_tag(tag):
    tag = tag.previous_sibling
    while not isinstance(tag, bs4.Tag):
        tag = tag.previous_sibling
    return tag

def is_table_after_subclass(tag):
    if tag.name != 'table': return False
    prev = previous_tag(tag)
    return prev.name == 'p' and prev.text == 'subclass'

Now, you can do this:

soup.find_all(is_table_after_subclass)

Another way to do it is to first iterate all the tables, then skip the ones with the wrong previous sibling. Or to first iterate all the subclass paragraphs, then skip the ones with the wrong next sibling. For example:

def next_tag(tag):
    tag = tag.next_sibling
    while not isinstance(tag, bs4.Tag):
        tag = tag.next_sibling
    return tag

for subclass in soup.find_all('p', text='subclass'):
    tag = next_tag(subclass)
    if tag.name == 'table':
        do_stuff(tag)
0
votes

Soupy is my attempt to make queries like this more natural (Soupy wraps BeautifulSoup, to make query chaining easier). Here's one solution:

from soupy import Soupy, Q

text = """
<p><b>subclass</b></p>
<table>...</table>
<p><b>frnekr</b></p>
<table>...</table>
<p><b>subclass</b></p>
<p> No table here </p>
"""
from soupy import Soupy, Q

(dom.find_all('p', text="subclass")     # find relevant p tags
    .each(Q.find_next_sibling('table')) # look for sibling tables
    .filter(Q)                          # drop failed searches
    .val())                             # dump out of Soupy

Which produces

[<table>...</table>]

This is roughly equivalent to @abarnert's last code example