2
votes

I want to select all <div> where class name is either post has-profile bg2 OR post has-profile bg1 but not last one i.e. panel

<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>

<div class="panel bg1" id="abc"> ... </div>

select() is matching only single occurrence. I'm trying it with find_all(), but bs4 is not able to find it.

if soup.find(class_ = re.compile(r"post has-profile [bg1|bg2]")):
    posts = soup.find_all(class_ = re.compile(r"post has-profile [bg1|bg2]"))

How to solve it with regex and without regex? Thanks.

3

3 Answers

2
votes

You can use builtin CSS selector within BeautifulSoup:

data = """<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>
<div class="panel bg1" id="abc"> ... </div>"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

divs = soup.select('div.post.has-profile.bg2, div.post.has-profile.bg1')

for div in divs:
    print(div)
    print('-' * 80)

Prints:

<div class="post has-profile bg2" id="6"> some text 1 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg2" id="8"> some text 3 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg1" id="7"> some text 2 </div>
--------------------------------------------------------------------------------
<div class="post has-profile bg1" id="9"> some text 4 </div>
--------------------------------------------------------------------------------

The 'div.post.has-profile.bg2, div.post.has-profile.bg1' selector selects all <div> tags with class "post hast-profile bg2" and all <div> tags with class "post hast-profile bg1".

1
votes

You can define a function that describes the tags of interest:

def test_tag(tag):
    return tag.name=='div' \
       and tag.has_attr('class') \
       and "post" in tag['class'] \
       and "has-profile" in tag['class'] \
       and ("bg1" in tag['class'] or "bg2" in tag['class']) \
       and "panel" not in tag['class']

And apply the function to the "soup":

soup.findAll(test_tag)
0
votes

Using Regex.

Try:

from bs4 import BeautifulSoup
import re
s = """<div id="6" class="post has-profile bg2"> some text 1 </div>
<div id="7" class="post has-profile bg1"> some text 2 </div>
<div id="8" class="post has-profile bg2"> some text 3 </div>
<div id="9" class="post has-profile bg1"> some text 4 </div>

<div class="panel bg1" id="abc"> ... </div>"""

soup = BeautifulSoup(s, "html.parser")
for i in soup.find_all("div", class_=re.compile(r"post has-profile bg(1|2)")):
    print(i)

Output:

<div class="post has-profile bg2" id="6"> some text 1 </div>
<div class="post has-profile bg1" id="7"> some text 2 </div>
<div class="post has-profile bg2" id="8"> some text 3 </div>
<div class="post has-profile bg1" id="9"> some text 4 </div>