5
votes

I'm having two different sets of div tags in HTML:

<div class="ABC BCD CDE123">

<div class="ABC BCD CDE234">

<div class="ABC BCD CDE345">

and

<div class="ABC XYZ BCD">

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

I already know about this approach:

soup.find_all('div', class_=['ABC','BCD'])

Which searches as OR (so ABC or BCD must be present).

I also know about that approach here:

def myfunction(theclass):
    return theclass is not None and len(theclass)=5
soup.find_all('div', class_=myfunction)

Which will return all divs with a classname length of 5

I then tried to solve my problem with this:

soup.find_all('div', class_ = lambda x: x and 'ABC' and 'BCD' in x.split() and x and 'XYZ' not in x.split())

But this was not working. So I tried to debug it with this approach:

def myfunction(theclass):
    print theclass
    return True
soup.find_all('div', class_=myfunction)

The problem seems to be, that from a tag like this:

<div class="ABC BCD CDE123">

Only 'ABC' is handed over to myfunction, so theclass = 'ABC' and not theclass ='ABC BCD CDE123' what I would have expected. That's also the reason I guess why the lambda function fails.

Any clue how I can filter the tags acording to my requirement:

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

3
I can of course solve this problem myself with some ugly multi line hack, but I'm looking here for a nice and clean solution.stoney
Lxml + cssselect lets you do: .ABC.BCD:not(.XYZ) - I'm not sure about BS4 though.pguardiario

3 Answers

2
votes

This can be done using SET. Get the list of all result with class ABC and BCD. Enclose result in python SET. Apply the same for XYZ. You will now have two SET one for ABC and BCD and other for XYZ. Subtract both set

To Use ABC and BCD in the search list, use select function instead of find_all

from bs4 import BeautifulSoup

data = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>
<div class="ABC XYZ AAC"></div>
<div class="ABC AAC"></div>
'''

soup = BeautifulSoup(data)
ABC_BCD = set(soup.select('div.ABC.BCD'))
XYZ     = set(soup.select('div.XYZ'))
result = ABC_BCD - XYZ
for element in result:
    print element

output

<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>

With same code using find_all

ABC_BCD = set(soup.find_all('div', class_=['ABC','BCD']))
XYZ     = set(soup.find_all('div', class_=['XYZ']))
result = ABC-BCD
for element in result:
    print element

output is

<div class="ABC BCD CDE234"></div>
<div class="ABC AAC"></div> #This is what we dont need
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>
2
votes

Your approach was correct, but you missed one thing. BeautifulSoup converts the values of the attribute class in a list.

For example:

>>> soup.div['class']
['ABC', 'BCD', 'CDE123']

Instead of using x.split(), directly check whether the value is in the list or not.

Code:

html = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('div', class_=lambda c: 'ABC' in c and 'BCD' in c and 'XYZ' not in c))

Output:

[<div class="ABC BCD CDE123"></div>,
 <div class="ABC BCD CDE234"></div>,
 <div class="ABC BCD CDE345"></div>]
1
votes

I don't know about a one-step solution but you can use CSS select and then filter out the elements you don't want.

from bs4 import BeautifulSoup

html = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>
<div class="ABC XYZ AAC"></div>
<div class="ABC AAC"></div>
'''

soup = BeautifulSoup(html, "html.parser")
divs = soup.select('div.ABC.BCD')
result = [div for div in divs if "XYZ" not in div['class']]