BeautifulSoup: find Class names: AND + NOT

Question

I'm having two different sets of div tags in HTML:

<div class="ABC BCD CDE123">

<div class="ABC BCD CDE234">

<div class="ABC BCD CDE345">

and

<div class="ABC XYZ BCD">

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

I already know about this approach:

soup.find_all('div', class_=['ABC','BCD'])

Which searches as OR (so ABC or BCD must be present).

I also know about that approach here:

def myfunction(theclass):
    return theclass is not None and len(theclass)=5
soup.find_all('div', class_=myfunction)

Which will return all divs with a classname length of 5

I then tried to solve my problem with this:

soup.find_all('div', class_ = lambda x: x and 'ABC' and 'BCD' in x.split() and x and 'XYZ' not in x.split())

But this was not working. So I tried to debug it with this approach:

def myfunction(theclass):
    print theclass
    return True
soup.find_all('div', class_=myfunction)

The problem seems to be, that from a tag like this:

<div class="ABC BCD CDE123">

Only 'ABC' is handed over to myfunction, so theclass = 'ABC' and not theclass ='ABC BCD CDE123' what I would have expected. That's also the reason I guess why the lambda function fails.

Any clue how I can filter the tags acording to my requirement:

I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.

I can of course solve this problem myself with some ugly multi line hack, but I'm looking here for a nice and clean solution. — stoney
Lxml + cssselect lets you do: .ABC.BCD:not(.XYZ) - I'm not sure about BS4 though. — pguardiario

Saurabh Pandey Saurabh Pandey · Accepted Answer · 2018-07-05T11:42:49

This can be done using SET. Get the list of all result with class ABC and BCD. Enclose result in python SET. Apply the same for XYZ. You will now have two SET one for ABC and BCD and other for XYZ. Subtract both set

To Use ABC and BCD in the search list, use select function instead of find_all

from bs4 import BeautifulSoup

data = '''
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE345"></div>
<div class="ABC XYZ BCD"></div>
<div class="ABC XYZ AAC"></div>
<div class="ABC AAC"></div>
'''

soup = BeautifulSoup(data)
ABC_BCD = set(soup.select('div.ABC.BCD'))
XYZ     = set(soup.select('div.XYZ'))
result = ABC_BCD - XYZ
for element in result:
    print element

output

<div class="ABC BCD CDE234"></div>
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>

With same code using find_all

ABC_BCD = set(soup.find_all('div', class_=['ABC','BCD']))
XYZ     = set(soup.find_all('div', class_=['XYZ']))
result = ABC-BCD
for element in result:
    print element

output is

<div class="ABC BCD CDE234"></div>
<div class="ABC AAC"></div> #This is what we dont need
<div class="ABC BCD CDE123"></div>
<div class="ABC BCD CDE345"></div>

BeautifulSoup: find Class names: AND + NOT

3 Answers