I'm having two different sets of div tags in HTML:
<div class="ABC BCD CDE123">
<div class="ABC BCD CDE234">
<div class="ABC BCD CDE345">
and
<div class="ABC XYZ BCD">
I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.
I already know about this approach:
soup.find_all('div', class_=['ABC','BCD'])
Which searches as OR
(so ABC or BCD must be present).
I also know about that approach here:
def myfunction(theclass):
return theclass is not None and len(theclass)=5
soup.find_all('div', class_=myfunction)
Which will return all divs with a classname length of 5
I then tried to solve my problem with this:
soup.find_all('div', class_ = lambda x: x and 'ABC' and 'BCD' in x.split() and x and 'XYZ' not in x.split())
But this was not working. So I tried to debug it with this approach:
def myfunction(theclass):
print theclass
return True
soup.find_all('div', class_=myfunction)
The problem seems to be, that from a tag like this:
<div class="ABC BCD CDE123">
Only 'ABC' is handed over to myfunction
, so theclass = 'ABC'
and not theclass ='ABC BCD CDE123'
what I would have expected.
That's also the reason I guess why the lambda function fails.
Any clue how I can filter the tags acording to my requirement:
I want to select all the tags with ABC and BCD in it, but not containing the XYZ class with BeautifullSoup4.
.ABC.BCD:not(.XYZ)
- I'm not sure about BS4 though. – pguardiario