0
votes

The original BeautifulSoup object looks like this:

<p style="padding-left: 140pt;text-indent: 0pt;line-height: 13pt;text-align: center;">blahblah</p>
<ul>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
        <p style="display: inline;">blahblah</p>
    </li>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
         <p style="text-indent: 0pt;text-align: center;">blahblah</p>
    </li>
</ul>

The first step I want to do is to remove all tags whose style attribute includes a center text-align:

<ul>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
        <p style="display: inline;">blahblah</p>
    </li>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
    </li>
</ul>

Then the second step is to remove all style attribute:

<ul>
    <li>
        <p>blahblah</p>
    </li>
    <li>
    </li>
</ul>

Maybe the example above is somewhat weird. But the problem is: While it's easy to find a tag (or tags) in a BeautifulSoup object, can we find an easy way to operate a BeautifulSoup object itself? If I know the position of a tag, I can easily remove it from the BeautifulSoup object. For example, if I want to remove the second <li> tag, I can use soup.ul.li to point at the first <li> tag, then use .next_sibling to move to the second one, and then use .decompose() to remove it from the BeautifulSoup object. But if I don't know the position of the tags I want to remove, just know the criteria these tags should meet, it seems no way to find out the exact position of these tags and then operates on the BeautifulSoup object.

1

1 Answers

0
votes

you can use the re package to match the text-align: center in style attribute. Then you can delete the style attribute by simply checking its presence.

Code:

from bs4 import BeautifulSoup as soup
import requests
import re

html = """<p style="padding-left: 140pt;text-indent: 0pt;line-height: 13pt;text-align: center;">blahblah</p>
<ul>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
        <p style="display: inline;">blahblah</p>
    </li>
    <li style="padding-left: 11pt;text-indent: 0pt;line-height: 14pt;text-align: left;">
         <p style="text-indent: 0pt;text-align: center;">blahblah</p>
    </li>
</ul>"""

page = soup(html, 'html.parser')

style_center = page.find_all(style=re.compile('text-align: center'))
for style in style_center:
    style.decompose()

for tag in page.find_all():
    if 'style' in tag.attrs:
        del tag.attrs['style']

print(page)

OUTPUT:

<ul>
<li>
<p>blahblah</p>
</li>
<li>

</li>
</ul>