1
votes

There are various ways to split a beautifulSoup parsetree getting a list of the elements or getting the strings of the tags. But there seems to be no way to keep the tree intact while splitting it.

I want to split the following snippet (soup) on the <br />'s. Trivial with strings, but I want to keep the structure, I want a list of parsetrees.

s="""<p>
foo<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar<br />
<a href="http://...html" target="_blank">foo</a> | bar
</p>"""
soup=BeautifulSoup(s)

I could, obviously, do a [BeautifulSoup(i) for i in str(soup).split('<br />')], but I that's ugly and I have way too many links for that.

Iterating with soup.next and soup.previousSibling() on soup.findAll('br') is possible, but returns not a parsetree, but only all elements it contains.

Is there a solution extracting a full subtree of tags from a BeautifulSoup-tag, keeping all parent- and sibling-relations?

edit for more clarity:

The result should be a list consisting of BeautifulSoup-Objects, that I can traverse the splitted soup further down, by output[0].a, output[1].text and so on. Splitting a soup on the <br />s would return a list of all links to process further, which is what I need. All links from the snippet above, with text, attributes and the following "bar", being a description of each link.

1
Sorry, what output do you expect exactly? A tree without <br/> tags in it? Should the <p> tag still have a parent (if there was one before)? What exactly are you trying to achieve? - Martijn Pieters
You don't have to remove the <br/> tags at all to achieve your goal. - Martijn Pieters

1 Answers

0
votes

If you don't mind that the original tree is changed, I'd use .extract() on the <br /> tags to simply remove them from the tree:

>>> for br in soup.find_all('br'): br.extract()
... 
<br/>
<br/>
<br/>
<br/>
>>> soup
<html><body><p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p></body></html>

This is a full working tree still:

>>> soup.p
<p>
foo
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
<a href="http://...html" target="_blank">foo</a> | bar
</p>
>>> soup.p.a
<a href="http://...html" target="_blank">foo</a>

But you do not need to remove those tags at all to achieve what you want:

for link in soup.find_all('a'):
    print link['href'], ''.join(link.stripped_strings), link.next_sibling

results in:

>>> for link in soup.find_all('a'):
...     print link['href'], ''.join(link.stripped_strings), link.next_sibling
... 
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar
http://...html foo  | bar

regardless of wether or not we removed the <br/> tags from the tree first.