7
votes

Say that my html looks like this:

<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>

I have beautifulsoup doing this:

for tag in soup.find_all("td"):
    print tag.text

And I get

Potato1 Potato2
....
Potato9 Potato10

Would it be possible to just get the text that's inside the tag but not any text nested inside the span tag?

2

2 Answers

9
votes

You can use .contents as

>>> for tag in soup.find_all("td"):
...     print tag.contents[0]
...
Potato1
Potato9

What it does?

A tags children are available as a list using the .contents.

>>> for tag in soup.find_all("td"):
...     print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]

since we are only interested in the first element, we go for

print tag.contents[0]
1
votes

Another method, which, unlike tag.contents[0] guarantees that the text is a NavigableString and not text from within a child Tag, is:

[child for tag in soup.find_all("td") 
 for child in tag if isinstance(child, bs.NavigableString)]

Here is an example which highlights the difference:

import bs4 as bs

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)

print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]

print([child for tag in soup.find_all("td") 
       for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']

Or, with lxml, you could use the XPath td/text():

import lxml.html as LH

content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)

print(root.xpath('td/text()'))

yields

['Potato1 ', 'Potato9']