Another method, which, unlike tag.contents[0]
guarantees that the text is a
NavigableString
and not text from within a child Tag
, is:
[child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)]
Here is an example which highlights the difference:
import bs4 as bs
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)
print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]
print([child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']
Or, with lxml, you could use the XPath td/text()
:
import lxml.html as LH
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)
print(root.xpath('td/text()'))
yields
['Potato1 ', 'Potato9']