1
votes

I am experimenting with Beautiful Soup and I am trying to extract information from a HTML document that contains segments of the following type:

<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&amp;trk=manage_invitations_profile" 
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&amp;trk=manage_invitations_miniprofile" 
class="miniprofile" 
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>

I have used the following commands:

with open("C:\Users\pv\MyFiles\HTML\Invites.html","r") as Invites: soup = bs(Invites, 'lxml')
soup.title
out: <title>Sent Invites\n| LinkedIn\n</title>
invites = soup.find_all("div", class_ = "entity-body")
type(invites)
out: bs4.element.ResultSet
len(invites)
out: 0

Why find_all returns empty ResultSet object?

Your advice will be appreciated.

2
Try viewing page when You fetch it. If You can't see this div tag there, it would mean this part is generated using JS, so You wouldn't be able to scrape it this way (You'd have to use selenium). - Fejs

2 Answers

0
votes

The problem is that the document is not read, it is a just TextIOWrapper (Python 3) or File(Python 2) object. You have to read the documet and pass markup, essentily a string to BeautifulSoup.

The correct code would be:

with open("C:\Users\pv\MyFiles\HTML\Invites.html", "r") as Invites:
    soup = BeautifulSoup(Invites.read(), "html.parser")
    soup.title
    invites = soup.find_all("div", class_="entity-body")
    len(invites)
0
votes
import bs4

html = '''<div class="entity-body">
<h3 class="entity-name with-profile">
<a href="https://www.linkedin.com/profile/view?id=AA4AAAAC9qXUBMuA3-txf-cKOPsYZZ0TbWJkhgfxfpY&amp;trk=manage_invitations_profile" 
data-li-url="/profile/mini-profile-with-connections?_ed=0_3fIDL9gCh6b5R-c9s4-e_B&amp;trk=manage_invitations_miniprofile" 
class="miniprofile" 
aria-label="View profile for Ivan Grigorov">
<span>Ivan Grigorov</span>
</a>
</h3>
<p class="entity-subheader">
Teacher
</p>
</div>'''

soup = bs4.BeautifulSoup(html, 'lxml')
invites = soup.find_all("div", class_ = "entity-body")
len(invites)

out:

1

this code works fine