The following code is using scrapy + scrapy-splash + Python. I am trying to extract the upcoming matches (which includes: team names, tournament name, start time) from this site: https://www.hltv.org/matches
My code in the callback 'parse' function is:
match_days = response.xpath("//div[@class = 'upcoming-matches']//div[@class = 'match-day']")
for match in match_days.xpath("./a"):
print(match.extract())
# tournament_name = match.xpath(".//td[@class='event']//span[@class='event-name']/text()").extract_first()
# team1_name = match.xpath(".//td[@class='team-cell'][1]//div[@class='team']/text()").extract_first()
It is supposed to get me the contents for every "< a >" element (i.e. should look something like this e.g.:
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
<table class="table">
<tbody>
<tr>
<td class="time">
<div class="time" data-time-format="HH:mm" data-unix="1514028600000">12:30</div>
</td>
<td class="team-cell">
<div class="line-align">
<img alt="DKISS" src="https://static.hltv.org/images/team/logo/8657" class="logo" title="DKISS">
<div class="team">DKISS</div>
</div>
</td>
<td class="vs">vs</td>
<td class="team-cell">
<div class="team">PSYCHOACTIVE/proWince winner</div>
</td>
<td class="event"><img alt="Winner of the Future 2017" src="https://static.hltv.org/images/eventLogos/3464.png" class="event-logo" title="Winner of the Future 2017"><span class="event-name">Winner of the Future 2017</span></td>
<td class="star-cell">
<div class="map-text">bo3</div>
</td>
</tr>
</tbody>
</table>
</a>
But I only get this for each "< a >" :
<a href="/matches/2318355/dkiss-vs-psychoactive-prowince-winner-winner-of-the-future-2017" class="a-reset block upcoming-match standard-box" data-zonedgrouping-entry-unix="1514028600000">
</a>
I have tried this in the scrapy shell and the same result.
I tried on Chrome Developer tools and I can see all the contents for each "< a >" in the innerHTML property.
I don't believe the issue is with "< tbody >" as I have come to understand that it is omitted in some cases and added by web browsers, because when I print out the html contents of the returned page from the "response" "< tbody >" is there (by the way, I use a lua script via scrapy-splash to make a POST request to the url and get the html page)
Does anyone know why this is happening? I have spent past couple of days on this with no answer, nor have I any ideas on what more to test to figure why this is happening when it shouldn't be.
Thank you.