this may very well be a duplicate. I have read a lot of table-related questions -- like this one -- attempting to understand how to extract web page contents that're more deeply nested.
Anyhow here's the source code:
<div class='event-details'>
<div class='event-content'>
<p style="text-align:center;">
<span style="font-size:14px;">
<span style="color:#800000;">TICKETS: ...<br></span>
</span>
<span style="font-size:14px;">
<span style="color:#800000;">Front-of-House $35 ...<br></span>
</span>
</p>
<p><span style="font-size:14px;">From My Generation ...</span></p>
<p><span style="font-size:14px;">With note for note ...</span></p>
<p><span style="font-size:14px;">The Who Show was cast ...</span></p>
<p><span style="font-size:14px;">With excellent musicianship ...</span></p>
<p><span style="font-size:14px;">http://www.thewhoshow.com</span></p>
</div>
</div>
Here's what's making it difficult: I don't want the ticket information, which precedes the paragraph text that I do want and all of the text is, at one point or anther, preceded by an identical style
tag. Namely: <span style="font-size:14px;">
What I am hoping is that there is a way in BS to grab the unique feature which the paragraphs provide -- i.e., a p
tag followed immediately by the above span tag. See: <p><span style="font-size:14px;">
Here's what I've done:
desc_block = newsoup.find('div', {'class','event-details'}).find_all('p')
description = []
for desc in desc_block:
desc_check = desc.get_text()
description.append(desc_check)
print description[2:]
The problem is twofold: one, I'm appending characters (\n
for instance) and information (ticket info) I don't want; and two, that I'm appending at all, since what I really wish to do is extract the text and add it as a utf-8 string to an empty string. Can anyone please assist me with the first problem -- i.e., grabbing extraneous p
tags and info I don't want?? Any assistance would be greatly appreciated. Thank you.
"Front-of-House $35..."
as well, right? So the first two 14px spans, respectively all the text from the#800000
spans? – Lukas Graf<p><span style="font-size:14px;">
paragraphs. ps: thank you for the edit. I'll try to format html blocks like that in the future. – Bee Smears