2
votes

this may very well be a duplicate. I have read a lot of table-related questions -- like this one -- attempting to understand how to extract web page contents that're more deeply nested.

Anyhow here's the source code:

<div class='event-details'>
    <div class='event-content'>
    <p style="text-align:center;">
        <span style="font-size:14px;">
            <span style="color:#800000;">TICKETS: ...<br></span>
        </span>
        <span style="font-size:14px;">
            <span style="color:#800000;">Front-of-House $35 ...<br></span>
        </span>
    </p>
    <p><span style="font-size:14px;">From My Generation ...</span></p>
    <p><span style="font-size:14px;">With note for note ...</span></p>
    <p><span style="font-size:14px;">The Who Show was cast ...</span></p>
    <p><span style="font-size:14px;">With excellent musicianship ...</span></p>
    <p><span style="font-size:14px;">http://www.thewhoshow.com</span></p>
    </div>
</div>

Here's what's making it difficult: I don't want the ticket information, which precedes the paragraph text that I do want and all of the text is, at one point or anther, preceded by an identical style tag. Namely: <span style="font-size:14px;">

What I am hoping is that there is a way in BS to grab the unique feature which the paragraphs provide -- i.e., a p tag followed immediately by the above span tag. See: <p><span style="font-size:14px;">

Here's what I've done:

desc_block = newsoup.find('div', {'class','event-details'}).find_all('p')
description = []
for desc in desc_block:
    desc_check = desc.get_text()
description.append(desc_check)
print description[2:]

The problem is twofold: one, I'm appending characters (\n for instance) and information (ticket info) I don't want; and two, that I'm appending at all, since what I really wish to do is extract the text and add it as a utf-8 string to an empty string. Can anyone please assist me with the first problem -- i.e., grabbing extraneous p tags and info I don't want?? Any assistance would be greatly appreciated. Thank you.

1
You might be looking for XPath expressions. See the answer I gave here for examples on how to use it. You can use XPath expressions with lxml, which also provides the Beautiful Soup parser (amongst others).Lukas Graf
@LukasGraf that's an awesome response you offered; thank you, lukas. can you provide any advice or point to another tutorial for making lxml (the only tool I've used with xpath) play nicely with BS?? I can never seem to make the two work, tho I'm confident it can be done, as the lxml docs say as much. probably a dumb question, but certainly you'd know better than I, and I will dutifully follow any recommendation you make :). thank you kindly, sir.Bee Smears
working on an answer ;-)Lukas Graf
You want to discard "Front-of-House $35..." as well, right? So the first two 14px spans, respectively all the text from the #800000 spans?Lukas Graf
@LukasGraf yes that's right; I just want the <p><span style="font-size:14px;"> paragraphs. ps: thank you for the edit. I'll try to format html blocks like that in the future.Bee Smears

1 Answers

3
votes

If you parse your document with lxml, you can use XPath expressions to select only the elements you care about based on their location in the tree and their attributes.

To install lxml, do either

  • easy_install lxml
  • pip install lxml
  • declare it as a dependency for your package in setup.py
  • or use any other way to install the package

(assuming you already have BeautifulSoup installed)

Example

from BeautifulSoup import UnicodeDammit
from lxml import html


def decode_html(html_string):
    converted = UnicodeDammit(html_string, isHTML=True)
    if not converted.unicode:
        raise UnicodeDecodeError(
            "Failed to detect encoding, tried [%s]",
            ', '.join(converted.triedEncodings))
    # print converted.originalEncoding
    return converted.unicode


tag_soup = open('mess.html').read()

# Use BeautifulSoup's UnicodeDammit to detect and fix the encoding
decoded = decode_html(tag_soup)

# Use lxml's HTML parser (faster) to parse the document
root = html.fromstring(decoded)

spans = root.xpath("//span[@style='font-size:14px;']")
wanted_spans = spans[2:]

blocks = []
for span in wanted_spans:
    line = span.text.strip().replace('\n', '')
    blocks.append(line)

description = '\n'.join(blocks)
print description

This code uses lxml's fast HTML parser to parse the document (works just fine for the snippet you provided), but BeautifulSoup's encoding detection to first guess the appropriate character set and decode the document. For more information on how to use lxml with the BeautifulSoup parser, see the lxml docs on BeautifulSoup.

The spans are selected by the XPath expression //span[@style='font-size:14px;'], which basically means: "Any <span /> anywhere in the document that has an attribute style with the exact value font-size:14px;"

If you want to be more specific about selecting your elements, you could use an expression like

//div[@class='event-details']//span[@style='font-size:14px;']

to select only spans (somewhere) below a div with class event-details. Now, that's really the exact value that's being matched against - If there's even the ; missig behind the style value, it won't match. XPath knows nothing of CSS, it's a generic query language to traverse to elements or attributes in XML documents. If your document is that messy and you need to account for that, you'd need to use something like contains() in your XPath expression.

spans[2:] then selects all but the first two spans, and strip().replace('\n', '') makes sure we get no whitespace in the text. Finally I join all the lines to form a newline-separated description - if you don't even want a single newline, just join the lines using ' '.join(lines).

For more information on the XPath syntax, see for example the XPath Syntax page in the W3Schools Xpath Tutorial.

To get going with XPath it can also be very helpful to fiddle around with your document in one of the many XPath testers. Also, the Firebug plugin for Firefox, or Google Chrome inspector allow you to show the (or rather, one of many) XPath for the selected element.