BeautifulSoup Malformed Start Tag?

Question

I am trying to convert a Wordpress XML to Octopress, using in part BeautifulSoup to do the migration.

When I run exitwp, I get the following output:

writing......................................................Traceback (most recent call last):


File "exitwp.py", line 293, in <module>
    write_jekyll(data, target_format)
  File "exitwp.py", line 284, in write_jekyll
    out.write(html2fmt(i['body'], target_format))
  File "exitwp.py", line 45, in html2fmt
    return html2text(html, '')
  File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 700, in html2text
    return optwrap(html2text_file(html, None, baseurl))
  File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 695, in html2text_file
    h.feed(html)
  File "/Users/kevinquillen/Documents/workspace/exitwp2/html2text.py", line 285, in feed
    HTMLParser.HTMLParser.feed(self, data)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 108, in feed
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 148, in goahead
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 229, in parse_starttag
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 304, in check_for_whole_start_tag
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/HTMLParser.py", line 115, in error
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 64

I tried using BeautifulSoup 3.2.0 and 3.0.7a without much luck.

I also tried exporting different date ranges on Posts, but still get the same error at line 1, column number changes though.

The only thing I can think of is some older posts have adsense code in them, but beyond that, how can I easily track down where it is choking on post content?

Python version 2.7 on OSX 10.7

Edit: also happens on a Page dump (just 2 pages) that has no bad markup.

Update: It doesn't seem to like anchor tags. Tag like shown below, very basic links in content. Removing them, it compiled correctly. Why does it not like this HTML? Removing them caused it to compile without error.

<a href="http://www.google.com" target="_blank">Google</a>

guettli guettli · Accepted Answer · 2012-01-03T08:34:22

modify your code like this (in html2text.py):

try:
    HTMLParser.HTMLParser.feed(self, data)
except:
    print 'malformed data: %r' % data
    raise

I guess you will see, that 'data' contains something strange. If not, please add the data to your question.

BeautifulSoup Malformed Start Tag?

1 Answers