4
votes

I apologize in advance for any lack of clarity (I'm new to programming). I'm trying to parse a set of local files with lxml.etree. I wrote a parsing script using lxml (and xpath) that finds relevant data from an SEC webpage and exports to a .csv file. That script works for a single url but I want to generalize to thousands of html pages. I've gotten all the html files locally downloaded (I used curl to get the links, wget to download)--but I haven't had any success in replacing my parser. The old version that worked was:

page = requests.get('url')
tree = html.fromstring(page.text)

I've tried to replace it with etree.parse so that I'm parsing files locally downloaded in the directory 'Bullseye'

path = "/Users/dbk13/Desktop/SEC/bullseye"
dirs = os.listdir( path )

for files in dirs: 
    page = os.path.join(path,files)
    etree.parse(page)

Is there an issue with my path to the local files?

The error I keep getting is something like:

File "postings_up_updated.py", line 26, in etree.parse(page) File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421) File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883) File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182) File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181) File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131) File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254) File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690) File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94757) lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

1
An empty file is not a valid XML document! Obviously, in /Users/dbk13/Desktop/SEC/bullseye there is an empty file.Antti Haapala

1 Answers

2
votes

The error message suggests that the file is empty, however, I think it more likely that you are trying to parse a directory as though it were a file. This code produces the same traceback as you've shown:

from lxml import etree

etree.parse('/tmp')
Traceback (most recent call last):
.
.
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

This might be happening if there are subdirectories in "/Users/dbk13/Desktop/SEC/bullseye" because os.listdirs() will include subdirectories in the returned list. If this is the case, you could try checking for regular files using os.path.isfile():

import os

path = "/Users/dbk13/Desktop/SEC/bullseye"
dirs = os.listdir( path )

for filename in dirs:
    page = os.path.join(path, filename)
    if os.path.isfile(page):
        etree.parse(page)

Another point worth making is that you appear to be attempting to parse HTML files using an XML parser. That is not likely to succeed because the vast majority of HTML files are not XML, and therefore can not be reliably parsed with an XML parser. I'd recommend lxml.html but you already seem to have tried that. Another alternative HTML parser is BeautifulSoup.