I apologize in advance for any lack of clarity (I'm new to programming). I'm trying to parse a set of local files with lxml.etree. I wrote a parsing script using lxml (and xpath) that finds relevant data from an SEC webpage and exports to a .csv file. That script works for a single url but I want to generalize to thousands of html pages. I've gotten all the html files locally downloaded (I used curl to get the links, wget to download)--but I haven't had any success in replacing my parser. The old version that worked was:
page = requests.get('url')
tree = html.fromstring(page.text)
I've tried to replace it with etree.parse so that I'm parsing files locally downloaded in the directory 'Bullseye'
path = "/Users/dbk13/Desktop/SEC/bullseye"
dirs = os.listdir( path )
for files in dirs:
page = os.path.join(path,files)
etree.parse(page)
Is there an issue with my path to the local files?
The error I keep getting is something like:
File "postings_up_updated.py", line 26, in etree.parse(page) File "lxml.etree.pyx", line 3299, in lxml.etree.parse (src/lxml/lxml.etree.c:72421) File "parser.pxi", line 1791, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:105883) File "parser.pxi", line 1817, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:106182) File "parser.pxi", line 1721, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:105181) File "parser.pxi", line 1122, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:100131) File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:94254) File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:95690) File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:94757) lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
/Users/dbk13/Desktop/SEC/bullseye
there is an empty file. – Antti Haapala