3
votes

I am trying to parse the DBLP data set using lxml in python. However it is giving this error:

lxml.etree.XMLSyntaxError: Entity 'uuml' not defined, line 54, column 43

DBLP does provide a DTD file for defining entities here. How can I use that file to parse the DBLP XML document?

Here is my current code:

filename = sys.argv[1]
dtd_name = sys.argv[2]
db_name = sys.argv[3]

conn = sqlite3.connect(db_name)

dblp_record_types_for_publications = ('article', 'inproceedings', 'proceedings', 'book', 'incollection',
    'phdthesis', 'masterthesis', 'www')

# read dtd
dtd = ET.DTD(dtd_name) #pylint: disable=E1101

# get an iterable
context = ET.iterparse(filename, events=('start', 'end'), load_dtd=True, #pylint: disable=E1101
    resolve_entities=True) 

# turn it into an iterator
context = iter(context)

# get the root element
event, root = next(context)

n_records_parsed = 0
for event, elem in context:
    if event == 'end' and elem.tag in dblp_record_types_for_publications:
        pub_year = None
        for year in elem.findall('year'):
            pub_year = year.text
        if pub_year is None:
            continue

        pub_title = None
        for title in elem.findall('title'):
            pub_title = title.text
        if pub_title is None:
            continue

        pub_authors = []
        for author in elem.findall('author'):
            if author.text is not None:
                pub_authors.append(author.text)

        # print(pub_year)
        # print(pub_title)
        # print(pub_authors)
        # insert the publication, authors in sql tables
        pub_title_sql_str = pub_title.replace("'", "''")
        pub_author_sql_strs = []
        for author in pub_authors:
            pub_author_sql_strs.append(author.replace("'", "''"))

        conn.execute("INSERT OR IGNORE INTO publications VALUES ('{title}','{year}')".format(
            title=pub_title_sql_str,
            year=pub_year))
        for author in pub_author_sql_strs:
            conn.execute("INSERT OR IGNORE INTO authors VALUES ('{name}')".format(name=author))
            conn.execute("INSERT INTO authored VALUES ('{author}','{publication}')".format(author=author,
                publication=pub_title_sql_str))

        elem.clear()
        root.clear()

        n_records_parsed += 1
        print("No. of records parsed: {}".format(n_records_parsed))

conn.commit()
conn.close()
2
If the XML document has a doctype declaration (<!DOCTYPE dblp SYSTEM "dblp.dtd">) and if dblp.dtd is in the same directory as the XML file, and if load_dtd=True is used, then I don't get any syntax error. I don't think using dtd = ET.DTD(dtd_name) has any effect in this case. - mzjn

2 Answers

3
votes

You can add a custom URI Resolver https://lxml.de/resolvers.html:

class DTDResolver(etree.Resolver):
    def resolve(self, system_url, public_id, context):
        return self.resolve_filename(os.path.join("/path/to/dtd/file", system_url), context)

context.resolvers.add(DTDResolver())
2
votes

After keeping the DTD file in the same directory as the XML file and making sure that DTD filename and the name of the DTD file in the doctype declaration (<!DOCTYPE dblp SYSTEM "dblp.dtd">) of the XML document matches, as suggested by mzjn in the comments, it is no longer giving syntax errors.