having problem about reading an xml file in python

Question

I wanna read this xml file using python on Google Colab in this way:

import xml.etree.ElementTree as ET

tree = ET.parse('drive/MyDrive/pubmed22n1192.xml')

while pubmed22n1192.xml is the name of this file

but I get this error info

File "<string>", line unknown
ParseError: syntax error: line 1, column 0

is there something wrong with this file? considering the size of this file, I share a few lines of it

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2019//EN" "https://dtd.nlm.nih.gov/ncbi/pubmed/out/pubmed_190101.dtd">
<PubmedArticleSet>
  <PubmedArticle>
    <MedlineCitation Status="MEDLINE" Owner="NLM">
      <PMID Version="1">14584002</PMID>
      <DateCompleted>
        <Year>2004</Year>
        <Month>05</Month>
        <Day>04</Day>
      </DateCompleted>
      <DateRevised>
        <Year>2022</Year>
        <Month>02</Month>
        <Day>13</Day>
      </DateRevised>
      <Article PubModel="Print">
        <Journal>
          <ISSN IssnType="Electronic">1469-493X</ISSN>
          <JournalIssue CitedMedium="Internet">
            <Issue>4</Issue>
            <PubDate>
              <Year>2003</Year>
            </PubDate>
          </JournalIssue>
          <Title>The Cochrane database of systematic reviews</Title>
          <ISOAbbreviation>Cochrane Database Syst Rev</ISOAbbreviation>
        </Journal>
        <ArticleTitle>Intravenous immunoglobulin for the treatment of Kawasaki disease in children.</ArticleTitle>
        <Pagination>
          <MedlinePgn>CD004000</MedlinePgn>
        </Pagination>
        <Abstract>
          <AbstractText Label="BACKGROUND" NlmCategory="BACKGROUND">Kawasaki disease is the most common cause of acquired heart disease in children in developed countries. The coronary arteries supplying the heart can be damaged in Kawasaki disease. The principal advantage of timely diagnosis is the potential to prevent this complication with early treatment. Intravenous immunoglobulin (IVIG) is widely used for this purpose.</AbstractText>
          <AbstractText Label="OBJECTIVES" NlmCategory="OBJECTIVE">The objective of this review was to evaluate the effectiveness of IVIG in treating, and preventing cardiac consequences, of Kawasaki disease in children.</AbstractText>
          <AbstractText Label="SEARCH STRATEGY" NlmCategory="METHODS">Electronic searches of the Cochrane Peripheral Vascular Disease Group Specialised Register, CENTRAL, MEDLINE, EMBASE, and CINAHL were performed (last searched April 2003). We also searched references from relevant articles and contacted authors where necessary. In addition we contacted experts in the field for unpublished works.</AbstractText>
          <AbstractText Label="SELECTION CRITERIA" NlmCategory="METHODS">Randomised controlled trials of intravenous immunoglobulin to treat Kawasaki disease were eligible for inclusion.</AbstractText>
          <AbstractText Label="DATA COLLECTION AND ANALYSIS" NlmCategory="METHODS">Fifty-nine trials were identified in the initial search. On careful inspection only sixteen of these met all the inclusion criteria. Trials were data extracted and assessed for quality by at least two reviewers. Data were combined for meta-analysis using relative risk ratios for dichotomous data or weighted mean difference for continuous data. A random effects statistical model was used.</AbstractText>
          <AbstractText Label="MAIN RESULTS" NlmCategory="RESULTS">The meta-analysis of IVIG versus placebo, including all children, showed a significant decrease in new coronary artery abnormalities (CAAs) in favour of IVIG, at thirty days RR (95% CI) = 0.74 (0.61 to 0.90). No statistically significant difference was found thereafter. A subgroup analysis excluding children with CAAs at enrollment also found a significant reduction of new CAAs in children receiving IVIG RR (95%) = 0.67 (0.46 to 1.00). There was a trend towards benefit from IVIG at sixty days (p=0.06). Results of dose comparisons showed a decrease in the number of new CAAs with increased dose. The meta-analysis of 400 mg/kg/day for five days versus 2 gm/kg in a single dose showed statistically significant reduction in CAAs at thirty days RR (95%) = 4.47 (1.55 to 12.86). This comparison also showed a significant reduction in duration of fever with the higher dose. There was no statistically significant difference noted between different preparations of IVIG. There was no statistically significant difference of adverse effects in any group.</AbstractText>
          <AbstractText Label="REVIEWER'S CONCLUSIONS" NlmCategory="CONCLUSIONS">Children fulfilling the diagnostic criteria for Kawasaki disease should be treated with IVIG (2 gm/kg single dose) within 10 days of onset of symptoms.</AbstractText>
        </Abstract>
        <AuthorList CompleteYN="Y">
          <Author ValidYN="Y">
            <LastName>Oates-Whitehead</LastName>
            <ForeName>R M</ForeName>
            <Initials>RM</Initials>
            <AffiliationInfo>
              <Affiliation>Research Division, Royal College of Paediatrics, 50 Hallam Street, London, UK, W1W 6DE.</Affiliation>
            </AffiliationInfo>
          </Author>
          <Author ValidYN="Y">
            <LastName>Baumer</LastName>
            <ForeName>J H</ForeName>
            <Initials>JH</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Haines</LastName>
            <ForeName>L</ForeName>
            <Initials>L</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Love</LastName>
            <ForeName>S</ForeName>
            <Initials>S</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Maconochie</LastName>
            <ForeName>I K</ForeName>
            <Initials>IK</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Gupta</LastName>
            <ForeName>A</ForeName>
            <Initials>A</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Roman</LastName>
            <ForeName>K</ForeName>
            <Initials>K</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Dua</LastName>
            <ForeName>J S</ForeName>
            <Initials>JS</Initials>
          </Author>
          <Author ValidYN="Y">
            <LastName>Flynn</LastName>
            <ForeName>I</ForeName>
            <Initials>I</Initials>
          </Author>
        </AuthorList>
        <Language>eng</Language>
        <PublicationTypeList>
          <PublicationType UI="D016428">Journal Article</PublicationType>
          <PublicationType UI="D017418">Meta-Analysis</PublicationType>
          <PublicationType UI="D016454">Review</PublicationType>
          <PublicationType UI="D000078182">Systematic Review</PublicationType>
        </PublicationTypeList>
      </Article>
      <MedlineJournalInfo>
        <Country>England</Country>
        <MedlineTA>Cochrane Database Syst Rev</MedlineTA>
        <NlmUniqueID>100909747</NlmUniqueID>
        <ISSNLinking>1361-6137</ISSNLinking>
      </MedlineJournalInfo>
      <ChemicalList>
        <Chemical>
          <RegistryNumber>0</RegistryNumber>
          <NameOfSubstance UI="D016756">Immunoglobulins, Intravenous</NameOfSubstance>
        </Chemical>
      </ChemicalList>
      <CitationSubset>IM</CitationSubset>
      <MeshHeadingList>
        <MeshHeading>
          <DescriptorName UI="D002648" MajorTopicYN="N">Child</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D006801" MajorTopicYN="N">Humans</DescriptorName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D016756" MajorTopicYN="N">Immunoglobulins, Intravenous</DescriptorName>
          <QualifierName UI="Q000627" MajorTopicYN="Y">therapeutic use</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D009080" MajorTopicYN="N">Mucocutaneous Lymph Node Syndrome</DescriptorName>
          <QualifierName UI="Q000628" MajorTopicYN="Y">therapy</QualifierName>
        </MeshHeading>
        <MeshHeading>
          <DescriptorName UI="D016032" MajorTopicYN="N">Randomized Controlled Trials as Topic</DescriptorName>
        </MeshHeading>
      </MeshHeadingList>
      <NumberOfReferences>90</NumberOfReferences>
    </MedlineCitation>
    <PubmedData>
      <History>
        <PubMedPubDate PubStatus="pubmed">
          <Year>2003</Year>
          <Month>10</Month>
          <Day>30</Day>
          <Hour>5</Hour>
          <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="medline">
          <Year>2004</Year>
          <Month>5</Month>
          <Day>5</Day>
          <Hour>5</Hour>
          <Minute>0</Minute>
        </PubMedPubDate>
        <PubMedPubDate PubStatus="entrez">
          <Year>2003</Year>
          <Month>10</Month>
          <Day>30</Day>
          <Hour>5</Hour>
          <Minute>0</Minute>
        </PubMedPubDate>
      </History>
      <PublicationStatus>ppublish</PublicationStatus>
      <ArticleIdList>
        <ArticleId IdType="pubmed">14584002</ArticleId>
        <ArticleId IdType="doi">10.1002/14651858.CD004000</ArticleId>
        <ArticleId IdType="pmc">PMC6544780</ArticleId>
      </ArticleIdList>
    </PubmedData>
  </PubmedArticle>

this file consists of info about some articles, here's the first one, so is not included well, I used the xml extension on VScode to find some format errors but it seemed okay

Your link to Google Drive is the generic front page. Anyone who opens it will get their own Google Drive page. You should probably just include the first few lines of your XML file. — Thom Wiggers
It's possible that this xml is malformed in which case you would have more success using Beautiful Soup as it would be more resilient in this case, while using lxml as the parser. — Frostyfeet909

Frostyfeet909 Frostyfeet909 · Accepted Answer · 2022-07-27T12:29:56

It's hard to say without the full file but from parsing this snippet using xml I received a xml.etree.ElementTree.ParseError: no element found error which makes me think the xml may be malformed.

In this case you may use Beautiful Soup as it is more resilient to bad xml and indeed when using this it seemed to return the expected result..

import bs4

xml = ...

soup = bs4.BeautifulSoup(xml, features="xml")
funny_chemical = soup.find("NameOfSubstance").text

print(funny_chemical)

Returns:

'Immunoglobulins, Intravenous'

having problem about reading an xml file in python

1 Answers