0
votes

I am trying to extract search keywords from SOAP xml schema with BeautifulSoup and cannot figure out how to extract value attributes.

I have tried using soap.find_all but it will not let me extract the value attribute.

Here is what I have so far:

soap = requests.get('http://ecp.iedadata.org/soap_search_schema.xsd')
soapXML = soap.content.decode("utf-8")
soapSoup = BeautifulSoup(soapXML, "xml")
level1 = soapSoup.findAll('xs:attribute', {'name':'level1'})[0]
level1['value']

And this is where I have an issue. According to BeautifulSoup documentation this should output all the 'value' attributes.

print(level1):

<xs:attribute name="level1" use="optional">
<xs:simpleType>
<xs:restriction base="xs:string">
<xs:enumeration value=""/>
<xs:enumeration value="alteration"/>
<xs:enumeration value="igneous"/>
<xs:enumeration value="metamorphic"/>
<xs:enumeration value="notfound"/>
<xs:enumeration value="ore"/>
<xs:enumeration value="sedimentary"/>
<xs:enumeration value="vein"/>
<xs:enumeration value="xenolith"/>
</xs:restriction>
</xs:simpleType>
</xs:attribute>

So as you can see, I am trying to get the text from the value attribute in all of the xs:enumeration tags. The end result would be a list of search terms for level1. i.e.:

(alteration, igneous, metamorphic, notfound, ore, sedimentary, vein, xenolith)

I cannot just call on the xs:enumeration tag as there are multiple keywords (i.e. level2, level3, SampleType... etc.) and each have different xs:enumeration values.

Here is the error on the last line (level1['value'])


KeyError Traceback (most recent call last) in ----> 1 level1test['value']

~/anaconda3/envs/py37/lib/python3.7/site-packages/bs4/element.py in getitem(self, key) 1069 """tag[key] returns the value of the 'key' attribute for the tag, 1070 and throws an exception if it's not there.""" -> 1071 return self.attrs[key] 1072 1073 def iter(self):

KeyError: 'value'

2

2 Answers

0
votes

Just replace level1['value'] with:

for i in level1:
    if type(i) is not bs4.element.NavigableString:
        data = i.contents
        for k in data[1]:
            if type(k) is not bs4.element.NavigableString:
                print(k['value'])

Output:

alteration
igneous
metamorphic
notfound
ore
sedimentary
vein
xenolith
0
votes

Simply use the value attribute selector

import requests 
from bs4 import BeautifulSoup as bs

soap = requests.get('http://ecp.iedadata.org/soap_search_schema.xsd')
soapXML = soap.content.decode("utf-8")
soapSoup =bs(soapXML, "xml")
enumeration_values = [item['value'] for item in  soapSoup.select("[value]") if item['value']]
print(enumeration_values)

Marginally faster would be to use the type selector

import requests 
from bs4 import BeautifulSoup as bs

soap = requests.get('http://ecp.iedadata.org/soap_search_schema.xsd')
soapXML = soap.content.decode("utf-8")
soapSoup =bs(soapXML, "xml")
enumeration_values = [item['value'] for item in  soapSoup.select("enumeration") if item['value']]
print(enumeration_values)