3
votes

How can I extract attribute value from XML element using XML Extractor in U-SQL for my Azure data lake analytic job.

Update: More details about the issue

My XML file looks like this:

<?xml version="1.0" encoding="utf-8"?>
<testelement testatr="xyz">
</testelement>

Here is my U-SQL Script:

DECLARE @testfile string = "sample2.xml";
@logText =
EXTRACT log string            
FROM @testfile
USING Extractors.Tsv();

@gethID = SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(@logText.log, "testelement/attribute::testatr").ElementAt(0) AS siteName FROM @logText;
OUTPUT @gethID TO "result.out" USING Outputters.Tsv(); 

After debugging I observed, exception occurred when Load method of XPath class try to load:

"<?xml version=1.0 encoding=utf-8?>"

Here is an exception:

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: '1.0' is an unexpected token. The expected token is '\"' or '''. Line 1, position 15.\nCurrent row dump: \tlog:\t\"<?xml version=1.0 encoding=utf-8?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseXmlDeclaration(Boolean isTextDecl)
\n   at System.Xml.XmlTextReaderImpl.Read()
\n   at System.Xml.XmlLoader.Load(XmlDocument doc, XmlReader reader, Boolean preserveWhitespace)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}

Update 2:

After using quoting:false I am getting another exception:

Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException was unhandled
Message: An unhandled exception of type 'Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.ScopeDebugException' occurred in Microsoft.Cosmos.ScopeStudio.BusinessObjects.Debugger.dll
Additional information: {"diagnosticCode":195887111,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error while evaluating expression Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(log, \"testelement/attribute::testatr\").ElementAt(0)","description":"Inner exception from user expression: Root element is missing.\nCurrent row dump: \tlog:\t\"<?xml version=\"1.0\" encoding=\"utf-8\"?>\"
\n","resolution":"","helpLink":"","details":"==== Caught exception System.Xml.XmlException\n\n   at System.Xml.XmlTextReaderImpl.Throw(Exception e)
\n   at System.Xml.XmlTextReaderImpl.ParseDocumentContent()
\n   at System.Xml.XmlLoader.LoadDocSequence(XmlDocument parentDoc)
\n   at System.Xml.XmlDocument.Load(XmlReader reader)
\n   at System.Xml.XmlDocument.LoadXml(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Load(String xml)
\n   at Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(String xml, String xpath)
\n   at ___Scope_Generated_Classes___.SqlFilterTransformer_2.Process(IRow row, IUpdatableRow output) in c:\\workarea\\bswbigdata\\USQLAppForLogs\\USQLAppForLogs\\bin\\Debug\\A06D46624BBA798\\ReadBlobs.usql.Debug_A54F30D359F939C7\\__ScopeCodeGen__.dll.cs:line 53","internalDiagnostics":""}
2

2 Answers

3
votes

You identify the values with XPath expressions. An attribute is queried with @attr_name (or the full axis expression attribute::attr_name).

UPDATE based on question update:

It looks like that the parser somehow gets confused by the " inside the XML declaration. I see that you use the built-in Tsv() extractor that per default currently processes " inside a field as a quoting character and then drops it. This is a bug we are planning on fixing.

Until then I suggest that you use Extractors.Tsv(quoting:false).

Also please make sure that your XML document does not contain any CR/LF if you are using any of the built-in text extractors (Extractors.*) and that it does not contain a tab-value if you are using .Tsv.

If your XML will contain CR and/or LF then you will have to use a custom extractor to use a different row-delimiter. If you need to do that, please leave me a message, because I am currently tracking such requests to see what we can improve in the built-in extractors.

If your file only contains a single XML document (as opposed to several rows of XML documents) I would suggest to use the XML extractor that is also part of the XML sample on GitHub.

0
votes

On the new error message: It looks like the XML document contains a CR or LF or both after the XML declaration and thus the Tsv() extractor splits the XML document. See my comment in the previous answer:

Also please make sure that your XML document does not contain any CR/LF if you are using any of the built-in text extractors (Extractors.*) and that it does not contain a tab-value if you are using .Tsv.

If your XML will contain CR and/or LF then you will have to use a custom extractor to use a different row-delimiter. If you need to do that, please leave me a message, because I am currently tracking such requests to see what we can improve in the built-in extractors.