1
votes

Receiving XML formatted messages via a tcp socket and trying to parse them with Nokogiri. If I could rely on a single, complete root tag in my buffer everything would be straightforward.

Trivial example:

<doc><a>some long text ....</a><b>more text</b></doc>

=> #<Nokogiri::XML::Document:0x1326a30 name="document" children=[#<Nokogiri::XML::Element:0x1325fcc name="doc" children=[#<Nokogiri::XML::Element:0x1325aa4 name="a" children=[#<Nokogiri::XML::Text:0x13255f4 "some long text ....">]>, #<Nokogiri::XML::Element:0x1324f3c name="b" children=[#<Nokogiri::XML::Text:0x1324b68 "more text">]>]>]>

everything as expected.

Long messages may be split across packets, leaving the buffer holding an incomplete tag:

<doc><a>exceptionally long text ....

=> #<Nokogiri::XML::Document:0x12c45ec name="document" children=[#<Nokogiri::XML::Element:0x12c2968 name="doc" children=[#<Nokogiri::XML::Element:0x12c210c name="a" children=[#<Nokogiri::XML::Text:0x12c1cc0 "exceptionally long text">]>]>]>

still as expected, Nokogiri::XML::SyntaxError: Premature end of data in tag doc line 1, we can wait for more data in the buffer.

However, short messages may be clustered within a single packet and arrive at once:

<doc><a>text</a></doc><doc><a>other text</a></doc>

=> #<Nokogiri::XML::Document:0x1312cd8 name="document" children=[#<Nokogiri::XML::Element:0x1312814 name="doc" children=[#<Nokogiri::XML::Element:0x1312594 name="a" children=[#<Nokogiri::XML::Text:0x1312288 "text">]>]>]>

second message not parsed, Nokogiri::XML::SyntaxError: Extra content at the end of the document.

I can't see any way to get Nokogiri to return to me the extra content so I can try to continue parsing. This may be a limitation of the underlying libxml2 or Nokogiri's interface with the library. String.scan doesn't give string indexes (to split messages and preserve the extra text) and Regexp.match won't match globally. Any ideas on how best to extract all of the complete messages from my buffer and leave the trailing incomplete one?

2
You need to give a sample of how you're reading your content. Nokogiri only reads a file stream or a string buffer, so the incomplete content is a result of whatever is feeding data to Nokogiri. - the Tin Man
You also need to give an example of how you are calling Nokogiri when creating the parsed document. - the Tin Man
I'm using eventmachine and collecting data into a buffer from a receive_data callback, which is the reason I can't control how many messages arrive or their completeness within my buffer. I'm using Nokogiri::XML(buffer), about as generic as you can get with a Nokogiri parse. def receive_data(data); @receive_buffer << data; document = Nokogiri::XML(@receive_buffer) - ct00

2 Answers

0
votes

Nokogiri expects an IO stream or string. From the docs for Nokogiri::HTML::Document.parse and Nokogiri::XML::Document.parse.

parse(string_or_io, url = nil, encoding = nil, options = XML::ParseOptions::DEFAULT_HTML)

Parse HTML. thing may be a String, or any object that responds to read and close such as an IO, or StringIO.

"thing" should actually be "string_or_io", to match their example, but you get the idea.

If you can add more information about how you're retrieving the content and parsing it we might be able to give more help.

0
votes