5
votes

Background: I'm using Ruby's Nokogiri gem to parse an XML file. The problem I'm having is that the SAX parser returns an incomplete result when a string contains >, which is HTML encoding for >. For example:

<element>PART1PART2</element> #=> returns "PART1PART2"
<element>PART3&gt;PART4</element> #=> returns "PART3"

My parser looks like this:

require 'nokogiri'
class MySample < Nokogiri::XML::SAX::Document
  def characters(string)
    puts string
  end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(MySample.new)
# Feed the parser some XML
parser.parse_file(ARGV[0])

Research: If a string contains >, then Nokogiri thinks that's the end of the string. Having a > within a string would be considered poorly formatted XML. However, my XML is properly formatted, but Nokogiri thinks that &gt; marks the end of the string. This would mean that Nokogiri is interpreting the HTML (converting &gt; to >) before it parses the string.

Question: Why is Nokogiri interpreting the HTML for &gt;, and how can I ensure it parses the full string?


1-YEAR UPDATE (FWIW)

It's been over a year since I first posted this question, and at this point in time I have not come across a definitive answer to my original question. Therefore I thought I'd provide a little update for anyone who comes across this post in the future. Please keep in mind that I am strictly speaking of SAX parsing, not DOM parsing.

Major points:

  • The original question is in regards to Nokogiri v1.6.1. The most current release (at the time of this writing) is v1.6.6, but the issue still has not been resolved.

  • There is however a workaround for this problem (see matt's comment below), but it will be tricky to implement if not all strings are formatted the same way (e.g. one string contains &gt; once, another string contains &gt; twice, etc.).

  • I briefly tested another Ruby parser called Ox and found out that it does not have the same issue as Nokogiri. Indeed it correctly handles strings that contain &gt;. Additionally, it can also handle strings that contain >. As a bonus, it appears to perform faster than Nokogiri (but it's not without its faults).

Bottom line:

If you are having a similar issue with Nokogiri, then I suggest checking out Ox as a possible alternative. I'm not going to argue that one gem is better than the other (that's not what this is for). I can however vouch for Ox in terms of its ability to handle strings that contain &gt; and/or >.

1
+1 for asking the question in nice way.. - Arup Rakshit
This works okay for me. Note that the characters method “might be called multiple times given one contiguous string of characters” and in this case (at least for me) it is called three times – once with PART3, once for the entity (> is passed in) and once with PART4, so it looks like Nokogiri (or libxml) is splitting the string up around the entity. Are you only looking at what’s passed in the first time it’s called? You will need to buffer multiple calls to characters to form the complete string. - matt
You're absolutely correct. This is the workaround I ended up implementing, but it's not ideal. It works fine when every string has the same number of >'s, but my strings didn't. I got it to work, but it's extremely ugly, so I was hoping to turn off the HTML interpretation to make things cleaner. - seane
Also: > is actually valid here (< wouldn’t be, but > is okay). - matt
You're right. However, W3Schools says, "The greater than character is legal, but it is a good habit to replace it." I have taken this precaution, so (unless I'm totally missing something) I'm a little disappointed that Nokogiri doesn't handle it accordingly. - seane

1 Answers

0
votes

You don't say why you're trying to use a SAX parser. Nokogiri handles the document correctly when parsing it using the DOM parser:

require 'nokogiri'

doc = Nokogiri::XML(<<EOT)
<root>
  <element>PART1PART2</element>
  <element>PART3&gt;PART4</element>
</root>
EOT

puts doc.to_xml
# >> <?xml version="1.0"?>
# >> <root>
# >>   <element>PART1PART2</element>
# >>   <element>PART3&gt;PART4</element>
# >> </root>

You might want to check with the developers on their mail-list.