Background: I'm using Ruby's Nokogiri gem to parse an XML file. The problem I'm having is that the SAX parser returns an incomplete result when a string contains >, which is HTML encoding for >. For example:
<element>PART1PART2</element> #=> returns "PART1PART2"
<element>PART3>PART4</element> #=> returns "PART3"
My parser looks like this:
require 'nokogiri'
class MySample < Nokogiri::XML::SAX::Document
def characters(string)
puts string
end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(MySample.new)
# Feed the parser some XML
parser.parse_file(ARGV[0])
Research: If a string contains >, then Nokogiri thinks that's the end of the string. Having a > within a string would be considered poorly formatted XML. However, my XML is properly formatted, but Nokogiri thinks that > marks the end of the string. This would mean that Nokogiri is interpreting the HTML (converting > to >) before it parses the string.
Question: Why is Nokogiri interpreting the HTML for >, and how can I ensure it parses the full string?
1-YEAR UPDATE (FWIW)
It's been over a year since I first posted this question, and at this point in time I have not come across a definitive answer to my original question. Therefore I thought I'd provide a little update for anyone who comes across this post in the future. Please keep in mind that I am strictly speaking of SAX parsing, not DOM parsing.
Major points:
The original question is in regards to Nokogiri v1.6.1. The most current release (at the time of this writing) is v1.6.6, but the issue still has not been resolved.
There is however a workaround for this problem (see matt's comment below), but it will be tricky to implement if not all strings are formatted the same way (e.g. one string contains
>once, another string contains>twice, etc.).I briefly tested another Ruby parser called Ox and found out that it does not have the same issue as Nokogiri. Indeed it correctly handles strings that contain
>. Additionally, it can also handle strings that contain>. As a bonus, it appears to perform faster than Nokogiri (but it's not without its faults).
Bottom line:
If you are having a similar issue with Nokogiri, then I suggest checking out Ox as a possible alternative. I'm not going to argue that one gem is better than the other (that's not what this is for). I can however vouch for Ox in terms of its ability to handle strings that contain > and/or >.
charactersmethod “might be called multiple times given one contiguous string of characters” and in this case (at least for me) it is called three times – once withPART3, once for the entity (>is passed in) and once withPART4, so it looks like Nokogiri (or libxml) is splitting the string up around the entity. Are you only looking at what’s passed in the first time it’s called? You will need to buffer multiple calls tocharactersto form the complete string. - matt>'s, but my strings didn't. I got it to work, but it's extremely ugly, so I was hoping to turn off the HTML interpretation to make things cleaner. - seane>is actually valid here (<wouldn’t be, but>is okay). - matt