0
votes

Html-agility-pack seems to build nodes from elements within TextArea, which are not real nodes. For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1255">
<title>Sample</title>
</head>
<body>
<TEXTAREA>Text in the <div>hello</div>area</TEXTAREA>
</body>
</html>

This will yield a child-node of "div" under the "textarea". Browsers will treat everything as text.

Is there a way to compel html-agility-pack to behave the same way?

Clarification

I don't want the node to be created in the first place. If I run doc.DocumentNode.SelectNodes("//div") I want this to yield nothing. Right now I have to use doc.DocumentNode.SelectNodes("//div [not(ancestor::textarea]") but I have to do this for every select I perform to avoid phantom nodes.

Any ideas?

2

2 Answers

0
votes

Use the InnerText property to get just the text of a node. This also gets the text of any child nodes (in this case the div).

var textArea = doc.DocumentNode.SelectSingleNode("//textarea");

string text = textArea.InnerText;
0
votes

Issue has been fixed by the kind folks at zzzprojects.

Fix available and tested on version 1.8.2.

You can see the ticket here: Issue 183