Loading HTML document in DOM XML parser in VBScript fails if the <!DOCTYPE html> tag is present

Question

I am currently attempting to use VBScript to perform batch modifications to HTML files. To do this, I'm using the Microsoft.XMLDOM object. It is failing to load my HTML file as an XML document. After some experimenting, it appears that the following tag being at the first line is the culprit:

<!DOCTYPE html>

If this line is removed, my script will work as expected. If this line is included, it will not load. No particular error message appears, but attempting to get anything out of the XMLDOM object will return nothing, which is the same behavior when the file the object attempts to load doesn't exist.

Does anyone know why this occurs and how to work around this? I cannot remove this tag from my files as they are HTML documents and they are routinely regenerated by another application.

Here is a sample of my code:

strFilePath = WScript.Arguments(0)
strTitlePrefix = WScript.Argument(1)

Set objXMLDoc = CreateObject("Microsoft.XMLDOM")
objXMLDoc.Async = False

objXMLDoc.load(strFilePath)

Set objDoc = objXMLDoc.documentElement
Set objNodes = objDoc.selectNodes("//title")
For Each thisNode in objNodes
  OriginalTitle = thisNode.text
  NewTitle = TitlePrefix & OriginalTitle
  thisNode.text = NewTitle     
Next

It fails at this line:

Set objNodes = objDoc.selectNodes("//title")

This is the error message:

Microsoft VBScript runtime error: Object required: 'objDoc'

The code does what I expect it to do if I remove the tag at the top of the document it's trying to read, so I know that the problem is that this tag causes it to think the file is not an XML document.

Microsoft.XMLDOM is deprecated, use Msxml2.DOMDocument.6.0 instead. For further help with your code: please show your code. — Ansgar Wiechers
@AnsgarWiechers I just tried my code loading the Msxml2.DOMDocument.6.0 object instead and unfortunately it made no difference. Thanks. — Quote

FrankM FrankM · Accepted Answer · 2018-07-06T06:33:28

First, you can narrow down the cause of the problem by adding some error checking after load:

objXMLDoc.load(strFilePath)
If objXMLDoc.parseError.errorCode <> 0 Then
   MsgBox "ERROR when loading " + strFileName + ": " + objXMLDoc.parseError.reason
End If

(Depending on your VBScript environment, you may have to use something else than MsgBox.)

You probably will get the error message

error: DTD is prohibited.

The reason is that loading DTD syntax (such as !DOCTYPE) is prohibited by default in MSXML 6.0. See MSXML Security Overview for details. Here is the relevant part

Some parts of XML (such as DTDs and inline schemas) are inherently risky. In the default installation configuration of MSXML 6.0, these features have been disabled. You are free to enable these features, but first you should ensure that the security risks associated with them do not apply to you.

If you attempt to load a DTD without explicitly enabling ProhibitDTD Property, you will receive the following error:

error: DTD is prohibited.

If you add the line

objXMLDoc.setProperty "ProhibitDTD", False

before loading, the DTD is probited error will no longer occur.

You most probably also will have to add the line

objXMLDoc.validateOnParse = False

before loading, in case your HTML file does not contain the full HTML DTD (with it usually does not).

To summarize, here is the complete code:

strFilePath = "C:\Temp\test.html"

Set objXMLDoc = CreateObject("Msxml2.DOMDocument.6.0")
objXMLDoc.Async = False
objXMLDoc.setProperty "ProhibitDTD", False
objXMLDoc.validateOnParse = False
objXMLDoc.load(strFilePath)
If objXMLDoc.parseError.errorCode <> 0 Then
   MsgBox "ERROR when loading " + strFileName + ": " + objXMLDoc.parseError.reason
End If

Set objDoc = objXMLDoc.documentElement

MsgBox TypeName(objDoc)
Set objNodes = objDoc.selectNodes("//title")
MsgBox objNodes.Length

It can successfully load and parse this file:

<!DOCTYPE html>
<html>
<head>
<title>Title of the document</title>
</head>

<body>
The content of the document......
</body>

</html>

The last line will output "1", as there is only one tag.

Please note, that there is one drawback, however: HTML is not XML! I.e. not every HTML file is a well-formed XML file. E.g. if in the above sample HTML file, there would be a <br> tag (without matching </br>), loading would fail. Only XML compliant HTML files can be opened using the above method.

Loading HTML document in DOM XML parser in VBScript fails if the <!DOCTYPE html> tag is present

1 Answers