1
votes

I am trying to write c# regular expression which will filter below rules.

  • https://www.test.com/help/about/index.aspx?at=eng&st=png...
  • http://www.test.com/help/about/index.aspx?at=eng&st=png...
  • www.test.com/help/about/index.aspx?at=eng&st=png...
  • test.com/help/about/index.aspx?at=eng&st=png...

My regular expression is:

^(http(s)?(:\/\/))?(www\.)?[a-zA-Z0-9-_\.]+/([-a-zA-Z0-9:%_\+.~#?&//=]*) 

which is working fine when I am testing through C# online testers, however when I am trying to put in my code, I am getting a parsing error.

Code:

public SSLUrl(XElement configurationEntry)
{
    XAttribute xSsl = configurationEntry.Attribute("ssl");
    XAttribute xIgnore = configurationEntry.Attribute("ignore");

    mUseSSL = false;

    if (xSsl != null)
        bool.TryParse(xSsl.Value, out mUseSSL);

    mIgnore = false;

    if (xIgnore != null)
        bool.TryParse(xIgnore.Value, out mIgnore);

    mRegex = new Regex(HandleRootOperator(configurationEntry.Value),
        RegexOptions.Compiled | RegexOptions.IgnoreCase);
}

Sample XML file:

<?xml version="1.0"?>
<SSLSwitch>
<!-- Redirect status code for HTTP and HTTPs-->
  <http>301</http>
  <https>301</https>

  <!-- Do not change HTTP or HTTPS for anything under /system/ -->
  <url ignore="true">^~/system/</url>  

  <!-- Do not change HTTP or HTTPS for anything in the root folder -->
  <url ignore="true">^~/[^/]*\.</url>

 <url ignore="true">^(http(s)?(:\/\/))?(www\.)?[a-zA-Z0-9-_\.]+/([-a-zA-Z0-9:%_\+.?&//=]*)</url>
</SSLSwitch>

Error:

An error occurred while parsing EntityName. Line 45, position 85.

Description:

An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

Exception Details:

System.Xml.XmlException: An error occurred while parsing EntityName. Line 45, position 85.

Source Error:

An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.

**Stack Trace: **

[XmlException: An error occurred while parsing EntityName. Line 45, position 85.] System.Xml.XmlTextReaderImpl.Throw(String res, Int32 lineNo, Int32 linePos) +189
System.Xml.XmlTextReaderImpl.HandleEntityReference(Boolean isInAttributeValue, EntityExpandType expandType, Int32& charRefEndPos) +7432563 System.Xml.XmlTextReaderImpl.ParseText(Int32& startPos, Int32& endPos, Int32& outOrChars) +1042
System.Xml.XmlTextReaderImpl.FinishPartialValue() +79
System.Xml.XmlTextReaderImpl.get_Value() +72
System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r) +225
System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o) +75 System.Xml.Linq.XElement.ReadElementFrom(XmlReader r, LoadOptions o) +722 System.Xml.Linq.XElement.Load(XmlReader reader, LoadOptions options) +79 System.Xml.Linq.XElement.Load(String uri, LoadOptions options) +137 Handlers.SSLSwitch..cctor() +102

1
Sharing the error you get would be a good starting point. Even better would be to show your code that raises the error. - Brendan Green
Please show your code, and the error that you are getting. - DeanOC
There is a forward slash after the plus that's not escaped with a backslash - jwatts1980
your XML is invalid - the & in your regex should be escaped to &amp; - Keith Hall
Put your pattern into a CDATA block and replace all \/ with /. - Wiktor Stribiżew

1 Answers

1
votes

The & inside the regex is treated as the beginning of the XML entity and is followed with a substring that cannot be parsed as an XML entity, hence the error.

I'd suggest

<url ignore="true"><![CDATA[^(https?://)?(www\.)?[\w.-]+/([-\w:%+.?&/=]*)]]></url>
                   ^-------------------------------------------------------^

Inside the CDATA block, XML entities are treated as literals.

Note that \w is almost the same as [a-zA-Z0-9_] (if you add RegexOptions.ECMAScript flag when compiling the Regex object, it will be equal to that char class).

Also, /, a forward slash does not have and sometimes should not be escaped at all since it does not have any special meaning in .NET regex. In PHP or Perl, it is often used as a regex delimiter to separate action/pattern/modifiers. In .NET, you can use inline modifiers or RegexOptions flags to modify some special regex metacharacter behavior, thus, / is not used to delimit those regex parts.

I also removed unnecessary groupings. I do not understand why // is used in the last character class, so I replaced it with / (as a // inside a char class will still match only 1 /). If you need to define \, use \\ in the character class.