1
votes

Why code below does not throw URISyntaxException in runtime as an illegal URI?

new URI("http:us//er:ps//w@si//te.c/om/dir1/di//r2/fi//le.txt#frag//ment");

// or same with "http:// ... "
new URI("http://us//er:ps//w@si//te.c/om/dir1/di//r2/fi//le.txt#frag//ment");

According to wikipedia "/" is a reserved (special) character and it shall be percent-encoded (aka URL-encoded) as %2F

The reserved character /, for example, if used in the "path" component of a URI, has the special meaning of being a delimiter between path segments. If, according to a given URI scheme, / needs to be in a path segment, then the three characters %2F or %2f must be used in the segment instead of a raw /.

But URI constructor allows not to URL-encode it!

Wikipedia defines URI format as follows (RFC 3986, section 3 (2005)):

URI = scheme:[//authority]path[?query][#fragment]

And URI constructor allows direct usage of / (not percent-encoded) in any component (for the exception of scheme perhaps).

URI Javadoc states:

This constructor parses the given string exactly as specified by the grammar in RFC 2396, Appendix A, except for the following deviations: ...

Characters in the other category are permitted wherever RFC 2396 permits escaped octets, that is, in the user-information, path, query, and fragment components, as well as in the authority component if the authority is registry-based. This allows URIs to contain Unicode characters beyond those in the US-ASCII character set.

This allows not-percent-encoded "Other" characters (see above wiki link for reserved / unreserved / other characters clarification), like ɷ (non-ASCII), so this is not about reserved characters like forward slash.

But anyway - why and what for?

P.S. Wikipedia explains why we can use forward slashes in other components, but why we can use it in path component (directory names, file names) is still unclear.

Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.

In the "query" component of a URI (the part after a ? character), for example, / is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

1
Which slashes are you specifically expecting to cause an exception? - Andy Turner
any single / within path component, that is two such cases here: .... di//r2/fi//le.txt ... - in the middle of directory dir2 name and in the middle of file file.txt name. - Code Complete
Ok, so the question really is why the path element of the URI allows /. And in the first wikipedia note you show it clearly says / is a delimiter between path segments. You must escape the ones within a path segment. You must not escape the ones that delimit path segments. - Perdi Estaquel
“Wikipedia defines URI format as follows…” Not quite. Wikipedia and the URI specification define a hierarchical URI as scheme:[//authority]path[?query][#fragment]. Since the character after http: in your URI is not a slash, you don’t have a hierarchical URI at all, so the only character restriction is what the spec terms uric, which includes slashes. - VGR

1 Answers

1
votes

Ok, so the question really is why the path element of the URI allows /.

And in the first wikipedia note you show it clearly says / is a delimiter between path segments. (Path element != Path segment)

You must escape the ones WITHIN a path segment.

You must not escape the ones that DELIMIT path segments

URI: http://address.com/path%2fSegment1/path%2fSegment2/path%2fSegment3