188
votes

I am making a website with articles, and I need the articles to have "friendly" URLs, based on the title.

For example, if the title of my article is "Article Test", I would like the URL to be http://www.example.com/articles/article_test.

However, article titles (as any string) can contain multiple special characters that would not be possible to put literally in my URL. For instance, I know that ? or # need to be replaced, but I don't know all the others.

What characters are permissible in URLs? What is safe to keep?

13
There was a similar question, here. Check it out, you may find some useful answers there also (there were quite a lot of them).Rook
I reworded the question to be more clear. The question and answers are useful and of good quality. (48 people, including me, have favorited it) In my opinion, it should be reopened.Jonathan Allard

13 Answers

236
votes

To quote section 2.3 of RFC 3986:

Characters that are allowed in a URI, but do not have a reserved purpose, are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde.

  ALPHA  DIGIT  "-" / "." / "_" / "~"

Note that RFC 3986 lists fewer reserved punctuation marks than the older RFC 2396.

118
votes

There are two sets of characters you need to watch out for: reserved and unsafe.

The reserved characters are:

  • ampersand ("&")
  • dollar ("$")
  • plus sign ("+")
  • comma (",")
  • forward slash ("/")
  • colon (":")
  • semi-colon (";")
  • equals ("=")
  • question mark ("?")
  • 'At' symbol ("@")
  • pound ("#").

The characters generally considered unsafe are:

  • space (" ")
  • less than and greater than ("<>")
  • open and close brackets ("[]")
  • open and close braces ("{}")
  • pipe ("|")
  • backslash ("\")
  • caret ("^")
  • percent ("%")

I may have forgotten one or more, which leads to me echoing Carl V's answer. In the long run you are probably better off using a "white list" of allowed characters and then encoding the string rather than trying to stay abreast of characters that are disallowed by servers and systems.

47
votes

Always Safe

In theory and by the specification, these are safe basically anywhere, except the domain name. Percent-encode anything not listed, and you're good to go.

    A-Z a-z 0-9 - . _ ~ ( ) ' ! * : @ , ;

Sometimes Safe

Only safe when used within specific URL components; use with care.

    Paths:     + & =
    Queries:   ? /
    Fragments: ? / # + & =
    

Never Safe

According to the URI specification (RFC 3986), all other characters must be percent-encoded. This includes:

    <space> <control-characters> <extended-ascii> <unicode>
    % < > [ ] { } | \ ^
    

If maximum compatibility is a concern, limit the character set to A-Z a-z 0-9 - _ . (with periods only for filename extensions).

Keep Context in Mind

Even if valid per the specification, a URL can still be "unsafe", depending on context. Such as a file:/// URL containing invalid filename characters, or a query component containing "?", "=", and "&" when not used as delimiters. Correct handling of these cases are generally up to your scripts and can be worked around, but it's something to keep in mind.

43
votes

You are best keeping only some characters (whitelist) instead of removing certain characters (blacklist).

You can technically allow any character, just as long as you properly encode it. But, to answer in the spirit of the question, you should only allow these characters:

  1. Lower case letters (convert upper case to lower)
  2. Numbers, 0 through 9
  3. A dash - or underscore _
  4. Tilde ~

Everything else has a potentially special meaning. For example, you may think you can use +, but it can be replaced with a space. & is dangerous, too, especially if using some rewrite rules.

As with the other comments, check out the standards and specifications for complete details.

19
votes

Looking at RFC3986 - Uniform Resource Identifier (URI): Generic Syntax, your question revolves around the path component of a URI.

    foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment
      |   _____________________|__
     / \ /                        \
     urn:example:animal:ferret:nose

Citing section 3.3, valid characters for a URI segment are of type pchar:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

Which breaks down to:

ALPHA / DIGIT / "-" / "." / "_" / "~"

pct-encoded

"!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

":" / "@"

Or in other words: You may use any (non-control-) character from the ASCII table, except /, ?, #, [ and ].

This understanding is backed by RFC1738 - Uniform Resource Locators (URL).

12
votes

From the context you describe, I suspect that what you're actually trying to make is something called an 'SEO slug'. The best general known practice for those is:

  1. Convert to lower-case
  2. Convert entire sequences of characters other than a-z and 0-9 to one hyphen (-) (not underscores)
  3. Remove 'stop words' from the URL, i.e. not-meaningfully-indexable words like 'a', 'an', and 'the'; Google 'stop words' for extensive lists

So, as an example, an article titled "The Usage of !@%$* to Represent Swearing In Comics" would get a slug of "usage-represent-swearing-comics".

11
votes

unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"

6
votes

The format for an URI is defined in RFC 3986. See section 3.3 for details.

6
votes

From an SEO perspective, hyphens are preferred over underscores. Convert to lowercase, remove all apostrophes, then replace all non-alphanumeric strings of characters with a single hyphen. Trim excess hyphens off the start and finish.

3
votes

I had a similar problem. I wanted to have pretty URLs and reached the conclusion that I have to allow only letters, digits, - and _ in URLs.

That is fine, but then I wrote some nice regex and I realized that it recognizes all UTF-8 characters are not letters in .NET and was screwed. This appears to be a know problem for the .NET regex engine. So I got to this solution:

private static string GetTitleForUrlDisplay(string title)
{
    if (!string.IsNullOrEmpty(title))
    {
        return Regex.Replace(Regex.Replace(title, @"[^A-Za-z0-9_-]", new MatchEvaluator(CharacterTester)).Replace(' ', '-').TrimStart('-').TrimEnd('-'), "[-]+", "-").ToLower();
    }
    return string.Empty;
}


/// <summary>
/// All characters that do not match the patter, will get to this method, i.e. useful for Unicode characters, because
/// .NET implementation of regex do not handle Unicode characters. So we use char.IsLetterOrDigit() which works nicely and we
/// return what we approve and return - for everything else.
/// </summary>
/// <param name="m"></param>
/// <returns></returns>
private static string CharacterTester(Match m)
{
    string x = m.ToString();
    if (x.Length > 0 && char.IsLetterOrDigit(x[0]))
    {
        return x.ToLower();
    }
    else
    {
        return "-";
    }
}
1
votes

I found it very useful to encode my URL to a safe one when I was returning a value through Ajax/PHP to a URL which was then read by the page again.

PHP output with URL encoder for the special character &:

// PHP returning the success information of an Ajax request
echo "".str_replace('&', '%26', $_POST['name']) . " category was changed";

// JavaScript sending the value to the URL
window.location.href = 'time.php?return=updated&val=' + msg;

// JavaScript/PHP executing the function printing the value of the URL,
// now with the text normally lost in space because of the reserved & character.

setTimeout("infoApp('updated','<?php echo $_GET['val'];?>');", 360);
0
votes

I think you're looking for something like "URL encoding" - encoding a URL so that it's "safe" to use on the web:

Here's a reference for that. If you don't want any special characters, just remove any that require URL encoding:

HTML URL Encoding Reference

-4
votes

Between 3-50 characters. Can contain lowercase letters, numbers and special characters - dot(.), dash(-), underscore(_) and at the rate(@).