1
votes

I want to sanitize blog titles with unicode characters in url. I need to replace invalid characters and spaces with "-" for better seo rewriting like this.

        http://example.com/это-моя-хорошая

Can anyone have any idea how to do it?

1
Please define "invalid characters". - deceze♦
i don't want the characters like . , [] {} / ? in my url. If user post the title with these characters i want to change it to '-' for better seo. - uttam
I've no idea which language u are using, since i don't see a C# tag. But in C# i would do Url.Encode() - rfcdejong

1 Answers

3
votes

You can use this algorithm for an SEO-friendly Unicode URL:

  1. Convert the text to Unicode Normalization Form C, i.e. precomposed characters.
  2. Use a regular expression with Unicode character classes to replace each non-letter non-digit character with a space.
  3. Remove leading, trailing and double spaces.
  4. Shorten.
  5. Replace spaces with hyphens.