3
votes

I have a site that I'm building out with php that will allow for multi language for content. One part of the site will have business listings. I have SEO friendly urls setup to view these listings, so for example I would have a business listing called "A bar down the street". The url would look like this:

/listing/a-bar-down-the-street

However lets say there is an Arabic version of this listing, then the name would look like this:

شريط أسفل الشارع

How would I make that into the same url format as the English version but in the language it is currently in? When I tried my function on the Arabic version that turns a string into a seo friendly url it comes back empty.

EDIT: To clarify further, all I'm looking for is a php function that allows me to turn any string into an SEO friendly url no matter what language the site is in.

EDIT PART 2 Below is the function Im using to rewrite the string to a SEO friendly url. Perhaps you can tell me what I need to add to make it language friendly?

    public function urlTitle($str,$separator = 'dash',$lowercase = TRUE)
    {

        if ($separator == 'dash')
        {

            $search     = '_';
            $replace    = '-';

        }else
        {

            $search     = '-';
            $replace    = '_';

        }

        $trans = array(
                        '&\#\d+?;'              => '',
                        '&\S+?;'                => '',
                        '\s+'                   => $replace,
                        '[^a-z0-9\-_]'          => '',
                        $replace.'+'            => $replace,
                        $replace.'$'            => $replace,
                        '^'.$replace            => $replace,
                        '\.+$'                  => ''
                        );

        $str = strip_tags($str);
        $str = preg_replace("#\/#ui",'-',$str);

        foreach ($trans AS $key => $val)
        {

            $str = preg_replace("#".$key."#ui", $val, $str);

        }

        if($lowercase === TRUE)
        {

            $str = mb_strtolower($str);

        }

        return trim(stripslashes($str));

    }
4
stackoverflow.com/questions/9511254/… this link might help you - uttam
@uttam Unfortunately I don't have the normalizer installed on the server and don't think I could get it installed. - John

4 Answers

1
votes

I have found similar discussion in an existing SO discussion. It seems that what you are requesting should be possible "out-of-the-box".

I would recommend looking into your webserver config to see what is the problem, there should not be a difference between seo-friendly English urls and any other url-encodable string.

What webserver are you running?

UPDATE I see that you are only accepting alphanumeric characters:

'[^a-z0-9\-_]'          => '',

I suspect that may filter out any non-a-z characters and cause the empty return. Or, alternatively, you can try to debug your function to see which of the replace condition causes your content to be wiped-out.

What you are encountering here is that URLs by default cannot contain any character, browsers in general use encoding to achieve nice-looking multi language URLs.

See example from link:

URLs are allowed only a certain set of english letter characters, which includes the numbers, dashes, slashes, and the question mark. All other characters have to be encoded, which applies to non-Latin domain names. If you go to فنادق.com, you will notice that some browsers will decode it and show you فنادق.com but some like Chrome will show you something like this http://www.xn--mgbq6cgr.com/.

Which means that you can no longer filter your post title and only allow url-valid characters, you need to encode the titles and hope that the browser will render them as you would like.

Another option would be to use trans-literation, possibly after detection of a browser which is known to not render the url-encoded special characters.

0
votes

So what seems to work for me is taken out this part of my php function:

'[^a-z0-9\-_]'          => '',

And updating the strtolower line to:

$str = mb_strtolower($str,'UTF-8');

And it seems to work as normal. However can anyone confirm this will work going forward? Will browsers understand this for all languages? Or do I have to normalize the string to make sure every browser can understand the url? The problem is I'm not on php 5.3, which is required to install the normalization extension for php. I'm afraid it will break things if I do upgrade, I'm currently on 5.2x.

0
votes

John, you're right, the main problem is that your regex character class ([^a-z0-9\-_]) doesn't allow UTF-8 characters. This should work better: [^\p{L}0-9\-_]

I had been working on a function like this recently and just published a blog post that includes the function I came up with: Creating SEO Friendly URLs in PHP with url_slug()

0
votes

I have a site with 48 different languages we support. The function I use to clean the urls is here (in javascript), perhaps this is helpful for you:

const noHyphenLangs = ['ko', 'ja', 'zh-cn', 'zh-tw', 'ar', 'th']
const formatTranslationIntoPath = (text, symbol) => { // utf-8 encoding
  let t = text
  const replaceChar = noHyphenLangs.includes(symbol) ? '' : '-'
  t = t.replace(/-/g, ' ')
  t = t.replace(/\s/g, replaceChar)
  t = t.replace(/['`’]/g, '') // remove quotes
  t = t.replace(/[,,()]/g, '') // remove junk
  t = t.normalize('NFD').replace(/\p{Diacritic}/gu, '') // simplify letters for url https://stackguides.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript
  t = t.replace(/[Łł]/g, 'l') // doesn't get replaced in diacritic replacements

  return t.toLowerCase()
}

const ex1 = formatTranslationIntoPath('让我们  尝试-这样-做', 'zh-cn') // 让我们尝试这样做
const ex2 = formatTranslationIntoPath('Việt miễn phí', 'vi') // viet-mien-phi

PS: For most languages, you don't want to remove the non-alpha-numeric characters if there is no diacritic replacements available.

Ref: https://gist.github.com/KevinDanikowski/24c79cbb7a3ef2a7f3e452e740848249