We recently met up with a company who publishes content in several languages and it got me thinking about URL structures again - specifically canonical URLs which can often occur in this environment.
For example what is the best URL structure to use? ...and how can we encode URLs automatically so that they are "human-readable" in every language?
Canonical URLs and URL Normalisation
This is probably the best place to start as we have all come across some URL rewriting methods in the past... Canonical URLs are, by definition, basically several similar URLs that point to the same place... Which Google and other search engines might see as duplicate content (there are several things you can do to help the search engines to avoid doing this but it's best to avoid multiple URLs for the same document). So we all understand here is one particular example:
http://www.example.com
http://example.com
http://example.com/index.html
http://www.example.com/index.html
These four URLs essentially all go to the domain's homepage but if the server URL handling isn't setup correctly (as well as a few other tricks such as setting up your robots files, Google's webmaster tools settings etc etc) the search engines could, in theory, see this as duplicate content.
The issue deepens when we go multi-lingual...
The point being that, with multi-lingual websites, the issue becomes more prominent as we need to think about encoding the URLs into a human-readable structure whilst attempting to retain the keyword density of their native tongue.
For example:
We may have a page about mobile phones on our website with UTF8 encoding... Such as www.example.co.uk/mobile-phones/. In French this page, translated could be:
http://www.example.com/téléphones-mobiles/
Now because URLs must contain English characters (US-ASCII to be precise) this URL could end up encoded such as:
http://www.example.com/t%c3%a9l%c3%a9phones-mobiles/
Which quite frankly is grotesque and by no means human readable.
We're currently writing some extensions to a few open source platforms to get around this very issue and my personal preference would be to simply replace those characters as it makes for a more "human-readable" URL so the above would become...
http://www.example.com/telephones-mobiles/
(replacing 'é' with 'e')
But now that looks too "English" right?
So we need to identify the language on our page (we could do this with a "lang" tag but we know these are ignored by most search engines) therefore we are writing the extensions for a few popular content management systems to do this when publishing our content so ultimately our new, sexy, human readable URLs (using our above example) would become...
http://www.example.com/fr/telephones-mobiles/
...or...
http://fr.example.com/telephones-mobiles/
Depending on the site owner's preference.
We'd love to hear some feedback and opinions on what other people think works best (especially anyone publishing to Cyrillic language sites and other more complex language sets Chinese would be interesting to hear opinions about for example!).
If you have anything to say or if you would like to beta test our URL encoders when they are available please do get in touch, specifying the CMS that you are using and the issues you currently have!
Further reading on what prompted this article...
An article on multi-lingual websites at Google Webmaster Central