Countries And Languages

Date published 16 May 2005 Author

Danny Sullivan

Categories

Industry

This page deals with the specifics of optimizing content for regional search engines. To be successful, you’ll need two key ingredients. First, it’s important to have a domain for the country or region you wish to target. Second, you should write pages in the language of that country or region.

By the way, is it even worthwhile to worry about country-specific search engines? Absolutely, assuming that you want to sell products and services to people within different countries. Only you can assess where your target audience is located at. If you only care about the US, then don’t worry about non-US search engines. If you want to reach different parts of the world, then you do want to take country-specific search engines into account.

Domain Issues

Having a domain name for a particular country is one of the most important things you can do in order to be listed with a search engine designed for that country. It’s a key factor in declaring that your site belongs within its listings. If you want to be found in a UK search engine, then having a UK-domain name will be a huge help. Similarly, to be found in a German search engine, having a German domain name is important. If you want to be found in several country-specific search engines, then you’ll want to have a domain name for each country you wish to target. How do you then make use of these domain names? You can either follow the “One Site” or “Many Sites” strategy.

One Site

In the One Site situation, you have one single web site, and you make the various country domains you’re registered all resolve to the same place. For instance, let’s say car manufacturer Ford has all of these domains:

ford.com (US)
ford.fr (France)
ford.de (Germany)
ford.co.uk (UK)

With One Site, no matter which address you entered into your browser, you’d always end up at the same web site.

Many Sites

The only difference with the Many Sites method is that each domain resolves to an independent web site. For instance, entering ford.co.uk would take the user to a completely different site than ford.com. It’s also likely that the country-specific sites will be written in the country’s primary language. For example, a site targeted at Germans would use German right on the home page.

Registering Country-Specific Domain Names

Any large domain registrar should be able to help you in registering domain names around the world. ICANN, which oversees the domain name system, lists both accredited registrars and provides information about the domain names assigned to each country.

Language Issues

When it comes to language, you need to understand that no major crawler-based search engine allows you to specify what language your page is in. They will determine this automatically. Yes, there is a meta tag that is designed to help them. For instance, here’s how the format would look for a page written in French:

However, no major search engine will recognize this tag or act upon it, that I currently know of. That means if you want a page to be seen as French, it just must be written in French.

You should also avoid mixing large amounts of different languages on the same page. For instance, don’t have the same page written first in French, followed by the same text in English and then Italian. That will confuse the search engines and probably prevent them from tagging the page for any of those languages. Instead, have a separate page for each language.

For Latin languages, there’s often a concern about using “extended” characters such as accent marks. Should you use them? Yes, if you think your audience will enter the words complete with accent marks into search engines.

For instance, most search engines will return pages that have either manana or maÑana on them, in a search for manana. But if someone specifically enters maÑana, then usually only pages that have maÑana with the tilde (that symbol over the n) will appear. Thus, if you used the tilde, then you increased the odds of appearing in response to that search. Plus, you should still come up even if someone doesn’t include the symbol, though it is possible you may not rank as highly because you used it.

Think about your audience. Do they have a keyboard with these extended characters on it? If so, they’ll probably use the characters, and so you should probably use them on your pages. However, most people in English-speaking countries such as the US and Britain will not have keyboards with these characters. If you are targeting these places, you may wish to experiment with leaving out the extended characters. It possibly could improve your rankings.

Encoding Issues

You also need to be aware that there are two parts to using extended characters: the character set and the character encoding.

The character set is like the alphabet you are using on the page. Someone writing in English will use a standard alphabet set (a, b, c…) plus numerals (1, 2, 3…) and some special characters (!, &, +, …). Someone writing in French may use a similar character set, but with some unique characters of its own (â, i, h…). Someone writing in Chinese or another non-Latin language will use a radically different character set than English. In general, look at your computer keyboard. What you see there is the character set you are using. (For more on official character sets, see the Registered Character Sets list).

The character set itself means nothing to a computer or computer software like your browser. They need the character set translated into computer language, which can then be shared between computers, such as between a web server and your browser. This translation is called character encoding. If both computers know what character encoding is being used, then they can talk behind the scenes to make render the character set that you, a human will view.

Now, HTML documents use a special character set called the Universal Character Set, or UCS. This is virtually identical to a set of characters called Unicode, which itself is a greatly expanded version of the ASCII character set you may have heard of. (The Unicode Web Site provides more information about Unicode and how it serves as a standard for rendering all the world’s languages).

All you really need to know is that UCS is like having the ability to represent any character from any language in the world.

So you have access to all those characters, but remember, those characters also need to be encoded in some way so that the computers can talk to each other. As a simplified explanation, this basically lets the computers know which portion of the UCS character set you’ll be using.

For example, if you are writing in a Latin language, they use an encoding format for those particular characters. If you are writing in a Cyrillic language, they’ll use a different encoding method. Because the character encoding method is so closely linked to the portion of the UCS character set you are using, it’s referred to in short as the “charset,” for character set.

Ideally, your web server is supposed to know what charset your documents use and then pass along this information to browsers or search engine spiders. In reality, many servers may not do this. That’s why HTML allows you to specify your charset in each page, using — yep — a meta tag. For instance, here’s the charset meta tag that says a page is written using a Latin alphabet set (also called ISO-Latin-1):

What if both you and your web server fail to specify a charset? Don’t worry — things will probably still work OK. For instance, if you write in English, then your audience probably reads English. That means when they installed their browser, it probably made a note of how their computer was configured and set itself up to guess that all web pages viewed are written using an English alphabet. In fact, most documents on the web probably do NOT have a charset declared for them. Fortunately, the behind-the-scenes smarts of our browsers make up for this.

Now imagine you are writing in Chinese, and someone comes to your site with a browser configured for English pages. The characters won’t be rendered correctly, unless the person manually configures their browser to understand an additional charset. Similarly, if a search engine comes to your web site, it’s possible that it may not understand the charset you are using. If so, then your pages may not be indexed correctly.

For instance, search engines that deal with Asian language sites typically have to be specially configured to understand the coding of the pages they encounter. Even accented Latin characters may cause problems. For instance, here’s how the title of a site called ¡Olé! Venezuela appeared at Google in April 2000:

¡Olé! Venezuela

What happened was that Google encountered the é character but failed to render it properly, perhaps because the page didn’t declare a charset or perhaps for one of many other reasons. Avoiding these problems is covered in the next section.

Encoding Documents And Characters

If you are working with extended Latin characters, such as accented letters, or writing in a non-Latin language, then it is probably a good idea to specifically note the charset you are using. Most authoring tools should help you do this. Check your help files. There are also links to more information, below.

As a general catch-all, I’d suggest the following charset meta tag for anyone using a European language. In fact, it is supposed to cover the entire range of characters that HTML allows:

Adding a charset tag doesn’t necessarily mean that search engines will then understand all your extended or unusual characters. None of the majors have ever required that a charset tag be placed on pages. Instead, they seem to rely on looking at documents and making their own guesses as to the charset being used. Only AltaVista, at the time of this writing, seems to make note of the charset you provide. It still doesn’t use this for actually interpreting your document, relying instead on making its best guess. But, at least it will display your charset along with your description.

So, for those writing in non-English languages, adding charset meta tags possibly might be helpful, but don’t expect that they will make a huge impact. On the other hand, you should be very careful of how you insert what are called character references.

Character references are just a way of encoding a specific character independently of the charset of your document, or even of your computer’s operating system and hardware. For instance, take the word “theatre.” In French, it makes use of accent marks, as so:

théâtre

Since I don’t have a French keyword, I cannot create this word just by using keystrokes. Instead, using the Netscape Composer authoring tool, I am able to select the special characters using the Tools | Character Tools | Insert Special Characters command. They are then inserted into my document, so that the word above is rendered with this behind the scenes HTML code:

théâtre

See how the é is made by HTML that says &eacute, while the â is rendered by code that says &acirc?

Something different happens within Microsoft FrontPage, the HTML authoring tool I regularly use. Using the Insert | Symbol command, I am able to select and insert the special characters just as with Netscape Composer. But they look exactly the same in the HTML source code as when they are rendered. The chart below summarizes the situation:

Program	HTML Code	Rendered As
Composer	théâtre	théâtre
FrontPage	théâtre	théâtre

How can completely different code still render the same characters? It’s because my browser understands that my operating system can display an extended character such as i. Therefore, when it sees that symbol in the HTML code, it already knows how to translate it. In contrast, had it been another character, the display may not have worked unless I manually configured my browser.

Similarly, a search engine spider may encounter problems if it isn’t properly configured to read and store special characters properly. You could use character references, as Composer does, only to find that that the word théâtre becomes théâtre when the search engine indexes your page. Similarly, if your keyboard allows you to type théâtre, then you might not use character references, only to discover that the search engine then strips off all the accent marks when indexing your page — or worse.

My recommendation from when this article was originally written in 2001 was to NOT use character references like théâtre in your title tag or meta tags. When I did some test runs on various search engines then, I found many failed to interpret character references when used in these areas. I suspected that once they process the body portion of the document, the encoding then kicks in.

That means within your body, you should be fine regardless of whether you use character references or not. For instance, if you type théâtre using a French keyboard, and your authoring tool leaves it that way in your HTML code, great — you should be OK. Similarly, if théâtre appears as théâtre in the source code, that should work also. Just avoid having it appear that way — with character references — in the HTML source code for your title and meta tags.

Related Resources

These categories in Search Topics have articles related to this topic:

Here are some additional articles specifically related to issues discussed on this page:

Blair “Liar” Linkbomb Highlights Country-Specific Skewing
Search Engine Watch, May 17, 2005

A search putting the UK prime minister at the top of Google UK but not Google.com highlights how search engines more and more are skewing their results on country-specific editions. A look at how and why these changes have happened, ranging from mirroring and censorship issues to specific ranking differences that are done in hopes of bettering the user experience.

HTML Document Representation
W3C HTML 4.01 Specification, Dec. 24, 1999

From the official HTML 4.01 specifications, this document explains more about character encoding, the charset tag and using character references to insert special characters.

Notes on helping search engines index your Web site
W3C HTML 4.01 Specification, Dec. 24, 1999

This briefly discusses tips on helping search engines recognize that you have documents available in multiple languages. This mechanism is NOT recognized by any of the major search engines, despite being part of the HTML specifications.

Language information and text direction
W3C HTML 4.01 Specification, Dec. 24, 1999

More about how pages can be identified as written in a particular language. Again, the major search engines are not making use of the specifications here to determine the language of your web page.

Extended ASCII
Webopedia, Sept. 1, 1996

Why are accented characters sometimes called “extended” characters? Because they were added to the original ASCII character set, forming “extended ASCII” or “high” ASCII. This page also has links defining ASCII and ISO Latin 1, which is similar to extended ASCII.

Industry

SEO

PPC

Analytics

Social

Local

Mobile

Video

Content

Development

Opinion

Information

Follow us

Domain Issues

Language Issues

Encoding Issues

Encoding Documents And Characters

Related Resources

Leave a Reply Cancel reply

Resources

Analytics The 2023 B2B Superpowers Index

Analytics Data Analytics in Marketing

Digital Marketing The Third-Party Data Deprecation Playbook

Digital Marketing Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Resources

The 2023 B2B Superpowers Index

Data Analytics in Marketing

The Third-Party Data Deprecation Playbook

Utilizing Email To Stop Fraud-eCommerce Client Fraud Case Study

Related Articles

France Rejects Google's 'Right to Be Forgotten' Appeal

What Would You Do If You Were CEO of Skyscanner?

Google Summoned to Remove Links to Articles About 'Forgotten' Articles

Alphabet: What's Going to Happen to Google Advertisers?

Google: Right to be Forgotten? We're Not Complying

Google Given Right to Appeal in Safari Cookies Case

Google to Drive Media Innovation with News Lab

Apple Cars Hit the Road to Rival Google Street View