ArsDigita Archives
 
 
   
 
spacer

Language issues

Part of an article on Building a Multilingual Web Service Using the ACS, by John Lowry (lowry@arsdigita.com)

ArsDigita : ArsDigita Systems Journal : One article


previous | next chapter

Language issues

Translating site content into different languages is one of the most tangible requirements for building a multilingual site. What is less obvious are all the steps involved in getting content translated. In this section, we describe how to determine which language a user prefers; how to build a message catalog of text strings that get displayed; how to build a data model with language-dependent columns; how to do language-aware sorting; and managing the translation process.

Language negotiation

A web site chooses the language in which to serve its content by a process known as language negotiation. This process beings with finding out which language a user prefers. Here are some possible ways to do this:
  1. Preference determined from the HTTP request

    The HTTP standard includes an Accept-Language header which the client can send to the server as part of a request for a web page. Here is a request sent by a web browser, which includes a language preference:

    GET /index.adp HTTP/1.0
    User-Agent: Mozilla/4.61 [en] (X11; I; Linux 2.2.12-20 i686) default
    Host: www.arsdigita.com
    Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*
    Accept-Encoding: gzip
    Accept-Language: en-GB, en
    Accept-Charset: iso-8859-1,*,utf-8
    
    This user prefers British English and accepts other types of English. An advantage of using the Accept-Language header is that it can contain a list, thereby increasing the likelihood of a preferred language being available. This header, however, may not always be the most convenient or reliable way to choose a user's language. The user's preferred languages may not be available on our web site, or the user may have incorrectly configured his language preference in his browser.

  2. Preference specified by user on first visit to the site

    The first time a user visits a site he is prompted to select a language from the list available. This causes a cookie to be stored on the user's local computer, which the user's browser sends back to the server on every subsequent request. A user's language preference will be lost if he moves to a different computer or deletes his cookie file.

  3. Preference determined by page that links to the site

    A user's language preference can be encoded in a URL. For example, http://host/en/index.tcl would specify that the contents be served in English. In order for this method to work, all links to the site must have the language identifier embedded correctly. The page that contains the link lives on a remote server so it would be hard to ensure that this method works reliably. Users that preferred different languages would not be able to share URLs because their language preferences would be included.

  4. Preference specified by user at registration

    A user can select a language preference when he registers for the site. Whenever he returns to the site the language can be ascertained from his login token. On the surface, this seems almost identical to the cookie mechanism described above. However, the implementation of the filter that populates the ad_locale data structure is quite different. It is inexpensive to get the value of a cookie. However, it would not be possible to retrieve the user's language preference from the database without a more significant loss in performance. Therefore we memoize the code that creates the user's preferences. If a user changes his preferred language in the database, the change will not be recognized until the cache is flushed or the server is restarted.

In practice, the method used to discover a user's language preference will depend on the requirements of the site. What works for one site may be inappropriate for another. In some cases, the cookie solution will turn out to be best. In others, however, we will need an algorithm to select a default choice if the language specified by a user's Accept-Language header is not available. For now, let's just assume that we have written a procedure that will implement whatever solution we have chosen.

A site may not be able, or may not choose, to serve content in a user's preferred language. In some cases, a site may wish to serve a page in multiple languages. For example, part of the content may be in a user's preferred language and part may be in the language of the group that owns the page. We need to provide an API for determining in which language to serve a page that can be sensitive to the page context.

Because we need to run this procedure before serving each page on the site, it is convenient to return other details besides the language to use. For example, we may want to return other properties that can be used to localize the content on a web page, such as timezone and locale. For this reason, our procedure is separate from the language API we describe later. Here is its signature and a short description of how it is invoked:

ad_locale context property

This procedure returns the value of a locale-specific property within the context of a web request. Possible values for context are user (the current logged-in user) or community (the group that owns the requested web page). Possible values for property are language, locale or timezone.
A programmer will need to call the ad_locale procedure before any content is generated in response to a request. For performance reasons, we populate the data structure that lies behind ad_locale the first time a user requests a page or a group's content is accessed. We accomplish this by using a pre-authorization filter that runs for every web page that gets requested. For each possible context, the filter creates a global variable containing a set of properties. Adding new contexts or properties is easily done by either adding a new global variable or an additional key to the set of properties.

The programmer can determine a user's language as follows:

set user_lang [ad_locale user language]

Message catalog

In a static web site, we can translate the content on a page by page basis. But on a dynamic site, the content on a page can be different each time it is displayed, because it is generated by a program. In order to translate this content, we need to divide it up into shorter elements that are at the same level of granularity as handled by the program.

A message catalog is a list of the text strings that the application will display. For example, if the program needs to display a phrase such as Welcome Philip Greenspun, the string that gets stored in the message catalog is Welcome. The name Philip Greenspun is dynamically generated from the database.

We maintain a message catalog for each language that the web site uses. A programmer can lookup a string in the appropriate message catalog using a key. On a typical web page there will be dozens of strings that need to be displayed. The catalogs need to be very efficient at doing lookups to return the translated content.

A typical web page contains the text to display interposed with HTML tags. Ideally, each chunk of text between HTML tags should be a single message in the catalog. This makes it simpler to separate the content from the markup tags so that translators don't have to enter HTML when they make their entries in the catalog. A web page designer that is using a message catalog cannot expect to have the same freedom to arrange content as he would in a site that was displayed in only one language. He faces the following constraints:

  • Content retrieved from the database cannot be interposed with content that comes from the message catalog. Imagine a shopping site that is selling different types of wine, such as red, white and rose. The types are stored in the database and then displayed on a web page. A page designer must be careful not to specify a content string such as red wine where the first word is taken from the database and the second is a message catalog lookup. In French, this would display as rouge vin which is an incorrect translation. It should be vin rouge.

  • For best performance, we need to ensure that there are as few message catalog lookups as possible per web page. One way to do this is to use the templating module, supplied as part of the ACS toolkit, which can cache partial pages under certain conditions. We can thus ensure that the only parts of the page that require lookup are those that are dynamically generated for each request.

  • Designers should strive to have big chunks of text between HTML tags so that there will be as few catalog lookups as possible per page. Each catalog lookup requires, at a minimum, a procedure call and a lookup in a hash table. If performance becomes critical, we could avoid message catalog lookups entirely by coding a different web page for each language. However, this becomes much more expensive to maintain.
Let's now look at our proposed implementation for a message catalog using AOLserver, Oracle and Tcl. First, here is the data model. Messages are stored in the lang_messages table:
CREATE TABLE lang_messages (    
        key             VARCHAR(200),
        locale          REFERENCES ad_locales,
        message         CLOB,
        -- if the message is "registered" then it exists in a file
        -- so we should not allow editing via the web interface
        registered_p    CHAR(1) CHECK(registered_p IN ('t','f')),
        PRIMARY KEY (key, locale)
);
Each message has a unique key and locale combination. You might be surprised to see that we claim to be storing messages for each language, but the column is actually named locale rather than language. A locale can specify language, country and dialect. Specifying dialiect, however, is needlessly fine-grained for most web sites. We would likely not want to provide different translations for countries with the same language such as England and the United States.

In fact, we have chosen to populate the ad_locales table with locale identifiers that omit the dialect and country and use just a language identifier as its primary key. Examples include en (English) or fr (French). Consequently, everywhere else within the web site that we need to specify a locale in a database table, we will use the same level of breadth. A full list of possible language codes, which we can use in the primary key column of the ad_locales table, is defined in the ISO 3166 standard.

The procedures for inserting and retrieving messages to and from the catalog are described below:

ad_lang_message_lookup lang key

This procedure retrieves a message from the catalog given a key and a locale. We provide _ (underscore) as a synonym for ad_lang_message_lookup, so that our API follows a naming convention similar to that used by Gnu's gettext tools

ad_lang_message_register lang key message

This procedure inserts a message into the catalog. Callers to this procedure need to specify a key and a locale for the message. We provide _mr as a synonym for ad_lang_message_register.
The implementation of ad_lang_message_lookup needs to pay great attention to performance, since this could be called many times per page request. The messages are cached in the server's memory in a set structure. There is one set for each language. Each message lookup thus requires a procedure call and a set lookup, but the overhead of a database query is normally avoided once the data structure has been initialized.

How does a page designer include translated strings in the HTML for a web page? A programmer can use the procedure above to return a string within a Tcl script. For example, the code below retrieves the message with key hello_world for the French language:

_ fr hello_world
However, we also need to provide an interface to the message catalog for web page designers that are not programmers. Since most modern web servers have a mechanism for specifying custom markup tags, the natural choice for our interface is to provide web page desginers with a custom tag for displaying a translated string. We name our new tag <TRN> and it is used as follows:
<TRN symbolic="hello_world">Hello world</TRN>
When the server's parser encounters this tag, it returns the results of a message catalog lookup for the key hello_world in the current user's language. We can configure the language to depend on the context in which this page is served. By default, the context is user but we can set it using the type property. In this case, the ADP parser returns the result of the catalog lookup in the language of the group that owns the web page:
<TRN type="community" symbolic="hello_world">Hello world</TRN>

The string between the opening and closing tags is not printed out on the web page that gets displayed to the user. It serves two purposes. First, it allows the web page designer to identify what is going to get printed out to the user. Second, it is a method for entering a translation for the specified key into the message catalog.

The first time the tag is encountered by the server, the content of the tag is automatically registered in the Oracle database and the server's cache. By default, English is assumed to be the language of the translation. We could specify a different language for the message text that gets registered with the following code:

<TRN lang="fr" symbolic="hello_world">Bonjour monde</TRN>
There is another way that an entry can be added to the message catalog. This is by calling the ad_lang_message_register procedure described earlier. Programmers can do this within a Tcl library that gets sourced on server startup. This must be done for all messages that are not present in ADP pages. Examples would include the results of procedures that return strings that get displayed on a web page. These procedures should return a message catalog key that can get translated into the user's language before display on a page.

A few problems can arise with this message catalog. The code that registers the message can be contained within conditional statements and never be executed. In that case, it will not be stored in the database and flagged for translation. Or a programmer could write a procedure that returns a key to the message catalog without ensuring that the message has been registered. Programmers must therefore be careful to ensure that all their messages have been registered properly. The design of our system does not implement a mechanism to enforce proper registration of messages in the catalog.

An additional problem is that there is no automatic means for a programmer to ensure that a message catalog key is unique. It is possible to have different ADP pages and Tcl libraries with different messages provided for the same key. The programmers who write the HTML pages and the Tcl code must be careful not to duplicate keys. We help ensure this by using a unique prefix to the key for each module so that a programmer can have his own namespace. For example, the events module will have all message keys begin with events. so that they will never conflict with the global keys.

Data Model

Columns in the database may need to be translated into each language that will be used on the web site. The approach we use to solve this problem is the same in each case. We split all tables into language-dependent and language-independent tables.

As an example, we will look at the country codes table which is part of the core ACS toolkit. Here is a version of the original table.

CREATE TABLE country_codes (
        iso		CHAR(2) PRIMARY KEY,
        country_name    VARCHAR(150) NOT NULL,
	telephone_code  VARCHAR(10)
);
The iso column is the primary key and is thus included in each table so that we can join on this column. The telephone code, which is the country's international dialing code is language independent, because it is the same in each locale. For example, the international dialing code for the UK is 44, whatever country you are in. The country name is language dependent: United States in English becomes Etats-Unis in French. Here are the tables created after we have split country_codes into language dependent and independent parts. By convention, we've used the suffix _data to denote a language independent table and _lang to denote the language dependent table.
CREATE TABLE country_codes_data (
        iso		CHAR(2) PRIMARY KEY,
	telephone_code  VARCHAR(10)
);

CREATE TABLE country_codes_lang (
        iso             REFERENCES country_codes_data, 
        locale          REFERENCES ad_locales, 
        country_name    VARCHAR(150) NOT NULL,
        UNIQUE(locale, iso)
);
As a convenience, we create a view with the same name as the original table.
CREATE OR REPLACE VIEW country_codes AS
       SELECT ccd.iso,
              ccd.telephone_code,
	      ccl.locale,
	      ccl.country_name
         FROM country_codes_data ccd,
	      country_codes_lang ccl
        WHERE ccd.iso = ccl.iso;
Using this view, we need only make a small modification to our database queries to ensure that the results appear in the correct language. Before introducing the multilingual data model, we might have a query that looked like this.
SELECT country_name FROM country_codes;
We now replace this query with the following which ensures that the country name is translated.
SELECT country_name FROM country_codes WHERE locale = '$user_locale';
The example table we have chosen, country_codes, is not subject to any insert, update or delete statements in the ACS toolkit. We populate this table when we load the data model and do not need to provide an interface to change a country name in the database.

However, in many cases a user of the web site will be permitted to modify language dependent columns of tables. All the insert, update and delete statements that affect these tables will need to be modified so that they refer to the user's locale when selecting the rows of the table that need to be modified. In some cases, when a row is inserted or updated, it will be necessary to create or change translations for every language. An example, where this may be required is the user_groups table. Lists of groups appear on various pages in the ACS. If a new group is created, it will be necessary to have the group name translated so that it can be read in each language.

Language-aware sorting

Lists of textual data displayed on a web page often need to be sorted in alphabetical order. But each language can use a different sorting sequence for characters. Oracle provides the NLSSORT function that carries out linguistic sorting. An example using this function for a Spanish language sort is shown below.
  SELECT key
    FROM testsort 
ORDER BY NLSSORT(key, 'NLS_SORT = XSpanish');
Rather than find and replace each query that uses an ORDER BY, we originally wrote all queries to use a Tcl procedure, ad_lang_sort, that returns an appropriate call to NLSSORT for a particular database column and language.

Translation

It is easy to forget that someone must go through a multilingual web site and translate every single item of content. We need to provide an interface that can be used by translators to ensure that all necessary parts of the web site are translated into each language.

First, we need to record all database tables that contain columns that need to be translated. We do this in this table:

CREATE TABLE lang_translate_columns (   
        column_id               INTEGER PRIMARY KEY,
        on_which_table          VARCHAR2(50),
        on_which_column         VARCHAR2(50),
        required_p              CHAR(1)
                                CHECK(required_p in ('t','f')),
        UNIQUE (on_which_table, on_what_column)
);
We can use the required_p flag to indicate whether all entries in a column must be translated for the site to function. Ideally, we would also maintain dependency information so that we automatically identify things that need to be retranslated, though this is not done in the data model shown.

We also need to maintain a list of all entries in the message catalog and flag the ones which require translation. It would be possible to present a translator with a web page which simply listed each string in the catalog and provide a text input box for entering the translation. However, it can be hard to provide a good translation when a message is shown out of context.

A better way to do this is to provide a special view of the web site to translators. We can modify the ad_lang_message_lookup procedure to display a hyperlink to a translation page beside each message. A translator can simply browse the web site and click on the appropriate link to get a form to enable him to translate a string.

Code example

Below is an example of an AOLserver Tcl page that creates an entry in the message catalog for four languages: English, French, Spanish and German. We use Babelfish as a quick and dirty way to provide translations for our original message text which is Hello world. The script displays a web page showing the translations taken from the message catalog:
# procedure to translate strings using Babelfish
proc babel_translate { msg lang } {
    set marker "XXYYZZXX. "
    set qmsg "$marker $msg"
    set url "http://babel.altavista.com/translate.dyn?doit=done&BabelFishFrontPage=ye\s&bblType=urltext&url="
    set babel_result [ns_httpget "$url&lp=$lang&urltext=[ns_urlencode $qmsg]"]
    set result_pattern "$marker (\[^<\]*)"
    set msg_tr "** Babelfish TRANSLATION ERROR **"
    regexp -nocase $result_pattern $babel_result ignore msg_tr
    regsub "$marker." $msg_tr "" msg_tr
    return [string trim $msg_tr]
}                               

# set a test message and add it to the catalog
set msg "Hello world"
_mr en "hello_world" $msg

# add the translations to the catalog
_mr fr "hello_world" [babel_translate $msg en_fr]
_mr es "hello_world" [babel_translate $msg en_es]
_mr de "hello_world" [babel_translate $msg en_de]

# return a web page that displays the translations
ns_return 200 text/plain "
English:  [_ en hello_world]
Français: [_ fr hello_world]
Español:  [_ es hello_world]
Deutsch:  [_ de hello_world]"
If you have the code described in this article, you can run the above Tcl script to get the following results:
English:  Hello world
Français: Bonjour monde
Español:  Hola mundo
Deutsch:  Hallo Welt
We have to wait for three requests to Babelfish before the web page is returned, so this page might be quite slow.

More information

The Accept-Language header section of the HTTP specification http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Gnu gettext tools http://www.gnu.org/manual/gettext/html_mono/gettext.html
AOLserver NSV sets http://aolserver.com/doc/3.0/nsv.txt
ACS templating module http://www.arsdigita.com/doc/templates/
Apache Content Negotiation http://www.apache.org/docs/content-negotiation.html
Go Translator http://translator.go.com/
Babelfish http://babel.altavista.com/

asj-editors@arsdigita.com

Reader's Comments

Using an automatic translator like babelfish is very useful, but you will need to warn the user as the translation is shocking!

I have tried viewing Japanese pages through these online translators and they are often absolutely meaningless (and very funny!) in English. For example here's Yahoo Japan in English.

Also here in Japan the use of web based translators is already very common, particularly Excite's. Be careful that you offer the original language also, and not forcing the user to see your page through a translator.

If you want to look professional then you would be better off using a software localization company. And while I'm about it I will shamelessly plug the company I work for in Tokyo! Intersoft who offer localization into Japanese. :)



-- Matthew Lock, February 4, 2001
spacer