ArsDigita Archives
 
 
   
 
spacer

Character Set Encoding

Part of an article on Building a Multilingual Web Service Using the ACS, by Henry Minsky (hqm@ai.mit.edu)

ArsDigita : ArsDigita Systems Journal : One article : One Chapter


previous | next chapter

Overview of Character Set Handling in AOLserver and ACS

The ISO-8859-1 character set, also known as ISO Latin 1, can handle most characters in Western European languages. You can avoid character set problems if you are building a multilingual web site that only uses these languages providing you use a programming language and database that handle ISO Latin 1 correctly. AOLserver's embedded scripting language, Tcl, uses Unicode internally in recent versions. Versions of AOLserver after the ad5 release can automatically convert between ISO Latin 1 and Unicode when data is transferred to and from the client.

If, however, you need to use languages from outside Western Europe or you need to use characters that are not part of ISO Latin 1, such as the Euro currency symbol, you will need to deal with character sets. The following areas may require character set conversion:

  • Loading data from the filesystem into AOLserver/Tcl
  • Exchanging data between AOLserver and Oracle
  • Delivering content to web browsers
  • Processing text input from users
We will discuss each of these areas, and then describe a solution for handling character sets in a uniform way, in a character set encoding management scheme for content in the ACS.

Unicode in AOLserver and ACS

Our solution to managing different character sets is to convert content to Unicode as soon as possible, and keep it in Unicode for as long as possible. Unicode is a character set that can represent most of the world's writing systems. By using a Unicode-centric approach, we reduce the complexity of trying to manage content in many different character set encodings throughout the system.

AOLserver 3.0 uses Tcl 8.3 as its internal scripting engine. Tcl 8.3 represents strings internally in Unicode using the UTF-8 encoding. Tcl 8.3 has support for conversion between about 30 common character set encodings, and new encodings can be added to the system library if needed.

In addition, ArsDigita has augmented the current AOLserver 3.0 API with functions which perform character set encoding conversions on HTTP connection data.

Unfortunately, Unicode is only beginning to be widely supported. So for the near future, content will have to be delivered to browsers in legacy encodings, publishers will want to author content in specific encodings, and developers will want to edit their code using their favorite tools which only support certain encodings. In making decisions about how we manage content and character set encodings in ACS, we have to decide how accommodating to be towards these different "legacy" users of the system, while trying to reduce the complexity of the publishing environment.

It is important to understand that encoding a text string as Unicode does not relieve us of the task of representing the language or languages of the content it contains. Unicode, by design, does not attempt to represent what language a string of characters belongs to.

When we need to know what language a string contains, say to sort it correctly or to present the correct user interface in a specific language, we must implement some mechanism to associate a language or locale with the string. That is the job of the internationalization and localization facilities of the system. The Language section of this article describes the approach for representing and storing this language and locale information for database fields and text strings.

Definitions

Character
A character is an abstract entity, such as "LATIN CAPITAL LETTER A" or "JAPANESE HIRAGANA KA".

Coded Character Set
A Coded Character Set (CCS) is a mapping from a set of characters to a set of integers defined in RFC 2277 and RFC 2130

Character Encoding Scheme
A Character Encoding Scheme (CES) is a mapping from a CCS or several CCSs to a set of bytes defined in RFC 2277 and RFC 2130. UTF-8 is an example of a character encoding scheme.

Character Set
This document uses the term character set or charset to mean a set of rules for mapping from a sequence of bytes to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry. In this document we will use the terms charset and encoding somewhat interchangeably; using "encoding" when referring to Tcl/AOLServer character set encoding conversions, and "character set" when talking about the MIME and HTTP content-type information.

ISO-8859-1
ISO-8859-1, also known as ISO Latin 1, is the default character set for HTML. It uses a single byte character set encoding, and can be used to represent text in most Western European languages. ISO-8859-1 is a superset of 7-bit ASCII.

Unicode
Unicode defines a coded character set (also known as UCS, the Universal Character Set) which encompasses most of the world's writing systems. The Unicode Standard, Version 3.0, is code-for-code identical with International Standard ISO/IEC 10646.

UTF-8
UTF-8 is an example of a common character encoding scheme for representing Unicode. UTF-8 uses a variable length encoding for the character values, where a single character can be represented by from one to six bytes. UTF-8 has some features which make it convenient for representing Unicode on today's operating systems. One of the primary features of UTF-8 is backward compatibility with ordinary 7-bit ASCII text. ISO-8859-1, however, is not compatible with UTF-8, because it makes use of every bit and hence does not leave the ability to do variable length encoding.

Loading Data from the Filesystem into AOLServer

While using Unicode internally to encode strings, Tcl uses a default "system" encoding to communicate with the operating system. That is, it will convert between UTF-8 and the system encoding for passing string data back and forth from the underlying operating system; data such as filenames and other system call arguments and return values.

This description of the system encoding process comes from the Tcl Internationalization How-To:

Tcl attempts to determine the system encoding during initialization based on the platform and locale settings. Tcl usually can determine a reasonable default system encoding based on these settings, but if for some reason it cannot, it uses ISO 8859-1 as the default system encoding.

You can override the default system encoding with the encoding system command. The internationalization how-to recommends that you avoid using this command if at all possible. If you set the default system encoding to anything other than the actual encoding used by your operating system, Tcl will likely find it impossible to communicate properly with your operating system.

When dealing with text strings in international character sets, you must take care when you use Tcl or AOLserver facilities which communicate with the operating system. Trying to create a file with a filename containing Japanese characters on a Solaris machine, for example, would cause trouble, since the filename cannot be represented directly in the operating system's character set. In this particular case an extra layer of encoding using a 7-bit compatible encoding such as ISO-2022-JP might be used, but you would have to make sure you explicitly encoded and decoded strings from this character set when they were passed to the operating system. Windows NT, by contrast uses Unicode internally and so will support such filenames. To avoid portability issues, restricting filenames to 7-bit ASCII is recommended.

Character Set Encoding in Tcl Script Files

Whenever Tcl attempts to load a script or library file from disk into the interpreter for evaluation, it must convert that file into UTF-8 internally. To do that correctly, it must be told what character set encoding the source file is in.

There is one exception to this process, during server bootstrap the .tcl startup file is read directly with no encoding conversion, so we must use UTF-8 encoding for these files.

ACS and AOLserver load Tcl files in a number of places. At server startup, Tcl library files are loaded from certain directories in a bootstrap process, and then the ACS package loader takes over. At runtime, .tcl files may be sourced dynamically to service URL requests from HTTP clients.

In the common case, a Tcl file is loaded using the source command. This will read the file in using the Tcl system encoding. That will generally be ISO-8859-1, unless someone has explicitly set it differently. However, when developing applications in other languages, we may want to author Tcl files containing text strings in Unicode UTF-8 or in other character sets. In that case we must explicitly tell Tcl what encoding conversion to use when loading the file.

In order to load a file in another encoding, Japanese EUC in this example, the following Tcl code can be used to set the file channel encoding

set fd [open "app.tcl" r]
fconfigure $fd -encoding euc-jp
set jpscript [read $fd]
close $fd
eval $jpscript
In general, of course, we don't want web site developers to be doing this kind of conversion manually on individual files. Some approaches to a framework for providing automatic encoding conversion will be discussed in the ACS Encoding Management section.

Character Set Encoding in HTML Files

AOLserver is capable of delivering static content files from disk directly to a browser with no character set conversion or interpretation. This may be useful in some cases, however the general ACS content-processing methodology is to allow various dynamic transformations on static content. This could mean appending general comments, headers and footers, or other dynamic modifications. In any cases where content will be loaded into Tcl, the file must be converted into UTF-8 internally, and thus we need a way to determine the file's source encoding before loading it.

Thus, the same encoding questions which apply to Tcl script files also apply to HTML or any other content that gets loaded into Tcl internally for processing; if the file is not in the system encoding, then how do we specify it's encoding to the ACS, so it can be converted to Unicode correctly when it is read from disk?

See the sidebar on HTML Character Entity References for ISO-Latin-1 and Unicode Characters for information on displaying characters from international character sets within an HTML document.

Character Set Encoding in Template (ADP) Files

The AOLserver ADP parser assumes that a template is in UTF-8. If the content has been loaded into a Tcl string, then it will already be in UTF-8. Of course the proper encoding conversions must have been done to get it into a legal UTF-8 string in the first place. However if ns_adp_parse is reading a file directly from disk (with the -file option) then we must ensure that that the ADP file is in UTF-8 encoding.

The ACS currently has several different mechanisms for handling templates. There's the Dynamic Publishing system, the new Document API, and various project-specific templating mechanisms in use on different sites.

We need a mechanism for the ACS which lets us specify the source language and encoding of template documents, and also possibly independently specify the desired output character set for a template document. The required encoding conversions can then be performed automatically by the template handling section of the ACS request processor.

Delivering Content to Browsers

When a URL is requested from the web server, via an HTTP GET or POST request, the web server returns the requested content, along with some HTTP/MIME headers. The most important header is the Content-Type, which has a common value of text/html for HTML documents. Here is a stripped down HTTP response to a GET request from an ACS server

HTTP/1.0 200 OK
MIME-Version: 1.0
Content-Type: text/html
The Content-Type header can contain a parameter which specifies the character set, such as in the example below. Adding Content-Language header is a good thing to do as well, as it provides more information to the browser on how to best present the document.
HTTP/1.0 200 OK
MIME-Version: 1.0
Content-Type: text/html; charset=euc-jp
Content-Language: ja
The character set parameter tells the client what encoding the content is in. According to the HTTP specifications, if no character set parameter is sent, the client should assume it is in ISO-8859-1. However, today in practice on the Web it is the case that many servers of non-ISO-8859-1 content neglect to send the character set encoding information, and it is up to the browser to either guess the correct encoding, or for the user to manually set their browser to what looks like the correct encoding.

Note on Browser Autodetect of Character Set Encodings

Particularly with respect to the Asian languages, there may be multiple encodings in common use for documents. Japanese for example has ShiftJIS, EUC, ISO-2022-JP, and possibly Unicode encodings. Many browsers have an auto-select mode where they will try to guess the charset encoding of a document. There are algorithms which can quickly determine if a document is unambiguously in a particular Japanese or Chinese encoding.

To allow browsers to properly render our content, we must make sure that a character set encoding parameter is always sent with every web page we serve.

In many cases the author of content in a non-ISO-8859-1 character set will include a HTML META tag in the document body which specifies the content type and character set. For example

<meta http-equiv="content-type" content="text/html;charset=x-sjis">

Many browsers will use the META tag to help determine the page encoding, although if it conflicts with a charset specified in the Content-Type header, then the results will be unpredictable.

The following HTTP GET from Yahoo Japan's web site shows that they included no character set information in the HTTP header. None was present in the HTML content itself either via inclusion of a META tag. The encoding of the document was in fact Japanese EUC, but the browser must guess this heuristically in order to correctly render the page.

telnet www.yahoo.co.jp 80
Trying 210.140.200.16...
Connected to www.yahoo.co.jp.
Escape character is '^]'.
GET / HTTP/1.0

HTTP/1.0 200 OK
Content-Type: text/html
Content-Length: 21673

Managing the Output Encoding

AOLserver has several pathways for returning data to the HTTP connection.

Serving files directly from disk can be done verbatim; the file is sent byte-for-byte with no translation performed by the server. The AOLserver encoding API addresses those cases, and describes how to configure the MIME content type tables to help identify the character set encoding to a client. In that case no encoding translation is required.

The AOLserver API functions for returning content to a network stream are

ns_writefp 
ns_connsendfp 
ns_returnfp 
ns_respond 
ns_returnfile 
ns_return (and variants like ns_returnerror) 
ns_write 
We extended the AOLserver API to support explicit specification of a character set encoding conversion for the output stream. For example, ns_return has been modified to inspect the Content-Type header, and if it finds a charset parameter, it will perform encoding conversion to that character set.
ns_return 200 "text/html; charset=euc-jp" $html

If no character set is specified, the output will default to the init parameter ns/parameters/OutputCharset, or ISO-8859-1 if no explicit default is specified.

The new API function, ns_startcontent is used to explicitly set the conversion on the network output stream to a given charset encoding. This is used in conjunction with ns_write to manually set the output encoding for a document.

ReturnHeaders "text/html; charset=euc-jp"
ns_startcontent -charset "euc-jp"

Why not just send Unicode?

At this point you might ask if we could just encode all output in Unicode (UTF-8), thus relieving us of the task of deciding which output encoding to use. Ultimately that would be the most portable way for all documents to be delivered, greatly relieving the burden of tracking the numerous redundant character sets in use today.

However, Unicode is still a relatively new standard, and many browsers and other tools can not work with it yet. So we must provide support to encode content in the character sets that are in common use today.

Given this requirement, we must decide how we are going to build a an API for developers and content publishers to specify the desired output encoding for individual or entire classes of documents.

Exchanging Data between AOLserver and Oracle

Text data can be passed between Oracle and Tcl without requiring any character set conversion if the database is configured to use UTF-8 encoding. Ideally, the UTF-8 character encoding is specified when the CREATE DATABASE command is run, though it is also possible to change a database's encoding after it has been created.

You need to set the character encoding as an environment variable when you start up your Oracle client process. If you are using the multi-threaded server (MTS) option, the client processes are started when the Oracle server is started. If, however, have configured your Oracle server to have a dedicated server-side process for each client, you can explicitly set its character set encoding as UTF-8 in a start-up script such as the following:

#!/bin/sh

. /etc/shell-mods.sh

TCL_LIBRARY=/home/aol30/lib/tcl8.3
export TCL_LIBRARY

NLS_LANG=AMERICAN_AMERICA.UTF8
export NLS_LANG

TZ=GMT
export TZ

exec `dirname $0`/nsd8x-i18n $*
The AOLserver nsd8x-i18n executable referenced above is a version of AOLserver compiled with the ArsDigita extensions for character set encoding support described in Character Encoding in AOLserver 3.0 and ACS

You must also use a version of ArsDigita's Oracle driver with the patch for LOB fetches and variable width character sets (version 2.2 or later):

$ strings /home/aol30/bin/ora8.so | grep ArsDigita
ArsDigita Oracle Driver version 2.2
If your driver does not have this support, you may get the following error when sending multibyte characters from AOLserver to Oracle:
[05/Jun/2000:23:20:04][5128.11][-conn1-] Error: ora8.c:1398:ora_exec:error in `OCIStmtExecute()': 
       ORA-03127: no new operations allowed until the active operation ends

When loading the ACS data model into Oracle, the following script ensures that the character set is explicitly set to ISO-8859-1, so that any high-bit characters in the SQL code are properly converted to UTF-8.

#!/bin/sh
NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
export NLS_LANG
exec sqlplus $* < load-data-model.sql

Processing Text Input from Users

Arbitrary text data can be posted to the server in HTTP POST and GET requests. Unfortunately the HTTP protocol is very weak in the area of annotating user submitted data with character set information. This creates some problems in unambiguously converting user-submitted data into a common UTF-8 encoding.

Form data can arrive at the server in three ways

  1. Query data in the URL path query string
  2. Form (POST) data in application/x-www-form-urlencoded encoding
  3. Form (POST) data in multipart/form-data encoding

URL Query String Data

A URL can contain arbitrary data in the query section of the path (the part after the first '?' character). Imagine a web server receives a GET request for a URL from a browser, which looks like this
http://hqm.arsdigita.com/i18n/examples/form-1.tcl?mydata=%C5%EC%B5%FE
Looking at just the structure of this URL, we cannot determine which charset the variable mydata is encoded in. We will have to actually examine the content itself, and make our best guess. It happens this was the Kanji for "Tokyo" encoded in Japanese EUC, but it could be a legal string in any number of character sets.

The HTTP specification mandates that text data in a URL query be URLencoded, hence the %XX escape sequences, however it does not say anything about how to specify what character set is being used.

In practice we have to rely on other context knowledge to figure out what character set query data are encoded in. Most browsers will look at the character set in the HTTP header of the last document they retrieved , or at the META tag there is one, and set their current character set to that automatically, so in that sense, the browser character set gets set automatically to that of the last document viewed (usually the form that you are clicking 'submit' on). However, the user can usually override that character set choice by setting the encoding manually from a menu. If we have some way implicit or explicit of knowing the encoding of the form which the GET or POST came from, then we can properly decode the query strings.

Note on Autodetect of Character Set Data from Browsers

Under some circumstances a user's browser may POST or GET data in an encoding which is different from that which your application expects. At that point it will be impossible for AOLserver to correctly convert the data into UTF-8. In the case of Japanese, it is usually possible to heuristically guess the encoding, because the common Japanese encodings have enough redundancy that in even a small sample of text, some code sequences can be unambiguously assigned to a particular character set. Several software libraries are available to heuristically detect Japanese character sets. In order to use this mechanism for character set detection it would be necessary to modify the AOLserver URL and form data decoding process by adding a call-out to a user-configurable character set detection library.

Handling Simple URL and Form Data

When a simple HTTP GET or POST request comes in from a browser, query and form data has been URL encoded . URL encoding is a content transfer encoding where a single byte may be encoded as a '%' character followed by two hexadecimal characters, such as %CF. The encoding is designed to be safe for 7-bit transmission channels, using only a safe common subset of ASCII printing characters. The request data needs to be decoded back into bytes, using the methods described above to discover the character set in use by the client browser.

Composing URL-Encoded Strings

In developing an application you may need to create hyperlinks in your documents, and encode data into the URL. The W3C recommends that UTF-8 be used to encode data within a URL. However, we have provided an API to allow different character encodings within URLs. The ns_urlencode function has been extended to accept a character set argument. Thus, you could call:
ns_urlencode -charset shift_jis $your_data
See Character Encoding in AOLserver 3.0 and ACS for more details on this API function.

Decoding Form Data

The AOLserver encoding API defines the ns_urlcharset command which can be used to set the automatic decoding of submitted form data from a specified character set to UTF-8. In the current API, ns_urlcharset must be called before form data is requested for the first time from AOLserver.
ns_urlcharset "euc-jp"
...
...

ad_page_contract { foo } {
  Allow user to submit foo data
}
By default, form data is treated as if it is in ISO-8859-1 decoding. This can be modified as a configuration parameter

Another new API function ns_formfieldcharset allows you to designate a field in the form data as containing the name of the form's character set encoding. This is then extracted and used as if it were passed to ns_urlcharset.

At this time the chance is pretty low of getting correct language or character set HTTP header information from browsers submitting user data. Both Internet Explorer 5 and Netscape headers 4 headers do not specify the character set for either GET or POST requests. Even if browsers do send character set headers, it would be unwise to depend on them since they may be incorrect. Thus, we are not going to suggest trying to use browser HTTP language or charset header information to decode user submitted data. Since HTTP and HTML have up to now been primarily developed in either the English speaking world or in single language communities, the internationalization protocols for HTTP have been loosely adhered to at best.

Deducing the correct character set in which to interpret user submitted data is largely a matter of setting up your application context so that you already know what language you expect, as much as possible.

Multipart form submission

An HTML form which contains the enctype=multipart/form-data parameter will encode the POSTed form in a MIME-style multipart message.
[RFC 2388]

4.5 Charset of text in form data

   Each part of a multipart/form-data is supposed to have a Content-
   Type.  In the case where a field element is text, the charset
   parameter for the text indicates the character encoding used.

   For example, a form with a text field in which a user typed 'Joe owes
   100' where  is the Euro symbol might have form data returned
   as:

    --AaB03x
    Content-Disposition: form-data; name="field1"
    Content-Type: text/plain;charset=windows-1250
    Content-Transfer-Encoding: quoted-printable

This would seem to be a great help in allowing us to find the character set encoding of submitted form data. Unfortunately, most browsers do not obey this standard and will not provide any Content-Type header at all with the parts of a multipart form submission, much less character set information.

Below is the trace of a POST request from Internet Explorer 5.0 from a form which is encoded in Japanese EUC. Note that the only information provided in the multipart/form-data entry for the form variable mydata is the name of the form field itself. No character set or content-transfer-encoding field was sent. Note that even though the page containing the form was being viewed in Japanese, the browser sent a Accept-Language header stating that it only accepts English.

<form method=post action=mpform-1.tcl enctype=multipart/form-data>
<input size=40 type=text name=mydata value=\"[util_quotehtml $mydata]\">
<input type=submit>
</form>>

POST /i18n/examples/mpform-1.tcl HTTP/1.0
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, ...
Accept-Language: en-us
Content-Type: multipart/form-data; boundary=---------------------------7d02c699c0
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98; DigExt)
Host: hqm.arsdigita.com
Content-Length: 139
Pragma: no-cache


-----------------------------7d02c699c0
Content-Disposition: form-data; name="mydata"

[... RAW EUC-JP BYTE DATA HERE ...]
-----------------------------7d02c699c0--

Another warning, because of carelessness or bugs in browser implementations, if you deliver a form which contains default data for an input field, even if the user does not modify the input field at all, when they POST the form back to the server, it may still come back in a different encoding than it was sent down in. Thus it is wise to compose forms with a hidden "reference" field that contains a string that could be decoded unambiguously using a character set auto-detection routine. This would provide protection against this kind of "floating character-set" problem.

It is wise to use only 7-bit ASCII for the names of form variables. That way, there is little chance of improperly decoding them when processing the form submission, since most character set encodings support 7-bit ASCII as a subset.

Currently while ns_urlcharset will do automatic charset conversion on data submitted in application/x-www-form-urlencoded encoding, it does not do any automatic encoding conversion on data from multipart/form-data submissions. That is, the ns_urlcharset command will have no effect if the form enctype was multipart/form-data.

If you want to decode multipart posted data, you will need to call encoding convertfrom explicitly to convert each string to UTF-8 for use in your application.

This restriction is currently present in order to allow uploaded file data to be read without character set conversion being performed. This is necessary if the form data is going to be stored verbatim directly into the database in BLOB form, without being interpreted in a particular character set.

Character Set Management for ACS

As of ACS version 3.3, absolutely every request is served through a unified request processor (rpp_handler in /packages/acs-core/request-processor-procs.tcl), described in detail in the request processor documentation.

We can solve the character set conversion problem by extending the request processor. It needs to be aware of the encodings of source documents and the desired output encodings, then automatically perform encoding conversion where needed. Below is an example of how this facility would appear to the site developers and publisher.

As a general principle, we need to have a convention for specifying the source encoding of a document, and for specifying the output encoding in which we deliver the document to a browser. In the common case, the source encoding and output encoding for a document will be the same, but we want to provide an easy way for a page author to override this. So, for example, it should be possible for a developer to create a Japanese template file in UTF-8, but have it served to browsers encoded in ShiftJIS.

For multi-lingual web sites, it is probably more appropriate to try to provide an API which lets authors work at the "language" level rather than the character set level. In most cases, the content authors will want to think in terms of managing content in different languages, and would like to have the character set encoding issues take care of themselves as much as possible. However we also want to provide a way to explicitly specify character set encodings to give developers full control over how their documents are interpreted.

File Name Conventions for a Multilingual Site

Maintaining a multilingual web site can be made easier by making use of the abstract URL facility to create the logical structure of the site, independent of language, and then having the system automatically dispatch to the correct language-specific file for a given user or session. Language identification is discussed in the Language section of this article.

In the common case, a document will be requested by a browser using an abstract URL. We need the request processor to combine this request with the connection's environment information, based on cookies, user and session id's, URL pathname, etc., and come up with the desired target language for the document.

Once we have computed the desired language for a document, the request processor can be extended to search for the file on disk which matches the language and perform the necessary encoding conversions in order to process and deliver the file to the browser.

Here are some techniques for structuring the files on a multi-lingual web site:

One Language per Template, with Abstract URLs

A site with where each source file has an explicit language. For a given abstract URL, these might be named using overloading of filename extensions, such as
foo.en_US.html  American HTML file
foo.en_GB.html  English HTML file
foo.fr.html  French HTML file
bar.ja.adp   Japanese ADP file
bar.el.adp   Greek ADP file
baz.ru.tcl   Russian Tcl file

The language code is a language_country pair, using abbreviations defined by the ISO 639 standard for language names and the ISO 3166 standard for country names.

When we are only dealing with a single locale for a given language, we can shorten the filename from the full locale to just the language name for convenience, for example fr instead of fr_FR. It has also been suggested that aliases for file language suffixes be assigned for convenience, such as .us for en_US and .gb for en_GB.

Single Fully-Multilingual Template Files

The other style of usage is to create multilingual applications where a single template page serves content in multiple languages, with all language-specific content generated using the Message Catalog and Localization API's. In that case, it does not necessarily make sense to assign a single language to a template or source file.

For a truly multilingual page the source encoding would probably either ISO-8859-1 or UTF-8. Since the actual language-specific content will be drawn from the message catalog or database, the page itself should be thought of as a language-independent logical template. The output encoding in which we send the content to the browser, however, will depend on what language we are serving in the request. So we must have an API for dynamically computing and setting the output encoding from within the document request itself.

Automatic Mapping of Language to Character Set

A configuration table can be created to map languages to character set encodings. This might look like
en = iso-8859-1   English
fr = iso-8859-1   French
ja = sjis         Japanese ShiftJIS
ru = iso-8859-5   Russian
el = iso-8859-7	  Greek
An API function called ad_charset_for_language could be used to select the appropriate character set for a given language.

It is certainly mandatory that we know what character set a source file is in so that it can be converted to UTF-8 correctly when loading into Tcl. We can also use the character set of the file as the default output encoding when we deliver the final page content to the browser. This will generally be the correct thing to do. In some cases the publisher may want to manually override this for some files. For example authoring Japanese content may be more easily done in EUC (or even Unicode) on Unix systems, but the site may want to deliver the content to users in the more widely used ShiftJIS encoding. Configuration options and APIs for authors to do this should be allowed for in our system.

While we are overloading filenames with language info, it might be useful to be able to specify a character set directly in the same way. Thus we could simultaneously allow filenames like this

foo.ej.html         Japanese EUC HTML file
foo.sjis.html       Japanese ShiftJIS HTML file
foo.iso8859-5.html  ISO-8859-5 HTML file
bar.utf8.adp        UTF-8 template 
bar.iso8859-1.adp   ISO-8859-1 template 
As long as we use charset names which are distinct from language names, then there should be no conflict. One algorithm for handling an abstract URL could have the request processor could look for a file with an explicit charset first, and then look for language files, or vice versa.

ACS/AOLserver Platform-specific Changes

The sections below outline some modifications that can be made to an ACS version 3.x system running on the AOLserver/Tcl plaform to perform some of the character set encoding handling described above. For more detailed and up-to-the-minute patch kits, see ACS I18N Patch Kit and ACS 3.4.x International Character Set Support v.3.4.5

Serving Static Files

AOLserver is capable of delivering a document from disk directly to a web browser, byte-for-byte. In that case, it does not need to know anything about the character set encoding of the document. If you have a file stored in a non-ISO-8859-1 encoding, it is is possible to have AOLserver add the correct content-type character-set parameter to the HTTP header using its built-in MIME-type lookup mechanism. Entries in the AOLserver MIME-type table can be added by modifying the init file ns/mimetype parameter to assign a MIME-type for a given file name extension.
yourserver.ini:


[ns/mimetypes] 
Default=text/plain 
NoExtension=text/plain 
.html=text/html; charset=iso-8859-1
.html_sj=text/html; charset=Shift_JIS
.html_ej=text/html; charset=euc-jp
.tcl_sj=text/plain; charset=Shift_JIS
.tcl_ej=text/plain; charset=euc-jp
.adp_ej=text/html; charset=euc-jp
.adp=text/html; charset=utf-8

This approach will work as long as the data passes directly from the file through AOLserver and out to the browser, without being loaded into Tcl. However the ACS generally does do some processing in Tcl on a file's contents before it delivers it to the browser. For example, .tcl scripts will need to be loaded by the interpreter and evaluated, static pages may have generalized comments and system headers or footers appended to them, and .adp template files need to have a template parser run on their content, possibly running arbitrary Tcl code. In all of these cases we must know the source file's encoding in order to read it properly into a Tcl string.

Serving HTML Files with Dynamic Annotations

On a typical ACS installation, there may be many static HTML files which are annotated dynamically by the system before they are delivered to the browser. In these cases, we must know the source encoding of the files. Using the filename extensions suggested above we can tell the system what the encoding is, and the request processor routine that handles static files (rp_handle_html_request) can be modified to set the correct channel encoding when loading the file.

Patching ad_serve_html_page with the following code will allow HTML files with alternate encodings to be served. For example, with the ns/mimetypes shown above, a file foo.html_ej would be loaded into Tcl as EUC-JP encoding, and delivered to the the browser in the same encoding, by default.

ad-html.tcl:

 ad_serve_html_page


    set type [ns_guesstype $full_filename]
    set encoding [ns_encodingfortype $type]
    set stream [open $full_filename r]
    fconfigure $stream -encoding $encoding
    set whole_page [read $stream]
    close $stream
    # set the default output encoding to the file mime type
    ns_startcontent -type $type

Actually if we want to use the generalized filename convention suggested in the previous section, where the file's language is encoded into a second suffix (foo.locale.html) then code above will need to use an alternate function than the AOLServer built-in ns_guesstype command for mapping a filename to a MIME-type. ns_guesstype only parses out simple filename extensions, where everything after the last '.' in the filename is considered to be the extension. We will need to write a function which parses out the language field from the filename, looks it up in our language-to-character-set table.

Specifying Source and Output Encoding of Tcl Files

In the same manner as static HTML files, .tcl script filenames can be annotated with a language or character set extension, and the request processor routine rp_handle_tcl_file can be modified to set this channel encoding when loading the file.

For output encoding, we should make rp_handle_tcl_request to by default set the output character set to whatever the source character set is, but the developer should be able to explicitly override this at any point by one or more of the methods below

  • Calling ns_startcontent
  • Setting an output Content-Type header with a charset parameter
  • Calling ns_return or related functions with an explicit character type

Modifications to the Request Processor

The new API function ns_encodingfortype is used to return the Tcl encoding to use for a given document MIME type.

The following source_with_encoding function would be used to replace the basic source call in rp_handle_tcl_request


proc_doc source_with_encoding {filename} { loads filename, using a charset encoding
looked up via the ns_encodingforcharset command, based on the ns_guesstype MIME
type of the filename. } {
    set type [ns_guesstype $filename]
    set encoding [ns_encodingfortype $type]
    set fd [open $filename r]
    fconfigure $fd -encoding $encoding
    set code [read $fd]
    close $fd
    ns_startcontent -type $type
    uplevel 1 $code
}

proc_doc rp_handle_tcl_request {} { Handles a request for a .tcl file. } {
    global ad_conn

    doc_init
    rp_eval [list source_with_encoding [ad_conn file]]
    if { [doc_exists_p] } {
        # The file returned a document. We need to serve it.
        rp_eval doc_serve_document
    }
}


acs-core/abstract-url-init.tcl:

foreach { type handler } {
    tcl rp_handle_tcl_request
    tcl_ej rp_handle_tcl_request
    adp rp_handle_adp_request
    adp_ej rp_handle_adp_request
    html rp_handle_html_request
    htm rp_handle_html_request
} {
    rp_register_extension_handler $type $handler
}

Specifying Source and Output Encoding of Template Files

Using ns_adp_parse directly, template (.adp) files can only be authored in UTF-8. However, we can modify the request processor to read the files into Tcl strings first, performing any needed conversions, and then to apply the adp parser to the string. There are some issue of efficiency here, but the same issues are manifest in general caching architectures, so whatever solution we design for caching template pages should also be able to improve the performance of dynamically converting the source file charset encoding.
proc_doc rp_handle_adp_request {} { Handles a request for an .adp file. } {
     doc_init

    set mimetype [ns_guesstype [ad_conn file]]
    set encoding [ns_encodingfortype $mimetype]
    set fd [open [ad_conn file] r]
    fconfigure $fd -encoding $encoding
    set template [read $fd]
    close $fd

    if { ![rp_eval [list ns_adp_parse -string $template] adp] } {
         return
    }

    if { [doc_exists_p] } {
         doc_set_property body $adp
         rp_eval doc_serve_document
    } else {
        set content_type [ns_set iget [ns_conn outputheaders] "content-type"]
        if { $content_type == "" } {
	    set content_type  [ns_guesstype [ad_conn file]]
        } else {
            ns_set idelkey [ns_conn outputheaders] "content-type"
        }
	ns_return 200 $content_type $adp
    }
}

Setting the Template's Output Encoding

When evaluating a template file, there should be a well defined chain of control for setting the document's output encoding.

The output charset for a template should default to whatever source charset was computed, as described above.

From that point, code inside the template could be used to override the default encoding. For example, in an adp file, the output encoding be explicitly set by the following means:

  • For the Document API, there could be special property names language or content-type which are recognized by doc_set_property and which set the Content-Type output headers.

    Or perhaps an explicit doc_set_language or doc_set_encoding command would make the intention clearer.

  • In the dynamic publishing system, a directive in a datasource file such as
     <content-type>text/html; charset="euc-jp"</content-type>
    
    or else a more explicit charset directive
     <charset name="euc-jp">
    

META HTTP-EQUIV="Content-Type"

The HTML META HTTP-EQUIV tag can included in a document as an extra hint to inform a browser what charset a document is using. From RFC 2070:
In any document, it is possible to include an indication of the encoding scheme like the following, as early as possible within the HEAD of the document:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=ISO-2022-JP">

This is not foolproof, but will work if the encoding scheme is such that ASCII-valued octets stand for ASCII characters only at least until the META element is parsed. Note that there are better ways for a server to obtain character encoding information, instead of the unreliable META above

You might think that if we are setting the Content-Type header properly with a charset parameter, then the META tag may be redundant. However consider what happens if the user downloads and stores the file on disk or emails it to a friend. In that case, the Content-Type and other header information will likely be discarded. So it is useful to annotate your documents this way if possible.

For a multilingual template the content may be output in a variety of encodings. If we use an HTML META HTTP-EQUIV tag, then we must make sure it refers to the same charset as we are actually serving the document in. This means we need an API for the code in the template to ask what character set is being used.

Setting the Default Encoding for the Entire Site

For a web site which is going to be basically monolingual, but using a different character set than the default ISO-8859-1, it is possible to set the default encoding for most cases, using the following ns/parameters values. For example, to set the default character set encoding to Japanese ShiftJIS, add the following to the server.ini file
[ns/parameters]
HackContentType=1
URLCharset=Shift_JIS
OutputCharset=Shift_JIS
This will cause files which have no explicit character set parameter in their content-type to be treated as ShiftJIS, both for output conversion and for user input conversion. To ensure that Tcl script files are loaded using ShiftJIS encoding, you should explicitly set the mime type of .tcl files to contain the chosen character set. The modified version of the request processor, described above, will then choose the correct encoding conversion to use when loading the file.

[ns/mimetypes] 
.tcl=text/plain; charset=Shift_JIS
.adp=text/plain; charset=Shift_JIS

Modifications to the Document API

Currently, there is a new proposed ACS mechanism for document creation and delivery, the Document API which will be performing most of the default steps needed to compose and return a document from the server.

The new Document API performs similar steps to the request processor's default document handlers, using the doc_serve_document function. This function will need to be modified in a similar way to how we modified rp_handle_adp_request in order to become "encoding aware". The document's source encoding will need to be looked up, the channel encoding set before reading it, and the output encoding set to the correct value.

Modifications to ad_return_template

The ad_return_template function in the ACS is used to tie a tcl file to a corresponding template from a template library. ad_return_template has a model of user language using either a language preference from the users preferences table, or a cookie called language_preference, and a scoring system to choose the best matching template. Although it currently uses ns_adp_parse with the -file option, thus limiting it to UTF-8 templates, the code could be extended to incorporate the character set lookup mechanisms described above for properly setting the source and output encoding.

Note, while ad_return_template attempts to score files based on how well they match the user's preferred language, this feature needs to be used very deliberately. Giving a visitor a link to a page in a language that they are not expecting should be regarded as a serious site design problem, and not merely a graceful degradation of service.

More information

previous | next chapter

asj-editors@arsdigita.com

spacer