|
YOUR FEEDBACK
Did you read today's front page stories & breaking news?
SYS-CON.TV |
TOP THREE LINKS YOU MUST CLICK ON Internationalization Cover Story: A Practical Solution to Internationalization of a J2EE Web App
Making Web Applications Multilingual
By: Murali Kashaboina; Bin Liu
Jan. 21, 2006 03:15 PM
Essentially, there are two parts to the internationalization of a Web application. The first part is internationalization of the application code. This involves preparing the code so that it can adapt itself to new languages and regions. In practice, this preparation involves the separation of text, labels, display messages, and any other data that is sensitive to language and region of the world. This type of adaptation of code enables generalization of the product in such a way that it can handle new languages and countries without any re-design. The second part is localization of the application. This involves actual adaptation of the internationalized code to a specific language or region (aka locale). In practice, localization involves creation of translated text, labels, and messages, and the addition of any other application data that is specific to a certain locale. Internationalization is a common problem that typically gets a blind eye turned toward it during design and development. Internationalization design must be up-front work in the development life cycle and not an afterthought. It is rightly said, "a stitch in time saves nine." It may not be too easy to design for internationalization up front; however, it will be far more difficult to incorporate internationalization at a later stage when the application has already been developed. Up-front planning for application internationalization can save significant amounts of time and money. There could be myriad ways of addressing this problem; however the following approaches are widely used:
The Fundamental Concepts A character is the smallest component of a written language that has a specific name and some semantic value. Each character can have more than one graphical representation. For example, character "A" can be graphically represented as "A," "A," or "A." Independent of the graphical representation, the meaning of the character remains the same. Each such graphical representation is called as a glyph. A set of glyphs is called a font. So a character will have a different glyph in different fonts. A character set comprises of a group of related characters that can be used for some purpose. All the characters on an "English" key board can be grouped into a character set because they provide ability to develop meaningful and informative documents in "English." Computers do not understand characters automatically but rather need a coded set of characters to process the data. In a coded character set, each character is assigned with an integer value commonly referred to as code point. American Standard Code for Information Interchange (ASCII) is a good example of a coded character set. ASCII is a small coded character set that comprises 127 characters. There are other coded character sets such as ISO-8859-1 and Unicode. Essentially, the code point of a character in the coded character set is used to identify the right glyph to display on the computer screen. Character set encoding is yet another term that is widely used. A character set encoding scheme is a set of rules for mapping byte sequences (aka octets) to character code values and vice versa. Coded character sets such as ISO-8859-1, UTF-8, and UTF-16 have their own encoding schemes. For example, different schemes encode the character "ß" into byte sequences as shown in Table 1. The terms "coded character set" and "coded character set encoding" have different meanings and should not be used interchangeably. To avoid this confusion, the short name "charset" is usually used to represent coded character set encoding. Table 2 shows some of the charsets that support different languages. Table 2 leads to a big question: what character set should be used to support multiple different languages in an internationalized application? For example, the ISO-8859-1 character set will not support Chinese characters that are actually supported by the GB2312 character set. Obviously, there should be a common character set that can encode all of the characters in different languages of the world. Unicode is one such coded character set that promises to provide a unique code point for every character in every language. Java uses Unicode to encode characters. JRE 1.4 supports Unicode 3.0. Unicode is a large character set composed of almost 65,000 characters covering almost all world languages. Unicode encodes characters in 2 bytes, i.e., Unicode is 16-bit encoding with a range of code points from U+0000 to U+FFFF, represented in Unicode hexadecimals. There is one more character set, known as the Universal Character Set (UCS), which can support all language characters and symbols. However, UCS uses a 31-bit encoding scheme that is not supported by most of the computer applications, whereas 16-bit encoding is widely supported. To address this issue, new transformed encoding schemes have been created based on Unicode and UCS. One of them is UTF-8 (UCS Transformation Format). UTF-8 transforms UCS characters into 1, 2, 3, or 4 byte encodings. UTF-8 preserves ASCII codes and encodes an ASCII character as a single byte. In essence, UTF-8 uses multi-byte encoding to represent characters in 1-4 bytes (octets). The UTF-8 support for a wide range of characters and the efficient way of encoding makes it the de facto character set that should be used for displaying multiple languages. The application described in this article uses UTF-8 everywhere there is a need to encode content in different languages. The Internationalization Requirements for the Example Application YOUR FEEDBACK
BEA WEBLOGIC LATEST STORIES
SUBSCRIBE TO THE WORLD'S MOST POWERFUL NEWSLETTERS SUBSCRIBE TO OUR RSS FEEDS & GET YOUR SYS-CON NEWS LIVE!
|
SYS-CON FEATURED WHITEPAPERS MOST READ THIS WEEK BREAKING NEWS FROM THE WIRES
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||