|
New Codes Open Doors for Non-Roman Alphabet Users
Computer systems can now accommodate more than 65,000
different characters
- enough for Chinese, Japanese and Korean ideograms,
and other alphabets
such as Greek, Russian, Hebrew and Arabic script.
Dealing with different languages may be a challenge for website operators,
but when it comes to computers there are additional problems for nations
and cultures whose languages do not use the Roman alphabet. Until recently,
even the accents commonly used above and below letters in German, French
and Spanish could cause problems. This was because the originalsystem for
representing letters on computers- known as Ascii coding -set only
half of the 256 codes available to identify different characters.
(The basic. unit of computing data used to do this is the byte,
each of which has eight
bits, which can be either 0 or 1. That gives 256 alternatives.)
This system covered the standard upper and lower case alphabet,
numbers, punctuation and a few special symbols found on key-boards, such
as currency and percentage symbols. The other 128 codes could be used arbitrarily.
Printers used them to create special effects. Software applications used
them, among other things, to represent accented characters. But different
applications adopted different standards. This is why you still occasionally
see gobbledegook in e-mail messages from other countries.
The International Standards Organisation (ISO) has now agreed to give
standard
meanings to these remaining codes. The new standard is known as ‘Latin-l’
or
‘extended Ascii’ and includes accented characters.
Another coding system has been devised to handle Asian languages such
as Chinese, Japanese and Korean. These languages have thousands of ideographic
characters, each representing a single word. A coding system called Unicode
has emerged as the standard. This uses twice as much data to represent
each character, and so is known as a ‘double-byte’ coding system. This
would be an inefficient means of representing
English and European languages that use the Roman alphabet because
e-mails and
text files would be twice as big in terms of data. But the system can
accommodate
more than 65,000 different characters - enough for Chinese, Japanese
and Korean ideagrams and other alphabets such as Greek, Russian, Hebrew
and Arabic script. In
all, 49,194 of the codes have been allocated in the latest version
of the Unicode system. Double-byte codes are a very efficient means of
storing ideographic characters, such as Chinese, since a whole word is
stored in the equivalent of the space for two letters. Since each word
has a unique code, there is also less of the ambiguity that is inherent
in, for example, English, where a word such as ‘wind’ can be a noun meaning
‘movement of air’ or a verb meaning to ‘crank a handle’. Unicode’s unambiguous
meanings improve the accuracy of automated translation systems, especially
when attempting translations from these languages. The next challenge is
to be sure that users can display and enter characters. Software applications
and web browsers are starting to adopt Unicode. Often, an additional software
component known as a plug-in is required. On web browsers, this can often
be downloaded automatically, and can increase the size of the web browser
considerably (by around 3MB or megabytes). And it appears not to cater
to all
scenarios. Some sites operate correctly after these software downloads,
while others contimue to display a hotch-potch of symbols.
There will be improvements as the Unicode Standard becomes adopted in
the next version of HTML - the computer language that underpins information
display on the web. In the meantime, specialist browsers, such as Tango
and Mozilla, are designed specifically to handle different character systems.
There are also software products to help with handling different character
systems on startdard computer systems named including Windows and Macintosh.
The problem of entering online ideographic characters from the
keyboard has been ther solved by building words from phonetic symbols,
rather like the pronunciation guides in dictionaries. In Japanese kanji,
for examample, the 46 phonetic symbols can be entered from the Japanese
PC keyboard. Since homophones exist, there may be more than one character
for a pronuncia-tion - much like “colonel”, the military rank, and “kernel”,
the nut in a fruit, in English. The appropriate character can be selected
from a menu whenever the space key is pressed. Since November, it has
been possible to register the top-level internet domain names (TLD's) --
the much cherished dot.com and dot.org names that have been so fought over
using Chinese, Japanese and Korean characters. Companies have rushed to
protect their brand interests in Asian markets by register ing their translated
trading names to stop others doing so. This would prevent a new outbreak
of so-called "cyber squatting," in which individuals register well-known
corporate names as internet addresses with the intention of selling them
back to the companies so named for a considerable fee.
However, Net Searchers, a consultancy specialising in online brand
and trade mark
protection, cautions that neithe ther of the two companies that
offer registration
systems today - Verisign/NSI and IDNS -has been fully ratified by Icann,
the internet-naming convention authority. Meanwhile, the Chinese government
has put forward an alternative registration system. Net Searchers suggests
that companies can best protect themselves by registering their trade names
across all the available systems, rather than choosing a single one.
|