(unidentified newspaper source)
 
 
New Codes Open Doors for Non-Roman Alphabet Users

Computer systems can now accommodate more than 65,000 different characters
- enough for Chinese, Japanese and Korean ideograms, and other alphabets
such as Greek, Russian, Hebrew and Arabic script.

Dealing with different languages may be a challenge for website operators, but when it comes to computers there are additional problems for nations and cultures whose languages do not use the Roman alphabet. Until recently, even the accents commonly used above and below letters in German, French and Spanish could cause problems. This was because the originalsystem for representing letters on computers- known as Ascii coding -set  only half of the 256 codes available to identify different characters.
(The basic. unit of computing data used  to do this is the byte, each of which has eight
bits, which can be either 0 or 1. That gives 256 alternatives.)

This system  covered the standard upper and lower case alphabet, numbers, punctuation and a few special symbols found on key-boards, such as currency and percentage symbols. The other 128 codes could be used arbitrarily. Printers used them to create special effects. Software applications used them, among other things, to represent accented characters. But different applications adopted different standards. This is why you still occasionally see gobbledegook in e-mail messages from other countries.

The International Standards Organisation (ISO) has now agreed to give standard
meanings to these remaining codes. The new standard is known as ‘Latin-l’ or
‘extended Ascii’ and includes accented characters. 

Another coding system has been devised to handle Asian languages such as Chinese, Japanese and Korean. These languages have thousands of ideographic characters, each representing a single word. A coding system called Unicode has emerged as the standard. This uses twice as much data to represent each character, and so is known as a ‘double-byte’ coding system. This would be an inefficient means of representing
English and European languages that use the Roman alphabet because e-mails and
text files would be twice as big in terms of data. But the system can accommodate
more than 65,000 different characters - enough for Chinese, Japanese and Korean ideagrams and other alphabets such as Greek, Russian, Hebrew and Arabic script. In
all, 49,194 of the codes have been allocated in the latest version of the Unicode system. Double-byte codes are a very efficient means of storing ideographic characters, such as Chinese, since a whole word is stored in the equivalent of the space for two letters. Since each word has a unique code, there is also less of the ambiguity that is inherent in, for example, English, where a word such as ‘wind’ can be a noun meaning ‘movement of air’ or a verb meaning to ‘crank a handle’. Unicode’s unambiguous meanings improve the accuracy of automated translation systems, especially when attempting translations from these languages. The next challenge is to be sure that users can display and enter characters. Software applications and web browsers are starting to adopt Unicode. Often, an additional software component known as a plug-in is required. On web browsers, this can often be downloaded automatically, and can increase the size of the web browser considerably (by around 3MB or megabytes). And it appears not to cater to all
scenarios. Some sites operate correctly after these software downloads, while others contimue to display a hotch-potch of symbols.

There will be improvements as the Unicode Standard becomes adopted in the next version of HTML - the computer language that  underpins information display on the web. In the meantime, specialist browsers, such as Tango and Mozilla, are designed specifically to handle different character systems. There are also software products to help with handling different character systems on startdard computer systems named including Windows and Macintosh.

The problem of entering  online ideographic characters from the keyboard has been ther solved by building words from phonetic symbols, rather like the pronunciation guides in dictionaries. In Japanese kanji, for examample, the 46 phonetic symbols can be entered from the Japanese PC keyboard. Since homophones exist, there may be more than one character for a pronuncia-tion - much like “colonel”, the military rank, and “kernel”, the nut in a fruit, in English. The appropriate character can be selected
from a menu whenever the space key is pressed. Since November, it has been possible to register the top-level internet domain names (TLD's) -- the much cherished dot.com and dot.org names that have been so fought over using Chinese, Japanese and Korean characters. Companies have rushed to protect their brand interests in Asian markets by register ing their translated trading names to stop others doing so. This would prevent a new outbreak of so-called "cyber squatting," in which individuals register well-known corporate names as internet addresses with the intention of selling them
back to the companies so named for a considerable fee.
 

However, Net Searchers, a  consultancy specialising in online brand and trade mark
protection, cautions that neithe ther of the two companies  that offer registration
systems today - Verisign/NSI and IDNS -has been fully ratified by Icann, the internet-naming convention authority. Meanwhile, the Chinese government has put forward an alternative registration system. Net Searchers suggests  that companies can best protect themselves by registering their trade names across all the available systems, rather than choosing a single one.
 



back to main article list