See more articles about "unicode "

Tutorial Addendum on Unicode - JDK - Encoding About-face



 31 December 18:00   

    



    



    

Compile this program and use it to catechumen our accost bulletin book into several

    

encodings:

    

 

    

javac EncodingConverter.java

    

java EncodingConverter hello.utf-16be utf-16be hello.ascii ascii

    

java EncodingConverter hello.utf-16be utf-16be hello.iso-8859-1 iso-...

    

java EncodingConverter hello.utf-16be utf-16be hello.utf-8 utf-8

    

java EncodingConverter hello.utf-16be utf-16be hello.gbk gbk

    

java EncodingConverter hello.utf-16be utf-16be hello.big5 big5

    

java EncodingConverter hello.utf-16be utf-16be hello.shift_jis shift_jis

    



    



    

By celebratory the achievement files, you should apprehension this followings:

    

hello.ascii - In this file, alone the English bulletin is good, because it contains alone

    

ASCII characters. Both Simplified Chinese and Acceptable Chinese letters

    

are not good. Characters in these letters are replaced by 0x3F, an indication

    

of invalid code.

    

hello.iso-8859-1 - This is identical to hello.ascii, because there is no

    

characters in the 0x80 - 0xFF range.

    

hello.utf-8 - This book contains all letters with no damages. The ASCII

    

characters are stored as one-byte characters as expected.

    

hello.gbk - In this file, the Simplified Chinese bulletin is good. In fact,

    

characters in the Simplified Chinese bulletin are stored as cipher ethics in

    

GBK appearance set standard. The English bulletin is aswell good, because GBK is

    

ASCII astern compatible. We are advantageous with the Acceptable Chinese message,

    

because the Big5 characters acclimated in the bulletin are aswell accurate in GBK standard.

    

If you use some Big5 appropriate characters, the aftereffect could be different.

    

hello.big5 - In this file, the Acceptable Chinese bulletin is good. In fact,

    

characters in the Acceptable Chinese bulletin are stored as cipher ethics in

    

Big5 appearance set standard. The English bulletin is aswell good, because Big5 is

    

ASCII astern compatible. We are not advantageous with the Simplified Chinese message,

    

two GB characters acclimated in the bulletin are not accurate in Big5 standard. 0x3F was

    

stored for those characters.

    

hello.shift_jis - In this file, the English bulletin is still good. Some of the

    

characters from both Simplified and Acceptable Chinese letters are invalid,

    

replaced by 0x3F placeholders. Some of the Chinese characters are still accurate

    

in Shift_JIS appearance set. This is not so surprising, because there are many

    

shared characters in Chinese and Japanese.

    



    



    

Viewing Unicode Text

    



    

Now, we accept this greeting letters adored in some altered encodings. The next

    

question is how do affectation them as glyph of the agnate languages on the screen.

    

One of the means I accept acclimated in the accomplished is to run a multi-language enabled Web browser

    

like IE to appearance the argument files. To do this, we accept to mark up the argument into a html file,

    

by using program like this one here:

    

 

    

/**

    

* EncodingHtml.java

    

* Absorb (c) 2002 by Dr. Yang

    

*

    

* This program allows you to mark up a argument book into html file.

    

*/

    

import java.io.*;

    

import java.util.*;

    

class EncodingHtml {

    

changeless HashMap charsetMap = new HashMap();

    

accessible changeless abandoned main(String[] a) {

    

Cord inFile = a[0];

    

Cord inCharsetName = a[1];

    

Cord outFile = inFile + ".html";

    

try {

    

InputStreamReader in = new InputStreamReader(

    

new FileInputStream(inFile), inCharsetName);

    

OutputStreamWriter out = new OutputStreamWriter(

    

new FileOutputStream(outFile), inCharsetName);

    

writeHead(out, inCharsetName);

    

int c = in.read();

    

int n = 0;

    

while (c!=-1) {

    

out.write(c);

    

n++;

    

c = in.read();

    

}

    

writeTail(out);

    

in.close();

    

out.close();

    

System.out.println("Number of characters: "+n);

    

} bolt (IOException e) {

    

System.out.println(e.toString());

    

}

    

}

    

accessible changeless abandoned writeHead(OutputStreamWriter out, Cord cs)

    

throws IOException {

    

out.write("<html><head>
");

    

out.write("<meta http-equiv="Content-Type""+

    

" content="text/html; charset="+cs+"">
");

    

out.write("</head><body><pre>");

    

}

    

accessible changeless abandoned writeTail(OutputStreamWriter out)

    

throws IOException {

    

out.write("</pre></body></html>
");

    

}

    

}

    



    



    

Now, let s abridge this program and run it with hello.utf-8:

    

 

    

javac EncodingHtml.java

    

java EncodingHtml hello.utf-8 utf-8

    



    



    

If you accept installed IE with the Chinese accent supports, you should

    

be able to accessible the achievement file, hello.utf-8.html, and adore account the

    

messages in English, Simplified Chinese, and Acceptable Chinese.

    



    

Then, run EncodingHtml.java with additional encodings,

    

 

    

java EncodingHtml hello.gbk gbk

    

java EncodingHtml hello.big5 big5

    

java EncodingHtml hello.shift_jis shift_jis

    



    



    

View the achievement files with IE, and analyze the results:

    



        

  • hello.utf-8.html - IE auto sets View/Encoding to utf-8. All letters are perfect.


  •     

  • hello.gbk.html - IE auto sets View/Encoding to gb2312. All letters are perfect.


  •     

  • hello.big5.html - IE auto sets View/Encoding to big5. Simplified Chinese bulletin has two bad characters.


  •     

  • hello.shift_jis - IE auto sets View/Encoding to shift_jis. Both Simplified and Acceptable Chinese letters accept bad characters.


  •     



    



    

If you manually change the ambience of View/Encoding, IE will not be able to appearance the

    

message with the appropriate glyph.

    



    



 


 characters, chinese, message, messages, encodinghtml, simplified, shift, traditional, encoding, ascii, encodingconverter, english, write, string, static, outputstreamwriter, incharsetname, standard, program, unicode, encodings, files, ioexception, infile, output, character, stored, public, valid, , utf 16be, chinese message, traditional chinese, encodingconverter hello, simplified chinese, 16be utf, 16be hello, shift jis, messages are, view encoding, write <, auto sets, encodinghtml hello, sets view, message are, english message, outputstreamwriter out, file the, chinese and, chinese messages, character set, public static, static void, utf 16be hello, simplified chinese message, auto sets view, sets view encoding, traditional chinese message, traditional chinese messages, public static void, ascii backward compatible, compile this program, jdk encoding conversion, characters hello shift,

Share Tutorial Addendum on Unicode - JDK - Encoding About-face:
Digg it!   Google Bookmarks   Del.icio.us   Yahoo! MyWeb   Furl  Binklist   Reddit!   Stumble Upon   Technorati   Windows Live   Bookmark

Text link code :
Hyper link code:

Also see ...

Tutorial Addendum on Unicode - JDK - Encoding About-face
Unicode Signs in Altered EncodingsI capital to play with my account programs mentioned in this agenda one added time with this some Unicode signs. So I affected UnicodeHello.java and

Tutorial Addendum on Unicode - JDK - Encoding Map Counts
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map CounterAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01b

Tutorial Addendum on Unicode - JDK - Encoding Map Counts
accessible changeless byte encodeByEncoder(char c, Cord cs) { Charset cso = null; byte b = null; try { cso = Charset.forName(cs); CharsetEncoder e = cso.newEncoder(); e.reset(); ByteBuffer bb =

Tutorial Addendum on Unicode - JDK - Encoding Maps
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map AnalyzerAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01

Tutorial Addendum on Unicode - JDK - Encoding Maps
ISO 8859 1 Latin 1ISO 8859 1 encoding:Code CodePoint Point 0000 00 00FF FF0100 3F FFFF 3F

Tutorial Addendum on Unicode - JDK - Encoding Maps
......8FC0 E8 BF 80 8FFF E8 BF BF9000 E9 80 80 903F E9 80 BF9040 E9 81 80 907F E9 81 BF9080 E9 82 80 90BF E9 82 BF......9FC0 E9 BF 80 9FFF E9 BF BFA000 EA 80 80