Tutorial Addendum on Unicode - JDK - Encoding About-face
| |
Compile this program and use it to catechumen our accost bulletin book into several
encodings:
javac EncodingConverter.java
java EncodingConverter hello.utf-16be utf-16be hello.ascii ascii
java EncodingConverter hello.utf-16be utf-16be hello.iso-8859-1 iso-...
java EncodingConverter hello.utf-16be utf-16be hello.utf-8 utf-8
java EncodingConverter hello.utf-16be utf-16be hello.gbk gbk
java EncodingConverter hello.utf-16be utf-16be hello.big5 big5
java EncodingConverter hello.utf-16be utf-16be hello.shift_jis shift_jis
By celebratory the achievement files, you should apprehension this followings:
hello.ascii - In this file, alone the English bulletin is good, because it contains alone
ASCII characters. Both Simplified Chinese and Acceptable Chinese letters
are not good. Characters in these letters are replaced by 0x3F, an indication
of invalid code.
hello.iso-8859-1 - This is identical to hello.ascii, because there is no
characters in the 0x80 - 0xFF range.
hello.utf-8 - This book contains all letters with no damages. The ASCII
characters are stored as one-byte characters as expected.
hello.gbk - In this file, the Simplified Chinese bulletin is good. In fact,
characters in the Simplified Chinese bulletin are stored as cipher ethics in
GBK appearance set standard. The English bulletin is aswell good, because GBK is
ASCII astern compatible. We are advantageous with the Acceptable Chinese message,
because the Big5 characters acclimated in the bulletin are aswell accurate in GBK standard.
If you use some Big5 appropriate characters, the aftereffect could be different.
hello.big5 - In this file, the Acceptable Chinese bulletin is good. In fact,
characters in the Acceptable Chinese bulletin are stored as cipher ethics in
Big5 appearance set standard. The English bulletin is aswell good, because Big5 is
ASCII astern compatible. We are not advantageous with the Simplified Chinese message,
two GB characters acclimated in the bulletin are not accurate in Big5 standard. 0x3F was
stored for those characters.
hello.shift_jis - In this file, the English bulletin is still good. Some of the
characters from both Simplified and Acceptable Chinese letters are invalid,
replaced by 0x3F placeholders. Some of the Chinese characters are still accurate
in Shift_JIS appearance set. This is not so surprising, because there are many
shared characters in Chinese and Japanese.
Viewing Unicode Text
Now, we accept this greeting letters adored in some altered encodings. The next
question is how do affectation them as glyph of the agnate languages on the screen.
One of the means I accept acclimated in the accomplished is to run a multi-language enabled Web browser
like IE to appearance the argument files. To do this, we accept to mark up the argument into a html file,
by using program like this one here:
/**
* EncodingHtml.java
* Absorb (c) 2002 by Dr. Yang
*
* This program allows you to mark up a argument book into html file.
*/
import java.io.*;
import java.util.*;
class EncodingHtml {
changeless HashMap charsetMap = new HashMap();
accessible changeless abandoned main(String[] a) {
Cord inFile = a[0];
Cord inCharsetName = a[1];
Cord outFile = inFile + ".html";
try {
InputStreamReader in = new InputStreamReader(
new FileInputStream(inFile), inCharsetName);
OutputStreamWriter out = new OutputStreamWriter(
new FileOutputStream(outFile), inCharsetName);
writeHead(out, inCharsetName);
int c = in.read();
int n = 0;
while (c!=-1) {
out.write(c);
n++;
c = in.read();
}
writeTail(out);
in.close();
out.close();
System.out.println("Number of characters: "+n);
} bolt (IOException e) {
System.out.println(e.toString());
}
}
accessible changeless abandoned writeHead(OutputStreamWriter out, Cord cs)
throws IOException {
out.write("<html><head>
");
out.write("<meta http-equiv="Content-Type""+
" content="text/html; charset="+cs+"">
");
out.write("</head><body><pre>");
}
accessible changeless abandoned writeTail(OutputStreamWriter out)
throws IOException {
out.write("</pre></body></html>
");
}
}
Now, let s abridge this program and run it with hello.utf-8:
javac EncodingHtml.java
java EncodingHtml hello.utf-8 utf-8
If you accept installed IE with the Chinese accent supports, you should
be able to accessible the achievement file, hello.utf-8.html, and adore account the
messages in English, Simplified Chinese, and Acceptable Chinese.
Then, run EncodingHtml.java with additional encodings,
java EncodingHtml hello.gbk gbk
java EncodingHtml hello.big5 big5
java EncodingHtml hello.shift_jis shift_jis
View the achievement files with IE, and analyze the results:
- hello.utf-8.html - IE auto sets View/Encoding to utf-8. All letters are perfect.
- hello.gbk.html - IE auto sets View/Encoding to gb2312. All letters are perfect.
- hello.big5.html - IE auto sets View/Encoding to big5. Simplified Chinese bulletin has two bad characters.
- hello.shift_jis - IE auto sets View/Encoding to shift_jis. Both Simplified and Acceptable Chinese letters accept bad characters.
If you manually change the ambience of View/Encoding, IE will not be able to appearance the
message with the appropriate glyph.
|
characters, chinese, message, messages, encodinghtml, simplified, shift, traditional, encoding, ascii, encodingconverter, english, write, string, static, outputstreamwriter, incharsetname, standard, program, unicode, encodings, files, ioexception, infile, output, character, stored, public, valid, , utf 16be, chinese message, traditional chinese, encodingconverter hello, simplified chinese, 16be utf, 16be hello, shift jis, messages are, view encoding, write <, auto sets, encodinghtml hello, sets view, message are, english message, outputstreamwriter out, file the, chinese and, chinese messages, character set, public static, static void, utf 16be hello, simplified chinese message, auto sets view, sets view encoding, traditional chinese message, traditional chinese messages, public static void, ascii backward compatible, compile this program, jdk encoding conversion, characters hello shift, |
Also see ...
Unicode Signs in Altered EncodingsI capital to play with my account programs mentioned in this agenda one added time with this some Unicode signs. So I affected UnicodeHello.java and
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map CounterAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01b
accessible changeless byte encodeByEncoder(char c, Cord cs) { Charset cso = null; byte b = null; try { cso = Charset.forName(cs); CharsetEncoder e = cso.newEncoder(); e.reset(); ByteBuffer bb =
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map AnalyzerAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01
ISO 8859 1 Latin 1ISO 8859 1 encoding:Code CodePoint Point 0000 00 00FF FF0100 3F FFFF 3F
......8FC0 E8 BF 80 8FFF E8 BF BF9000 E9 80 80 903F E9 80 BF9040 E9 81 80 907F E9 81 BF9080 E9 82 80 90BF E9 82 BF......9FC0 E9 BF 80 9FFF E9 BF BFA000 EA 80 80