See more articles about "unicode "

Tutorial Addendum on Unicode - JDK - Encoding Map Counts



 31 December 18:00   

    



    

accessible changeless byte encodeByEncoder(char c, Cord cs) {

    

Charset cso = null;

    

byte b = null;

    

try {

    

cso = Charset.forName(cs);

    

CharsetEncoder e = cso.newEncoder();

    

e.reset();

    

ByteBuffer bb = e.encode(CharBuffer.wrap(new char {c}));

    

if (bb.limit()>0) b = copyBytes(bb.array(),bb.limit());

    

} bolt (IllegalCharsetNameException e) {

    

System.out.println(e.toString());

    

} bolt (CharacterCodingException e) {

    

// invalid character, acknowledgment null

    

}

    

acknowledgment b;

    

}

    

accessible changeless abandoned printBytes(byte b) {

    

if (b!=null) {

    

for (int j=0; j<b.length; j++)

    

System.out.print(" "+byteToHex(b));

    

} abroad {

    

System.out.print(" XX");

    

}

    

}

    

accessible changeless byte copyBytes(byte a, int l) {

    

byte b = new byte;

    

for (int i=0; i<Math.min(l,a.length); i++) b = a;

    

acknowledgment b;

    

}

    

accessible changeless Cord byteToHex(byte b) {

    

char a = { hexDigit, hexDigit };

    

acknowledgment new String(a);

    

}

    

accessible changeless Cord charToHex(char c) {

    

byte hi = (byte) (c >>> 8);

    

byte lo = (byte) (c & 0xff);

    

acknowledgment byteToHex(hi) + byteToHex(lo);

    

}

    

}

    



    



    

Note that:

    



        

  • CharsetEncoder.encode() is acclimated to encode the cipher credibility stored as "char" type.


  •     

  • Since Java can alone encode cipher credibility in the 0x0000 - 0xFFFF range, only

        

    a subset of the appearance set will be encoded for some encodings, like UTF-8,

        

    which can encode cipher credibility up to 0x10FFFF.



  •     

  • The encoding name should be defined as command argument.


  •     



    



    

Run this program with US-ASCII as argument, you will get:

    



    

US-ASCII encoding:

    

0000 > 00 - 007F > 7F = 128

    

0080 > XX - FFFF > XX = 65408

    

Total characters = 65536

    

Valid characters = 128

    

Invalid characters = 65408

    



    



    

This tells us that the US-ASCII appearance set has alone 128 characters.

    



    

Run this program with ISO-8859-1 (Latin 1) as argument, you will get:

    



    

ISO-8859-1 encoding:

    

0000 > 00 - 00FF > FF = 256

    

0100 > XX - FFFF > XX = 65280

    

Total characters = 65536

    

Valid characters = 256

    

Invalid characters = 65280

    



    



    

This tells us that the US-ASCII appearance set has alone 256 characters.

    



    

Comparison of encoding Maps

    



    

The afterward table is based on the achievement of the EncodingCouter program with

    

different accurate encoding names. It provides a abrupt allegory between

    

the some altered encodings.

    

Encoding Map US-ASCII

    

Name Admeasurement Accordant Notes

    



    



    

US-ASCII 128 Y 7-bit characters only

    

ISO-8859-1 256 Y 8-bit (single byte) characters

    

CP1252 251 Y One byte output, with cipher credibility up to 0x2122

    

UTF-8 63488 Y 1-3 bytes,

    

UTF-16BE 63488 N 2 bytes, carbon artful the cipher points

    

UTF-16LE 63488 N 2 bytes, abandoning the cipher credibility

    

UTF-16 63488 N 4 bytes, endure 2 bytes = UTF-16BE

    

GBK 24068 Y 1-2 bytes, Chinese 1993 standard

    

GB18030 63488 Y 1-4 bytes, superset of GBK, 2000 standard

    



    



    

Source: s Addendum on JDK.

    



    



 


 characters, encoding, bytes, ascii, return, public, encode, points, static, character, notes, bytetohex, string, system, program, argument, , code points, public static, character set, system out, encoding map, characters 65536valid characters, encode code points, public static string, public static byte,

Share Tutorial Addendum on Unicode - JDK - Encoding Map Counts:
Digg it!   Google Bookmarks   Del.icio.us   Yahoo! MyWeb   Furl  Binklist   Reddit!   Stumble Upon   Technorati   Windows Live   Bookmark

Text link code :
Hyper link code:

Also see ...

Tutorial Addendum on Unicode - JDK - Encoding Maps
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map AnalyzerAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01

Tutorial Addendum on Unicode - JDK - Encoding Maps
ISO 8859 1 Latin 1ISO 8859 1 encoding:Code CodePoint Point 0000 00 00FF FF0100 3F FFFF 3F

Tutorial Addendum on Unicode - JDK - Encoding Maps
......8FC0 E8 BF 80 8FFF E8 BF BF9000 E9 80 80 903F E9 80 BF9040 E9 81 80 907F E9 81 BF9080 E9 82 80 90BF E9 82 BF......9FC0 E9 BF 80 9FFF E9 BF BFA000 EA 80 80