Tutorial Addendum on Unicode - JDK - Encoding Map Counts
| |
accessible changeless byte encodeByEncoder(char c, Cord cs) {
Charset cso = null;
byte b = null;
try {
cso = Charset.forName(cs);
CharsetEncoder e = cso.newEncoder();
e.reset();
ByteBuffer bb = e.encode(CharBuffer.wrap(new char {c}));
if (bb.limit()>0) b = copyBytes(bb.array(),bb.limit());
} bolt (IllegalCharsetNameException e) {
System.out.println(e.toString());
} bolt (CharacterCodingException e) {
// invalid character, acknowledgment null
}
acknowledgment b;
}
accessible changeless abandoned printBytes(byte b) {
if (b!=null) {
for (int j=0; j<b.length; j++)
System.out.print(" "+byteToHex(b));
} abroad {
System.out.print(" XX");
}
}
accessible changeless byte copyBytes(byte a, int l) {
byte b = new byte;
for (int i=0; i<Math.min(l,a.length); i++) b = a;
acknowledgment b;
}
accessible changeless Cord byteToHex(byte b) {
char a = { hexDigit, hexDigit };
acknowledgment new String(a);
}
accessible changeless Cord charToHex(char c) {
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
acknowledgment byteToHex(hi) + byteToHex(lo);
}
}
Note that:
- CharsetEncoder.encode() is acclimated to encode the cipher credibility stored as "char" type.
- Since Java can alone encode cipher credibility in the 0x0000 - 0xFFFF range, only
a subset of the appearance set will be encoded for some encodings, like UTF-8,
which can encode cipher credibility up to 0x10FFFF.
- The encoding name should be defined as command argument.
Run this program with US-ASCII as argument, you will get:
US-ASCII encoding:
0000 > 00 - 007F > 7F = 128
0080 > XX - FFFF > XX = 65408
Total characters = 65536
Valid characters = 128
Invalid characters = 65408
This tells us that the US-ASCII appearance set has alone 128 characters.
Run this program with ISO-8859-1 (Latin 1) as argument, you will get:
ISO-8859-1 encoding:
0000 > 00 - 00FF > FF = 256
0100 > XX - FFFF > XX = 65280
Total characters = 65536
Valid characters = 256
Invalid characters = 65280
This tells us that the US-ASCII appearance set has alone 256 characters.
The afterward table is based on the achievement of the EncodingCouter program with
different accurate encoding names. It provides a abrupt allegory between
the some altered encodings.
Encoding Map US-ASCII
Name Admeasurement Accordant Notes
US-ASCII 128 Y 7-bit characters only
ISO-8859-1 256 Y 8-bit (single byte) characters
CP1252 251 Y One byte output, with cipher credibility up to 0x2122
UTF-8 63488 Y 1-3 bytes,
UTF-16BE 63488 N 2 bytes, carbon artful the cipher points
UTF-16LE 63488 N 2 bytes, abandoning the cipher credibility
UTF-16 63488 N 4 bytes, endure 2 bytes = UTF-16BE
GBK 24068 Y 1-2 bytes, Chinese 1993 standard
GB18030 63488 Y 1-4 bytes, superset of GBK, 2000 standard
Source: s Addendum on JDK.
|
characters, encoding, bytes, ascii, return, public, encode, points, static, character, notes, bytetohex, string, system, program, argument, , code points, public static, character set, system out, encoding map, characters 65536valid characters, encode code points, public static string, public static byte, |
Also see ...
Notes and sample codes bark are based on J2SDK 1.4.1_01.Encoding Map AnalyzerAs mentioned in my additional note, "Character Set and Encoding", J2SDK 1.4.1_01
ISO 8859 1 Latin 1ISO 8859 1 encoding:Code CodePoint Point 0000 00 00FF FF0100 3F FFFF 3F
......8FC0 E8 BF 80 8FFF E8 BF BF9000 E9 80 80 903F E9 80 BF9040 E9 81 80 907F E9 81 BF9080 E9 82 80 90BF E9 82 BF......9FC0 E9 BF 80 9FFF E9 BF BFA000 EA 80 80