December 3, 2009

1. Character Set


In the computer world, we need to express too many characters, in order to computer can correctly display these characters, we will encode these characters, making characters and a series of code-one correspondence. When our system, according to a coded way to read a document, will be inside the code will be automatically converted into the corresponding characters displayed on the screen. (We are not here to discuss how the character on the monitor display through the lattice of this process)
In Chinese because of its more than the number of characters, its encoding complexity of the natural character than in the West. So in writing code, the software used in the process, we often encounter the Chinese garbled related issues.

g.cn's home page is UTF-8 encoding (the browser will first receive the html according to automatic detection of its code), this time if we try to GB2312 encoding to parse the words of the page will be displayed above the garbage.
Why is this? ? ?
First, we request a URL, the server returns the contents of the specified character set encoding is the byte transmitted browser, the browser re-encoded according to a certain approach to resolve these bytes. However, if coming from the correct character encoding method and your resolve inconsistencies byte encoding, it will be garbled.

Let us first introduce several frequently encountered in everyday use codes.
In our daily use, we will encounter iso 8859-1, gb2312, gbk, gb18030, big5, unicode character set or other character encoding, these are the same one-level concept, some students may be asked, what about UTF -8, UTF-16 then? In fact, unicode is rather special, although the adoption of unicode encoding, each character corresponds to a unique code, but its implementation on the computer there are several ways they can be, Unicode implementation is called Unicode Transformation Format (Unicode Translation Format, referred to as To UTF), so that UTF-8 or UTF-16 is only an implementation of Unicode encoding method. Here we have singled out a few codes to explain below:
1.ISO 8859-1
Officially numbered as ISO / IEC 8859-1:1998, also known as Latin-1 or "Western European language", is within the International Organization for Standardization ISO / IEC 8859 the first 8-bit character sets. It ASCII-based vacant 0xA0-0xFF in the context of accession to 96 letters and symbols, to add symbols for use in the Latin alphabet languages. ISO 8859-1:1987 version has been released. ISO-8859-1 is a single-byte encoding.
2.GBK (GB extended)
The full name of Chinese Internal Code extended specification, the English name of Chinese Internal Code Specification. K, that is "expansion" of the corresponding Chinese phonetic alphabet (KuoZhan11) the "expansion" word consonant. GBK is an extension of GB2312, GBK in support of this also supports Simplified Chinese Traditional Chinese. The People's Republic of China's official now mandatory GB18030 standard.
3.Unicode
Come to talk about Unicode bar, in the java or javascript, we construct a "Chinese" of the unicode string is generally used when the "\ u4E2D \ u6587" expressed, accounting for two bytes. UTF-8 is a variable length character encoding, such as the ordinary English characters only need one byte, and the Chinese will take up three bytes for each character. And UTF-16 for a two-byte code unit (fixed), so from the perspective of bytes can not be achieved and is compatible with ASCII, and UTF-16 there is big-endian and small-tail sequence of two different forms of storage.

2. Character encoding through the java code is compiled to run from beginning to end


Here is the question we need to consider:
1. The source file encoding
2. Compile-time specified encoding parameters (Eclipse will be based on your source file encoding formats automatically select the appropriate compiled code parameter)
3. The system default encoding (which can be through the System.getProperty ( "file.encoding") to obtain)
4. Console terminal to display the set encoding
5. Run-time JVM in the String is Unicode encoded First, we adopted a simple java code is compiled to run to illustrate the encoding used in this process

package com.lukejin.stringtest;
import java.io.UnsupportedEncodingException;

public class StringTest {
public static void main(String[]args) throws UnsupportedEncodingException{
String chinese="ab中文";
System.out.println(chinese);
}
}
this code looks very simple.
Assumed that the source file encoding format is GBK, then when you can view the binary format, through the relevant software to view when you can find the following code

Java-depth analysis of the character encoding
After using the Eclipse compiler (eclipse compile-time compiled to help you automatically add the parameter-encoding gbk), binary editor, you can open the compiled class


This is a compilation after "Chinese" have been converted into UTF-8 encoding of the three bytes to represent a Chinese character.

E4 B8 AD E6 96 87

Then when the JVM is how to run a problem?
First, chinese is the right "ab-China" Unicode string
System.out.println (chinese);
Sentence will be chinese system default encoding according to encode into a byte stream sent to the output stream, the
Then stream the output terminals will bytes inside the terminal according to character code and decode received

3. And encoding the two methods


About String encoding there are two more important methods is necessary to mention
getBytes (String charset)
new String (byte [] bytes, String charset)
These two methods are compared in terms of String, that is relative in terms of Unicode strings.
Here we will explain in detail the function and these two functions,
GetBytes is a character in the string char accordance with the charset to encode the corresponding numbers obtained is the encoded byte array, which is an encode process,
The new String (byte [] bytes, String charset) is in accordance with charset to decode the byte array will be a solution out of character with the unicode characters stored, and returns the unicode string.
The following examples and figure a way to demonstrate the process of the above assumptions String a = "Chinese"; / / Of course, you can write unicode in the form of String a = "\ u4E2D \ u6587";
Byte [] bs = A.getBytes ( "gbk")


String b = new String (bs, "iso-8859-1 ");// If gbk encoding is used here to decode the words will naturally be the original a


We can see that this time has a garbled, but since no information is lost, they still can be restored into Chinese.

The process of recovery String c = new String (b.getBytes ( "iso-8859-1"), "gbk");
This process can refer to the above two diagrams for reflection.
First, b according to iso-8859-1 encoding to be "D6D0 CEC4"
"D6D0 CEC4" in accordance with GBK string decoding to be "Chinese"
"Chinese" were carried out indicated that the use of unicode (because of java in the String is unicode), so b as the "Chinese" ( "\ u4E2D \ u6587")

4. Web programming common coding problems



Why do we need to java in the so-called transcoding it?
The key reason is that when we construct the string using the wrong character set.
For example front pass over the gbk-encoded byte stream bytes, but the server side is wrong in order to iso-8859-1 character set for decoding into the character.
This process is equivalent to new String (bytes, "iso-8859-1");
Of course, many people in this process is not written by us but by the servlet framework to complete the course, you can change the character set value.

Servlet response we send back to the front desk, where there will be a coded concept, is that you output the contents of the encoding, as well as set the browser so that the receiving end, what kind of encoding to decode.
For instance, we can through the response.getWriter (); used before

response.setContentType ( "text / html; charset = utf8");

This statement is to set the output encoding, in fact, play two roles in this statement,

  • First, set the output encoding of this instance, we have an a string, then send it to the browser when the byte stream are certainly those bytes is the case a.getBytes (above you in this set code)
  • Second, the response sent to the browser in the encoded header set this encoding, the browser can make the correct encoding to decode the byte array.

Of course, after the Servlet 2.4 version provides setCharacterEncoding this method can separate a single set encoding. This you can read the source code that web container.
These two methods must be

response.getWriter ();

Prior to the role it works, and the response byte stream coded character sets and response of the header of the contentType the charset is the same.


Well here's one I came across a problem:

Background there is a Servlet is a JQuery front of the ajax call returns a string containing the Chinese, due to historical reasons, get the string based on iso-8859-1 decoding are String a. and the system's default encoding is iso - 8859-1 encoding, if we this a direct print out of the console, only to find the correct output can be Chinese, (thought to himself: strange, there should be no Chinese?) In fact, this principle is that the output to the console After the following two steps,
The first one. Would be the wrong String according to iso-8859-1 encoded into bytes, the bytes and correct gbk-encoded bytes is the same, this time to set the encoding of our SecureCRT is GBK, it can correctly display the Chinese, but also GBK is the way we go to decode a iso8859-1 encoding, a lucky hit with, the Chinese but can be shown.

Now back to Web up thinking, because of response bytes written resolution coding and browser encoding is the same (if inconsistent, we can simulate the console way)
Therefore, we must do first a convert String b = new String (a.getBytes ( "iso-8859-1"), "gbk") such a, b in the string is the right strings,
This time we only have to support Chinese encoding sent to the client's browser can be a

response.setContentType ( "text / html; charset = utf-8");
out = response.getWriter ();
out.write (str);


In many cases, some students also annoy garbage and javascript-related, the main reason is that, in fact, the internal js runtime, String is also based on Unicode encoded, (or to say precisely when the utf-16).
Therefore, in the course of the use of ajax, students encounter some garbled easily disturbed by the problem. Note that the server-side back to the ajax of bytes guaranteed to be correct as long as the coding on it. (Such as Chinese, so long as the corresponding guaranteed to be correct in Chinese gbk, utf-8 encoding, etc.)

5. Garbled summary


In the Java run-time of the world, garbled generation (both compile-time generated here) exist in two places at source, in fact, that is what I have mentioned two functions (of course, sometimes the framework of which helped us a call a function, so you get is already uploaded by the network over a byte array converted to String a),

  • getBytes (String charset) if according to a specified charset to encode a unicode String, but found that the coding system, where (for example, iso-8859-1) do not have this character, it will be encoded into the 3F (actually a question mark), so that has caused the loss of information, and can not be restored.
  • new String (byte [] bytes, String charset) if a byte array according to a specified character set to decode the character set, but suddenly some of them do not know when the encoding, for example, a certain period of a byte array according to UTF-8 decoding time, do not know, and to a unicode string side is the "\ uFFFD", in fact this thing called 'REPLACEMENT CHARACTER', shows a question mark

    Therefore, we encounter the following situations are often garbled
    1. A kind of encoded files to another way to parse code to read, this would certainly garbled, this is where we open a file when the operating system frequently.
    2. The wrong way transmission over the encoding of the byte stream decoding. So, get the wrong unicode string.
    3. And console inconsistent encoding of unicode strings correctly coded, and sent to the console display. Will be garbled.