At the computer, we are using information stored in binary code expressed. We recognize, and the screen in English, Chinese symbols and storage of binary code used by the mutual conversion, is encoded.
There are two basic concepts necessary description, charset and character encoding:
charset, character set, which is a symbol and a figure of a table mapping relation, that is, it determines 107 the koubei are 'a', 21475 is the reputation of the "mouth", different tables have different mapping relations, such as ascii, gb2312, Unicode. With this figure and the character mapping table, we can express a binary put the figure into a character.
chracter encoding, encoding. For example, the same should be "I" in the number of 21,475, we are using \ u5k3e3 express, or use% E5% 8F% A3 to express it? This is by the character encoding to decide.
For 'koubei.com' This string is the common American character, they developed a character set known as ASCII, the full name is the american standard code of information interchange standards for information exchange of the United States Code, with 0-127 This 128 figure, (2 of 7 power, 0 × 00 -0 × 7f ) represents 123abc such commonly used 128 characters. Are a total of 7 bits, together with the first one is the sign bit, we should come up negative yards express what the anti-code, and a total of 8 bits constitute a byte. American was a point that is stingy, if the very beginning designed as a byte is 16 bits, 32 bits, the world would be less a lot of questions, but at that time, it is estimated that they feel that 8 bits is enough, you can express 128 different characters in it !
Between the computer to engage in this stuff out are American, and so save their own, since the household put coded symbols are OK, pretty cool use of. However, when the computer started the internationalization of the time, problems arose, grabbed it, for example Chinese, like tens of thousands of Chinese characters, how do?
Existing 8 bits of a byte of the basic system is not damaged, should not go changing to 16 bits, such as, otherwise, change is too great, can only go the other way: with a number of ascii characters express a go the other characters, that is, MBCS (Multi-Byte Character System, multi-byte character system).
For in Chinese developers may encounter a relatively large number of questions is gbk, gb2312, utf-8 conversion between the type of problem. Strictly speaking this is not very accurate, gbk, gb2312 are character set (charset), and utf-8 is an encoding (character encoding), are standard Unicode character set UCS an encoding, because the use of Unicode characters Set the main page UTF-8 encoding, so we usually put them side by side, but in fact is not accurate.
With Unicode, at least not encountered extraterrestrial human civilization before This is a master key, and are using it now. And now the most widely used Unicode encoding is UTF-8 (8-bit UCS / Unicode Transformation Format), and it has a few good places in particular:
- UCS character set encoding, general all over the world
- Is a variable length coding (variable-length character encoding), compatible with ascii
The second point is a great advantage, it allows pure ascii previously used coding system compatible but will not increase the amount of extra memory (assuming the fixed-length coding means that each character composed by two bytes, then this time ascii characters occupied storage space will be increased to double).
UTF-8 should make it clear that the introduction of a table would be more convenient:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
To understand this table, we watch the first two lines are enough
U-00000000 - U-0000007F:
0xxxxxxx this is the first line, meaning that, if you found a utf-8 encoded byte of binary code are 0xxxxxxx, are 0 at the beginning, that is between 0-127 decimal, then he is the single representative of the byte a character, and are owned and ascii code exactly the same meaning. All other utf8 encoding binary values are used one at the beginning of 1xxxxxxx, more than 127, and are required at least 2 bytes can represent a symbol. So a byte is a first switch, the representative of the characters are not an ascii code. This is just talking about compatibility, from the English definition, is the utf8 encoding two attributes:
UCS characters U +0000 to U +007 F (ASCII) are encoded simply as bytes 0 × 00 to 0 × 7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters> U +007 F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0 × 00 -0 × 7F ) can appear as part of any other character.
And we look at the second line:
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
Look at the first byte: 110xxxxx, its meaning is that I am not a ascii code (because the first not to 0), my characters are more than one bytes the first byte (the second for one), my participation This character is composed of two bytes (the third is 0), from fourth place to start is the character information storage location.
Look at the second byte: 10xxxxxx, its meaning is: I am not a ascii code (because the first not to 0), I am not a multi-bytes characters of the first byte (the second bit is 0), the third place to start is the character information storage location.
From this example, can be summed up, utf-8 encoding, in the long list of consecutive byte binary code, may be 2-6 bytes to represent a symbol, then compared to a byte symbols express ascii code, we need space to store two additional information: First, the symbolic start location, a "starter" position, with the words of biology, that is, when the protein translation initiation codon AUG of the position; Second, this symbols used by a few bytes (in fact, if each symbol has starter, this length is not provided, but to provide information to increase the length of bytes in the partial loss of capacity at the time of fault-tolerant). Solutions are: using a byte whether it is the second one to represent whether the byte is a character starting byte (a byte inside because the first have already been used, 0 express ascii code, one that non - ascii), that is, a multi-byte symbols of the first bytes must be 11xxxxxx, a 192-255 between the binary number. Next, from the beginning of the third place, providing the length of information, the third is that this symbol 0 are 2-byte, the third one to start each one more characters plus a few bytes occupied by one. utf-8 up to the definition of the 6-byte characters, such 110xxxxx required more than 2 bytes of starter express many 4 1, so the starter is 1111110x, shown in the above table.
Let us consider the definition of standard English now, the expression of the same meaning:
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0 × 80 to 0xBF . This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
True information bit (that is, the real character of the figure charset information), are the direct use of binary mode, according to the order placed above the table 'x' on. Contacts with our Chinese programmers most Chinese characters for example, they are at the encoding range U-00000800 - U-0000FFFF between, from the above table can be found, the interval of the utf-8 encoding are three expressed in bytes (which is utf-8 encoded characters will be occupied by more than 2 bytes for each character of the EUC-CN encoding gb2312 character set of Chinese characters to use more storage space reasons), or by word-of-mouth the "I" word For now, I at Unicode characters are the numbers like this:
I: 21475 == 0 × 53e3 == binary 101001111100011
alert ( '\ u53e3'); / / get 'I'
alert (escape ( 'I')); / / get '% u53E3'
alert (String.fromCharCode ('21475 ')); / / get' I '
alert ( 'I'. charCodeAt (0)); / / get'21475 '
alert (encodeURI ( 'I')); / / get '% E5% 8F% A3'
Can see, string volume can be used directly \ u + hexadecimal Unicode character code of the form 'I', and accepted 10 Ways fromCharCode the Unicode hexadecimal code, the characters' mouth '.
The second alert has been one of the '% u7545', This is a non-standard Unicode encoding, URI belong to are part of the Percent encoding, but the use of W3C has formally been rejected by any one of the RFC are not standard, ECMA-262 standard provides for escape of such acts, it is estimated that is temporary.
It is interesting comparing the fifth to be alert of the '% E5% 8F% A3' what is it? How get it?
This is the URI used in a relatively large number of Percent encoding, percent-encoding, RFC 3986 standard requirements.
RFC 3986 provides that, Percent encoding the non-reserved words are as follows:
Unreserved characters, per RFC 3986 (January 2005)
0 1 2 3 4 5 6 7 8 9 - _. ~
In other words, these words appear in the URI of the time, do not encode, because they and the URI format does not matter, only that the original meaning of the characters
In addition, the reserved words are as follows:
Reserved characters, per RFC 3986 (January 2005)
! * '();: @ & = + $, /?% # 
These characters have special significance is that if does not mean that those intent on behalf of the special significance of the time appeared to be encoded as follows:
Reserved characters after percent-encoding
! * '();: @ & = + $, /?% # 
% 21% 2A% 27% 28% 29% 3B% 3A% 40% 26% 3D% 2B% 24% 2C% 2F% 3F% 25% 23% 5B% 5D
The% No. 2 behind is a hexadecimal number, this number is the Unicode's UTF-8 encoding another form of expression.
Let us look in detail to restore the 'I' word Why are '% E5% 8F% A3' bar.
Just now we talked about 'mouth' of the Unicode encoding 21475 binary forms are:
Just now we talked to, for a Chinese character, it's UTF-8 encoded form is:
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
Now we Be the blanks 'mouth' incision to fill into the binary code replace x:
101001111100011 = ---- 0101 - 001111 - 100011
101001111100011 = 1110xxxx 10xxxxxx 10xxxxxx
The first one byte less, add the left 0 filled by:
11100101 10001111 10100011
'%' + ParseInt ('11100101 ', 2). ToString (16) +
'%' + ParseInt ('10001111 ', 2). ToString (16) +
'%' + ParseInt ('10100011 ', 2). ToString (16)
) / / Get '% e5% 8f% a3'
How to get% e5% 8f% a3 instead.
In addition, the introduction and then click HTML in the Numeric character reference, NCR coding I believe we all know, HTML special characters encoded are required, such as & need to be encoded as follows: & also ® such special characters . In fact, also can use the HTML code to display any Unicode character, and edit a html document as follows:
The result is three "I" word.
There is also a commonly used encoding is base64 encoding, base64 encoding is in the email for this kind of non-pure 8-bit transport-layer transmission designed from binary data, so that you can pass binary email attachment. It az AZ 0-9 + / = This is 64 characters to express the original data, and three consecutive character encoding for the four, the length increased by 33%.
var a = 'reputation';
\ u0061 = 'koubei.com';
alert (a); / / get 'koubei.com'
Of course, hackers would have a more professional manner to avoid filtration, into the code (such as sql injection, XSS attacks, etc.).
Thank you everyone for reading, are my stauren, Yahoo UED reputation of the front-end development team Tsung-Engineer , This is my first Koubei at the UED blog published an article, if the wrong place, please have everyone pointed out. At the same time Welcome to my personal blog:
Read other people's records, feel good, on the collection here, Original Source: http://blog.csdn.net/fenixshadow/archive/2007/11/17/1890010.aspx Persistence Layer 1: 1) Hibernate This need not introduce, and used very frequently, used more are mapped
The use of rails in the Windows file, if the file name for the Chinese, then the upload is successful, you will find that file name is garbled. For example: How are you. Gif, found the name into From: Nao Video . Rmvb. But the document does not affec ...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"><HEAD> <meta http-equiv="Content-Type" content="text/html; charset=gb
Hibernate features: Hibernate powerful database has nothing to do with good, O / R mapping ability, and if you are very proficient in Hibernate, but also for Hibernate to conduct an appropriate package, then your project will be the entire persistence lay
First, Appfuse brief introduction Matt Raible are Appfuse developed a guiding entry-level J2EE framework, how to integrate its popular Spring, Hibernate, ibatis, struts, Xdcolet, junit, etc. give the basic framework of the model, the latest version 1.7 is