Character encoding: ASCII, Unicode, UTF-8, GB2312
Way of encoding from the file view the file can be divided into ASCII code and binary code file two.
ASCII files are also known as text files, which when stored in the disk corresponds to one byte per character, used to store the corresponding ASCII code. For example, the number stored in the form of 5678:
ASC Code: 0,011,010,100,110,110 0,011,011,100,111,000
---------------- ↓ ↓ ↓ ↓
Decimal code: 5678 a total of 4 bytes. ASCII code files can be displayed on the screen by character, such as the source file is ASCII file, use the DOS TYPE command to display the contents of the file. As is shown by the character, it can read the file contents.
Binary file is based on binary encoding to store the file. For example, the number stored in the form of 5678: 0,001,011,000,101,110 only two bytes. Although binary files can also be displayed on the screen, but its content can not be read. C system in dealing with these files, does not distinguish between types, both as a character stream, byte for processing. Input and output character stream from the beginning and end of program control and is not only physical signs (such as carriage return) control. Therefore, to this document is called "streaming file."
This is a fun programmer programmers write books. The so-called fun is easier to understand for some of the original concept is not clear, their knowledge and upgrade similar to playing RPG games. This article was motivated by finishing two questions:
Use Windows Notepad to "Save As", you can GBK, Unicode, Unicode big endian and UTF-8 encoding these types of mutual conversion. Txt file also, Windows is the way how the identification code it?
I was earlier found Unicode, Unicode big endian and UTF-8 encoded txt file will start with a few more bytes are FF, FE (Unicode), FE, FF (Unicode big endian), EF, BB , BF (UTF-8). However, these tags are based on what standard?
See a recent online ConvertUTF.c, to achieve the UTF-32, UTF-16 and UTF-8 encoding of these three inter-conversion. For Unicode (UCS2), GBK, UTF-8 the encoding, I had to understand. But this program so I am a little confused, can not remember and UCS2 to UTF-16 What is the relationship.
Check the check information, and finally sorted out these problems, incidentally, also understand some of the details of Unicode. Written an article, send friends had similar questions. As far as possible when writing this article easy to understand, but ask the reader to know what is bytes, what is the hex.
0, big endian and little endian
big endian and little endian byte is the number of CPU handle multiple different ways. Such as "Han" Unicode character code is 6C49. When you write to file, what is the 6C EDITORIAL, or the 49 EDITORIAL? EDITORIAL if 6C is big endian. Or the 49 EDITORIAL is little endian.
"Endian" The word comes from "Gulliver's Travels." Lilliput from the civil war is whether to eat eggs from the bulk of the time (Big-Endian) or a small head knock (Little-Endian) knock, which had occurred six rebels, one of the emperor lost his life, and the other a lost throne.
We generally translate endian "byte order", will be called big endian and little endian "Big Tail" and "little tail."
1, the character encoding, internal code, incidentally, introduced the character to be encoded character encoding in order to be computerized. The default encoding used by the computer is the computer's internal code. Early computer use 7-bit ASCII code, in order to deal with Chinese characters, the programmer designed GB2312 for Simplified Chinese and Traditional Chinese for big5.
GB2312 (1980) of a total collection of 7445 characters, including the 6763 characters and 682 other symbols. Area code within the range of characters high byte from B0-F7, the low byte from the A1-FE, occupied by the code bits is 72 * 94 = 6768. There are 5 vacancies which is D7FA-D7FE.
GB2312 Chinese characters too little support. Chinese Extension in the 1995 collection of 21,886 symbols GBK1.0, it is divided into character areas and graphic symbols area. Areas, including 21,003 Chinese characters. GB18030 2000 to replace GBK1.0 official national standard. The standard contains 27,484 Chinese characters, but also a collection of Tibetan, Mongolian, Uighur and other major minority languages. Now the PC platform must support GB18030, for the requirements of embedded products temporarily. So mobile phones, MP3 generally support GB2312.
From ASCII, GB2312, GBK to the GB18030, the encoding method is backward compatible, that is the same character in these programs always have the same code, followed by the standard to support more characters. In these codes, the English and Chinese can be unified manner. Chinese encoding method is to distinguish between the high byte of the highest bit is not 0. Called by the programmer, GB2312, GBK GB18030 belong to the double-byte character set (DBCS).
Some of the Chinese internal code or the default Windows GBK, GB18030 update by upgrading to GB18030. However, the increase in GBK GB18030 character relative to the average person is very difficult to use, usually we refer to with GBK Chinese internal code Windows.
Here are some details:
GB2312's original or area code, from the area code to the code, you need the high byte and low byte, respectively, with A0.
In the DBCS, GB storage format within the code is always big endian, that is high in the former.
GB2312 highest two bytes are 1. But to meet this condition code bit is only 128 * 128 = 16384. Therefore, the low byte of GBK and GB18030 may not be the most significant bit is 1. However, this does not affect the DBCS character stream parsing: DBCS characters in the read stream, as long as high as 1 byte of experience to the next two bytes can be encoded as a byte, and do not control the low byte high what it is.
2, Unicode, UCS and UTF
Mentioned earlier from the ASCII, GB2312, GBK GB18030 encoding method to the backward compatible. The only ASCII Unicode compatible (more precisely, with the ISO-8859-1 compatible), and the GB code is not compatible. Such as "Han" Unicode character encoding is 6C49, and the GB code is BABA.
Unicode is a character encoding method, but it is an international organization designed to accommodate all the world's languages coding scheme. Unicode scientific name is "Universal Multiple-Octet Coded Character Set", referred to as the UCS. UCS can be seen as "Unicode Character Set" in the acronym.
According to Wikipedia (http://zh.wikipedia.org/wiki/) the record: there were two attempts to separate the history of design Unicode organization, the International Organization for Standardization (ISO) and a software manufacturer's Association (unicode. org). ISO has developed ISO 10646 project, Unicode Association developed the Unicode project.
1991 years ago, both sides recognize the world does not need two incompatible character sets. So they began to merge their work and to create a single code table and work together. Starting from Unicode2.0, Unicode and ISO 10646-1 projects using the same font and code.
Currently there are still two projects, and independently release their standards. Association of the latest version of Unicode is now the 2005 Unicode 4.1.0. The latest ISO standard 10646-3:2003.
UCS provides a variety of how to use multiple bytes of text. How to transfer these codes by UTF (UCS Transformation Format) specification, and common UTF specification includes UTF-8, UTF-7, UTF-16.
IETF's RFC2781 and RFC3629 to RFC's style, clear, crisp and yet rigorous description of UTF-16 and UTF-8 encoding methods. I always can not remember the IETF is the abbreviation for Internet Engineering Task Force. But the IETF's RFC is responsible for maintaining all the norms on the basis of Internet.
3, UCS-2, UCS-4, BMP
UCS in two formats: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, UCS-4 is to use 4 bytes (actually only 31 bits, the highest bit must be 0) encoding. Let's do some simple math games:
UCS-2, 2 ^ 16 = 65,536 yards bit, UCS-4, 2 ^ 31 = 2,147,483,648 yards bits.
UCS -4 according to the highest bit is 0 bytes into a maximum of 2 ^ 7 = 128 group. Each group is divided into bytes according to the second highest of 256 plane. Under section 3 of each plane is divided into 256-byte lines (rows), each line contains 256 cells. Of course, the same line of cells is different from the last byte, the rest are the same.
group 0 in plane 0 is called Basic Multilingual Plane, or BMP. Or UCS-4, the high two bytes of code bits to 0 is called BMP.
UCS-4 in the BMP will remove the front of the two zero bytes to get UCS-2. The word in the UCS-2 with two zero-bytes before the holiday to get the UCS-4 in the BMP. At present, the UCS-4 specification has not been assigned any character outside the BMP.
4, UTF encoding
UTF-8 is the 8-bit encoded as a unit of the UCS. From the UCS-2 encoding to UTF-8 as follows:
UCS-2 encoding (16 hex) UTF-8 byte stream (binary)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
Such as "Han" Unicode character encoding is 6C49. 6C49 between the 0800-FFFF, so it must use a template 3 bytes: 1110xxxx 10xxxxxx 10xxxxxx. Will be written in binary is 6C49: 0,110,110,001,001,001, followed by the bit stream in the template instead of x, are: 11100110 1,011,000,110,001,001, the E6 B1 89.
Readers can use Notepad to test our code is correct.
UTF -16 to 16 as a unit of the UCS encoding. The UCS code for less than 0x10000, UTF-16 encoding to UCS code is equal to the corresponding 16-bit unsigned integer. For the UCS is not less than 0x10000 code defines a method. However, because the actual use of the UCS2, or UCS4 of BMP bound to less than 0x10000, so for now, that the UCS-2 UTF -16 and the same. However, only a UCS-2 encoding scheme, UTF-16 have to use the actual transfer, so we have to consider the byte order issues.
5, UTF byte ordering and BOM
UTF -8 encoding unit in bytes, there is no byte order issues. UTF-16 code units of two bytes, a UTF-16 in the interpretation of the text, we should first find out the byte order of each coding unit. For example, receive a "Kui" in Unicode encoding is 594E, "B" of the Unicode encoding is 4E59. If we receive UTF-16 byte stream "594E", then this is the "Kui" or "B"?
Unicode specification recommended by the byte order mark is BOM. BOM is not a "Bill Of Material" in the BOM table, but the Byte Order Mark. BOM is a bit small smart idea:
In the UCS, there is a code called "ZERO WIDTH NO-BREAK SPACE" character, its code is FEFF. The FFFE is not present in the UCS character, so should not appear in the actual transmission. UCS specification suggested we transfer a byte stream before transmission character "ZERO WIDTH NO-BREAK SPACE".
So, if the recipient received FEFF, to show that this is the Big-Endian byte stream of; If you receive FFFE, to show that this is Little-Endian byte stream of. Thus the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.
UTF -8 BOM does not need to show that the byte order, but can be used to show that the BOM encoding. Character "ZERO WIDTH NO-BREAK SPACE" in the UTF-8 encoding is EF BB BF (the reader can use our encoding method described above to test). Therefore, if the recipient receives the EF BB BF at the beginning of the byte stream to know that it is UTF-8 encoded.
Windows is to use the BOM to mark the text file encoding.
6, a reference paper for further information on the main reference is the "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).
I also found two data looks good, but because I began to doubt have found the answer, so I did not see:
"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a)
"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03)
About character encodings: ASCII, Unicode, UTF-8, GB2312
1. ASCII code
We know that inside the computer, all the information eventually expressed as a binary string. Each bit (bit) has two states 0 and 1, so eight bits can be combined to make 256 kinds of state, which is called a byte (byte). That is, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 0,000,000 to 11,111,111.
60's of last century, the United States developed a set of character encoding, bits of English character and the relationship between the uniform provisions made. This is called ASCII, still in use.
ASCII code provides for a total of 128 character code, such as spaces "SPACE" is 32 (binary 00100000), uppercase letter A is 65 (binary 01000001). This is 128 symbols (including 32 control symbols can not be printed), it only takes a behind the 7-bit bytes, the front of a uniform provision to 0.
2, non-ASCII encoding
English is enough with the 128 code symbols, but used to represent other languages, 128 symbols is not enough. For example, in French, the letter above the phonetic alphabet, it can not be expressed with ASCII code. As a result, some European countries decided to use the highest bit of idle bytes into a new symbol. For example, the encoding é in French 130 (binary 10000010). As a result, the coding system used in European countries, can represent up to 256 symbols.
However, there is the emergence of new problems. Different countries have different letters, so even if they use the encoding of 256 symbols, representing the letters are different. For example, 130 represents the coding in the French é, in Hebrew letters it represents coding Gimel (ג), in Russian encoding will sign on behalf of another. But anyway, all the way, 0-127 encoding symbols that are the same, just not the same as this paragraph 128-255.
As for the Asian countries, the text, use the symbol even more, as many as 10 million Chinese characters around. 256 bytes can only be a symbol that is definitely not enough, you must use a symbolic expression of multiple bytes. For example, Simplified Chinese is the common encoding GB2312, use two bytes to represent a character, so in theory up to that 256x256 = 65536 symbols.
Chinese text encoding issues that need special discussion, this note does not involve. Here only point out that although more than one byte are used to represent a symbol, but the GB class of Chinese character encoding and later in the UTF-8 Unicode and is unrelated.
Unicode character set (referred to as UCS), the International Standards Organization was established in April 1984 ISO / IEC JTC1/SC2/WG2 working group for the national language, the unity of coded symbols. U.S. multinationals set up in 1991, Unicode Consortium, and in October 1991 reached an agreement with the WG2, using the same coded character set. The current 16-bit Unicode encoding scheme is used, its content and ISO10646 character set of the BMP (Basic Multilingual Plane) the same. Unicode in June 1992 by DIS (Draf International Standard), the current version V2.0 released in 1996, contains symbols 6811, 20,902 Chinese characters, Korean alphabet 11172, defined area 6400, retaining 20,249, for a total 65534. Unicode encoded the same size. For example a letter "a" and a Chinese character for "good", the encoded size of the space are the same, are the two bytes!
Unicode can be used to represent characters in all languages, but also fixed-length double-byte (also with four-byte) code, including letters included. So you can say it is not compatible with iso8859-1 encoding, nor compatible with any code. However, compared to iso8859-1 encoding is, uniocode code is added in front of a 0 byte, such as the letter 'a' is "0061."
It should be noted that the fixed-length codes to facilitate computer processing (Note GB2312/GBK not fixed-length coding), but they can be used to represent all unicode characters, so a lot of software is used internally to deal with unicode encoding, such as java.
Unicode is a great collection of course, now the size can accommodate more than 100 million symbols. Not the same code for each symbol, for example, U +0639, said Arabic Ain, U +0041 for English capital letters A, U +4 E25 said the Chinese "strict." Symbols correspond to the specific table, you can query unicode.org, or special characters correspond to the table. http://www.chi2ko.com/tool/CJK.htm
4. Unicode problem
Note that, Unicode is only a symbol set, it only provides a binary code symbol, but does not provide for how to store the binary code.
For example, the Chinese character for "strict" is unicode hexadecimal 4E25, converted to a full 15-bit binary number (100111000100101), that is the symbol that at least 2 bytes. Other signs that more may need 3 bytes or 4 bytes, or even more.
There are two serious problems here, the first question is, how can differences between unicode and ascii? How do you know the computer three bytes to represent a symbol, rather than the three symbols represent it? The second problem is that we already know, the English alphabet with only one byte is enough, if unicode unified regulations, each symbol with three or four bytes, each letter before they are bound to have two to three bytes are 0, which is great for storing waste, the size of the text file so large that two times, this is unacceptable.
A result they are: 1) the emergence of a variety of storage unicode, which means that there are many different types of binary format that can be used to represent unicode. 2) unicode in a very long period of time can not be promoted until the advent of the Internet.
Popularity of the Internet and strongly demanded the emergence of a unified coding. UTF-8 is the most widely used on the Internet as a way to achieve unicode. Other implementations include UTF-16 and UTF-32, but the basic need on the Internet. To repeat, this relationship is, UTF-8 is the way to achieve one of the Unicode.
UTF-8 one of the biggest feature is that it is a variable length encoding. It can use 1 to 4 bytes to represent a symbol, the symbol changes depending on the length of bytes.
UTF-8 encoding rules are simple, and only two:
1) For single-byte symbols, the first byte set to 0, followed by seven unicode code for this symbol. So for English letters, UTF-8 encoding and ASCII codes are the same.
2) For n bytes of symbols (n> 1), before the first byte of n bits are set to 1, n +1 bits set to 0, followed by the first two bytes shall be set to 10. The remaining bits not mentioned, all unicode code for this symbol.
The following table summarizes the encoding rules, the letter x that can be encoded bits.
Unicode symbols range | UTF-8 encoding
(Hexadecimal) | (binary)
0000 0000-0000 007 F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Below is the Chinese character "strict" for example, demonstrates how to implement UTF-8 encoding.
Known "strict" unicode is 4E25 (100111000100101), according to the table, can be found 4E25 in the context of the third line (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes that the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, from the "strict" the beginning of the last bit, turn from back to front fill in the format x, the extra bits make up 0. This has been "strict" UTF-8 encoding is "11100100 1,011,100,010,100,101" in hex is E4B8A5.
6. Unicode and UTF-8 conversion between
Through the example of the previous section, you can see the "severity" of the Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are not the same. Conversion between them can be achieved through the program.
In Windows, there is a simple transformation method is to use the built-in Notepad applet Notepad.exe. Open the file, click "File" menu "Save as" command, a dialog box will pop up in the bottom there is a "code" of the drop-down bar.
There are four options: ANSI, Unicode, Unicode big endian and UTF-8.
1) ANSI is the default encoding. The English file is ASCII, the file is GB2312 for simplified Chinese encoding (only for Windows Simplified Chinese, Traditional Chinese version if it is will use Big5 code).
2) Unicode encoding refers to the UCS-2 encoding, that is directly deposited by two-byte Unicode character code. This option is used in little endian format.
3) Unicode big endian encoding and correspond to the previous option. I will explain in the next section little endian and big endian meaning.
4) UTF-8 encoding, which is discussed in the previous section coding.
After selecting the "encoding", click the "Save" button, the file encoding conversion immediately better.
7. Little endian and Big endian
The previous section already mentioned, Unicode UCS-2 code format can be used directly to storage. The Chinese character "strict" for example, Unicode code is 4E25, need to use two bytes of storage, a byte is 4E, another byte is 25. When stored, 4E front, 25 in the post, that is Big endian mode; 25 in the former, 4E in the post, is Little endian way.
The two odd name comes from the British writer Jonathan Swift's "Gulliver's Travels." In the book, Window on China where the outbreak of a civil war, the causes of war is that people debate whether to eat eggs from the bulk of the time (Big-Endian) or a small head knock (Little-Endian) knocking. To this matter, before and after the war broke out six times, an emperor lost his life, and the other the emperor lost his throne.
Therefore, the first byte of the first, is a "big way" (Big endian), the second byte of the first is the "head way" (Little endian).
So naturally, there will be a problem: how do you know of a computer by which a file encoded in the end?
Unicode specification defined the top of each file that were added to a coding sequence of characters, the character called "zero-width non-breaking line space" (ZERO WIDTH NO-BREAK SPACE), with the FEFF said. This is just two bytes, and the large FE FF than 1.
If a text file of the first two bytes are FE FF, it means the file using the bulk approach; if the first two bytes are FF FE, it means that the file is head way.
Next, give an example.
Open "Notepad" program Notepad.exe, create a text file, the content is a "strict" word, followed by ANSI, Unicode, Unicode big endian and UTF-8 encoding saved.
Then, with the text editor UltraEdit in the "Hex function" to observe the internal encoding of the document.
1) ANSI: encoding of the file is two bytes "D1 CF", this is the "severity" of the GB2312 encoding, which also suggests ways GB2312 is the use of bulk storage.
2) Unicode: Encoding is four bytes "FF FE 25 4E", which "FF FE" that is stored head, the real coding is 4E25.
3) Unicode big endian: the four-byte encoding is "FE FF 4E 25", which "FE FF" means that the bulk storage.
4) UTF-8: encoding is six bytes "EF BB BF E4 B8 A5", the first three bytes "EF BB BF" indicates that this is UTF-8 encoding, after three "E4B8A5" is the "strict" specific encoding, it is stored sequence is consistent with the coding sequence.
9.1 GB code
Stands for GB2312-80 "Information exchange with the basic set of characters coded character set," released in 1980, is the national standard of Chinese information processing, in the use of simplified Chinese mainland and overseas areas (such as Singapore, etc.) is forced to use only the Chinese encoding. P-Windows3.2 and Apple OS is the basic character encoding to GB2312, Windows 95/98 Zeyi GBK Chinese characters as the basic coding, but compatible with support for GB2312.
Double-byte encoding range: A1A1 ~ FEFE
A1-A9: Symbol area, contains 682 symbols
B0-F7: character area, including the 6763 characters
GB2312 (1980) of a total collection of 7445 characters, including the 6763 characters and 682 other symbols. Area code within the range of characters high byte from B0-F7, the low byte from the A1-FE, occupied by the code bits is 72 * 94 = 6768. There are 5 vacancies which is D7FA-D7FE. GB2312-80 contains 7545 characters in the CPC, a character with two bytes of code. The highest bit is 0 for each character. GB2312-80 codes referred to as the national standard code.
GB2312 Chinese characters too little support. Chinese Extension in the 1995 collection of 21,886 symbols GBK1.0, it is divided into character areas and graphic symbols area. Areas, including 21,003 Chinese characters.
Traditional 1990's coding standard developed GB12345-90 "Information Interchange character set of the first auxiliary coded character set", aimed at regulating a variety of occasions to use traditional characters, and the Ancient Books and so on. The standard contains a total of 6866 Chinese characters (GB2312 more than 103 characters, the character most of the other vendors do not include these words), pure traditional word about more than 2,200.
Double-byte encoding range: A1A1 ~ FEFE
A1-A9: Symbol area, increase the vertical symbol
B0-F9: character area, including the 6866 characters
GBK encoding (Chinese Internal Code Specification) was developed in China, equivalent to the UCS of the new Chinese national standard coding extension. gbk encoding that can be used both traditional and simplified characters, and can only express simplified gb2312, gbk is compatible gb2312 encoding. GBK Working Group in October 1995, the same year in December to complete GBK specification. The standards-compliant code GB2312, contains a total of 21,003 Chinese characters, symbols 883, and provides code bits 1894 created characters, Jane, complex characters and into a library. Windows95/98 Simplified Chinese version of the character encoding to use the surface of GBK, GBK and the UCS through one correspondence between the character code table and the bottom contact.
English name: Chinese Internal Code Specification
Chinese Name: Chinese characters in the CICS 1.0 double-byte encoding, GB2312-80 expansion, in the code-bit compatible on the scope and GB2312-80: 8140 ~ FEFE (excluding xx7F) bits contain a total of 23,940 yards 21,003 Chinese characters, contains the ISO / IEC 10646-1 characters in all of Japan and South Korea