HTML and javascript coding commonly used in Analysis

At day-to-day front-end development work, we will keep with HTML, javascript, css and other language dealing with, and a real languages, computer language has its alphabet, grammar, morphology, encoding and so on, here at my easy HTML front-end of the talk about the day-to-day job with javascript coding problem often encountered.

At the computer, we are using information stored in binary code expressed. We recognize, and the screen in English, Chinese symbols and storage of binary code used by the mutual conversion, is encoded.

There are two basic concepts necessary description, charset and character encoding:

charset, character set, which is a symbol and a figure of a table mapping relation, that is, it determines 107 the koubei are 'a', 21475 is the reputation of the "mouth", different tables have different mapping relations, such as ascii, gb2312, Unicode. With this figure and the character mapping table, we can express a binary put the figure into a character.
chracter encoding, encoding. For example, the same should be "I" in the number of 21,475, we are using \ u5k3e3 express, or use% E5% 8F% A3 to express it? This is by the character encoding to decide.

For '' This string is the common American character, they developed a character set known as ASCII, the full name is the american standard code of information interchange standards for information exchange of the United States Code, with 0-127 This 128 figure, (2 of 7 power, 0 × 00 -0 × 7f ) represents 123abc such commonly used 128 characters. Are a total of 7 bits, together with the first one is the sign bit, we should come up negative yards express what the anti-code, and a total of 8 bits constitute a byte. American was a point that is stingy, if the very beginning designed as a byte is 16 bits, 32 bits, the world would be less a lot of questions, but at that time, it is estimated that they feel that 8 bits is enough, you can express 128 different characters in it !

Between the computer to engage in this stuff out are American, and so save their own, since the household put coded symbols are OK, pretty cool use of. However, when the computer started the internationalization of the time, problems arose, grabbed it, for example Chinese, like tens of thousands of Chinese characters, how do?

Existing 8 bits of a byte of the basic system is not damaged, should not go changing to 16 bits, such as, otherwise, change is too great, can only go the other way: with a number of ascii characters express a go the other characters, that is, MBCS (Multi-Byte Character System, multi-byte character system).
MBCS With this concept, we can express more of the characters, such as we use the two ascii characters, there are 16 bits, in theory, has 2 16-th power 65,536 characters. However, how these codes assigned to the characters on it? Word-of-mouth such as the "mouth" is the Unicode encoding 21,475, who decisions? Character Set, which has just been introduced by charset. ascii is one of the most basic character set, On top of that, we have similar to gb2312, big5 this for Simplified Chinese and Traditional Chinese character set of MBCS and so on. Finally called Unicode Consortium institutions, decided to do a get all the characters including the character set (UCS, Universal Character Set) and the corresponding encoding standards, namely Unicode. Since 1991, it released the first edition of the Unicode international standards, ISBN 0-321-18578-1, the International Organization for Standardization ISO is also involved in the custom, ISO / IEC 10646: the Universal Character Set. In short, Unicode is a basic coverage of all that already exists on the earth symbol characters in the standard, and now is being increasingly widely used, ECMA Standard also provides, javascript languages use Unicode characters in the internal standard (which means, javascript the variable names, function names, etc. are allowed in Chinese!).

For in Chinese developers may encounter a relatively large number of questions is gbk, gb2312, utf-8 conversion between the type of problem. Strictly speaking this is not very accurate, gbk, gb2312 are character set (charset), and utf-8 is an encoding (character encoding), are standard Unicode character set UCS an encoding, because the use of Unicode characters Set the main page UTF-8 encoding, so we usually put them side by side, but in fact is not accurate.

With Unicode, at least not encountered extraterrestrial human civilization before This is a master key, and are using it now. And now the most widely used Unicode encoding is UTF-8 (8-bit UCS / Unicode Transformation Format), and it has a few good places in particular:

  1. UCS character set encoding, general all over the world
  2. Is a variable length coding (variable-length character encoding), compatible with ascii

The second point is a great advantage, it allows pure ascii previously used coding system compatible but will not increase the amount of extra memory (assuming the fixed-length coding means that each character composed by two bytes, then this time ascii characters occupied storage space will be increased to double).

UTF-8 should make it clear that the introduction of a table would be more convenient:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

To understand this table, we watch the first two lines are enough

U-00000000 - U-0000007F:
0xxxxxxx this is the first line, meaning that, if you found a utf-8 encoded byte of binary code are 0xxxxxxx, are 0 at the beginning, that is between 0-127 decimal, then he is the single representative of the byte a character, and are owned and ascii code exactly the same meaning. All other utf8 encoding binary values are used one at the beginning of 1xxxxxxx, more than 127, and are required at least 2 bytes can represent a symbol. So a byte is a first switch, the representative of the characters are not an ascii code. This is just talking about compatibility, from the English definition, is the utf8 encoding two attributes:

UCS characters U +0000 to U +007 F (ASCII) are encoded simply as bytes 0 × 00 to 0 × 7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters> U +007 F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0 × 00 -0 × 7F ) can appear as part of any other character.

And we look at the second line:

U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
Look at the first byte: 110xxxxx, its meaning is that I am not a ascii code (because the first not to 0), my characters are more than one bytes the first byte (the second for one), my participation This character is composed of two bytes (the third is 0), from fourth place to start is the character information storage location.
Look at the second byte: 10xxxxxx, its meaning is: I am not a ascii code (because the first not to 0), I am not a multi-bytes characters of the first byte (the second bit is 0), the third place to start is the character information storage location.

From this example, can be summed up, utf-8 encoding, in the long list of consecutive byte binary code, may be 2-6 bytes to represent a symbol, then compared to a byte symbols express ascii code, we need space to store two additional information: First, the symbolic start location, a "starter" position, with the words of biology, that is, when the protein translation initiation codon AUG of the position; Second, this symbols used by a few bytes (in fact, if each symbol has starter, this length is not provided, but to provide information to increase the length of bytes in the partial loss of capacity at the time of fault-tolerant). Solutions are: using a byte whether it is the second one to represent whether the byte is a character starting byte (a byte inside because the first have already been used, 0 express ascii code, one that non - ascii), that is, a multi-byte symbols of the first bytes must be 11xxxxxx, a 192-255 between the binary number. Next, from the beginning of the third place, providing the length of information, the third is that this symbol 0 are 2-byte, the third one to start each one more characters plus a few bytes occupied by one. utf-8 up to the definition of the 6-byte characters, such 110xxxxx required more than 2 bytes of starter express many 4 1, so the starter is 1111110x, shown in the above table.
Let us consider the definition of standard English now, the expression of the same meaning:

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0 × 80 to 0xBF . This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

True information bit (that is, the real character of the figure charset information), are the direct use of binary mode, according to the order placed above the table 'x' on. Contacts with our Chinese programmers most Chinese characters for example, they are at the encoding range U-00000800 - U-0000FFFF between, from the above table can be found, the interval of the utf-8 encoding are three expressed in bytes (which is utf-8 encoded characters will be occupied by more than 2 bytes for each character of the EUC-CN encoding gb2312 character set of Chinese characters to use more storage space reasons), or by word-of-mouth the "I" word For now, I at Unicode characters are the numbers like this:
I: 21475 == 0 × 53e3 == binary 101001111100011

At javascript in, run the code (using Firebug's console, or an HTML editor to insert the following code between a pair of script tags):

alert ( '\ u53e3'); / / get 'I'
alert (escape ( 'I')); / / get '% u53E3'
alert (String.fromCharCode ('21475 ')); / / get' I '
alert ( 'I'. charCodeAt (0)); / / get'21475 '
alert (encodeURI ( 'I')); / / get '% E5% 8F% A3'

Can see, string volume can be used directly \ u + hexadecimal Unicode character code of the form 'I', and accepted 10 Ways fromCharCode the Unicode hexadecimal code, the characters' mouth '.

The second alert has been one of the '% u7545', This is a non-standard Unicode encoding, URI belong to are part of the Percent encoding, but the use of W3C has formally been rejected by any one of the RFC are not standard, ECMA-262 standard provides for escape of such acts, it is estimated that is temporary.
It is interesting comparing the fifth to be alert of the '% E5% 8F% A3' what is it? How get it?

This is the URI used in a relatively large number of Percent encoding, percent-encoding, RFC 3986 standard requirements.

RFC 3986 provides that, Percent encoding the non-reserved words are as follows:

Unreserved characters, per RFC 3986 (January 2005)
abcdefghijklmnopqrstu vwxyz
0 1 2 3 4 5 6 7 8 9 - _. ~

In other words, these words appear in the URI of the time, do not encode, because they and the URI format does not matter, only that the original meaning of the characters

In addition, the reserved words are as follows:

Reserved characters, per RFC 3986 (January 2005)
! * '();: @ & = + $, /?% # []

These characters have special significance is that if does not mean that those intent on behalf of the special significance of the time appeared to be encoded as follows:

Reserved characters after percent-encoding
! * '();: @ & = + $, /?% # []
% 21% 2A% 27% 28% 29% 3B% 3A% 40% 26% 3D% 2B% 24% 2C% 2F% 3F% 25% 23% 5B% 5D

The% No. 2 behind is a hexadecimal number, this number is the Unicode's UTF-8 encoding another form of expression.

Let us look in detail to restore the 'I' word Why are '% E5% 8F% A3' bar.

Just now we talked about 'mouth' of the Unicode encoding 21475 binary forms are:

Just now we talked to, for a Chinese character, it's UTF-8 encoded form is:
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

Now we Be the blanks 'mouth' incision to fill into the binary code replace x:
101001111100011 = ---- 0101 - 001111 - 100011
101001111100011 = 1110xxxx 10xxxxxx 10xxxxxx
The first one byte less, add the left 0 filled by:
11100101 10001111 10100011

Let us put the three binary number converted to hexadecimal 16, and percent-plus, run the following javascript code:

alert (
'%' + ParseInt ('11100101 ', 2). ToString (16) +
'%' + ParseInt ('10001111 ', 2). ToString (16) +
'%' + ParseInt ('10100011 ', 2). ToString (16)
) / / Get '% e5% 8f% a3'

How to get% e5% 8f% a3 instead.

Also built-javascript function encodeURI, decodeURI, encodeURIComponent, decodeURIComponent is conducted Percent Encode, only in the treatment of: /;? And other special characters when there are differences.

In addition, the introduction and then click HTML in the Numeric character reference, NCR coding I believe we all know, HTML special characters encoded are required, such as & need to be encoded as follows: & also ® such special characters . In fact, also can use the HTML code to display any Unicode character, and edit a html document as follows:

</ body>
</ html>

The result is three "I" word.

There is also a commonly used encoding is base64 encoding, base64 encoding is in the email for this kind of non-pure 8-bit transport-layer transmission designed from binary data, so that you can pass binary email attachment. It az AZ 0-9 + / = This is 64 characters to express the original data, and three consecutive character encoding for the four, the length increased by 33%.
This encoding at some of the more advanced javascript commonly used applications, such as the Super Mario games, its music is written inside the javascript file. For example, the use of graphic examples of canvas, which is also the head of Writing at the source code in javascript. This is the RFC 2397 protocol data URIs provided, Firefox browser support, IE8 also the beginning of support, using data URIs and base64 encoding, we can not use any outside music, images and other multimedia files and create a rich effect.

The above is what I want to introduce the javascript and html code used to and the principle would also like to mention the last sentence of a lot of hacking and coding are related to the code with encoded through some simple filtering, the following js code:

var a = 'reputation';
\ u0061 = '';
alert (a); / / get ''

Of course, hackers would have a more professional manner to avoid filtration, into the code (such as sql injection, XSS attacks, etc.).

Thank you everyone for reading, are my stauren, Yahoo UED reputation of the front-end development team Tsung-Engineer , This is my first Koubei at the UED blog published an article, if the wrong place, please have everyone pointed out. At the same time Welcome to my personal blog: And provide online Hex, NCR, Percent encode, Base64 encoding decoding tools:

分类:AJAX 时间:2009-03-12 人气:1062
blog comments powered by Disqus


iOS 开发

Android 开发

Python 开发



PHP 开发

Ruby 开发






Javascript 开发

.NET 开发



Copyright (C), All Rights Reserved. 版权所有 闽ICP备15018612号

processed in 1.780 (s). 10 q(s)