The rise of Web2.0, setting off a new round of Internet business trend of the network. User-oriented concept of the new site-building, subdivision of the site functionality and user base, not only successfully created a large number of new sites, but also greatly facilitate the access of the people. But the Web2.0 user-oriented concept, makes the new site with new features - high concurrency, high flow, large volume of data, logic and complex, building on the site also raised new demands.
This paper focuses on high-traffic sites with high concurrency architecture, and the main study and discuss the following:
First of all, the height of the entire network to discuss the use of mirror sites, CDN content distribution networks, load balancing technology to bring more convenience and their respective advantages and disadvantages. Then on the fourth floor level in the local area network switching technology, including hardware and software solutions, F5 solutions LVS, a brief discussion. The next level in a single server, this article focused on a single server's Socket optimization, hard disk-level cache, memory-level cache, CPU and IO balance technique (that is, computing and data-based process-oriented program to read and write with the deployment), read-write separation technology. In the application layer, this article describes some of the techniques commonly used in large sites, and choose to use the technical reasons. Finally, the height of the structure of the site to discuss expansion, fault tolerance and so on.
This combination of theory and practice form, combined with practical work on the experience gained with the broader applicability.
1.1 The development of the Internet <br /> recent years, the Internet has grown from a purely for scientific research, the United States to pass static documents internal network, developed into a used in many fields, and sent a massive multi-media and dynamic global network information. In terms of scale, the number of Internet hosts, bandwidth, the number of people online and almost always maintained the trend of exponential growth, in July 2006, the Internet hosts a total of 439,286,364 units, WWW site, the number reached 96,854,877 a . The world's Internet population reached 700 million in 2004, 29 million , China's Internet in December 2006 the number reached about 100 million 37 million . On the other hand, the Internet content delivery occurred Le tremendous changes, early Hulianwangyi static, text of public information for the main content, and the current Internet Zechuandizhe Daliang of dynamic, multimedia Ji nature of the information, people Bujinkeyi Read through the Internet to dynamically generated information, but can it use the e-commerce, instant messaging, online gaming, highly interactive services. Therefore, we can say that the Internet is no longer just an information sharing network, which has become a ubiquitous platform for interactive services.
1.2 The new trend of Internet Website
Expanding the scale of the Internet, the growing user base, and web2.0  The rise of the Internet site of building the new requirements:
High performance and scalability. May 2000, ranking first in the world visits (source of statistics ), Yahoo  claims that the number of days to reach 600 million page view 25 million, or about 30,000 times per second HTTP request (for each 4 page views and average request basis). Such large-scale traffic on the service performance in the very high demand. More importantly, the Internet's broad audience, making the success of Internet services and speed of traffic growth potential is very large, so service system must have very good scalability to meet future service growth potential.
Support highly concurrent access. A high degree of concurrent access to the service of storage and the ability to make a high concurrency requirements, the current mainstream and ultra-pipelined superscalar processor can handle number of concurrent requests is limited, because as the number of concurrent increase in the cost of the process of scheduling will rise very fast. The nature of Internet WAN access delay determines its longer, so a longer time to complete a request, generated by the request to the page to download from 3 seconds to complete calculations, Yahoo in May 2000 when an average of 90,000 concurrent requests . But also for more complex services, the server often have to maintain user session information, such as an Internet site if there are 1 million Ci daily user sessions, each 20 minutes, then the average would be about the same time 14 000 concurrent sessions.
High availability. Internet services worldwide 24 hours a day determined by its users will have access, so any service stop will affect the user. For e-commerce applications, a temporary suspension of service means the permanent loss of customers, and substantial economic losses, such as ebay.com  1999 Years 6 22 hours a month can not access the website, this site's 3.8 million the user a tremendous impact on loyalty, making Ebay company had paid 500 million for compensation for the loss of customers, the company's market value dropped by 4 billion U.S. dollars . Therefore, the availability of critical Internet applications require very high.
1.3 Introduction Sina podcasts
To YouTube  as the representative of the micro-video sharing site recently in the ascendant, the year 2006 alone, nearly one hundred domestic imitation appeared micro video sharing site YouTube , trying to replicate YouTube's success model. Such sites can be said that under the concept of representative Web2.0 websites Juyou all the typical Web2.0 website features: high concurrency, high flow, large volume of data, logic Fuza, users scattered so. Sina  as the largest portal Sina successful operation in 2005, based on the blog, launched at the end of 2006, SINA podcasting service. SINA Podcasting As the micro-portals of the first video sharing service websites, relying on Sina.com, and Sina blog's huge popularity of resources, less than six months after the launch period, and achieved great success: the number of similar sites to upload video First, the flow of the fastest growing, the number of users up to , all these achievements behind a huge investment in hardware, good support and flexible framework for application level software design.
2. Network layer structure
2.1 mirror sites Technology
Mirror site is identical to a site into several servers, each with its own URL, the server site each other as mirror sites . Mirror sites and the main station and there is not much difference, or can be viewed as the master copy. Mirror sites have several advantages: If you can not access the main station for the normal (such as server failure, network failure or slow speeds, etc.), can still receive services through the mirror server. Inconvenience is the: Update website content when the need to update multiple servers; require the user to remember more than one site, or need access to multiple users to select a mirror site, and user choice, not necessarily optimal. The user selection process, the lack of the necessary controllability.
In the early stages of development of the Internet, the Internet Web content are few and mostly static content, update frequency end. But because the server computing power, bandwidth is small, Wang Suman, popular sites to visit great deal of pressure. Mirror sites in this case as an effective solution, is widely used. With the development of the Internet, more and more sites using server-side scripts to dynamically generated content, more and more difficult to synchronize updates, increasingly high demand on the controllability, mirror technology because they can not meet the needs of such sites, gradually fade out of people's attention. But there are some major software download site, because the conditions meet the mirror site - downloadable content is static, updated less frequently, on the bandwidth and speed requirements and the relatively high, such as foreign SourceForge (http://www.SourceForge . net, well-known open source hosting site), Fedora (http://fedoraproject.org, RedHat sponsored Linux distribution), China's Free Software (http://www.onlinedown.net), Sky Software Station ( http://www.skycn.com) and so on, are still using this technology (Figure 1).
Figure 1 Above: The sky software, choose a mirror station home page
The following illustration: SourceForge download mirror selection page (default)
The process of building the site, according to the actual situation, the static content for a number of images to speed up the access speed to enhance user experience.
2.2 CDN Content Delivery Network
CDN stands for Content Delivery Network, the content distribution network. Its purpose is available in the Internet through a new layer of the network architecture, the site's content distribution to the nearest user's network "Bian Yuan", Shi Yong Hu can get close to the desired content, Fensan strain on servers and the Internet address crowded conditions, enhance response speed of user access to the site. As the network bandwidth in order to address a small, user access to large, network and other reasons caused by uneven distribution of users visit the Web site to respond to the problem of slow .
CDN mirror sites with different techniques is that site instead of the user to choose the best content server, enhanced controllability. CDN is caught in the web browser and the server being accessed or the intermediate layer of mirrored cache, when the viewer clicks or server to access the original URL, but that has to do is best for visitors a mirror server on the page cache. This is done by adjusting the server's domain name resolution to be achieved. CDN technology to use domain name resolution server needs to maintain a mirror server IP list and a visit to the mirror server's mapping table. When a user request comes, according to the user's IP, check the corresponding table, the optimal mirror server's IP address, and returned to the user. Here's the best, is necessary to consider the server's processing power, bandwidth, distance from the visitors and so on. When a local mirror site flow excessive bandwidth consumption too fast, or there is a server, network and other faults, they can easily set the user's access to another place (Figure 2). This enhanced controllability.
Schematic diagram of Figure 2 CDN
CDN network acceleration technology has its limitations. First of all, because the content updated, need to update multiple mirror servers simultaneously, it also only applies to less frequent updates, or real-time requirements of the site is not very high; Second, DNS Resolution a cache, when a mirror site visits need to be transferred, the primary DNS server changed the IP analytic results, but all over the DNS server cache update lag time, this time the user's access point to the server will still be controlled there is still insufficient.
Currently, the domestic large-scale high traffic websites such as Sina, Netease and other information channels, both using the CDN network acceleration technology (Figure 3), although the site's traffic is great but no matter where access speeds are fast. But the forum, mail, etc. Updated frequently, demanding real-time channel, is not suitable for this technology.
Figure 3 Sina use ChinaCache CDN services.
ChinaCache service nodes in more than 130, of which more than 80 nodes in China, covering a large network in major 6 major provinces .
2.3 Application Layer Distributed Design <br /> Sina podcast CDN network in order to obtain the advantages of speed, but also must avoid CDN deficiencies in the application layer software design, has taken an alternative approach. SINA provides a podcast player for video files check interface address. When the user opens the page when the video playback, the player first join query interface, access to video files through the interface where the optimal mirror server address, and then to the server to download video files. Thus, the query with an additional controlled access to the full, while the query traffic that is very small, almost negligible. CDN in the flexibility of the Domain Name was also retained: the interface program to maintain a list of mirror sites and visiting IP to the corresponding table mirror sites can be. Mirror mirror sites do not need all of the content, but only slow image update video files. This is completely affordable.
2.4 Network Layer Architecture Summary
From the view of the entire height of the Internet Web site architecture, to the direction is clear: allow users to get close to them, but also in the speed and make a balance between controllability. For more frequently updated content, because of the difficulty to maintain synchronization between the mirror site, you need to use other assistive technologies.
Layer 3 switching architecture
3.1 Introduction <br /> fourth layer exchange in accordance with OSI  seven layer model, the fourth layer is the transport layer. Transport layer is responsible for end to end communication, the IP protocol stack is TCP and UDP protocol layer where. TCP and UDP packet contains the port number (port number), they can be the only distinction belongs to each data packet protocols and applications. Receiver port number of the computer's operating system, according to the received IP packet to determine the type and level it to the appropriate procedures. IP address and port number combination often called "jack (Socket)".
A simple exchange of the fourth layer is defined: it is a transfer function, it decided to transfer not only On the basis of MAC address (the second level bridge) or source / destination IP address (the third layer routing), and based on IP address TCP / UDP (fourth layer) application port number combination (Socket) . The fourth layer switching functions as a virtual IP, point to the actual server. It supports multiple data transmission protocols, there are HTTP, FTP, NFS, Telnet, etc..
To HTTP protocol, for example, in the fourth layer of exchange for each server group to set up a virtual IP (Virtue IP, VIP), each server supports one or several domain names. In the domain name server (DNS) in the storage server group VIP, rather than the real address of a server.
When the user requests a page, a group of VIP with the target server connection request is sent to the fourth floor switch. Fourth level switches use a selection strategy, the group selects the best server, the target packet VIP address with the actual server IP address instead, and passed to the server connection request. Fourth level exchange is generally realized to maintain session function, that is, all packets the same session from the fourth floor switch mapping, between the user and the same server for transmission .
Classified according to realize the exchange of the fourth floor is divided into hardware and software.
3.2 Hardware Implementation
The fourth layer by the exchange of professional hardware general hardware manufacturers as a business solution. Common with Alteon , F5  and so on. These products are very expensive, but can provide excellent performance and very flexible management capabilities. Yahoo China had close to 2000 servers using the 34 sets Alteon or you have . In view of relations between the conditions, do not discuss here.
Exchange of the fourth layer can be realized by software, but little worse performance than specialized hardware, but a certain amount of pressure to meet or achieve, but also more flexible to configure software. There are four exchange software used Linux on the LVS (Linux Virtual Server), which provides based on the heartbeat (heart beat) real-time disaster response solutions to improve the robustness of the system, while providing a flexible configuration and management of VIP features, you can meet a variety of applications .
4 Server Optimization
4.1 to consider the overall performance of the server <br /> expensive servers, for value, how to configure it to play the greatest effect, it does not affect normal service, this is the time in the design of site architecture must be considered. Common of the server processing speed factors: network connection, hard to read and write, memory, CPU speed. If a certain server components are still below the required full load operation, while other parts still have capacity remaining, we will call it performance bottlenecks. Server you want to achieve the greatest effect, the key is to remove bottlenecks, so that all the components have been fully utilized.
4.2 Socket Optimization
A standard GNU / Linux as an example. GNU / Linux distributions have tried to optimize a variety of deployment, which means that the implementation of the specific server environment, the standard distribution may not be optimal . GNU / Linux provides many adjustable kernel parameters, you can use these parameters to dynamically configured for the server, including the impact of the performance of some important Socket options. These options included in the / proc virtual file system. The file system of each file represents one or more parameters, they can By cat tool for reading, or use the echo command changes. Here are listed only some impact on TCP / IP stack performance of adjustable kernel parameters :
/ Proc/sys/net/ipv4/tcp_window_scaling "1" (1 that this option is enabled, 0 for off, the same below) to enable RFC  1323  defined the window scaling; to support more than 64KB of the window, must be enable the value.
/ Proc/sys/net/ipv4/tcp_sack "1" to enable a choice of response (Selective Acknowledgment), through the selective response of-order received packets to improve performance (this allows the sender to send only the missing packets above); for wide-area network, this option should be enabled, but it also increases the occupancy of the CPU.
/ Proc/sys/net/ipv4/tcp_timestamps "1" the proportion of fat in a more accurate method of time-out (see RFC 1323) to enable the calculation of the RTT; in order to achieve better performance should enable this option.
/ Proc/sys/net/ipv4/tcp_mem "24576 32768 49152" to determine how the TCP stack should reflect the memory usage; each value of the units are memory pages (usually 4KB). The first value is the lower limit of memory usage. The second value is the beginning of the buffer memory pressure mode of application of pressure using the upper limit. The third value is the memory limit. Exceed the upper limit of packets can be discarded, thereby reducing memory usage.
/ Proc/sys/net/ipv4/tcp_wmem "4096 16384 131072" defined as automatically tune memory use for each socket. The first value is for the socket's send buffer allocation of at least the number of bytes. The second value is the default value (the value will be wmem_default cover), the buffer in the system under heavy load can not grow to this value. The third value is to send maximum number of bytes of buffer space (the value will be wmem_max cover).
/ Proc/sys/net/ipv4/tcp_westwood "1" to enable the sender-side congestion control algorithm, which can maintain the assessment of the throughput and bandwidth trying to optimize the overall utilization; for WAN communications, which should enable this election items.
And other tuning efforts, the best way to actually continue the experiment. The behavior of specific applications, the processor speed and the amount of available memory will affect the role of these parameters on the performance results. In some cases, a number of useful operations that may just be harmful (and vice versa). Therefore, the need to test each of the various options, and then check the results of each option, and finally arrive at the most appropriate set of specific parameters of the machine.
If you restart the GNU / Linux system, set the kernel parameters are restored to default values. In order to set the value as the default values of these parameters, you can use / etc / rc.local file, the system starts automatically every time the parameters configured by the need to value.
Each option in the detection of the effect of the changes brought about when, GNU / Linux, there are some very powerful tools you can use:
ping is used to check the availability of the host of the most commonly used tools, network bandwidth can also be used to calculate the delay.
traceroute to print specific network hosts connected to a series of routers and gateways through the path (route) to determine the delay between each hop.
netstat to determine the network subsystem, protocol and connection of the various statistics.
tcpdump shows the connection of one or more protocol-level packet trace information, including time information, you can use this information to different protocol packet time.
Ethereal with an easy to use graphical interface provides tcpump (message tracking) information, support packet filtering.
iperf measured network performance TCP and UDP; measure the maximum bandwidth, and reported the loss of packet delay and the situation.
4.3 hard disk-level cache
Disk-level cache is that dynamically generated content will need to temporarily cache on the hard disk in an acceptable time delay within the same request is no longer dynamically generated, to achieve saving resources and enhance the site Bear capacity-building. Linux disk-level cache under the general use of Squid .
Squid is a high-performance proxy caching server. And general different proxy cache software, Squid with a single, non-modular, I / O-driven process to deal with all client requests. It accepts the client on the target object from the request and proper handling of these requests. For example, users want to download the browser (ie browse) a web page, the browser requests Squid as it made the page. Squid subsequently connected to the page where the original server to the server to issue a request to obtain the page. Made after page, Squid then page back to client browser, and also in the local cache directory of Squid keep a copy. The next time a user needs the same page, Squid simply reads it from the cache copy, directly returned to the user, instead of the original server request again. The current Squid can handle HTTP, FTP, GOPHER, SSL, and WAIS and other agreements.
Squid default HTTP protocol through the detection of the Expires header and Cache-Control field to determine cache time. In practice, you can explicitly in the output HTTP server-side script first, you can also configure apache's mod_expires module for apache to each web page automatically with the expiration time. For static content, such as pictures, video files, for downloading software, but also for the file type (extension), with the Squid's refresh_pattern to specify the cache time.
Squid is running, the default will be built in two layers of hash directory on your hard disk, used to store the cache Object. It will also set up in memory of a Hash Table, used to record the hard disk Object distribution. If configured as a Squid Squid in a cluster, then it will create a Digest Table (Summary Table), used to store other Squid on Object summary. When the client wants information on the local hard disk is not, you can quickly know which machine to the cluster obtained. Configure the hard disk space is reaching the time limit can be configured to use a strategy (default LRU: Least Recently Used-least recently used) delete some Object, in order to make room  .
The Squid Server cluster can be between the two relationships: a relationship is: Child and Parent. When the Child Squid Server no information, directly to the Parent Squid Server to information, and then waited until the Parent information up to it. The second relationship is: Sibling and Sibling. When the Squid Server is no information, it will first Sibling of Squid Server to information, no information if the Sibling, Parent to skip it to or directly on the original site to pick up.
The default configuration of the Squid, without any optimization of time, generally up to 50% hit rate  (Figure 4). If necessary, also through parameter optimization, split the business, optimizing the file system and so on, making Squid more than 90% of the cache hit ratio. Squid processing TCP connection consumes server resources than the real HTTP server to be smaller and more connected when Squid share most of the site's ability to greatly increased pressure.
Squid Blue Line that the flow of traffic the green part of the said Apache
4.4 memory-class level of cache memory cache <br /> is dynamically generated content will need to temporarily cached in memory, the delay in an acceptable time frame, the same request is no longer dynamically generated, but directly from memory read. Level cache memory under Linux Memcached  is a good choice.
Memcached is danga.com (Operation Live Journal  The technical team) developed a very good distributed memory object caching system, used in the dynamic system to reduce database load and improve performance. And front-end cache to accelerate Squid different, it is an object-based memory cache to reduce database queries of the ways to improve the website's performance, one of the most attractive feature is the support of a distributed deployment; that can build in a group of machines pile Memcached services, each service can be based on specific server hardware configuration to use different sizes of memory blocks, so that in theory can create an infinitely large memory-based caching system.
Memcached daemon is run on one or more servers, always ready to accept client connection operation, the client can be a variety of languages, currently known client API including Perl / PHP / Python / Ruby / Java / C # / C, etc. [Appendix 1]. Memcached client service first and establish a connection, then access the object. Each is accessed object has a unique identifier key, access operations are conducted through this key, save when it comes to setting valid. The objects stored in Memcached is actually placed in the memory, rather than on the hard disk. After the process of running Memcached, will be pre-application for a larger memory space, its management, used for a after, rather than the movement of the operating system for each application when needed. Memcached object stored in a huge Hash table, it uses algorithms to manage the Hash table NewHash to gain further performance improvements. So when the memory allocated to the Memcached enough time, Memcached is basically the time consumption of the network Socket connected .
Memcached has its shortcomings. First of all, its data is stored in the memory of them, once the service process restart (the process of being turned off accidentally, the machine reboot, etc.), data will be lost. Second Memcached to root privileges to run, and Memcached does not have any rights management and authentication, security, inadequate. The first is to use Memcached as a memory cache service can not be avoided, of course, if the data needs to be saved in memory, you can take to change Memcached source code, to increase on a regular basis to write the hard disk. For the second, we can bind Memcached services, including IP networks, through Linux firewall protection.
4.5 CPU and IO balance <br /> in all of the features of a website, some features may need to consume a large amount of server-side IO resources, as downloads, video playback, and some features may need to consume a large amount of server CPU resources, such as video format conversion, LOG and statistics. In a server cluster, when we find some of the machines CPU utilization and IO big difference when, for example, while the IO is responsible for high CPU load very low, we can consider some of the server CPU resources consumption process replaced the process of consumption of IO to achieve a balanced objective. Balance of each machine's CPU and IO consumption, the server can get not only better use of resources, but also to support the off set the time being encountered unexpected Shi Jian, visit the dramatic increase in the Shihou flow, to achieve the Xingneng decent drop (Graceful performance degradation) , rather than an immediate collapse.
4.6 Separation of read and write
If the site hard to read and write performance is the site of a performance bottleneck, you can consider the hard disk read and write functions separately, were optimized. In the hard disk devoted to writing, we can use Linux software RAID-0 (redundant array of disk 0) . RAID-0 hard disk IO access to improve, it would also increase the failure rate of the entire file system - it is equal to all the RAID drives and the failure rate. If you need to maintain or enhance hard disk fault tolerance, we need to implement software RAID-1, 4 or 5, they can be in one (or several) after disk drive failure remains the normal operation of the entire file system , but the file reading and writing efficiency as RAID-0. Specifically to read the hard drive, then do not have such trouble, you can use the common server drive to reduce costs.
General file system, will a mix of different size and format of the file reading and writing efficiency, so the specific file is not read or write the efficiency of the best. If necessary, you can select a file system, and modify the file system configuration parameters to achieve a particular read or write files to maximize efficiency. For example, if the file system needs to store a large number of small files, you can use ReiserFS  to replace the Linux operating system default ext3 system, because ReiserFS is a file system based on balanced tree structure, especially for large files large file system , the search speed than the use of local binary search method ext3 fast. ReiserFS in the directory is fully dynamic allocation, so there is no common ext3 can not recall a huge directory disk space situation. ReiserFS in small files (<4K) can be stored directly into the tree, read and write small files faster, the tree is byte aligned within the node, the number of small files can share the same disk block, saving space. ext3 fixed size block allocation strategy, that is, less than 4K 4K small file must occupy the space, wasted space caused by more serious . However, many Linux kernel supports ReiserFS's not well include 2.4.3,2.4.9 even relatively Xin of 2.4.16, if Wangzhanxiangyao Shiyong Ta, on Bixu to Anzhuang working with it better in 2.4.18 kernel - not the general manager is willing to use the new kernel too, because the software running on it, still has not been a lot of practice tests, there may be some minor bug has not been found, but for servers, another small bug is unacceptable. ReiserFS is still a relatively young, fast file system, which compared to ext3 for a big flaw is that every time ReiserFS file system upgrade, to be completely re-format the entire disk partition. So choose to use the time to have trade-offs .
5, application layer optimization
5.1 Web server software choice
The statistics , currently over 50% of the Internet site hosts to use Apache  server program. Is the open source Apache Web server of choice for industry because of its powerful and reliable, and suitable for most applications. But it seems strong but sometimes cumbersome, complicated, daunting profile, high efficiency is not too complicated cases. The lightweight Web server Lighttpd  is a rising star, based on single-process multiplexing technology, its ability to respond to static files much higher than the Apache. Lighttpd on PHP support is also very good, you can also Fastcgi to support other languages such as Python and so on. Although Lighttpd is a lightweight server, functions can not be compared with the Apache, some complex applications can not be competent, but even most of the content of dynamically generated web site, still inevitably be some static elements, such as images, JS script, CSS, etc. so, you can consider put Squid in front of Lighttpd, constitute Lighttpd-> Squid-> Apache in a processing chain, Lighttpd in front to deal specifically with requests for static content to dynamic content requests forwarded through the Proxy module to Squid, if Squid in the contents of the request and has not expired, then directly back to Lighttpd. New request or expired page request by Apache in the script program to deal with. After two-stage filtration Lighttpd and Squid, Apache needs to deal with requests greatly reduced, reducing the pressure of Web applications. At the same time this framework, to facilitate the spread of different processing on multiple computers, by the uniform distribution in front of Lighttpd.
In this framework, each level can all be optimized separately, such as Lighttpd can use asynchronous IO mode, Squid can enable memory to cache, Apache can enable MPM (Multi-Processing Modules, Multi-channel processing module), etc., and Each one can use multiple machines to balance the load, good flexibility.
Well-known video sharing website YouTube is to choose to use Lighttpd as the site of the front server program.
5.2 Database Selection
MySQL  is a fast, multi-threaded, multi-user and robust SQL database server to support mission-critical, heavy-duty use of the system is the most popular open source database management system, is the preferred web development under Linux. It consists of MySQL AB develops, publishes, and support.
MySQL database for website:
Performance. MySQL support for massive, rapid database storage and reading. You can also use 64-bit processor to get some extra performance, because MySQL often internally in 64-bit integer.
Ease of use. MySQL's core is a small and fast database. It's fast connections, fast and reliable access and security features to MySQL is very suitable for use on Internet sites.
Open. MySQL storage engine offers a variety of background choices, such as MyISAM, Heap, InnoDB, Berkeley Db and so on. The default format is MyISAM. MyISAM storage engine is very good compatible with the disk .
Support for enterprise applications. MySQL has a binary used to record data changes log. Because it is binary, this log data can change quickly from one machine replication (replication) to another machine. Even if the server crashes, the binary log can be kept intact. This feature is often used to build a database cluster, to support greater traffic access requirements  (Figure 5).
Figure 5 MySQL database model cluster indicate the main and auxiliary
MySQL also has a number of its own drawbacks, such as the lack of a graphical interface, the lack of stored procedures, does not support triggers, referential integrity, data sheet view subquery and so on, but these features are in the developer's TO-DO lists. This is the power of open source: you can always look better.
Abroad, Yahoo!, Domestic Sina, Sohu, and many large commercial sites to use MySQL as the backend database. For the average site systems, whether considered from a cost or performance, MySQL is the best choice.
5.3 The choice of server-side script parser
The most common server-side script, there are three: ASP (Active Server Pages), JSP (Java Server Pages), PHP (Hypertext Preprocessor)  .
ASP full name of the Active Server Pages, as well as its upgrades ASP.NET, a Microsoft produced a WEB server-side development environment, using it can generate and run dynamic, interactive, high-performance applications WEB services. ASP uses scripting languages VBScript (C #) as their development language. But because only run in the Windows environment, where we do not discuss it.
PHP is a cross-platform server-side embedded scripting language. It is heavily borrowed C, Java and Perl syntax, and coupling characteristics of PHP itself, so WEB developers to write dynamically generated pages quickly. It supports most existing databases. PHP is open source, its issuance to comply with GPL open source license, you can from the PHP official site (http://www.php.net) free download to install its binary file and all the source code. If the Linux platform with used with MySQL, PHP is the best option.
JSP is the Sun has introduced a new generation of site development language is Java language than Java applications and Java Applet application other than the third. Jsp can Serverlet and JavaBean support, and powerful site to complete procedures. As part of the family using Java technology, and Java 2 (Enterprise Architecture) is an integral part of, JSP technology has brought all the advantages of Java technology, including excellent cross-platform, highly reusable component design, robustness and security and so on, to support the highly complex Web-based applications.
In addition to these three common scripting addition, in Linux, we had a lot of other options: Python (Google uses), Perl, etc., if called as a CGI, then choose a wider area. Shi Yong these less common scripting language of the advantages are that they For specialized applications do not have another script advantages; bad Local , these scripting languages with fewer people for domestic use, when it comes technology the problem, and less able to find information.
In the large-scale website development process, regardless of what technology to use, the site configurability is necessary. In the latter part of the site during operation, will certainly be a lot of demand for change. If the needs of each change will result in modifying the source code, then the development of this site can be said to be a failure.
First, and most importantly, functionality and display to be separated. PHP and JSP support template technology, such as PHP's Smarty, Phplib, JSP's JSTL (JSP Standard Tag Library) and so on. Core function using a script language, using the front display of special labels with HTML, not only speed up development, and facilitate future maintenance and upgrade .
Second, the template for the front, the general also need to page the head, tail extract alone, the main body of the page also split by module or function. On the CSS, JS etc. supplementary code, also proposed to form a separate file storage. This will not only facilitate the management changes, but also in the cache when the user access, reduce network traffic, reduce server stress.
Again, the script for the core functions must be configured with the server-related content, such as database connection configuration file path and so the first script, and code separate. Especially when the site uses clustering technology, CDN technologies to accelerate when the configuration of each server may all be different. If you do not use the configuration file, you need to simultaneously maintain several different code, it is prone to error.
Finally, we should try to make changes in real time after the entry into force of configuration files, modify the configuration file to avoid the need to restart the service after the procedures.
5.5 Packaging and thought the middle layer
In the function block Cengci, if using JSP, Java-based pure object-oriented languages object-oriented thinking, like database connectivity, session management and other basic functions have been encapsulated into a class of. If you use PHP, you need to explicitly in the script code package, the package of each function block into a function, a file or a class.
At a higher level, can be divided into the site, said layer, logic layer, persistence layer, separately packaged, so that when a layer structure changes will not affect other layers. Such as Sina podcast in a time of upgrades will be a database persistence layer from centralized to distributed architecture, because encapsulates database connection and all operations [Appendix 2], to achieve the top does not change any code, a smooth implementation of the transition. Recently popular MVC framework, will split the entire site into Model (model / logic), View (view / interface), Controller (control / process) of three parts, and there are many good options to use the framework code, such as JSP's Structs, Spring, PHP's php.MVC, Studs and so on. Use the existing code framework, you can make web development more with less.
6 expansion, fault-tolerant
A large site, in the design of architecture, he must take into account possible future capacity expansion. Sina podcast designed with full consideration to this. For the video sharing websites, the video storage space consumption is enormous. Sina podcasts on the main storage server, specify the configuration file using a storage tray for each video file cabinet storage ID range. Current server needs to read a video, the first by asking the primary storage server interface to access the video cabinet and the directory where the disk address, and then go to the site to read the actual video file cabinet. So, if you need to increase the use of disk storage cabinets, you can only modify the configuration file, the foreground program unaffected.
Sina podcast using MySQL database cluster, the logic layer encapsulates all the database connections and operation. When the database storage structure changes, such as when adding a set main library, some data will be a separate database table, to increase the data read from the library, etc. use, only need to Xiugaifengzhuang the database operations class, do not modify the upper Daima .
Sina podcast front page server uses F5's hardware, the fourth layer switches, Netcom, telecommunications are oriented in different virtual IP, virtual IP behind each multiple servers to provide services there. When access flow increases, they can easily increase to the virtual IP behind the server, share the pressure.
6.2 Fault Tolerance
For commercial sites, the availability is very important. 7 * 24 access requests site has strong fault tolerance. Error include network errors, server errors and application errors.
December 27, 2006 off the coast of eastern Taiwan Richter 7.6 magnitude earthquake caused a number of channels in the Taiwan Strait submarine cable disruption, leading to many foreign sites like MSN, NBA, Yahoo! (English major stations) and the domestic could not access, but there are exceptions to Google as the representative of the domestic construction of distributed data nodes, many sites are still accessible. Although the network is not caused by the earthquake off against reason, but in this case if the site still can visit, no doubt to give website users impressed. This matter to a large commercial site by the lessons are: the main distribution site requires the user to maintain data on regional presence, to prevent possible network failures.
For server errors, in general, the design method to avoid redundancy. For the storage server (mainly responsible for writing the server), you can use RAID (redundant disk array); for the database (mainly responsible for writing the main library), you can double the main library design ; for the provision of services front, you can use the fourth floor of the cluster switching from multiple servers provide services to not only share the flow pressure, but also each other as a backup.
In the application layer program, should consider "user friendly" design error. Typical examples such as the HTTP 404 error page, the program internal error handling, error return tips, etc., as far as possible be human.
7 Summary and Outlook
For a high concurrent high-traffic sites, any link of the bottleneck will cause a decline in site performance, impact the user experience, and cause enormous economic losses. Level in the whole Internet, you should use a distributed design, websites and users of the network to shorten the distance to reduce backbone traffic and prevent accidents in case of network sites can not access the problem. In the LAN level, should use the server cluster, on the one hand can support a greater number of visits, on the other hand as a redundant backup, to prevent server failures can not access the website. In the single-server level, should be configured operating system, file system and application layer software, balanced consumption of various resources to eliminate performance bottlenecks, give full play to the server's potential. In the application layer, you can process a variety of caching to improve efficiency and reduce server resource consumption (Figure 6). In addition, the application layer requires rational design procedures for the future requirements change.
Figure 6 Typical high concurrency framework for high-traffic sites
At every level, need to consider fault-tolerant, strictly single point of failure, to do whatever use Ceng procedural error, the server software error, the server hardware errors, or network error will not affect the Web.
Under the current Linux environment, there is the famous LAMP (Linux + Apache + MySQL + PHP / PERL / PYTHON) website building program, but only for small sites, in general. High traffic for high concurrency large commercial sites, there is not a complete, cost-effective solution. Remove the server, hard disk, bandwidth and other hardware investment, but also need to spend a lot of budget and time and energy in software solutions.
With the sustained development of the Internet, Web2.0 rise in the foreseeable future, the Internet users continues to increase, the website provides user participation in increasing the content of user participation in the growing, more and more concurrency value of site , traffic will reach a new level, this will encourage more and more Geren, Gongsiyiji Yanjiujigou Lai focus on high concurrency site Jiagouwenti high Liu Liang. As numerous small and medium website Web1.0 achievement, achievements, like the LAMP, Web2.0 success will be doomed to a new, highly efficient, low cost solution. The program should include a transparent third-party CDN network speed service, low prices or even the fourth layer of high-level network switching equipment, optimizing the network performance of the operating system, optimized to read and write performance, distributed, highly reliable file system, kneading together a memory, hard drive in various levels of caching HTTP server, more efficient server-side script parser, and encapsulates most of the details of the application layer design framework. Framework for high-traffic site with high concurrency