Written Discussion NoSQL database
- Thought papers
- Means articles
- Consistent hashing
- Quorum NRW
- Vector clock
- Virtual node
- Merkle tree
- Map Reduce Execution
- Handling Deletes
- Storage implementation
- Node changes
- Keep out
- Software articles
- Column storage Series
- The Hbase Hadoop
- HadoopDB Yale University
- The Cassandra FaceBook
- The BigTable Google
- The PNUTS Yahoo
- The Microsoft SQL Data Services
- Non-cloud service competitors
- Document Storage
- Key Value / Tuple Storage
- Key Value store eventual consistency
- Application articles
- eBay architecture experience
- Taobao architecture experience
- Flickr architecture experience
- Operation and maintenance experience of Twitter
- Cloud Computing Architecture
- There are underlying principles NOSQL
No country has a relatively complete NoSQL database information, there are many pioneers arranges a lot, but not very system. Yet this position try to look at each of the data integration, and writing some of his own views.
Some of the current NoSql writing some of the major technologies, algorithms and ideas. Also cited a large number of existing database instance. Read full articles, I believe readers will understand the one about NoSQL database.
In addition, I also prepared to develop an open-source memory database galaxydb. This book also provides some structure for the database information.
CAP, BASE, and the final consistency is the cornerstone of the three NoSQL database exists. The five-minute rule is a theoretical basis for data storage memory. This is the source of everything.
- C: C onsistency consistency
- A: A vailability availability (mean quick access to data)
- P: Tolerance of network P artition partition tolerance (distributed)
10 years ago, Eric Brewer, Professor of the CAP that the famous theory, and later Seth Gilbert and Nancy lynch two CAP proved correctness of the theory. CAP theory tells us that a distributed system can not satisfy the consistency, availability, fault tolerance, and partition the three requirements, can only meet two.
Eat and fish also can not have both. Concern is consistency, then you need to deal with because the system is unavailable due to write failures, and if you are concerned about availability, then you should know that the system may not be accurate read operations to write operations write to read into the latest value. Therefore, the focus system is different from the corresponding strategy used is not the same, only the true understanding of the needs of the system is it possible to make good use of CAP theory.
As an architect, there are two directions to take advantage of the general theory of CAP
- key-value stores, such as Amaze Dynamo, etc., according to three principles of CAP the flexibility to choose different tendencies of the database products.
- Domain model + distributed cache + Storage (Qi4j and NoSql movement), according to three principles of CAP projects in conjunction with their custom flexible distributed programs, difficult.
I am prepared to provide a third alternative: CAP's database can be configured to achieve the dynamic deployment of CAP.
- CA: traditional relational database
- AP: key-value database
The large site, availability and partition tolerance of a higher priority than data consistency, the general will try to move A, P in the direction the design, and then by other means to ensure the consistency of business needs. Architects do not waste energy on how to perfect the design to meet the three distributed systems, but should be trade-offs.
For the consistency of different data requirements are different. For example, the user comments on the inconsistency is not sensitive and can tolerate a relatively longer period of time inconsistency, this inconsistency does not affect transactions and user experience. The data are very price sensitive, usually can not tolerate more than 10 seconds in the price of inconsistencies.
CAP proof theory: Brewer's CAP Theorem
In short: the process of loose, the results of tight, the end result must be consistent
In order to better describe the client-side consistency, we carried out the following scene, the scene consists of three components:
- Storage Systems
Storage system can be understood as a black box, it provides us with the availability and sustainability assurance.
- Process A
ProcessA mainly from a storage system write and read operations
- Process B and ProcessC
ProcessB and C is independent of A, and B, and C are independent of each other, they also realize the storage system write and read operations.
Below to above under the different scenarios to describe the degree of consistency:
- Strong consistency
Strong consistency (immediate compliance) if A writes a value to the first storage system, storage system to ensure follow-up A, B, C's read operations will return the latest value
- Weak consistency
A first, if a value is written to the storage systems, storage systems can not guarantee the follow-up A, B, C read operation can read the latest value. In this case there is a "inconsistencies window" concept, it refers specifically to write values from the A to the subsequent operation of A, B, C to read the latest value of this period of time.
- Eventual consistency
Weak consistency of the final agreement is a special case. If A first write a value to the storage system, storage system to ensure that if A, B, C before the follow-up to read no other write operation, then update the same value, and ultimately all the reads will be written to read to the most A latest value. In this case, if no failure occurs, then "inconsistency window" depends on the following factors: interaction delay, system load, and the number of replica replication technology (which can be understood as master / salve mode in, salve the number), the final consistency of the most famous is the DNS system, the system can be said that when a domain name of the IP updated after the configuration strategy, and according to the different cache control policies, and ultimately all customers will see the value of the latest .
- Causal consistency (causal consistency)
If Process A Process B notice that it has updated data, the follow-up Process B A read operation, read the latest written value, and there is no causal relationship with the A to C is the final consistency.
- Read-your-writes consistency
If Process A writes a new value, then the Process A follow-up operation will read the latest value. But after a while other users may have to before they can see.
- Session consistency
Such consistency requires the client and storage system interaction stage of the entire session to ensure Read-your-writes consistency.Hibernate the session is to provide security for the consistency of such consistency.
- Monotonic read consistency
If Process A conformance requirements of this has been the object of a value to read, then subsequent operations will not be read into the earlier value.
- Monotonic write consistency
Such consistency will ensure that the system performs a sequence of Process all the write operation.
That it is very interesting, BASE is the meaning of the English base, and ACID is the acid. Really incompatible ah.
- Basically Availble - Basic available
- Soft-state - soft state / Flexible Service
"Soft state" can be understood as "no connection", and "Hard state" is the "connection-oriented" and
- Eventual Consistency - consistency of the final
Eventual consistency is the ultimate goal is ACID.
Anti-ACID BASE model model ACID completely different model, at the expense of high consistency, availability or reliability of access: Basically Available basic available. Support the partition fails (eg sharding debris by the database) Soft state soft-state state can have a period of time is not synchronized, asynchronous. Eventually consistent final agreement is consistent with the final data can be, and not always consistent.
The main implementations thought BASE
1. The database by function
BASE basic idea mainly emphasized the availability, if you need high availability, that is pure performance, then they would have the consistency or the expense of fault tolerance, BASE program ideas in performance or have the potential to be tapped for.
I / O's five-minute rule
In 1987, Jim Gray and Gianfranco Putzolu published the "five-minute rule" point of view, in short, if a record is accessed frequently, it should be placed in memory, otherwise it should stay on your hard disk by a need to access. The critical point is five minutes. Looks like a law of empirical fact, five minutes of the evaluation criteria are based on input costs to determine, according to the level of hardware development at the time, to keep in memory the cost is equivalent to 1KB of data storage hard disk, according to the cost of 400 seconds (close to five minutes). This law is about the time in 1997 conducted a review, confirmed the five-minute rule is still valid (hard disk, memory, virtually no qualitative leap), and this review is for the SSD in this "new old hardware" may take to influence.
With the flash era, divided into two five-minute rule: It is slow as the SSD memory (extended buffer pool) or as a fast hard drive using the (extended disk) to use. Small memory pages in memory and flash memory for mobile comparison between the large memory pages between flash memory and disk movement. In this law 20 years after first proposed, in the flash times, 5 minutes rule is still valid, but for larger memory pages (for 64KB of the page, change the page size is precisely reflects the development of computer hardware technology, and bandwidth, delay).
Do not delete data
Oren Eini (aka Ayende Rahien) suggested the developer to avoid the soft delete the database, the reader may therefore think hard delete is a reasonable choice. Response to the Article as Ayende, Udi Dahan is strongly recommended to completely avoid deletion of data.
Advocate so-called soft delete in the table to add a column to keep the data integrity IsDeleted. If a line set IsDeleted flag columns, then the line is considered to be deleted. Ayende feel that this method "simple and easy to understand, easy to implement, easier to communicate," but "is often wrong." The question is:
Delete a line or an entity almost always not a simple event. It not only affects the data model, will also affect the appearance of the model. So we have a foreign key to ensure that no "Order Line" did not correspond to the parent "order" situation. And this example can only be regarded as the simplest case. ... ...
When soft remove the time, whether we prefer, the data are prone to damage, such as no one do not mind a small adjustment can make the "customer" of the "new order" point to an order has been soft deleted.
If the request is received by the developer to remove the data from the database, if it does not recommend using a soft delete, it can only hard-deleted. In order to ensure data consistency, the developer in addition to delete rows of data directly related to, but should cascade delete related data. Udi Dahan may remind the reader that the real world is not the cascade:
Assuming marketing decision to remove from the catalog, like commodities, it is not to say that all the products containing the old order must be lost? Then cascade down all the invoices corresponding to these orders is also the delete? Delete such a step down, our company is not profit and loss statements should be redone?
God has no ears.
Problem seems to be out in the "delete" on the interpretation of the word. Dahan gives this example:
I say "delete" actually refers to the product "sale" of the. We will no longer sell this product, dry the stock will no longer purchase. After the customer search for products or browse through the directory will not see this when the goods, but a temporary charge of the warehouse people who have to continue to manage them. "Delete" is to say the sake of expediency.
He then gave some standing on the correct interpretation of the user point of view:
Order not to be deleted, is "cancel" the. Too late to cancel the order, but also have to spend.
Staff not being deleted, is to be "fired" (and probably be retired.) There are appropriate to deal with compensation.
Post is not deleted, is "filled" (or recruitment application is withdrawn.)
In these examples, our focus should be placed on the user wants to accomplish, rather than a physical body in the technical action. In almost all cases, the need to consider the total is more than one entity.
In place of IsDeleted flag, Dahan data suggested a representative of the state of fields: valid, disable, cancel, dispose of and so on. Users can make use of such a state of the field the past data as a decision-making.
Delete data in addition to destruction of data consistency, there are other negative consequences. Dahan recommended that all data stays in the database: "Do not delete. Is not deleted."
Go to site to view other content here to continue