Hash table parsing algorithm

Author: July, wuliming, pkuoliver

Source: http://blog.csdn.net/v_JULY_v.

Description: This article is divided into three parts,

The first part is a Top K algorithm Baidu faces questions Wapakhabulo; second part on the elaboration of Hash table algorithms; third part of the Hash table to create a fast algorithm.

------------------------------------

Part: Top K algorithm Baidu faces questions Detailed Description of the problem:

Search engine each time a user through the log files to retrieve all the search strings used are recorded, each query string of the length of 1-255 bytes.

Suppose there are ten million records (the query string repeat is relatively high, although the total is 10 million, but if, after removing duplicate, not more than 3 million. A repetition of the query string the higher the inquiry for its users, the more , which is more popular.), you statistics of the 10 most popular query string, the memory can not require the use of more than 1G.

Prerequisite knowledge:

What is hash table?

Hash table (Hash table, also called hash table), is based on the key code value (Key value) and direct access to the data structure. In other words, it is mapped by the key code value to the table in a position to access the records to speed up the search speed. The mapping function is called the hash function, called an array of records stored hash.

Hash table is actually very simple, that is the Key functions through both a fixed algorithm called a hash function to convert integer numbers, then the number of array length to take over, take the result as an array of more than the subscript, to the value stored in the digital space for the next target array.

When using the hash table when the query is re-used hash function key into the corresponding array index, and navigate to the space for value, this way, you can take advantage of the array's positioning performance data location (the second article, three parts, will be elaborated for the Hash Table).

Problem Analysis:

To count the most popular queries, first of all there is to statistics the number of times each Query, then the statistical results, to find Top 10. So we can be designed based on this idea in two steps of the algorithm.

That is, to solve this problem is divided into the following two steps:

The first step: Query statistics

Query statistics following both a method to choose from:

1, the direct sequencing method first we thought the first sorting algorithm is, first of all inside of this Query logs are sorted, then traversing the sorted Query, Query statistics the number of occurrences of each.

But questions have clear requirements that the memory can not be more than 1G, ten million records, each record is 255Byte, it is clear to occupy 2.375G memory, this condition does not meet the requirements.

Let us recall the contents of data structures course, when larger than the data and the memory can not hold, we can use external sorting methods to sort, where we can merge sort, merge sort because there is a better time complexity O (NlgN).

Drained after the order has been ordered for us file through the Query, Query statistics the number of occurrences of each, again written to the file.

Comprehensive analysis, sort time complexity is O (NlgN), and through the time complexity is O (N), so the overall time complexity of the algorithm is O (N + NlgN) = O (NlgN).

2, Hash Table method in the first method, we used a statistical approach to sorting the number of occurrences of each Query time complexity is NlgN, you can not have better ways to store, and lower time complexity it?

The title shows, although there are ten million Query, but due to a relatively high degree of repetition, so the fact that only 300 million Query, each Query255Byte, so we can consider them all to go into the memory, and now just needs a suitable data structure, where, Hash Table is definitely our preferred choice, because Hash Table query speed is very fast, almost O (1) time complexity.

So, we have the algorithm: the maintenance of a Query Key as string, Value Query for the number of occurrences of the HashTable, each read a Query, if the string is not in the Table, then add the string, and the Value value of 1; if the string in the Table, then the string plus one can count. We finally O (N) time complexity completed within the massive data processing.

This method is compared to the algorithm 1: increase in time complexity an order of magnitude for the O (N), but not only on the time complexity of the optimization, the method requires only one data file IO, and IO times higher than Algorithm 1 and more, so the algorithm 2 to algorithm 1 works better operability.

Step two: find the Top 10

Algorithm A: I think for the ordinary sort sorting algorithm we have not unfamiliar, and not repeat them here, we should note that the time complexity of sorting algorithm is NlgN, in this subject, the three million records, with 1G of memory is able to save.

Algorithm II: some sort

Questions asked is to find the Top 10, so we do not need to sort all of the Query are, we only need to maintain a 10-size array, initialized into 10 Query, Query according to the statistics of each frequency by the large small order, and then traverse the 300 million records, and each reading of a record on an array of the last Query contrast, if less than the Query, then continue through, otherwise, the last data out of the array, adding the current Query. Finally, when all the data traversal is completed, then the array 10 Query is we are looking for the Top10.

Difficult to analyze, so the worst time complexity of the algorithm is N * K, where K is the top number.

Algorithm III: heap in the algorithm 2, we have the time to optimize the complexity of the NlogN NK, have to say this is a relatively large improvements, but there is no better way to do it?

Analysis, in algorithm II, after the completion of each comparison, the complexity of operations required are K, because the necessary elements into a linear form being, but also uses a sequence comparison. Here we note that the array is ordered, every time we find one when you can use binary search methods, the complexity of this operation dropped to the logK, however, the problem is followed by data movement because the movement increased frequency of the data. However, this algorithm is an improvement over algorithm two.

Based on the above analysis, we think, not only is there a quick search, but fast-moving elements of the data structure? The answer is yes, it is heap.

With the heap structure, we can find the log magnitude of time and adjust / move. So here, our algorithm can be improved as such, maintains a K (the title is 10) the small size of the root heap, and then traverse the 300 million Query, respectively, and compare the root element.

Consistent with the above algorithm two ideas, just algorithms algorithms Third, we use the minimum heap data structure instead of this array to find the target element has time complexity O (K) down to O (logK).

So this way, using the heap data structures, algorithms, three, the final time complexity dropped to N'logK, and compared two algorithms, there has been relatively large improvements.

Summary:

Thus, the algorithm is completely over, after the first step, each with a Hash Table Statistics Query the number of occurrences, O (N); then the second step, using the heap data structure to find Top 10, N * O (logK) . Therefore, our final time complexity is: O (N) + N '* O (logK). (N 10 million, N '300 million). If you have any better algorithms, comments welcome message. The first part, complete.

The second part, Hash table algorithms detailed analysis

What is a Hash

Hash, generally translated to do a "hash", also has a direct transliteration of "hash", that is, the input of arbitrary length (also called pre-mapping, pre-image), through a hash algorithm, converted into fixed-length output, the output is the hash value. This conversion is a contraction mapping, that is, the hash value of the space is usually much smaller than the input space, different input may hash to the same output, but not only from the hash value to determine the input value. Simply is a message of any length will be compressed to a fixed length message digest function.

HASH is used primarily in the field of information security encryption algorithm, which put some information into different lengths of 128-bit messy code, the value of these codes is called HASH value can also be said, hash is to find a store address data content and data of mapping between.

Array features are: easy addressing, insert and delete difficulties; and list the features are: addressing difficulties in inserting and removing easy. Then we can not composite characteristics of both, easy to make an addressing, inserting the data structure is also easy to remove? The answer is yes, this is what we want to bring the hash table, hash table implementations in many different ways, I am going to explain the most common method - zipper method, we can understand the "list of array ", as shown:

Obviously the left is an array, each array comprises a pointer to the head of a linked list, of course, this list may be empty, may also be many elements. According to some characteristics of our elements to be assigned to different elements to the list, and is based on these characteristics, to find the right list, and then find the element from the list.

Element characteristics of the subject into the array method is hashing. Hashing is more than one course, three of the more common are listed below:

1, the division hashing the most intuitive one, is this on the map using the hash method, the formula:

index = value% 16

Learned compiled all know, is seeking modulus obtained by a division operation, so called "division hashing."

2, square hashing method for the index is a very frequent operation, and multiplication is more time-saving than the division (of the current CPU, it is estimated that we do not feel it), so we consider the division into multiplication and a shift operation. Formula:

index = (value * value)>> 28 (right, divided by 2 ^ 28. notation: left bigger, is by. right smaller, is in addition.)

If the value is more evenly allocated, then this method can get good results, but I draw above that figure the value of each element of the index is calculated 0 - failed miserably. Perhaps you have a problem, value if large, value * value will not overflow it? The answer is yes, but we do not care about this multiplication overflow, because we do not multiply in order to obtain the results, but in order to obtain the index.

3, Fibonacci (Fibonacci) hashing

The disadvantage of hashing square is obvious, so we can find an ideal multiplier, not the value itself as a multiplier to get it? The answer is yes.

1, for 16-bit integer terms, this multiplier is 40503

2, for 32-bit integer terms, this multiplier is 2654435769

3, for 64-bit integer terms, this multiplier is 11400714819323198485

These "ideal multiplier," is how to get out of it? This is related with a law, called the golden rule, which describes the golden rule expression is undoubtedly the most famous classic Fibonacci series, that is the case in the form of the sequence: 0, 1, 1, 2, 3, 5, 8 , 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, .... In addition, the Fibonacci series and the value of the solar system eight planets, the ratio of orbital radius is surprisingly consistent.

Of our common 32-bit integer, the formula:

index = (value * 2654435769)>> 28

If you use this Fibonacci hashing, then the above diagram becomes like this:

Obviously, with the Fibonacci hashing adjusted to take than the original touch hashing a lot better.

Scope to quickly find, remove the basic data structure, usually the total amount of data can be placed in memory.

Basic principles and key points

choose hash function for strings, integers, order, specifically the corresponding hash method.

Collision process, one is open hashing, also known as the zipper method; the other is closed hashing, also known as the opening address method, opened addressing.

Expand

d-left hashing in d is more than one meaning, we first simplify the problem, look at the 2-left hashing. 2-left hashing refers to a hash table is divided into two halves of equal length, are called T1 and T2, T1 and T2 are equipped to a hash function, h1 and h2. In store a new key at the same time using two hash function calculations, two address h1 [key] and h2 [key]. Then need to check in T1 h1 [key] position and T2 h2 [key] position, a position which has been stored (a collision) key more, and then load the new key is stored in a small place. If both sides as much as, for example, two positions are empty or stores a key, put the new key is stored in the left sub-table T1 ,2-left is the resulting. In the search for a key, it must be two hash, also find the two locations.

Problem instance (mass data processing)

We know that massive data processing in the hash table has a wide range of applications, below, see the other side said Baidu questions:

Title: Massive log data, extract a visit that the highest number of Baidu IP.

Program: IP number is still limited, up to 2 ^ 32, so you can consider using the ip hash directly into memory, and then statistics.

The third part, the fastest algorithm of Hash Table

Next, we look to a specific analysis of the fastest Hasb table algorithm.

We gradually from a simple question to start: There is a huge array of strings, and then give you a single string, so you find from this array if the string and find it, how would you do? There is a way the most simple, honest and found the end from the beginning, one by one more, until you find the date, I think if people learned programming can be made to put such a program, but if there is such a program to pay programmers to to the user, I can only silent to evaluate, perhaps it really can work, but ... can only be the case.

Is naturally the most appropriate algorithm to use HashTable (hash table), the first who introduced one of the basic knowledge, the so-called Hash, usually an integer, by some algorithm, a string can be "compressed" into an integer. Of course, in any case, a 32-bit integer is not corresponding to the back of a string, but in the process, the two strings to calculate the Hash value equal to be very small, the following look at the MPQ in the Hash Algorithm:

Function a, the following function to generate a length of 0x500 (co-decimal: 1280) of cryptTable [0x500]

void prepareCryptTable ()

{

unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;

for (index1 = 0; index1 <0x100; index1 + +)

{

for (index2 = index1, i = 0; i <5; i + +, index2 + = 0x100)

{

unsigned long temp1, temp2;

seed = (seed * 125 + 3)% 0x2AAAAB;

temp1 = (seed & 0xFFFF) <<0x10;

seed = (seed * 125 + 3)% 0x2AAAAB;

temp2 = (seed & 0xFFFF);

cryptTable [index2] = (temp1 | temp2);

}

}

}

Function of two, the following function calculates the hash value string lpszFileName, which dwHashType type for the hash, in the following function three, GetHashTablePos function call this function in two, which can take values 0,1,2; the function returns lpszFileName hash of the string:

unsigned long HashString (char * lpszFileName, unsigned long dwHashType)

{

unsigned char * key = (unsigned char *) lpszFileName;

unsigned long seed1 = 0x7FED7FED;

unsigned long seed2 = 0xEEEEEEEE;

int ch;

while (* key! = 0)

{

ch = toupper (* key + +);

seed1 = cryptTable [(dwHashType <<+ ch] ^ (seed1 + seed2);

seed2 = ch + seed1 + seed2 + (seed2 <<5) + 3;

}

return seed1;

}

Blizzard this is a very efficient algorithm, known as the "One-Way Hash" (A one-way hash is a an algorithm that is constructed in such a way that deriving the original string (set of strings, actually) is virtually impossible ). For example, the string "unitneutralacritter.grp" the results obtained by this algorithm is 0xA26067F3.

Is not the first algorithm improve the look, compare the string one by one into the Hash value on it, the answer is not enough, in order to get the fastest algorithm, it can not be compared one by one, usually construct a hash Table (Hash Table) to solve the problem, the hash table is a large array, the array capacity is defined according to the requirements of the program, such as 1024, the Hash value of each by the modulo operator (mod) corresponds to a location in the array, In this way, compare the string as long as corresponding to the position of the hash value is not occupied, you can get the final result, and think this is what speed? Yes, it is the fastest O (1), it is now a closer look at the algorithm:

typedef struct

{

int nHashA;

int nHashB;

char bExists;

......

} SOMESTRUCTRUE;

A possible structure is defined?

Function of three, the following function in the Hash table to find the existence of the target string, a return Hash value of string to find, if any, return -1.

int GetHashTablePos (har * lpszString, SOMESTRUCTURE * lpTable)

/ / LpszString Hash table to find the string, lpTable Hash value for storage string Hash table.

{

int nHash = HashString (lpszString); / / call the function two, return to the search string lpszString Hash value.

int nHashPos = nHash% nTableSize;

if (lpTable [nHashPos]. bExists & &! strcmp (lpTable [nHashPos]. pString, lpszString))

{/ / Hash value if found in the table exists, and the string to find the location of the table corresponding to the same string,

return nHashPos; / / return the call function two, find the value of the Hash

}

else

{

return -1;

}

}

See this, I would like to wish everyone a very serious question: "If the two strings in the hash table corresponding to the position of how to do the same?", After all, an array capacity is limited, this possibility is very . Many ways to solve the problem, my first thought is to use "list", thanks to university to learn the data structure of this church hundred test Braun's magic, I met a lot of algorithms can be transformed into a linked list to solve, as long as the Kazakh Xi table linked to a linked list for each entry, save all the corresponding string on OK. This thing seems to have a perfect ending, if it is to me alone to solve the problem, then I might begin to define data structures and then write code.

Blizzard's programmers, however, the method used is more sophisticated methods. Basic principle is this: they are not in the hash table but with a hash value to verify the hash value with three strings.

MPQ file Mingha Xi using the internal table to keep track of all files. However, the format of the table with the normal number of different hash table. First, it does not use the hash as the next standard, the actual file name is stored in the table used for authentication, it does not actually store the file name. But the use of three different hash: a hash table for the next standard, two for validation. These two validation hash instead of the actual file name.

Of course, this will still be two different files Ming Haxi to three the same hash. However, the average probability of this happening is: 1:18889465931478580854784, the probability for any people who should all be small enough. Now go back to the data structure, Blizzard did not use the hash table using a linked list, while the use of "extended" way to solve the problem, look at this algorithm:

Function of four, lpszString order to find the hash string; lpTable string hash value for the stored hash table; nTableSize for the hash table length:

int GetHashTablePos (char * lpszString, MPQHASHTABLE * lpTable, int nTableSize)

{

const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;

int nHash = HashString (lpszString, HASH_OFFSET);

int nHashA = HashString (lpszString, HASH_A);

int nHashB = HashString (lpszString, HASH_B);

int nHashStart = nHash% nTableSize;

int nHashPos = nHashStart;

while (lpTable [nHashPos]. bExists)

{

/ * If the only time in the table to determine the existence of this string, it compares the two hash values can be, not on

* Structure of the string. This will speed up the run rate? Hash table to reduce the space? This

* What methods are generally used in situations? * /

if (lpTable [nHashPos]. nHashA == nHashA

& & LpTable [nHashPos]. NHashB == nHashB)

{

return nHashPos;

}

else

{

nHashPos = (nHashPos + 1)% nTableSize;

}

if (nHashPos == nHashStart)

break;

}

return -1;

}

The procedure explained:

1 to calculate the hash value of the string of three (one is used to determine the location, the other two for validation)

2. Look at this position in the hash table

3 hash table is empty in this position do? If empty, then certainly the string does not exist, returns -1.

4. If there is, then check whether the other two hash values match, if the match, then find the string, return the Hash value.

5 move to the next location, if you have moved to the end of the table, the anti-around to the table to the query from the beginning

6. See if it is returned to its original position, if so, return not found

7 back to 3

ok, this is mentioned in this article the fastest Hash table algorithms. What? Not fast enough?: D. Welcome, you criticize the correction.

--------------------------------------------

Add 1, a simple hash function:

/ * Key is a string, nTableLength for the length of the hash table

* The value of the hash function is more evenly distributed * /

unsigned long getHashIndex (const char * key, int nTableLength)

{

unsigned long nHash = 0;

while (* key)

{

nHash = (nHash <<5) + nHash + * key + +;

}

return (nHash% nTableLength);

}

Supplement 2, a complete test program:

Hash table is an array of fixed length, if too much is wasted, if too small, does not embody efficiency. Appropriate array size is the key to the performance of the hash table. Hash table size is the best is a prime number. Of course, depending on the amount of data, there will be a different hash table size. For the amount of data the application has been uneven, the best design is to use dynamic variable size hash table, then if you find the hash table size is too small, such as one of the elements is twice the size of the hash table, the We need to expand the hash table size, is generally doubled.

Here is the hash table size of the possible values:

17, 37, 79, 163, 331,

673, 1361, 2729, 5471, 10949,

21911, 43853, 87719, 175447, 350899,

701819, 1403641, 2807303, 5614657, 11229331,

22,458,671, 44,917,381, 89,834,777, 179,669,557, 359,339,171,

718678369, 1437356741, 2147483647

Following the complete source code for the program has been tested in linux:

view plain

# Include <stdio.h>

# Include <ctype.h> / / Thank you citylove correction.

/ / CrytTable [] which holds the HashString function inside some of the data will be used in prepareCryptTable

/ / Initialization function which

unsigned long cryptTable [0x500];

/ / The following function generates a length of 0x500 (co-decimal: 1280) of cryptTable [0x500]

void prepareCryptTable ()

{

unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;

for (index1 = 0; index1 <0x100; index1 + +)

{

for (index2 = index1, i = 0; i <5; i + +, index2 + = 0x100)

{

unsigned long temp1, temp2;

seed = (seed * 125 + 3)% 0x2AAAAB;

temp1 = (seed & 0xFFFF) <<0x10;

seed = (seed * 125 + 3)% 0x2AAAAB;

temp2 = (seed & 0xFFFF);

cryptTable [index2] = (temp1 | temp2);

}

}

}

/ / The following function calculates the hash value string lpszFileName, which dwHashType for the hash type,

/ / In the following GetHashTablePos function which calls this function, which can take values 0,1,2; the function

/ / Return lpszFileName hash value string;

unsigned long HashString (char * lpszFileName, unsigned long dwHashType)

{

unsigned char * key = (unsigned char *) lpszFileName;

unsigned long seed1 = 0x7FED7FED;

unsigned long seed2 = 0xEEEEEEEE;

int ch;

while (* key! = 0)

{

ch = toupper (* key + +);

seed1 = cryptTable [(dwHashType <<+ ch] ^ (seed1 + seed2);

seed2 = ch + seed1 + seed2 + (seed2 <<5) + 3;

}

return seed1;

}

/ / In the main test argv [1] of the three hash values:

/ /. / Hash "arr / units.dat"

/ /. / Hash "unit / neutral / acritter.grp"

int main (int argc, char ** argv)

{

unsigned long ulHashValue;

int i = 0;

if (argc! = 2)

{

printf ("please input two arguments / n");

return -1;

}

/ * Initialize the array: crytTable [0x500] * /

prepareCryptTable ();

/ * Print array crytTable [0x500] which the value * /

for (; i <0x500; i + +)

{

if (i% 10 == 0)

{

printf ("/ n");

}

printf ("%-12X", cryptTable [i]);

}

ulHashValue = HashString (argv [1], 0);

printf ("/ n ----% X ----/ n", ulHashValue);

ulHashValue = HashString (argv [1], 1);

printf ("----% X ----/ n ", ulHashValue);

ulHashValue = HashString (argv [1], 2);

printf ("----% X ----/ n ", ulHashValue);

return 0;

}

Thanks:

1, http://blog.redfox66.com/.

2, http://blog.csdn.net/wuliming_sc/. End.