Semi-structured data (semi-structured data), and compared with normal plain text, Web pages on certain structural data, the performance in which the HTML tagging; but with strict theoretical model and relational database data comparison This HTML mark brought a lot of structural but also weak, so the data on the Web was known as semi-structured data, which is the basic characteristics of data on Web.
Boolean model (boolean model), in information retrieval, on different occasions with different meanings. When we discuss the user to submit queries, when referring to a final query result set from a query, all components on fruit set of queries required an operation between the relationship; in comparison to discuss the document vector space model , Boolean model refers to form the various components of a document vector only take two values 1 and 0, representing whether the corresponding feature items appear.
Recall rate (recall), to judge the quality of a measure of retrieval system, said system to retrieve documents with queries related to the number of accounts with the inquiries related to the percentage of the total number of documents.
Query (query), user input using the information systems of their own language and rules of an expression of information needs. Common norms and input language contains a number of key words Boolean connectors.
Precision (precision), to judge the quality of a measure of retrieval system. System to retrieve relevant documents with the query retrieves the number of accounts for the percentage of all documents that reflect the search results for "correctness" of the measure.
Dictionary (vocabulary), document (or document collection), the collection of all the different lexical items.
Word frequency (term frequency, tf, or TF), TF (i, j) is a key word in a document dj ti the number of occurrences.
Agent (agent), or the said agents in the application, receives the user's request, to represent user tasks and return results, but without user supervision procedure, process, or part of the system. In, the agent for information from the archive or library users to search for keywords and relevant content, it is sometimes referred to as intelligent agents (Intelligent Agent).
Inverted file (inverted file), organize and index documents to facilitate retrieval methods. In this method, a set of keywords is the basis for a keyword in the collection corresponding to a string of records for each item, each of which contains a document number, the keyword in the document occurrences and other information.
Inverted document frequency (inversed document frequency, idf, or IDF), usually IDF (ti) value is, where N is the total number of all documents, n) / log (inNi in the N-word document contains the document number of entries ti.
Dynamic pages (dynamic Web page), need to submit queries to obtain information on the web.
News Summary (dynamic abstract), a way to do the document summary. For search engines, is the time in response to user queries, the query words in the document of the Wei Zhi, extract the query terms Zhou Wei-related text and returned to the user. Since a document may contain different query words, the dynamic summary of technology may bring a document with a summary of the formation of different words.
Total vocabulary assumption (shared bag of words), information retrieval technology in the most basic assumption, namely that the meaning of the document it contains key words can be expressed in a collection.
HTML (hypertext markup language), Hypertext Markup Language, is one of the key Web, it is ASCII format, a hypertext document provides a standard way of expression.
Cache (cache), often in the field of computer science as a concept, its basic meaning is the use of local principles to achieve a matching rate in the middle of two different mechanisms. It can occur in the middle of CPU and RAM can also appear in the application of the I / O operations and disk between. In search engines, check for the relief requested by disk access speed and low speed of contradictions, often with a variety of cache memory design, including the query cache, click Cache, and the inverted list caching.
Static pages (static Web page), do not need to query information can be obtained by submitting the page.
Mirror website (mirror Web page), exactly the same page, without any modification.
Locality principles (locality principle), is a nature program behavior. It includes: temporal locality and spatial locality. The former means that if a certain data have been accessed, it may also be accessed in the near future; which means that, if a data have been accessed, then, and its location adjacent to the data is likely to be accessed.
Denial of Service attack (denial of service, DoS), is an attack, so that web server requests back flooded with information, consuming network bandwidth or system resources, leading to overload the network or system so paralyzed Tingzhi normal network services provided .
Link analysis (link analysis): Web on the links between pages and can be viewed as a huge directed graph, link analysis refer to the link between the use of Web information to judge the importance (or relevance) technology. Useful links to information that contains a Web page degree, penetration, anchor text, etc.; common link analysis algorithm are: PageRank, HITS, SALSA, PHITS, Bayesian, etc..
MD5 (message digest 5), message digest for the message encoding an algorithm. MD5 algorithm is defined in RFC1321, its basic function is an arbitrary length of the message is transformed into a 128-bit summary, two different packets corresponding to the probability of a summary of the same very small, similar between the two levels and summary corresponds to the similarity between the two messages has nothing to do.
Anchor text (anchor text), HTML text link in the descriptions, the reader suggesting that the link points to the nature or characteristics of Web page. For example, a page written in a <a href = "http://www.cctv.com"> News Channel </ a>, the "news channel" is a link href = "http://www.cctv. com "in the anchor text in this page.
Directory-type website (hub page), the website provides many links to other authoritative web page hyperlink. With the corresponding authoritative website.
Zipf's Law (Zipf's law), by the American scholar GK Zipf 40 years of this century, the word frequency distribution law proposed. It can be stated as: If a long document, each word's frequency statistics up there, according to former high-frequency words, low-frequency words in the following descending order, and use natural numbers to compile grade A number of these words, that The word most frequently grade 1, followed by the frequency of grade 2, ... .... If that frequency with f, r said the serial number level, there are f = C / r (C is a constant).
All words (word segmentation), or that word, mainly used in Chinese information processing, that is a word into a word sequence. Such as, "Network and Distributed Systems Laboratory," word for the "Network and Distributed Systems Laboratory."
Search (full text retrieval), a method of text information retrieval (or a kind of sophistication), which is characterized not only appear in the document every word can be retrieved, and every word of every time there can also be retrieved.
Authoritative pages (authority page), Web content usually have a specific theme, and are many other pages link to pages with directory-type corresponding to the concept.
Hash table (Hash Table),, or hash table, is a data structure, it is easy to find information quickly. Hash table generated for each table when the data is assigned a random index code. This index code of randomness makes the data more evenly distributed, which could greatly reduce the time to find the follow-up.
Digital Library (digital libarary), a digital information object collection, organization and performance of the methods of these objects and these objects related to users of information technology. It includes support for users to locate, retrieve and access the information object service.
Search engine (search engine, SE), Web application on a software system, which to a certain degree of strategy to gather and find information on the Web, information processing and organization, to provide users with Web information searching.
Index term vector information (index term carrier), HTML tag information identifies the document index term information such as fonts and capitalization.
Stop words (stop word), means the document appears conjunctions, prepositions, articles and other words not much sense. For example, commonly used in English stop words are the, a, it, etc.; common in Chinese as "yes", "the", "land" and so on.
Throughput (throughput),, or throughput, is the system in the unit time to complete the assignments. For the search engine terms, this means the system in unit time (in seconds) where users can serve the maximum number of queries.
URL (uniform resource locator), to locate information resources on the Internet a protocol (or description specification), the page orientation usually is to the form "http://host/path/file.html" The URL used to describe , and FTP resources are used of the form "ftp://host/path/file" the URL to describe.
URL depth domain name, web domain name in the corresponding part of the url contains the subdomain number.
URL directory depth, web domain name corresponding to the url to remove part of the directory hierarchy, that is, url = schema: / / host / localpath in localpath part. If url is http://www.pku.edu.cn, the directory depth is 0; if http://www.pku.edu.cn/cs, the directory depth is 1.
Page out-degree (page outdegree), for a Web page that points to the number of hyperlinks to other web pages.
Pages clean (noise reduction), to identify and remove the pages the process noise; the removal of the page subject matter has nothing to do with the page information, such as advertising, copyright information.
Web crawling device (gatherer), refer to page collection subsystem to complete a web page url to crawl under the process or thread, usually a collection of subsystems on the gatherer will also start multiple parallel.
Page in-degree (page indegree) for a web page, the entire network hyperlink pointing to the page number.
Web collection subsystem (crawler system), especially in the search engine system, the link between the use of HTML documents the relationship between pages in the Web, according to the hyperlink relations between the pages one by one crawling process. Given its along the hyperlink in the Web, "crawling" of the work, this procedure is sometimes referred to as "spiders" (spider). Crawler, spider, robot, bot generally refers to the same thing.
Document Object Model (document object model, DOM), DOM will be an XML document into a set of objects, and then deal with the object model can be arbitrary. This mechanism is also known as "random access" agreement, because it can access data at any time any part, and then modify, delete, or insert new data.
Document automatic classification (automatic text categorization, ATC), using computer programs to determine the definition of specified documents and pre-subordinate relationships between document types.
FIFO (first in first out, FIFO), is a page replacement algorithm to choose the first page loaded into main memory of that transfer out, or is to reside in main memory the longest time that a transfer out.
Relevance ranking (relevance ranking), that information retrieval system returns the results of sorting, which reflects the entry of the order of results and query system to determine relevance.
Vector space model (vector space model, VSM), in accordance with the total of words assumption, a word document there is a general set of Σ, a document can use a vector that its elements are new terms appear in the document the situation in a quantitative description, a set of documents can be viewed as a vector space in a number of elements, so you can apply the concept of vector space distance to examine the degree of similarity between two documents and so on.
Response time (response time), the computer system, a request from the author (or ask) to begin to see through between the time to answer. For search engines that users submit queries that return results of his experience between the time. In the search engine practice, because this time the network state of dynamic change, often with a retrieval system for the completion of a query response time consumed to approximate.
Consumer weight (replicas or near-replicas detection), clear the collection of pages collected by the mirror or reproduce the process of Web page.
Protocol (protocol), developed for the realization of communication and coordination of the various functional units can operate a set of rules.
Information retrieval (information retrieval, IR), the information is organized according to some, and stored, and the user's need to find the information process.
Information Retrieval (IR model), according to user queries, the document collection to sort a set of related assumptions and algorithms. IR model can be expressed as the form of a four-tuple <D, Q, F, R (qi, dj)>, where D is a
A document collection, Q is a query set, F is the document and query modeling framework, R (qi, dj) is a sorting function, which to query qi, and the correlation between the document dj is given a ranking value. Commonly used information retrieval model are: set theory models, algebraic models, probability models.
User query log (user query log), is presented in the user when the query is automatically recorded by the system information, which includes the user query submitted to the key words, submit time, user IP addresses, page numbers (usually the query results page shows , 10 search results per page, the user first query page number to 1, the user shall flip page number when the user selects the results page), and whether other information in the cache hit.
User clicks on the log (user hit log), a user browsing query results and click on a page by the system automatically records the relevant information, it usually includes the time a user clicks on the page, click on the page URL, the user IP address, click the page number ( The page position in the query results), the click on the corresponding query words and other information.
Metadata (meta data), describe a type of resource (or object) of the property, and location and management of such resources, while helping to data retrieval of data.
Meta search engine (meta search engine), also known as integrated search engine, it will send the user's query to multiple independent search engine, to collect their result, according to some algorithm and then select and re-sorted to form a final The results returned to the user.
Chinese Information Processing (Chinese information processing), using computers to Chinese pronunciation, form, meaning of information such as language processing and operations, including characters, words, phrases, sentences, text input, output, recognition, conversion, compression , storage, retrieval, analysis, understanding and generation of various aspects of process technology.
Theme collection (topic-specific/focused crawling), that is subject-oriented information-gathering system, its main task is to utilize the limited network bandwidth, storage capacity and less time crawling the content as much as possible and closely related topics website.
Reprinted website (near-replicas Web page), the contents of the same but there may be some additional editorial information. Although the Web to do some changes, but its subject matter has not changed; that remove noise pages (such as advertising, copyright and other information), other body of the same content. Reprinted also known as the approximate mirror web pages.
The lowest frequency used (least frequently used, LFU), maintenance of the contents of a data cache replacement policy, when the cache is full, and there is new data to come in, it is always out of the existing data in the frequency of the lowest in the past to use the data. Granularity of data can replace the applications to determine.
Least recently used (least recently used, LRU), maintenance of the contents of a data cache replacement policy, when the cache is full, and there is new data to come in, it is always out of the existing data in the most time in the past has not been used through the data.