Introduction and how to use Lucene?

Lucene is a Java-based toolkit full-text index.

  1. The full text index engine based on Lucene Java Introduction: On the history of the author and Lucene
  2. Full text search implementation: Luene full-text index and database comparison of the index
  3. Chinese word segmentation system Description: thesaurus and automatic segmentation based on comparison of the algorithm term
  4. Specific installation and use of Introduction: Introduction and demonstration of system structure
  5. Hacking Lucene: Simplified Query Analyzer, remove the implementation, custom sorting, the expansion of application interface
  6. We can also learn from what Lucene

Also, if it is in the choice of full-text engine, and now may be to try Sphinx time: faster than Lucene, a Chinese word of support , and built-in support for simple distributed search;

Java-based full-text index / search engine - Lucene

Lucene full-text index is not a complete application, but is written in a text index with the Java engine kit, which can be easily embedded into applications for a variety of applications to achieve the full text index / search function.

Author of Lucene: Lucene contributor Doug Cutting , is a senior full-text index / search experts, once the V-Twin search engine (Apple's Copland operating system, one of the achievements) of the main developer of a senior post in the Excite system architecture designers, is currently engaged in some of the underlying architecture of INTERNET. He contributed to Lucene's goal is to add a variety of small and medium full-text search function application.

Lucene in the development process: previously published in the author's own , was published in SourceForge , the end of 2001 as a sub APACHE Foundation jakarta:

Java has a lot of projects use the Lucene full-text index as its background the engine, more well-known are:

  • J IVE : WEB Forum system;
  • Eyebrows : HTML mailing list archive / view / query system, the paper's main reference document " TheLucene Search Engine: Powerful, flexible, and Free "is EyeBrows of the main developers of the system, and EyeBrows APACHE project has become a major e-mail list archive system.
  • Cocoon : XML-based web publishing framework that uses the Lucene text search part of the
  • Eclipse : Java-based open development platform to help some of the full-text index using the Lucene

For Chinese users, the most concern is that it supports Chinese text retrieval. But through the back of the structure for the introduction of Lucene, you will understand the Lucene good architecture design, support for Chinese lexical analysis only their language interfaces can be achieved to extend the support for Chinese search.

Implementation of full-text search mechanism

Lucene's API more generic interface design, input-output structure is very much like a database table ==> record ==> field, so many traditional applications, documents, databases and so can be quite easy to map to the storage structure of Lucene / interface. Overall: You can support the first Lucene full-text index as a database system.

Compare Lucene and the database:

Lucene Database
  Index data source  :doc(field1,field2...) doc(field1,field2...)
                  \  indexer /
                | Lucene Index|
                 / searcher \
   Results output  :Hits(doc(field1,field2) doc(field1...))
   Index data source  :record(field1,field2...) record(field1..)
              \  SQL: insert/
              | DB  Index   |
              / SQL: select \
  Results output  :results(record(field1,field2..) record(field1...))
Document: a need to index the "unit"
Composed by a number of fields in a Document
Record: record contains multiple fields
Field: Field Field: Field
Hits: query result sets, formed by the matching Document RecordSet: query result set, composed by a number of Record

Search ≠ like "% keyword%"

Usually thick books are often attached to the back keyword index table (eg: Beijing: 12, 34, Shanghai: 3,77 page ... ...), it can help readers find relevant content more quickly to the page. The database indexes can greatly improve the speed of query is the same principle, imagine that the index search through the back of the book than the speed of the high content of the pages turning while the index of the number of times ... ... the reason of high efficiency, another reason is that it is sorted. Retrieval systems for the core is a scheduling problem.

Since the database index is not designed for the full-text index, therefore, use like "% keyword%", the database index does not work, use like query, the search process has become similar to the process of traversing the pages of open book , so the database service with fuzzy queries for, LIKE harm performance is great. If it is necessary for fuzzy matching on a number of key words: like "% keyword1%" and like "% keyword2%" ... the efficiency also can imagine.

So building an efficient retrieval system, the key is to create a similar technology as the reverse indexing mechanism for indexing the data source (such as the number of articles), while stored in the sort order, there is another sorted list of keywords used Storage Keywords ==> Article mapping, the use of such a mapping Index: [Key words ==> appears keyword article number, count (even including the location: the starting offset, ending offset), there frequency], the retrieval process is to fuzzy query into multiple queries can use the exact logic of the index portfolio process. Thus greatly improving the efficiency of multi-keyword query, so the issue comes down to the last full-text search is a scheduling problem.

It can be seen relatively fuzzy query the database query is a very uncertain precisely the problem, which is most of the full-text database of the reasons for the limited support. Lucene is characterized by the core index structure through a special traditional databases are not good at achieving the full-text indexing mechanism, and provides an extended interface to facilitate customized for different applications.

You can compare what the form of fuzzy query the database:

Lucene text indexing engine Database
Index Data from the data source one by one through the full-text index create inverted index For the LIKE query, the data is simply no access to traditional index of. Data needs to facilitate the record-by-GREP-style fuzzy matching, the search speed than to have an index number of the decline in orders of magnitude.
Match results Through word element (term) to match the implementation of the interface through linguistic analysis, can be achieved on Chinese and other non-English language support. Use: like "% net%" will also match netherlands out
Fuzzy match multiple keywords: use like "% com% net%": can not match the word order reversed ..
Matching A matching algorithm, matching degree (similarity) to compare the results of high standing in the front. Does not match the degree of control: for example, recorded in the net and the emergence of a word appears 5 times, the result is the same.
Results output Through a special algorithm, the highest that most closely matches the results of the first 100 output, the result set is small quantities of buffered read. Return all the result set matches the entry in a lot of time (such as the million) of memory storage requires a lot of these temporary result set.
Customizable Linguistic analysis through different interface can be easily customized to meet the application needs the index rules (including the Chinese language support) No interface or interface complex to be customized
Conclusion High-load applications of fuzzy query, fuzzy query responsible for the rules of information than a larger index The low utilization rate, fuzzy matching rules are simple information or need less fuzzy query

Full text search and database applications is the biggest difference: Let the most relevant results first 100 to meet the needs of users more than 98%

Lucene's innovations:

Most of the search (database) engine is used to maintain the B tree index, the update will lead to a large number of IO operations, Lucene in the implementation, which improved slightly: not maintain an index file, but in the extended When the index continued to create a new index file, and then regularly these new small index file into the original large index (for a different update strategy, the batch size can be adjusted), so that does not affect the efficiency of search under the premise of improving the efficiency of the index.

Lucene and some other full-text retrieval system / application comparison:

Lucene Other open-source text retrieval system
Increment index and volume index The index can be incremental (Append), can be mass index for large amounts of data, and interface design to optimize the index and small quantities of bulk incremental index. Many systems support only the index volume, and sometimes a little increase in data sources also need to rebuild the index.
Data Source Lucene does not define a specific data source, but the structure of a document, it can be very flexible to adapt to a variety of applications (as long as a suitable converter front-end data source into the corresponding structure), Many systems only for web pages, the lack of flexibility in other formats documents.
Index content to crawl Lucene document is composed of multiple fields, even those who can control the fields need to be indexed, those fields do not need an index, further need for the index field is also divided into sub-segmentation and do not need to type the word:
The need for sub-word index, such as: the title, the article content field is the index of segmentation is not required, such as: author / date fields
Lack of universal and often the entire index of the document
Language Analysis Extended through the implementation of different language parser:
Can filter out unwanted words: an the of other,
Western parsing: the property of the form jumps jumped jumper jump indexing / search non-English support: Asian languages, Arabic language support for the index
Lack of common interface
Query analysis Through the query interface implementation, you can customize their query syntax rules:
For example: between multiple keywords + - and or relations
Concurrent access To support the use of multi-user

On the segmentation of Asian languages the word problem (Word Segment)

For the Chinese, the full-text index of the first language but also to solve a problem, for the English, the statement through the space between words is a natural separation, but Asian languages CJK statement word is a word side by side, all, we must first statement in the press "word" indexing, then the word out on how segmentation is a big problem.

First of all, certainly not for a single character (si-gram) the index unit, or check the "Shanghai" you can not make with "sea" is also matched.

But the sentence: "Hello, Beijing", the computer how the Chinese language habits by splitting it?
"Hello, Beijing" or "Beijing Hello?" Let the computer language used in accordance with the segmentation, the machine often requires a relatively rich vocabulary to be able to identify more accurate statement of the word.

Another solution is the use of automatic segmentation algorithms: the word according to 2 yuan syntax (bigram) way out of cut points, such as:
"Hello, Beijing" ==> "Hello, Beijing Beijing you."

Thus, in the query, no matter the query "Beijing" or query "Hello", the search phrase by the same segmentation rules: "Beijing", "hello", a number of key words between the press and "and "relationship combinations, the same can be correctly mapped to the corresponding index. This way to other Asian languages: Korean, Japanese are common.

Based automatic segmentation of the biggest advantage is no vocabulary maintenance costs, simple, low efficiency drawback is that the index, but for small applications, based on 2-gram is enough for segmentation. Segmentation based on 2 million after the index is almost the size and source files, and for English, the original index file is generally only 30% -40% the file is different

Automatic segmentation Segmentation word list
To achieve Implementation is very simple Complex
Query Increase the complexity of the query analysis, For achieving more complex query syntax rules
Storage efficiency Large redundancy index, index, almost as large as the original Efficiency index for about 30% of original size
Maintenance costs Table maintenance costs without the words Vocabulary maintenance cost is very high: in Japan, Korea and other language needs are maintained.
Also need to include word frequency, etc.
Field of application Embedded systems: distributed systems operating environment of limited resources: no synchronization problems vocabulary multilingual environment: No maintenance costs vocabulary Efficiency requirements for query and storage of high professional search engine

The larger the current language of search engine algorithms are based on a combination of these two mechanisms. Analysis algorithm on the Chinese language, you can Google search keywords "wordsegment search" to find more relevant information.

Install and use


Note: Lucene in some of the more complex lexical analysis is generated by JavaCC (JavaCC: JavaCompilerCompiler, pure Java generator of lexical analysis), so if compiled from source code or need to modify the QueryParser, customize their lexical analyzer, need from https: / / / download javacc.

The composition lucene: index module for external applications (index) and the retrieval module (search) is the main entrance of the external application / Search portal
org.apache.Lucene.index / Index entry
org.apache.Lucene.analysis / Language Analyzer
org.apache.Lucene.queryParser / Query Analyzer
org.apache.Lucene.document / Storage structure / The underlying IO / memory structure
org.apache.Lucene.util / Some common data structures

Simple examples show you how to use Lucene:

The indexing process: from the command line reads the file name (s), the document of the path (path field) and content (body field) two fields to store, and full-text index the contents: the index unit is the Document objects, each Document object contains multiple fields Field objects, for different field properties and data output needs, the field can also choose a different index / storage field rules, the list is as follows:

Methods Cut words Index Storage Use
Field.Text (String name, String value) Yes Yes Yes Segmentation of the word index and store, for example: title, content field
Field.Text (String name, Reader value) Yes Yes No Word segmentation index is not stored, such as: META information
Not used to return the show, but need to retrieve the content
Field.Keyword (String name, String value) No Yes Yes Segmentation is not indexed and stored, for example: date fields
Field.UnIndexed (String name, String value) No No Yes No index, only storage, such as: file path
Field.UnStored (String name, String value) Yes Yes No Only the full-text index, do not store
public class IndexFiles {
  //  Use  :: IndexFiles [  Index output directory  ] [  Indexed list of files  ] ...
  public static void main(String[] args) throws Exception {
    String indexPath = args[0];
    IndexWriter writer;
    //  Constructed with the specified language parser to write a new indexer  (  3rd parameter indicates whether the additional index  )
    writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false);

    for (int i=1; i<args.length; i++) {
      System.out.println("Indexing file " + args[i]);
      InputStream is = new FileInputStream(args[i]);

      //  Structure contains two fields  Field  The Document Object
      //  Path is the path of a field  ,  No index, only the storage
      //  One is the content body field  ,  Full-text index, and store
      Document doc = new Document();
      doc.add(Field.UnIndexed("path", args[i]));
      doc.add(Field.Text("body", (Reader) new InputStreamReader(is)));
      //  Write the document index
    //  Close write indexer

Indexing process can be seen:

  • Language Analyzer provides the abstract interface, so language analysis (Analyser) is customizable, although the default lucene provided two more generic parser SimpleAnalyser and StandardAnalyser, these two do not support the default parser in Chinese, so To join the Chinese language segmentation rules, need to modify the 2 analyzers.
  • Lucene does not stipulate the format of the data source, but only provides a common structure (Document object) to accept the input index, so the input data source can be: database, WORD documents, PDF documents, HTML documents ... ... as long as the design corresponding analytic converter constructed a data source object can be indexed into Docuement.
  • For large quantities of data index, you can also adjust the frequency of property combined IndexerWrite file (mergeFactor) to improve the efficiency of mass index.

Retrieval process and the results showed that:

Search Results Hits object is returned, you can access it again Document ==> Field contents.

Assumptions are based on body full-text search fields, you can query the results of the field and the corresponding path matching the query (score) print it out,

public class Search {
  public static void main(String[] args) throws Exception {
    String indexPath = args[0], queryString = args[1];
    //  Point to the search engine index directory
    Searcher searcher = new IndexSearcher(indexPath);
    //  Query parser  :  Use the same language parser and index
    Query query = QueryParser.parse(queryString, "body",
                              new SimpleAnalyzer());
    //  Store search results using the Hits
    Hits hits =;
    //  Can be accessed through the hits to the corresponding field data and query matching
    for (int i=0; i<hits.length(); i++) {
      System.out.println(hits.doc(i).get("path") + "; Score: " +

throughout the search process, the language parser, query analyzer, and even search engine (Searcher) are providing abstract interface that can be customized as needed.
Hacking Lucene

Simplified Query Analyzer

Lucene personal feeling into JAKARTA project, drawing much of their time in the increasingly complex debugging QueryParser, most of which are not very familiar with most users, the current LUCENE support syntax:

Query:: = (Clause) *
Clause:: = ["+", "-"] [<TERM> ":"] (<TERM> | "(" Query ")")

The middle of the logic include: and or + - & & | | and other symbols, but also "phrase query" and the prefix for the Western / fuzzy queries such as, personal feeling for general applications, these features have some flashy, in fact, can be achieved the current query is similar to Google's analysis in fact, for most users is enough. Therefore, Lucene earlier version of QueryParser still the better choice.

Add modified to delete the specified records (Document)

Lucene index provides the extension mechanism, so the dynamic expansion of the index should be no problem, but also seems to modify the specified record can only be removed through the records and then re-add implementation. How to delete the specified records? Remove the method is very simple, just need to index the records in the data source ID specialist was built in the index, then use IndexReader.delete (Termterm) method through the record ID to delete the corresponding Document.

According to the value of a field sorting

lucene default correlation in accordance with its own algorithm (score) the results of sorting, but the results can be sorted according to other fields is a development mailing list LUCENE frequently asked questions, many of the original needs-based database applications in addition to based on matching degree (score) than the sort function. From the full-text search of the principle, we can understand that any index based on the efficiency of the search process will result in efficiency is very low, if the ranking based on other fields in the search process needs access to the storage field, the speed back to the greatly reduced, so it is is not desirable.

But there is also a compromise solution: to influence the search process, only the index ranking results have been stored docID and score the 2 parameters, so other than score-based ranking, in fact, the data source can be pre-lined order, then sort docID to achieve. This avoids the search results in LUCENE outside again to sort the results in the search process and not in the index to access a field value.

There is need to modify the HitCollector IndexSearcher process:

 scorer.score(new HitCollector() {
        private float minScore = 0.0f;
        public final void collect(int doc, float score) {
          if (score > 0.0f &&                          // ignore zeroed buckets
              (bits==null || bits.get(doc))) {    // skip docs not in bits
            if (score >= minScore) {
              /*   Original  :Lucene  Will docID and the corresponding matching  score  Cases into the hit list of results  :
               * hq.put(new ScoreDoc(doc, score));        // update hit queue
               *   Doc or if   1/doc   Instead of score, to achieve under  docID  Alignment or the reverse schedule
               *  Assuming the data source index is a field lined up according to order, and the results based on  docID  Sorting also realized
               *   A field for sorting, and even the more complex  score  And fitting docID  .
              hq.put(new ScoreDoc(doc, (float) 1/doc ));
              if (hq.size() > nDocs) {                 // if hit queue overfull
                hq.pop();                         // remove lowest in hit queue
                minScore = ((ScoreDoc); // reset minScore
      }, reader.maxDoc());

More general input and output interfaces

While lucene does not define a document format to determine the input, but more and more people think of using a standard intermediate format data into the interface as Lucene, and other data, such as PDF just by Jie Xiqi converted into a standard format to the middle data can be indexed. The key to XML-based intermediate format, similar to the implementation has been less than 4,5:

  Data Source  : WORD       PDF     HTML    DB       other
         \          |       |      |         /
                       XML  Intermediate format
                     Lucene INDEX

There is no parser for MSWord document as Word documents, RTF documents and various ASCII-based, resolution mechanisms need to use COM objects. This is my Google search of the relevant information on:
Another way is to convert text to Word document:

Optimization of the indexing process

General sub-index 2 cases, one is an index of small-volume expansion, one is the index of mass reconstruction. In the index process, the DOC was not added into each new index is a re-write index file (the file I / O is a very resource-consuming it).

Lucene index of the first in memory operations, and conducted according to a certain batch file is written. The larger the batch interval, the fewer the number of written documents, but it will take a lot of memory. Instead take less memory, but the file IO operations frequently, the index will be very slow speed. There is a MERGE_FACTOR in IndexWriter parameters can help you construct indexer application environment based on the situation after the full use of memory to reduce the file operation. According to my experience: the default is every 20 records Indexer index is written once per 50 times increase in the MERGE_FACTOR, indexing speed can be increased by 1 times.

Search optimization

lucene support memory Index: This search than the file-based I / O are orders of magnitude faster.
While minimizing the creation of IndexSearcher search results and prospects of the cache is necessary.

Lucene is optimized for the first time, full-text index search, do not put all the records (Document) to read out the specific details, the sky only the highest of all results matching the first 100 results (TopDocs) of ID put to the cache and returns the result set, where you can compare the database search: If it is a database of 10,000 search result set, the database is sure to have made all the records of the later start to return to the application of the result set. So even if the total number of search matches a lot, Lucene result set will not take up much memory space. For the general application of the fuzzy search is not so much the result of the first 100 have been retrieved to meet the needs of more than 90%.

If the results of the first cache after use would also like to read a few more behind the Searcher will again retrieve the results and generate a large number of the last search cache 1 times the cache and crawl back again. Therefore, if the construction of a section 1-120 to check the results of Searcher, Searcher is carried out 2 times the search process: After taking the first 100, run out of cached results, Searcher and then re-retrieve the results to construct a 200 cache, and so on , 400 cache, the cache 800. Because each Searcher object disappears, the cache is not access it, you might want to record the results cached, caching as much as possible to ensure that the number of under 100 to take advantage of the results of the first cache, not to waste several times Lucene search, classification results and can be cached.

Another feature of Lucene is in the process of collecting the results will match the results of low filtered out automatically. It is also, and database applications will need to return all search results different.

Some of my attempts :

  • Support Chinese Tokenizer: There are two versions, one is generated by JavaCC, part of the CJK character by a TOKEN index, the other is from the SimpleTokenizer rewritten, numbers and letters in English to support TOKEN, the Chinese press iteration index .
  • XML data source based indexer: XMLIndexer, so long as all data sources in accordance with the specified DTD into XML, you can use XMLIndxer the index.
  • According to a field sort: sort the results by recording the order of the index's search engine: IndexOrderSearcher, so if you need to sort search results based on a field that allows the data source sorted first by a field (such as: PriceField) This index, then use this ID in order to retrieve records by the search engine, the result is the equivalent of the field to sort the results.

Learn more from the Lucene

Luene is indeed a model of object-oriented design

  • Through all the problems an additional abstraction layer to facilitate future expansion and reuse: you can re-implement to achieve their own ends, while the other modules without the need;
  • Simple application entry Searcher, Indexer, and calls the underlying set of components to complete the search task coordination;
  • All the objects are very specific task: for example, the search process: QueryParser of the query into a combination of a series of precise queries (Query), read through the bottom of the index structure to index IndexReader to read and use the appropriate rate rate the device to the search results / ranking. All modules atomic level is very high, so you can re-implement without modifying other modules.
  • In addition to flexible application interface design, Lucene provides a number of applications for most of the language parser implementation (SimpleAnalyser, StandardAnalyser), which is a new user to one of the important reasons to get started quickly.

These advantages are well worth learning in the future development of reference. As a general toolkit, Lunece really need to be given to full-text search capabilities embedded into the application developers a lot of convenience.

In addition, the study and use of Lucene, I also gain a deeper understanding of why a lot of database optimization design requirements, such as:

  • Index on the fields as much as possible to improve query speed, but too many indexes will slow down updates the database table, and sort criteria on the results too much, in fact, often the performance of the killers.
  • Many commercial database insert large quantities of data will provide some optimization parameters, the role and the role of indexer merge_factor is similar,
  • 20% / 80% rule: check the result does not necessarily mean good quality, especially for large return a result set, how to optimize the quality of the results of the first dozens of often is the most important.
  • Application as much as possible to get from the database result set is relatively small, because even for large database, the result set of random access is also a very resource-consuming operation.


Apache: Lucene Project
Lucene developer / user mailing list archives

The Lucene search engine: Powerful, flexible, and free

Lucene Tutorial

Notes on distributed searching with Lucene

Chinese-language word segmentation

Search Engine Tool Introduction

Cutting Lucene of several papers and patents

Lucene's. NET implementation: dotLucene

Cutting Lucene Another project of: Java-based search engine Nutch

Word on the table and based on N-Gram compare segmentation of words

2005-01-08 Cutting at Pisa University to lecture on Lucene: Lucene framework of a very detailed explanation

Special thanks to:
Former CTO Xu Liangjie NetEase (Jack Xu) to guide me: is your search engine brought me to this industry.

Study: Car East Posted at :2002 -08-06 18:08 Last updated :2009-03-20 23:03
Copyright : free to reprint, reprint, sure to articles marked hyperlink original source and author information and this statement .

分类:Tech 时间:2011-05-23 人气:257
blog comments powered by Disqus


iOS 开发

Android 开发

Python 开发



PHP 开发

Ruby 开发






Javascript 开发

.NET 开发



Copyright (C), All Rights Reserved. 版权所有 闽ICP备15018612号

processed in 0.297 (s). 10 q(s)