Annotated Lucene：第三節(jié) 索引是如何創(chuàng)建的

4 索引是如何創(chuàng)建的

為了使用Lucene來索引數(shù)據(jù)，首先你比把它轉(zhuǎn)換成一個純文本（plain-text）tokens的數(shù)據(jù)流（stream），并通過它創(chuàng)建出Document對象，其包含的Fields成員容納這些文本數(shù)據(jù)。一旦你準(zhǔn)備好些Document對象，你就可以調(diào)用IndexWriter類的addDocument(Document)方法來傳遞這些對象到Lucene并寫入索引中。當(dāng)你做這些的時候，Lucene首先分析（analyzer）這些數(shù)據(jù)來使得它們更適合索引。詳見《Lucene In Action》

    // Store the index on disk
    Directory directory = FSDirectory.getDirectory("/tmp/testindex");
    // Use standard analyzer
    Analyzer analyzer = new StandardAnalyzer();
    // Create IndexWriter object
    IndexWriter iwriter = new IndexWriter(directory, analyzer, true);
    iwriter.setMaxFieldLength(25000);
    // make a new, empty document
    Document doc = new Document();
    File f = new File("/tmp/test.txt");
    // Add the path of the file as a field named "path".  Use a field that is
    // indexed (i.e. searchable), but don't tokenize the field into words.
    doc.add(new Field("path", f.getPath(), Field.Store.YES, Field.Index.UN_TOKENIZED));
    String text = "This is the text to be indexed.";
    doc.add(new Field("fieldname", text, Field.Store.YES,      Field.Index.TOKENIZED));
    // Add the last modified date of the file a field named "modified".  Use
    // a field that is indexed (i.e. searchable), but don't tokenize the field
    // into words.
    doc.add(new Field("modified",
        DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE),
        Field.Store.YES, Field.Index.UN_TOKENIZED));
    // Add the contents of the file to a field named "contents".  Specify a Reader,
    // so that the text of the file is tokenized and indexed, but not stored.
    // Note that FileReader expects the file to be in the system's default encoding.
    // If that's not the case searching for special characters will fail.
    doc.add(new Field("contents", new FileReader(f)));
    iwriter.addDocument(doc);
    iwriter.optimize();
    iwriter.close();

下面詳細(xì)介紹每一個類的處理機制。

4.1 索引創(chuàng)建類IndexWriter

一個IndexWriter對象創(chuàng)建并且維護(maintains) 一條索引.

它的構(gòu)造函數(shù)(constructor)的create參數(shù)(argument)確定(determines)是否一條新的索引將被創(chuàng)建，或者是否一條已經(jīng)存在的索引將被打開。需要注意的是你可以使用create=true參數(shù)打開一條索引，即使有其他readers也在在使用這條索引。舊的readers將繼續(xù)檢索它們已經(jīng)打開的”point in time”快照（snapshot），并不能看見那些新已創(chuàng)建的索引，直到它們再次打開（re-open）。另外還有一個沒有create參數(shù)的構(gòu)造函數(shù)，如果提供的目錄（provided path）中沒有已經(jīng)存在的索引，它將創(chuàng)建它，否則將打開此存在的索引。

另一方面（in either case），添加文檔使用addDocument()方法，刪除文檔使用removeDocument()方法，而且一篇文檔可以使用updateDocument()方法來更新（僅僅是先執(zhí)行delete在執(zhí)行add操作而已）。當(dāng)完成了添加、刪除、更新文檔，應(yīng)該需要調(diào)用close方法。

這些修改會緩存在內(nèi)存中（buffered in memory），并且定期地（periodically）刷新到（flush）Directory中（在上述方法的調(diào)用期間）。一次flush操作會在如下時候觸發(fā)（triggered）：當(dāng)從上一次flush操作后有足夠多緩存的delete操作（參見setMaxBufferedDeleteTerms(int)），或者足夠多已添加的文檔（參見setMaxBufferedDocs(int)），無論哪個更快些（whichever is sooner）。當(dāng)一次flush發(fā)生時，等待的（pending）delete和add文檔都會被flush到索引中。一次flush可能觸發(fā)一個或更多的片斷合并（segment merges）。

構(gòu)造函數(shù)中的可選參數(shù)（optional argument）autoCommit控制（controls）修改對IndexReader實體（instance）讀取相同索引的能見度（visibility）。當(dāng)設(shè)置為false時，修改操作將不可見（visible）直到close()方法被調(diào)用后。需要注意的是修改將依然被flush進(jìn)Directory，就像新文件一樣（as new files），但是卻不會被提交（commit）（沒有新的引用那些新文件的segments_N文件會被寫入（written referencing the new files））直道close()方法被調(diào)用。如果在調(diào)用close()之前發(fā)生了某種嚴(yán)重錯誤（something goes terribly wrong）（例如JVM崩潰了），于是索引將反映（reflect）沒有任何修改發(fā)生過（none of changes made）（它將保留它開始的狀態(tài)（remain in its starting state））。你還可以調(diào)用close()，這樣可以關(guān)閉那些沒有提交任何修改操作的writers，并且清除所有那些已經(jīng)flush但是現(xiàn)在不被引用的（unreferenced）索引文件。這個模式（mode）對防止（prevent）readers在一個錯誤的時間重新刷新（refresh）非常有用（例如在你完成所有delete操作后，但是在你完成添加操作前的時候）。它還能被用來實現(xiàn)簡單的single-writer的事務(wù)語義（transactional semantics）（"all or none"）。

當(dāng)autoCommit設(shè)為true的時候，每次flush也會是一次提交（IndexReader實體將會把每次flush當(dāng)作一次提交）。這是缺省的設(shè)置，目的是為了匹配（match）2.2版本之前的行為（behavior）。當(dāng)以這種模式運行時，當(dāng)優(yōu)化（optimize）或者片斷合并（segment merges）正在進(jìn)行（take place）的時候需要小心地重新刷新（refresh）你的readers，因為這兩個操作會綁定（tie up）可觀的（substantial）磁盤空間。

當(dāng)一條索引暫時（for a while）將不會有更多的文檔被添加，并且期望（desired）得到最理想（optimal）的檢索性能（performance），于是optimize()方法應(yīng)該在索引被關(guān)閉之前被調(diào)用。

打開IndexWriter會為使用的Directory創(chuàng)建一個lock文件。嘗試對相同的Directory打開另一個IndexWriter將會導(dǎo)致（lead to）一個LockObtainFailedException異常。如果一個建立在相同的Directory的IndexReader對象被用來從這條索引中刪除文檔的時候，這個異常也會被拋出。

專家（Expert）：IndexWriter允許指定（specify）一個可選的（optional）IndexDeletionPolicy實現(xiàn)。你可以通過這個控制什么時候優(yōu)先的提交（prior commit）從索引中被刪除。缺省的策略（policy）是KeepOnlyLastCommitDeletionPolicy類，在一個新的提交完成的時候它會馬上所有的優(yōu)先提交（prior commit）（這匹配2.2版本之前的行為）。創(chuàng)建你自己的策略能夠允許你明確地（explicitly）保留以前的”point in time”提交（commit）在索引中存在（alive）一段時間。為了讓readers刷新到新的提交，在它們之下沒有被刪除的舊的提交（without having the old commit deleted out from under them）。這對那些不支持“在最后關(guān)閉時才刪除”語義（”delete on last close” semantics）的文件系統(tǒng)（filesystem）如NFS，而這是Lucene的“point in time”檢索通常所依賴的（normally rely on）。

Annotated Lucene 作者：naven 日期：2007-5-1

posted on 2007-05-10 00:07 Javen-Studio 閱讀(1234) 評論(0) 編輯收藏引用

只有注冊用戶登錄后才能發(fā)表評論。
【推薦】100%開源！大型工業(yè)跨平臺軟件C++源碼提供，建模，組態(tài)！



網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

Javen-Studio 咖啡小屋

常用鏈接

留言簿(42)

文章檔案

blogs

friends

myblogs

最新評論

4 索引是如何創(chuàng)建的

4.1 索引創(chuàng)建類IndexWriter