學著站在巨人的肩膀上

金融數學,InformationSearch,Compiler,OS,

C++博客 :: 首頁 :: 新隨筆 :: 聯系 :: 聚合

:: 管理 ::

12 隨筆 :: 0 文章 :: 8 評論 :: 0 Trackbacks

<

2010年3月

>

日

一

二

三

四

五

六

28

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

公告

發布經典原創文章

常用鏈接

留言簿(1)

隨筆分類

中文文本信息處理(9) (rss)

隨筆檔案

搜索

閱讀排行榜

不好意思讓大家久等了，前一陣一直在忙考試，終于結束了。呵呵！廢話不多說了下面我們開始吧！

TSE用的是將抓取回來的網頁文檔全部裝入一個大文檔，讓后對這一個大文檔內的數據整體統一的建索引，其中包含了幾個步驟。

view plaincopy to clipboardprint?
1. The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.

//Doc.idx 文檔編號文檔長度    checksum hash碼

0   0   bc9ce846d7987c4534f53d423380ba70

1   76760   4f47a3cad91f7d35f4bb6b2a638420e5

2   141624 d019433008538f65329ae8e39b86026c

3   142350 5705b8f58110f9ad61b1321c52605795

//Doc.idx   end

The url index (url.idx) is used to convert URLs into docIDs.

//url.idx

5c36868a9c5117eadbda747cbdb0725f    0

3272e136dd90263ee306a835c6c70d77    1

6b8601bb3bb9ab80f868d549b5c5a5f3    2

3f9eba99fa788954b5ff7f35a5db6e1f    3

//url.idx   end

It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.

    ./DocIndex

        got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中

//DocId2Url.idx

0   http://*.*.edu.cn/index.aspx

1   http://*.*.edu.cn/showcontent1.jsp?NewsID=118

2   http://*.*.edu.cn/0102.html

3   http://*.*.edu.cn/0103.html

//DocId2Url.idx end

2. sort Url.idx|uniq > Url.idx.sort_uniq    //Data文件夾中的Url.idx.sort_uniq

//Url.idx.sort_uniq

//對hash值進行排序

000bfdfd8b2dedd926b58ba00d40986b    1111

000c7e34b653b5135a2361c6818e48dc    1831

0019d12f438eec910a06a606f570fde8    366

0033f7c005ec776f67f496cd8bc4ae0d    2103

3. Segment document to terms, (with finding document according to the url)

    ./DocSegment Tianwang.raw.2559638448        //Tianwang.raw.2559638448為爬回來的文件，每個頁面包含http頭

        got Tianwang.raw.2559638448.seg

//Tianwang.raw.2559638448   爬取的原始網頁文件在文檔內部每一個文檔之間應該是通過version，</html>和回車做標志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"
<html>

<head>

<title>Apabi數字資源平臺</title>

<meta http-equiv="Content-Type" content="text/html; charset=gb2312">

<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">

<META NAME="DESCRIPTION" CONTENT="數字圖書館方正數字圖書館電子圖書電子書 ebook e書 Apabi 數字資源平臺">

<link rel="stylesheet" type="text/css" href="css\common.css">

<style type="text/css">



</style>

<script LANGUAGE="vbscript">

...

</script>

<Script Language="javascript">

...

</Script>

</head>

<body leftmargin="0" topmargin="0">

</body>

</html>

//Tianwang.raw.2559638448   end

//Tianwang.raw.2559638448.seg   將每個頁面分成一行如下(注意中間沒有回車作為分隔)

1

...

...

...

2

...

...

...

//Tianwang.raw.2559638448.seg   end

//下是 Tiny search 非必須因素

4. Create forward index (docic-->termid)     //建立正向索引

    ./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx

//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下<BR>//分詞   DocID<BR>1<BR>三星/ s/ 手機/ 論壇/ ,/ 手機/ 鈴聲/ 下載/ ,/ 手機/ 圖片/ 下載/ ,/ 手機/<BR>2<BR>...<BR>...<BR>...

1. The document index (Doc.idx) keeps information about each document.

It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.

The information stored in each entry includes a pointer into the repository,

a document length, a document checksum.

//Doc.idx 文檔編號文檔長度 checksum hash碼

0 0 bc9ce846d7987c4534f53d423380ba70

1 76760 4f47a3cad91f7d35f4bb6b2a638420e5

2 141624 d019433008538f65329ae8e39b86026c

3 142350 5705b8f58110f9ad61b1321c52605795

//Doc.idx end

The url index (url.idx) is used to convert URLs into docIDs.

//url.idx

5c36868a9c5117eadbda747cbdb0725f 0

3272e136dd90263ee306a835c6c70d77 1

6b8601bb3bb9ab80f868d549b5c5a5f3 2

3f9eba99fa788954b5ff7f35a5db6e1f 3

//url.idx end

It is a list of URL checksums with their corresponding docIDs and is sorted by

checksum. In order to find the docID of a particular URL, the URL's checksum

is computed and a binary search is performed on the checksums file to find its

docID.

./DocIndex

got Doc.idx, Url.idx, DocId2Url.idx //Data文件夾中的Doc.idx DocId2Url.idx和Doc.idx中

//DocId2Url.idx

0 http://*.*.edu.cn/index.aspx

1 http://*.*.edu.cn/showcontent1.jsp?NewsID=118

2 http://*.*.edu.cn/0102.html

3 http://*.*.edu.cn/0103.html

//DocId2Url.idx end

2. sort Url.idx|uniq > Url.idx.sort_uniq //Data文件夾中的Url.idx.sort_uniq

//Url.idx.sort_uniq

//對hash值進行排序

000bfdfd8b2dedd926b58ba00d40986b 1111

000c7e34b653b5135a2361c6818e48dc 1831

0019d12f438eec910a06a606f570fde8 366

0033f7c005ec776f67f496cd8bc4ae0d 2103

3. Segment document to terms, (with finding document according to the url)

./DocSegment Tianwang.raw.2559638448 //Tianwang.raw.2559638448為爬回來的文件，每個頁面包含http頭

got Tianwang.raw.2559638448.seg

//Tianwang.raw.2559638448 爬取的原始網頁文件在文檔內部每一個文檔之間應該是通過version，</html>和回車做標志位分割的

version: 1.0

url: http://***.105.138.175/Default2.asp?lang=gb

origin: http://***.105.138.175/

date: Fri, 23 May 2008 20:01:36 GMT

ip: 162.105.138.175

length: 38413

HTTP/1.1 200 OK

Server: Microsoft-IIS/5.0

Date: Fri, 23 May 2008 11:17:49 GMT

Connection: keep-alive

Connection: Keep-Alive

Content-Length: 38088

Content-Type: text/html; Charset=gb2312

Expires: Fri, 23 May 2008 11:17:49 GMT

Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/

Cache-control: private

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"

<html>

<head>

<title>Apabi數字資源平臺</title>

<!--

.style4 {color: #666666}

-->

</style>

...

</script>

...

</Script>

</head>

</body>

</html>

//Tianwang.raw.2559638448 end

//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下(注意中間沒有回車作為分隔)

1

...

2

...

//Tianwang.raw.2559638448.seg end

//下是 Tiny search 非必須因素

4. Create forward index (docic-->termid) //建立正向索引

./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx

//Tianwang.raw.2559638448.seg 將每個頁面分成一行如下//分詞   DocID1三星/ s/ 手機/ 論壇/ ,/ 手機/ 鈴聲/ 下載/ ,/ 手機/ 圖片/ 下載/ ,/ 手機/2.........view plaincopy to clipboardprint?
//Tianwang.raw.2559638448.seg end

//moon.fidx

//每篇文檔號對應文檔內分出來的    分詞 DocID

都會 2391

使   2391

那些 2391

擁有 2391

它   2391

的   2391

人   2391

的   2391

視野 2391

變   2391

窄   2391

在   2180

研究生部    2180

主頁 2180

培養 2180

管理 2180

欄目 2180

下載 2180

）   2180

、   2180

關于 2180

做好 2180

年   2180

國家 2180

公派 2180

研究生 2180

項目 2180

//moon.fidx end

5.# set | grep "LANG"

LANG=en; export LANG;

sort moon.fidx > moon.fidx.sort

6. Create inverted index (termid-->docid)    //建立倒排索引

    ./CrtInvertedIdx moon.fidx.sort > sun.iidx

//sun.iidx //文件規模大概減少1/2

花工   236

花海   2103

花卉   1018 1061 1061 1061 1730 1730 1730 1730 1730 1852 949 949

花蕾   447 447

花木   1061

花呢   1430

花期   447 447 447 447 447 525

花錢   174 236

花色   1730 1730

花色品種     1660

花生   450 526

花式   1428 1430 1430 1430

花紋   1430 1430

花序   447 447 447 447 447 450

花絮   136 137

花芽   450 450

//sun.iidx end

TSESearch   CGI program for query

Snapshot    CGI program for page snapshot

<P>
author:http://hi.baidu.com/jrckkyy

author:http://blog.csdn.net/jrckkyy
</P>

posted on 2009-12-10 22:55 學者站在巨人的肩膀上閱讀(1317) 評論(1) 編輯收藏引用所屬分類: 中文文本信息處理

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！

相關文章: 自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(4) 自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(3) 自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(2) 自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[6]倒排索引的建立的程序分析(1) 自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[5]倒排索引的建立及文件介紹自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[4]小結自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[3]來到關鍵字分詞及相關性分析程序自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[2]路過查詢處理程序自頂向下學搜索引擎——北大天網搜索引擎TSE分析及完全注釋[1]尋找搜索引擎入口

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品