sunrise

每天不斷學習，才能不斷提升自己。

C++博客 :: 首頁 :: 新隨筆 :: 聯系 :: 聚合

:: 管理 ::

64 隨筆 :: 0 文章 :: 92 評論 :: 0 Trackbacks

<

2012年9月

>

日

一

二

三

四

五

六

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

1

2

3

4

5

6

常用鏈接

留言簿(12)

隨筆分類(63)

隨筆檔案(64)

收藏夾

算法之道

友情鏈接

iTech
從波
老鄧
黎明

搜索

積分與排名

積分 - 238718
排名 - 106

閱讀排行榜

評論排行榜

去除dbpedia中的標簽

話不多說，奉上代碼。

#/usr/bin/env python
#coding=utf8

#對提取的數據進行預處理

def pretreat(infile,outfile):
  rfile = open(infile,'r')
  wfile = open(outfile,'wa+')
  while(1):
    line = rfile.readline()
    if not line:
      break
    line = line.split('>')

    #數據的長度，避免重復計算
    lens = len(line)

    #獲得有效信息
    for i in range(lens):
      line[i] = line[i].split('/')
    for i in range(lens):
      #處理三元組第三個元素
      #print line[i]
      flag = 0
      if '@zh' in line[i][0]:
        line[i][0] = line[i][0].replace('@zh .','')
        line[i][0] = line[i][0].replace('／','')
      if '^^<http:' in line[i][0]:
        flag = 1
        line[i][0] = line[i][0].replace('^^<http:','')
        line[i][0] = line[i][0].replace('／','')
        print line[i][0]
        wfile.write(line[i][0].strip())
      if len(line[i]) >= 1 and i != 3 and 0 == flag:
        if '／' in line[i][len(line[i])-1]:
          line[i][len(line[i])-1] = line[i][len(line[i])-1].replace('／','')
        wfile.write(line[i][len(line[i])-1].strip()+' ')
    wfile.write('\n')
  wfile.close()

#判斷是否含有字母
def is_alphabet(input):
  input = unicode(input,"utf-8")
  buf = []
  for uchar in input:
    if (uchar >= u'\u0041' and uchar<=u'\u005a') or (uchar >= u'\u0061' and uchar<=u'\u007a'):
      return True
    else:
      return False

  #去除國家名中含有字母的三元組
def removealp(infile,outfile):
  rfile = open(infile,'r')
  wfile = open(outfile,'w')
  while(1):
    line = rfile.readline()
    if not line:
      break
    linetmp = line
    line = line.split(' ')
    if False == is_alphabet(line[0]):
      wfile.write(linetmp)
  wfile.close()

pretreat('article_categories_en_uris_zh.nt','tag_article_categories_en_uris_zh.txt')

posted on 2012-09-13 17:29 SunRise_at 閱讀(1424) 評論(0) 編輯收藏引用所屬分類: 可愛的python

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！

相關文章: turbogear2上傳文件功能關于PIL庫的一些概念 python的默認參數 Google Translate API json的編碼和解析 python多線程 python編碼轉換 Python yield 用法 python enumerate用法 python之Queue

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品

sunrise

常用鏈接

留言簿(12)

隨筆分類(63)

隨筆檔案(64)

收藏夾

ACMer

技術聯盟

可愛的python

數據挖掘