編程自動(dòng)化
當(dāng)音樂和傳說在深夜中沉寂后，程序的每個(gè)字符還在跳動(dòng)！

隨筆 - 224 文章 - 41 trackbacks - 0

2011年5月

>

日

一

二

三

四

五

六

24

25

26

27

28

29

30

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

享受編程

常用鏈接

留言簿(11)

隨筆分類(159)

隨筆檔案(224)

文章分類(2)

轉(zhuǎn)載經(jīng)典文章(2)

文章檔案(4)

經(jīng)典c++博客

codeguru技術(shù)論壇
Nehe
vc技術(shù)論壇
谷歌代碼搜索
劉未鵬c++的羅浮宮
那誰的技術(shù)博客

搜索

閱讀排行榜

評(píng)論排行榜

python 中文分詞(pymmseg -cpp)和中文亂碼的問題

pymmseg-cpp

http://code.google.com/p/pymmseg-cpp/

pymmseg-cpp is a Python port of the rmmseg-cpp project. rmmseg-cpp is a MMSEG Chinese word segmenting algorithm implemented in C++ with a Ruby interface.

Download the binary release on the right sidebar and copy the pymmseg directory to your Python's path (e.g. /usr/lib/python2.5/site-packages/). Here's an example of usage:

from pymmseg import mmseg
 
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
    print '%s [%d..%d]' % (tok.text, tok.start, tok.end)

Or you can download the source tarball or check out the latest code from the git repo hosted at github. Then you'll need to build the mmseg-cpp module yourself: goto the mmseg-cpp subdirectory and run the build.py script. It will build the native module for you.

For more information, refer to the README file.

很多同學(xué)都會(huì)出現(xiàn)亂碼的問題。可能是mmseg支持的是utf8， windows的本地默認(rèn)編碼是cp936，也就是gbk編碼，所以在控制臺(tái)直接打印utf-8的字符串當(dāng)然是亂碼了。
解決方法：
在控制臺(tái)打印的地方用一個(gè)轉(zhuǎn)碼就ok了，打印的時(shí)候這么寫：
print myname.decode('UTF-8').encode('GBK')

from pymmseg import mmseg
 
mmseg.dict_load_defaults()
text = # ...
algor = mmseg.Algorithm(text)
for tok in algor:
    print '%s [%d..%d]' % (tok.text.decode('UTF-8').encode('GBK') , tok.start, tok.end)

posted on 2011-05-03 13:27 漂漂閱讀(1159) 評(píng)論(0) 編輯收藏引用

只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。




網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品