I can

posts - 195, comments - 30, trackbacks - 0

Python generate corpus using Dirichlet distribution

At first, let's define the sample function:

def sample(dist, num_samples=1):
    """
    Uses the inverse CDF method to return samples drawn from an
    (unnormalized) discrete distribution.

    Arguments:

    dist -- (unnormalized) distribution

    Keyword arguments:

    num_samples -- number of samples to draw
    """

    cdf = cumsum(dist)
    r = uniform(size=num_samples) * cdf[-1]

    return cdf.searchsorted(r)

As we can see, the sample function input two parameters, one is dist, which can be an un-normalized distribution, another is the sample we want to draw.

Let's see how to generate corpus for Dirichlet--multinomial unigram language model

def generate_corpus(beta, mean, N):
    """
    Returns a corpus of tokens drawn from a Dirichlet--multinomial
    unigram language model. Each token is an instance of one of V
    unique word types, represented by indices 0,

, V - 1.

    Arguments:

    beta -- concentration parameter for the Dirichlet prior
    mean -- V-dimensional mean of the Dirichlet prior
    N -- number of tokens to generate
    """

    pass # YOUR CODE GOES HERE
    #print mean
    #print beta
    #print dot(mean,beta)
    #print dirichlet(mean*beta,size=1)
    temp=sample(dirichlet(beta*array(mean),size=1),N)
    #print temp
    return temp

please keep in mind the dirichlet function is “from numpy.random.mtrand import dirichlet"
and the parameters it receives are corresponding to beta*array(mean). beta is the concentration factor, and mean is the vector which sum to 1.

another way is to generate corpus is using the property:
P(D'|D,H)= Nv+beta_nv/N+beta

def generate_corpus_collapsed(beta, mean, N):
    """
    Returns a corpus of tokens drawn from a Dirichlet--multinomial
    unigram language model using the 'collapsed' generative process
    (i.e., phi is not explicitly represented). Each token is an
    instance of one of V unique word types.

    Arguments:

    beta -- concentration parameter for the Dirichlet prior
    mean -- V-dimensional mean of the Dirichlet prior
    N -- number of tokens to generate
    """

    V = len(mean) # vocabulary size

    corpus = zeros(N, dtype=int) # corpus

    Nv = zeros(V, dtype=int) # counts for each word type

    pass # YOUR CODE GOES HERE
    for n in xrange(N):
        corpus[n]=sample((Nv+beta*array(mean))/(n+beta),1)
        Nv[corpus[n]]+=1;
    return corpus

Let's see how to generate corpus for Mixture of Dirichlet-multinomial unigram language model

def generate_corpus(alpha, m, beta, n, D, Nd):
    """
    Returns a grouped corpus drawn from a mixture of
    Dirichlet--multinomial unigram language models.

    Arguments:

    alpha -- concentration parameter for the Dirichlet prior over theta
    m -- T-dimensional mean of the Dirichlet prior over theta
    beta -- concentration parameter for the Dirichlet prior over phis
    n -- V-dimensional mean of the Dirichlet prior over phis
    D -- number of documents to generate
    Nd -- number of tokens to generate per document
    """
    corpus = GroupedCorpus()

    pass # YOUR CODE GOES HERE
    #determine the topic the distribution for topic dirichlet(dot(m,alpha),size=1)
    #given the topic, the distribtuion for word dirichlet(dot(n,beta),size=1)
    theta=dirichlet(alpha*array(m),1)
    phis=dirichlet(beta*array(n),len(m))
    for d in range(0,D):
        [t]=sample(theta,1)
        #print groupVcab
        corpus.add(str(d),str(t),[str(x) for x in sample(phis[t,:],Nd)])
    return corpus

注意是T個(gè)topic (group)， phis=dirichlet(beta*array(n),len(m)) 產(chǎn)生了T個(gè) dirichlet distribution,相同的topic t應(yīng)該取同一個(gè) dirichlet distribution phis[t,:]

posted on 2012-10-28 10:13 luis 閱讀(642) 評(píng)論(0) 編輯收藏引用所屬分類: Python

只有注冊(cè)用戶登錄后才能發(fā)表評(píng)論。
【推薦】100%開(kāi)源！大型工業(yè)跨平臺(tái)軟件C++源碼提供，建模，組態(tài)！

相關(guān)文章: Python extract all comments:提取所有comments,提取c/c++中注釋Python腳本 Python 筆記 pi tan 等公式 Python 筆記2 // label switching Python generate corpus using Dirichlet distribution Python 空數(shù)組 Python筆記

網(wǎng)站導(dǎo)航: 博客園 IT新聞 BlogJava 博問(wèn) Chat2DB 管理

<

2012年10月

>

日

一

二

三

四

五

六

30

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

1

2

3

4

5

6

7

8

9

10

常用鏈接

留言簿(3)

隨筆分類

隨筆檔案

文章分類

感悟！奮斗！(2)

文章檔案

2009年7月 (2)

友情鏈接

個(gè)人主頁(yè)
Yi Lu's Homepage UMass Amherst

青青草原综合久久大伊人导航_色综合久久天天综合_日日噜噜夜夜狠狠久久丁香五月_热久久这里只有精品