当前位置:网站首页>Topic Modeling of Short Texts: A Pseudo-Document View
Topic Modeling of Short Texts: A Pseudo-Document View
2022-08-03 03:03:00 【eat 243】
- PTMThink that the number of this essay is much less but normal potential in the document of the size,These potential document known as pseudo document.
- False document rather than by learning the theme of short text distribution,PTMWith a fixed number of parameters,And relatively insufficient training corpus to obtain the ability to avoid over fitting.
2.1Basic Model
现在我们给出PTM的形式化描述.我们假设有K个主题 φ z z = 1 K {φ_z }^K_{z=1} φzz=1K,Each is a scale forVThe vocabulary of multinomial distribution.有DA book of essays d s s = 1 D {ds}^D_{s=1} dss=1D 和PFalse document d l ′ l = 1 P { {d}^{'}_{l}}^P_{l=1} dl′l=1P.**Short text is to observe the document,False document is latent document.Put forward the multinomial distribution ψ ψ ψThe short passage on to false documents this distribution modeling.我们进一步Assume that each essay this belongs to and only to a false document.**Every word in short text is start with the false document theme distributionθSampling a topicz,Then the sampling a wordw ~ φ 生成的z .
Remark1 (PTM从PA false document rather thanDLooking for theme in an essay in this, P ≪ D P\ll D P≪D.)
- PTMThe introduction of false documents is a key factor against data sparse negative.为了更好地理解这一点、假设有DA book of essays,Each text with an average ofN个tokens.已经证明,**当N太小时,即使D是非常大的,LDAAlso can't accurately study topic.**这是因为在这种情况下,Scattered in different for learning theme in this essay the shortage of co-occurrence words did not improve.然而,**PTM从PA false document rather thanDLooking for theme in an essay in this, P ≪ D P\ll D P≪D.**因此,We can roughly estimate the average pseudo documents have N ′ {N}^{'} N′个tokens, N ′ = D N / P ≫ N {N}^{'} = DN / P \gg N N′=DN/P≫N,This means that the word co-occurrence of potential improvement.
Remark2 (Given this essay belongs to the only false document,PTM根据LDAThe process of generating this essay.)
- In addition to the aggregation topic model(self - aggregate Topic Model, SATM),像PTMThe self-assembly method are still rarely seen in the literature.虽然PTM和SATMWill this essay aggregate into pseudo document,But their generation process is essentially different.SATMAssume that short text generation process is two phase.The first stage follow standardLDAFalse documents to generate regular size,The second phase will be sent to you byunigramThe mixing process of from its false document generation each essay this.The first phase means that sampling will take a word O ( P K ) O(PK) O(PK)时间,This is a very intensive.The second stage means reasoning process mustIndependent estimates the probability of false documents in the essay book distribution,So the number of parameters as the size of the corpus of linear growth,In the case of lack of training sample may result in serious fitting problem.与之形成鲜明对比的是,Given this essay belongs to the only false document,PTM根据LDAThe process of generating this essay.This means that the sample only need a wordO(K)的时间,And the number of parameters are fixed,以避免过拟合.
Remark3
- 讨论PTM和so-called Pachinko Allocation Model (PAM)The similarities and differences is also very interesting.PAM被提出Using a directed acyclic graph to capture any correlation between topic,因此被认为是LDAA more general version of the.因此,Although the four layers of timePAM(Figure 2b)显示了与PTM(Figure 2a)Similar to the model structure of,但它们在本质上是不同的.在Figure 2b中,PAMThe second layer by the capture of the third straton theme(All in blue)Of commonness between super theme.从这个意义上说,We could get the number from the third to the second layer to reduce the topic of.相比之下,PTMThe second floor of the nodes in the said false document(绿色),Than the theme of the third layer node(蓝色)On the number of more,And should be viewed as better combination of specific topics can be generated in this essay topic.
2.2 Sparsification
如上所述,PTM中的False documentIs essentially composed of a combination of a specific theme of short text compound subject.沿着这条线,It is natural to guess,When the number of false document is more and more youth,They said often is the theme of the ambiguous.为了解决这个问题,我们在这里提出了SPTM,这是PTMOf a sparse version,应用Spike和SlabPrior to the theme of the false document distribution processing.
“Spike and Slab”A priori is a very mature method in mathematics.It can be decoupled distribution of thin and smooth.在细节上,Auxiliary Bernoulli variables were introduced to a priori,Used to indicate a particular variable“开”或“关”状态.因此**,A model can determine whether the corresponding variables appear**.在我们的例子中,这Said whether to choose a topic in certain false document.
请注意,Spike和SlabA priori may be free to choose,This will lead to a probability distribution definition is not clear.Wang和BleiIn the theme distribution terms introduced had ever seen,This could lead to more trouble to the reasoning process.因此,我们应用了LinThe weak put forward smooth prior and smooth prior,Through the direct application ofSpike和Slab先验,Can avoid the distribution definition is not clear.此外,It led to a more simple reasoning process,This ensures that the scalability of our model.In order to better describe our sparse model,We first give the theme selector(topic selectors)、平滑先验(smoothing prior)And the weak smooth prior(weak smoothing prior)的定义.
- 定义1:For false document d l ′ {d}^{'}_l dl′,Theme selector b l , k , k ∈ 1 , ⋅ ⋅ ⋅ , k b_{l,k}, k∈{1,···,k} bl,k,k∈1,⋅⋅⋅,k,是一个二元变量,表示主题k是否与 d l ′ {d}^{'}_l dl′相关. b l , k b_{l,k} bl,k 是从 B e r n o u l l i ( π l ) Bernoulli(π_l) Bernoulli(πl)中采样,其中 π l π_l πl 是 d l ′ {d}^{'}_l dl′的伯努利参数.
- 伯努利分布指的是对于随机变量X有, 参数为p(0<p<1),如果它分别以概率p和1-p取1和0为值.
- 定义2:Smooth prior is D i r i c h l e t Dirichlet Dirichlet超参数α,Used to smooth the theme of the theme by selector to choose.Weak smooth prior is another D i r i c h l e t Dirichlet Dirichlet超参数 α ‾ \overline{\alpha} α,For smooth not select the theme of the.由于 α ‾ ≪ α \overline{\alpha}\ll \alpha α≪α,超参数 α ‾ \overline{\alpha} αKnown as weak smooth prior.
- Theme selector is called“Spikes”,And smoothing a priori and the weak smooth prior correspondence is“slab”.
- 定义1:For false document d l ′ {d}^{'}_l dl′,Theme selector b l , k , k ∈ 1 , ⋅ ⋅ ⋅ , k b_{l,k}, k∈{1,···,k} bl,k,k∈1,⋅⋅⋅,k,是一个二元变量,表示主题k是否与 d l ′ {d}^{'}_l dl′相关. b l , k b_{l,k} bl,k 是从 B e r n o u l l i ( π l ) Bernoulli(π_l) Bernoulli(πl)中采样,其中 π l π_l πl 是 d l ′ {d}^{'}_l dl′的伯努利参数.
这样,Just realized pseudo document theme proportion of sparse and smoothness of decoupling.Given topic selector b l ⃗ \vec{b_l} bl = { b l , k b_{l,k} bl,k} k = 0 K ^K_{k=0} k=0K,False document d l ′ {d}^{'}_l dl′Proportion of topics from D i r ( α b l ⃗ + α ‾ 1 ⃗ ) Dir(α\vec{b_l} +\overline{\alpha}\vec{1}) Dir(αbl+α1)中采样. α ‾ \overline{\alpha} αThe introduction of the distribution of the repaired pathological definition,At the same time keep the sparse sex effect.
Fig. 1b说明了SPTMThe plate representation.False document complete generation process is as follows:
2.3 Inference
- Accurate posterior inference in our model is difficult to deal with,So we turn to is used to approximate posterior inferencecollapsed Gibbs采样算法,The algorithm is simple,In speed and otherestimators相当,And it can approximate global maximum.由于空间的限制,We omit the deduction details,Only gives the sampling formula.
- We are here aboutSPTMInfer details,And at the end of this section describesPTM的推断.对θ、φ、ψ和πParsing integral,Sampling algorithm for latent variable is false documentation assignmentl、主题赋值zAnd theme selectorb,我们还对 D i r i c h l e t Dirichlet Dirichlet超参数 α \alpha α和Beta超参数 γ 1 \gamma_1 γ1进行了采样,并使 α ‾ \overline{\alpha} α等于 1 0 − 7 10^{-7} 10−7 和 γ 0 \gamma_0 γ0等于1.
- Sampling false documentation assignmentl,Given the remaining variables,采样l类似于 D i r i c h l e t Dirichlet DirichletThe mixture of polynomial sampling method.也就是说
其中 M l M_l Ml是分配给第lA false document d l ′ {d}^{'}_l dl′The number of short text. N d s N_{ds} Nds是第sA book of essays d s d_s ds的长度, N d s z N_{ds}^z Ndsz是 d s d_s dsAssigned to the topicz的tokens. N d s z N_{ds}^z Ndsz是 d l ′ {d}^{'}_l dl′中分配给主题z的tokens数, N l N_l Nl是 d l ′ {d}^{'}_l dl′中的tokens总数.所有带 ¬ d s \lnot d_s ¬dsThe count of said does not include from d s d_s ds的计数. b l , z b_{l,z} bl,z是主题zFalse document d l ′ {d}^{'}_l dl′The theme of the selector. A l = { z : b l , z = 1 , z ∈ { 1 , ⋅ ⋅ ⋅ , K } } A_l = \left\{z: b_{l,z} = 1, z∈\left\{1,···,K\right\}\right\} Al={ z:bl,z=1,z∈{ 1,⋅⋅⋅,K}}是 b l ⃗ \vec{b_l} bl 的“on”索引集,, ∣ A l ∣ |A_l| ∣Al∣是 A l A_l Al的大小.
- Sample topic assignmentz.Sample topic assignmentzMethod similar to the potential D i r i c h l e t Dirichlet Dirichlet分配.不同之处在于θNo longer belongs to the original essay this,But false document.而θSampling fromSpike和Slab先验,而不是 s y m m e t r i c D i r i c h l e t p r i o r symmetric Dirichlet prior symmetricDirichletprior.也就是说,
其中 N z w N^w_z Nzw是wAssigned to the themez的次数,并且 N z = ∑ w = 0 V N z w N_z = \sum^V_{w = 0} N^w_z Nz=∑w=0VNzw
- Sampling themes selectorb.为了采样 b l ⃗ \vec{b_l} bl,我们跟随WangAnd others to use π l π_l πl 作为辅助变量.让
Is a false document d l ′ {d}^{'}_l dl′The theme of the assignment of the collection.给出了 π l π_l πl 和 b l ⃗ \vec{b_l} blThe joint conditional distribution其中I[·]是一个指标函数.The terms of this joint distribution,我们在 π l π_l πlIterating over sampling b l ⃗ \vec{b_l} bl并在 b l ⃗ \vec{b_l} blIterating over sampling π l π_l πl,To win b l ⃗ \vec{b_l} bl的样本.注意,WangAnd others in the theme of slow convergence ofb进行积分,并对π进行采样.由于V很大,The price is very high to search the optimal combination of theme.然而,在我们的例子中,K相对于VIt is a relatively small,并且根据π对zSampling is a very time consuming.基于上述考虑,We took the opposite approach,通过积分 π \pi π对b进行采样.
对于超参数α,We use symmetry gaussian distributionMetropolis-Hastings作为proposal distribution.对于concentration parameter γ1,We use previously developed method is used forGamma先验.
到目前为止,我们已经说明了SPTM的collapsed Gibbs采样算法.Now we simply describePTM的推断.在对θ、φ和ψAfter analytical integral,Sampling algorithm for pseudo documents required latent variables assignmentlAnd topic assignmentz. 用 α 代替 b l , z α + α ‾ 用\alpha 代替 b_{l,z}\alpha+\overline{\alpha} 用α代替bl,zα+α和 用 K α 代替 ∣ A l ∣ α + K α ‾ 用K\alpha 代替|A_l|\alpha+K\overline{\alpha} 用Kα代替∣Al∣α+Kα在Equation 1中,我们得到lThe sampling equation.同样,用b代替l,z 方程2中的α+ α¯和α,我们得到zThe sampling equation.
边栏推荐
猜你喜欢
超级复杂可贴图布局的初级智能文本提示器
Kubernetes:(八)调度约束和故障排查
五大靠谱的婚恋相亲APP详细特点缺点分析!
易购数码类电商商城网页设计与实现项目源码
Wireshark data capture and analysis of the transport layer protocol (TCP protocol)
openCV第二篇
The Multiversity 的 “非常重要的生命体” NFT 推出
.NET深入解析LINQ框架(四:IQueryable、IQueryProvider接口详解)
Greenplum database failure analysis, can not listen to the port
暴力递归到动态规划 08(小马走象棋)
随机推荐
45部署LVS-DR群集
lombok 下的@Builder和@EqualsAndHashCode(callSuper = true)注解
The cornerstone of high concurrency: multithreading, daemon threading, thread safety, thread synchronization, mutual exclusion lock, all in one article!...
提高测试覆盖率的四大步骤
优秀的 Verilog/FPGA开源项目总结及交流群
PHICOMM(斐讯)N1盒子 - Armbian5.77(Debian 9)刷入EMMC
不想当Window的Dialog不是一个好Modal,弹窗翻身记...
简单的布局的初级智能文本提示器
openCV第一篇
IDEA基本使用-创建和删除项目
SAP ABAP Gateway Client 里 OData 测试的 PUT, PATCH, MERGE 请求有什么区别
LabVIEW程序框图保存为图像
[NCTF2019]SQLi-1||SQL注入
暴力递归到动态规划 07(516. 最长回文子序列)
[NCTF2019]SQLi-1||SQL Injection
10大领域5大过程47子过程快速记忆
UVM中SVA使用指南
如何让优炫数据库开机自启
mysql binlog日期解析成yyyy-MM-dd
EasyGBS播放器优化:设备通道视频播放出现跳屏问题的修复