当前位置:网站首页>query词权重, 搜索词权重计算
query词权重, 搜索词权重计算
2022-07-02 02:13:00 【人工智能曾小健】
query词权重(term weighting)是为了计算query分词后,每个term的重要程度。常用的指标是tf*idf(query中term的tf大部分为1),即一个term的出现次数越多,表明信息量越少,相反一个term的次数越少,表明信息量越多。但是term的重要程度并不是和term的出现次数呈严格单调关系,并且idf缺乏上下文语境的考虑(比如“windows”在“windows应用软件”中比较重要,而在“windows xp系统iphone xs导照片”的重要性就比较低)。词权重计算作为一种基础资源在文本相关性,丢词等任务中有着重要作用,其优化方法主要分为下面三类:
1)基于语料统计
2)基于点击日志
3)基于有监督学习
本文首先介绍一些基于语料统计的计算方法。
一、imp(importance的缩写)
idf的一个缺点是仅仅依靠词频比较,imp从在query中的重要性占比基础上,采用迭代的计算方式优化词的静态赋权,其计算过程如下:
其中BT为term的imp值,初始值可设为1,Tmp_i是query中的第i个term的重要性占比,N指所有包含第i个term的query数目。
二、DIMP(Dynamic imp)
idf和imp的一个共同缺点是其都是静态的赋权。DIMP根据query的上下文计算每个term的动态赋权,其主要假设是任意query中的词权重可以由相关query 的词权重来计算,计算过程可分为两部分:
1) 自顶向下的query树构建
根据实际场景中采用不同的构建方法,这里介绍一种在搜索中的做法。如下图,给定query作为根节点,首先获取query的相关query作为第二层节点,在第二层的基础上,枚举相关query的子query作为第三层节点,最后一层为分词后的term节点。因此query树种的节点都是不同粒度的文本串,边都是文本串间的相关关系。在拍卖词推荐任务中,用户query都是比较短的关键词,其可以通过拍卖词间的共同购买关系构建对应的query树。
边栏推荐
- leetcode2312. 卖木头块(困难,周赛)
- Logging only errors to the console Set system property ‘log4j2. debug‘ to sh
- 2022 Q2 - 提升技能的技巧总结
- leetcode2309. 兼具大小写的最好英文字母(简单,周赛)
- Design and implementation of key value storage engine based on LSM tree
- The concepts and differences between MySQL stored procedures and stored functions, as well as how to create them, the role of delimiter, the viewing, modification, deletion of stored procedures and fu
- leetcode373. Find and minimum k-pair numbers (medium)
- How does MySQL solve the problem of not releasing space after deleting a large amount of data
- Sword finger offer 42 Maximum sum of continuous subarrays
- Number of palindromes in C language (leetcode)
猜你喜欢
[graduation season] graduate seniors share how to make undergraduate more meaningful
Opengauss database backup and recovery guide
How to use redis ordered collection
How to build and use redis environment
Spend a week painstakingly sorting out the interview questions and answers of high-frequency software testing / automated testing
With the innovation and upgrading of development tools, Kunpeng promotes the "bamboo forest" growth of the computing industry
Decipher the AI black technology behind sports: figure skating action recognition, multi-mode video classification and wonderful clip editing
What are the necessary things for students to start school? Ranking list of Bluetooth headsets with good sound quality
How to debug apps remotely and online?
Webgpu (I): basic concepts
随机推荐
Post infiltration flow encryption
Construction and maintenance of business websites [15]
How to turn off debug information in rtl8189fs
JMeter (II) - install the custom thread groups plug-in
MySQL operates the database through the CMD command line, and the image cannot be found during the real machine debugging of fluent
[Video] Markov chain Monte Carlo method MCMC principle and R language implementation | data sharing
Based on configured schedule, the given trigger will never fire
Flutter un élément au milieu, l'élément le plus à droite
JMeter (I) - download, installation and plug-in management
Software No.1
Sword finger offer 62 The last remaining number in the circle
医药管理系统(大一下C语言课设)
Pytest testing framework
如何远程、在线调试app?
Duplicate keys detected: ‘0‘. This may cause an update error. found in
DNS domain name resolution
剑指 Offer 31. 栈的压入、弹出序列
【C#】使用正则校验内容
leetcode373. 查找和最小的 K 对数字(中等)
Open那啥的搭建文档