当前位置:网站首页>NLP natural language processing - Introduction to machine learning and natural language processing (3)
NLP natural language processing - Introduction to machine learning and natural language processing (3)
2022-07-26 07:25:00 【Emperor Confucianism is supreme】
NLP natural language processing - Introduction to machine learning and natural language processing - New word discovery and TF-IDF
1. The discovery of new words
(1) Why should we do new word discovery
① If there is no vocabulary , How do we find words ;
② As the amount of data increases , The old vocabulary will gradually fail to meet the subsequent needs ;
③ The supplementary vocabulary is helpful to the realization of downstream tasks .
④ Words are equivalent to a fixed collocation , The inside of the word is stable , Also called internal solidity 
; And the outside of the word is unstable , It is called left and right entropy 
.
Such as below : The word Hebei is stable , But the one behind is not fixed .
(2) What is an important word
① When we segment the text , Need to use words to understand the document , Then what we need is the important words in the document ; as follows :
② If a word is in a certain kind of text ( Assuming that A class ) There are many times in , And in other categories of text ( Not A class ) There are few , Then the word is A Important words of text like ( High weight words ).
conversely , If a word appears in many fields , Then its importance to any category is very poor .
③ Use mathematics to describe the importance of a word , namely NLP Medium TF-IDF:TF Word frequency , That is, the number of times a word appears in a category / The total number of words in this category ;IDF Reverse document frequency , Reverse document frequency is high -> The word rarely appears in other documents 
.
Calculation method : Each word will get one for each category TF·IDF value ,TF·IDF high -> The word is highly important in this field .
2. TD-IDF Characteristics of the algorithm
(1)tf-idf The calculation of depends very much on the result of word segmentation , If the word segmentation is wrong , The significance of statistical value will be greatly reduced ;
(2) Every word , For each document , Different tf-idf value , So we can't leave the data discussion tfidf;
(3) If there is only one text , Can't calculate tf-idf;
(4) Category data balance is important ;
(5) Susceptible to various special symbols , It's best to do some pretreatment .
3. TD-IDF Application of algorithm
(1)TF-IDF application - Search engine
① For all existing web pages ( Text ), Calculate , Lexical TFIDF value ;
② For an input query Carry out word segmentation ;
③ For documents D, Calculation query The words in the document D Medium TFIDF Sum of values , As query Score of relevance to documents .
(2)TF-IDF application - Text in this paper,
① By calculation TFIDF Worth every text keyword ;
② Sentences that contain many keywords , Think it is a key sentence ;
③ Choose some key sentences , As a summary of the text .
(3)TF-IDF application - Text similarity calculation
Calculate for all text tfidf after , Select from each text tfidf Higher front n Word , Get a set of words S. For each text D, Calculation S The word frequency of each word in , Take it as a vector of text . By calculating the cosine of the angle between vectors , Get the vector similarity , As the similarity of text .
Calculation of cosine value of vector angle :
4. TF-IDF The advantages of
① Good explainability : You can clearly see the keywords , Even if the prediction results are wrong , It's also easy to find the reason ;
② Fast calculation : Word segmentation itself takes up the most time , The rest are simple statistical calculations ;
③ Little dependence on annotation data : You can use unmarked corpus to complete part of the work ;
④ It can be combined with many algorithms : It can be seen as word weight .
5. TF-IDF The disadvantages of
① Greatly affected by the effect of word segmentation ;
② There is no semantic similarity between words ( This problem is fatal );
③ No word order information ( The word bag model );
④ Limited capacity , Unable to complete complex tasks , Such as machine translation and entity mining ;
⑤ Sample imbalance will have a great impact on the results ;
⑥ The distribution between samples within a class is not considered .
边栏推荐
- anaconda安装教程-手把手教你安装
- NFT数字藏品系统开发:数字藏品赋予品牌新活力
- Download and install the free version of typora
- NFT digital collection system development: activating digital cultural heritage
- QT: list box, table, tree control
- NFT数字藏品系统开发:激活数字文化遗产
- NFT数字藏品系统开发:企业如何开发属于自己的数藏平台
- Crawler data analysis
- 7月消息,Glassnode数据显示,Deribit上ETH永续期货合约未平仓头寸刚刚达到一个月高点237,959,827美元。
- NFT digital collection system development: what are the best digital marketing strategies for NFT digital collection
猜你喜欢
![Rgb-t tracking - [dataset benchmark] gtot / rgbt210 / rgbt234 / vot-2019-2020 / laser / VTUAV](/img/10/40d02da10a6f6779635dc820c074c6.png)
Rgb-t tracking - [dataset benchmark] gtot / rgbt210 / rgbt234 / vot-2019-2020 / laser / VTUAV

MMOE多目标建模

成为 Apache 贡献者,So easy!

Singles cup web WP

In July, glassnode data showed that the open position of eth perpetual futures contract on deribit had just reached a one month high of $237959827.
![Leetcode:749. isolate virus [Unicom component + priority queue + status representation]](/img/61/00f2a1aa30d2c5f5e5d96e773a8f30.png)
Leetcode:749. isolate virus [Unicom component + priority queue + status representation]

Crawler data analysis

7月消息,Glassnode数据显示,Deribit上ETH永续期货合约未平仓头寸刚刚达到一个月高点237,959,827美元。

404 page best practices to improve user experience

Differences in the use of function call pointer parameters *p, * & P
随机推荐
Idea shortcut key
Qt:列表框、表格、树形控件
redis-migrate-tool迁移报错。
NFT数字藏品系统开发:NFT数藏 的最佳数字营销策略有哪些
中国联通改造 Apache DolphinScheduler 资源中心,实现计费环境跨集群调用与数据脚本一站式访问
3.0.0 alpha blockbuster release! Nine new functions and new UI unlock new capabilities of dispatching system
WCF 入门教程二
HCIP --- MPLS技术
Opencv learning color detection
【Keras入门日志(3)】Keras中的序贯(Sequential)模型与函数式(Functional)模型
Embedded development: tools -- intelligent watchdog design
Oauth2.0 series blog tutorial summary
PXE高效批量网络装机
Taishan office lecture: word error about inconsistent values of page margins
3.0.0 alpha 重磅发布!九大新功能、全新 UI 解锁调度系统新能力
Leetcode:749. isolate virus [Unicom component + priority queue + status representation]
College degree sales career, from the third tier 4K to the first tier 20k+, I am very satisfied with myself
NFT数字藏品系统开发:数字藏品赋予品牌新活力
Apache Dolphinscheduler3.0.0-beta-1 版本发布,新增FlinkSQL、Zeppelin任务类型
Apache dolphin scheduler 2.x nanny level source code analysis, China Mobile engineers uncover the whole process of service scheduling and start