当前位置:网站首页>Natural language processing (NLP) roadmap - KDnuggets
Natural language processing (NLP) roadmap - KDnuggets
2020-11-09 00:40:00 【On jdon】
because In the past ten years big data The development of . Enterprises now need to analyze a large amount of data from various sources every day .
natural language processing (NLP) It's the field of artificial intelligence , Dedicated to processing and using text and voice data to create intelligent machines and insights .
Pretreatment technology
To prepare text data for reasoning , Some of the most common techniques are :
- Tokenization : Used to split input text into its constituent words ( Mark ). such , It's easier to convert our data into digital format .
- Stop words remove : Used to remove all prepositions from our text ( for example ,“ One ”,“ This ” etc. ), These prepositions can only be regarded as noise sources in our data ( Because they don't carry any additional words ) The information in our data ).
- Word stem : Finally, it is used to remove all affixes from the data ( Such as prefixes or suffixes ). such , actually , For our algorithm , Think of it as actually having a similar meaning ( for example , Insightful opinions ) It's much easier to use proper words for .
standards-of-use Python NLP library ( for example NLTK and Spacy), All of these preprocessing techniques can be easily applied to different types of text .
in addition , In order to infer the grammar and text structure of a language , We can use parts of speech such as (POS) Tags and shallow parsing ( chart 1) Technology like that . actually , Using these technologies , We can use lexical categories of words ( Based on the context of phrase grammar ) Mark each word explicitly .
modeling technique
- Speech pack
Bag of Words It's a kind of natural language processing and Computer vision technology , The goal is to create new features for training classifiers ( chart 2). This technique is implemented by constructing a histogram that counts all the words in the document ( Regardless of word order and grammar rules ).
One of the main problems that may limit the effectiveness of this technique is the presence of prepositions in our text , pronouns , Articles, etc . actually , All of these can be thought of as words that often appear in our text , Even if you don't really know what the main features and themes of our documents are .
To solve this type of problem , Commonly referred to as “ The term frequency - Anti document frequency ”(TFIDF) Technology .TFIDF The purpose of this paper is to adjust the frequency of word count in text by considering the frequency of each word appearing in a large number of texts . then , Using this technology , We're going to reward words that are very common in text but rarely in other texts ( Increase the frequency value proportionally ), At the same time, for the words that appear frequently in the text and other texts ( Scale down the frequency value ) To punish ( For example, prepositions , Pronouns, etc ).
- Potential Dirichlet distribution (LDA)
Potential Dirichlet distribution (LDA) It's a topic modeling technique . Topic modeling is an area of research , Focus on finding ways to cluster documents , In order to find potential distinguishing markers which can characterize their characteristics according to their contents ( chart 3). therefore , Topic modeling can also be seen as drop Dimension Technology , Because it allows us to reduce the initial data to a limited set of clusters .
Potential Dirichlet distribution (LDA) It's an unsupervised learning technology , It is used to find potential topics that can represent different documents and cluster similar documents together . The algorithm will Considered to exist N Topics as input , Then group the different documents into N Document clusters closely related to each other .
LDA With other clustering techniques ( for example K Mean clustering ) The difference is that LDA It's a soft clustering technique ( Each document is assigned to clusters based on probability distribution ). for example , Documents can be assigned to clusters A, Because the possibility that the algorithm determines that the document belongs to this category is 80%, Some features embedded in this document are still taken into account ( rest 20%) More likely to belong to the second cluster B.
- Word embedding
Word embedding is one of the most common ways to encode words into digital vectors , Then we can input it into our machine learning model for reasoning . Word embedding aims to transform our words into vector space reliably , So that similar words are represented by similar vectors .
Now , There's something to create Word There are three main techniques for surface embedding :Word2Vec, glove and fastText. All three techniques use shallow neural networks to create the required word embedding .
- Sentiment analysis
Emotional analysis is a kind of NLP technology , Usually used to understand some form of text is about the positive side of the subject , Negative or neutral emotions . for example , Trying to find out about a subject , General public opinion of a product or company ( Through online reviews , Tweets, etc ) when , This can be particularly useful .
In emotional analysis , Emotion in a text is usually expressed as -1( Negative emotion ) and 1( Positive emotions ) Between the value of the , It's called polarity .
Affective analysis can be regarded as an unsupervised learning technique , Because we don't usually provide handmade tags for data . To overcome this obstacle , We use pre marked dictionaries ( A collection of words ), The dictionary is used to quantify the emotions of a large number of words in different contexts . Some examples of widely used words in affective analysis are TextBlob and VADER.
- Transformer
Represents the latest NLP Model , In order to analyze text data .BERT and GTP3 It's something well known Transformers Model Example .
Creating Transformer Before , Recursive neural network (RNN) It is the most effective way to analyze text data in order to make prediction , But it's hard to reliably exploit long-term dependencies , for example , Our network may find it difficult to understand that words entered in previous iterations may be useful for the current iteration .
With the help of a method called “ attention ” (Attention) The mechanism of , Successfully overcome this limitation ( The mechanism Used to determine which parts of the text need to be focused and given more attention ). Besides ,Transformers Make parallel processing of text data easy , Not sequential processing ( So it improves the execution speed ).
Now , With the help of Hugging Face library , It's easy to be in Python To realize Transfer .
版权声明
本文为[On jdon]所创,转载请带上原文链接,感谢
边栏推荐
- 1.操作系统是干什么的?
- 大数据岗位基础要求有哪些?
- C / C + + Programming Notes: pointer! Understand pointer from memory, let you understand pointer completely
- Common feature pyramid network FPN and its variants
- 表连接
- Database design: paradigms and anti paradigms
- 计算机网络 应用层
- How does semaphore, a thread synchronization tool that uses an up counter, look like?
- Decorator (2)
- Are there many Python application scenarios?
猜你喜欢
随机推荐
SaaS: another manifestation of platform commercialization capability
Python应用场景多不多?
老大问我:“建表为啥还设置个自增 id ?用流水号当主键不正好么?”
深度优先搜索和广度优先搜索
C++邻接矩阵
c++11-17 模板核心知识(二)—— 类模板
AI人工智能编程培训学什么课程?
Computer network application layer
AQS 都看完了,Condition 原理可不能少!
STC转STM32第一次开发
On buffer overflow
Python features and building environment
Pipedrive如何在每天部署50+次的情况下支持质量发布?
The vowels in the inverted string of leetcode
Chapter 5 programming
服务器性能监控神器nmon使用介绍
APP 莫名崩溃,开始以为是 Header 中 name 大小写的锅,最后发现原来是容器的错!
Teacher Liang's small class
Decorator (2)
STS安装