当前位置:网站首页>Natural language processing (NLP) roadmap - KDnuggets
Natural language processing (NLP) roadmap - KDnuggets
2020-11-09 00:40:00 【On jdon】
because In the past ten years big data The development of . Enterprises now need to analyze a large amount of data from various sources every day .
natural language processing (NLP) It's the field of artificial intelligence , Dedicated to processing and using text and voice data to create intelligent machines and insights .
Pretreatment technology
To prepare text data for reasoning , Some of the most common techniques are :
- Tokenization : Used to split input text into its constituent words ( Mark ). such , It's easier to convert our data into digital format .
- Stop words remove : Used to remove all prepositions from our text ( for example ,“ One ”,“ This ” etc. ), These prepositions can only be regarded as noise sources in our data ( Because they don't carry any additional words ) The information in our data ).
- Word stem : Finally, it is used to remove all affixes from the data ( Such as prefixes or suffixes ). such , actually , For our algorithm , Think of it as actually having a similar meaning ( for example , Insightful opinions ) It's much easier to use proper words for .
standards-of-use Python NLP library ( for example NLTK and Spacy), All of these preprocessing techniques can be easily applied to different types of text .
in addition , In order to infer the grammar and text structure of a language , We can use parts of speech such as (POS) Tags and shallow parsing ( chart 1) Technology like that . actually , Using these technologies , We can use lexical categories of words ( Based on the context of phrase grammar ) Mark each word explicitly .
modeling technique
- Speech pack
Bag of Words It's a kind of natural language processing and Computer vision technology , The goal is to create new features for training classifiers ( chart 2). This technique is implemented by constructing a histogram that counts all the words in the document ( Regardless of word order and grammar rules ).
One of the main problems that may limit the effectiveness of this technique is the presence of prepositions in our text , pronouns , Articles, etc . actually , All of these can be thought of as words that often appear in our text , Even if you don't really know what the main features and themes of our documents are .
To solve this type of problem , Commonly referred to as “ The term frequency - Anti document frequency ”(TFIDF) Technology .TFIDF The purpose of this paper is to adjust the frequency of word count in text by considering the frequency of each word appearing in a large number of texts . then , Using this technology , We're going to reward words that are very common in text but rarely in other texts ( Increase the frequency value proportionally ), At the same time, for the words that appear frequently in the text and other texts ( Scale down the frequency value ) To punish ( For example, prepositions , Pronouns, etc ).
- Potential Dirichlet distribution (LDA)
Potential Dirichlet distribution (LDA) It's a topic modeling technique . Topic modeling is an area of research , Focus on finding ways to cluster documents , In order to find potential distinguishing markers which can characterize their characteristics according to their contents ( chart 3). therefore , Topic modeling can also be seen as drop Dimension Technology , Because it allows us to reduce the initial data to a limited set of clusters .
Potential Dirichlet distribution (LDA) It's an unsupervised learning technology , It is used to find potential topics that can represent different documents and cluster similar documents together . The algorithm will Considered to exist N Topics as input , Then group the different documents into N Document clusters closely related to each other .
LDA With other clustering techniques ( for example K Mean clustering ) The difference is that LDA It's a soft clustering technique ( Each document is assigned to clusters based on probability distribution ). for example , Documents can be assigned to clusters A, Because the possibility that the algorithm determines that the document belongs to this category is 80%, Some features embedded in this document are still taken into account ( rest 20%) More likely to belong to the second cluster B.
- Word embedding
Word embedding is one of the most common ways to encode words into digital vectors , Then we can input it into our machine learning model for reasoning . Word embedding aims to transform our words into vector space reliably , So that similar words are represented by similar vectors .
Now , There's something to create Word There are three main techniques for surface embedding :Word2Vec, glove and fastText. All three techniques use shallow neural networks to create the required word embedding .
- Sentiment analysis
Emotional analysis is a kind of NLP technology , Usually used to understand some form of text is about the positive side of the subject , Negative or neutral emotions . for example , Trying to find out about a subject , General public opinion of a product or company ( Through online reviews , Tweets, etc ) when , This can be particularly useful .
In emotional analysis , Emotion in a text is usually expressed as -1( Negative emotion ) and 1( Positive emotions ) Between the value of the , It's called polarity .
Affective analysis can be regarded as an unsupervised learning technique , Because we don't usually provide handmade tags for data . To overcome this obstacle , We use pre marked dictionaries ( A collection of words ), The dictionary is used to quantify the emotions of a large number of words in different contexts . Some examples of widely used words in affective analysis are TextBlob and VADER.
- Transformer
Represents the latest NLP Model , In order to analyze text data .BERT and GTP3 It's something well known Transformers Model Example .
Creating Transformer Before , Recursive neural network (RNN) It is the most effective way to analyze text data in order to make prediction , But it's hard to reliably exploit long-term dependencies , for example , Our network may find it difficult to understand that words entered in previous iterations may be useful for the current iteration .
With the help of a method called “ attention ” (Attention) The mechanism of , Successfully overcome this limitation ( The mechanism Used to determine which parts of the text need to be focused and given more attention ). Besides ,Transformers Make parallel processing of text data easy , Not sequential processing ( So it improves the execution speed ).
Now , With the help of Hugging Face library , It's easy to be in Python To realize Transfer .
版权声明
本文为[On jdon]所创,转载请带上原文链接,感谢
边栏推荐
- Decorator (2)
- 简单介绍c#通过代码开启或关闭防火墙示例
- Common feature pyramid network FPN and its variants
- First development of STC to stm32
- Core knowledge of C + + 11-17 template (2) -- class template
- salesforce零基础学习(九十八)Salesforce Connect & External Object
- Leetcode-11: container with the most water
- 装饰器(一)
- Exception capture and handling in C + +
- App crashed inexplicably. At first, it thought it was the case of the name in the header. Finally, it was found that it was the fault of the container!
猜你喜欢
Aprelu: cross border application, adaptive relu | IEEE tie 2020 for machine fault detection
On buffer overflow
Common feature pyramid network FPN and its variants
C / C + + Programming Notes: pointer! Understand pointer from memory, let you understand pointer completely
教你如何 分析 Android ANR 问题
Dynamic relu: Microsoft's refreshing device may be the best relu improvement | ECCV 2020
Have you ever thought about why the transaction and refund have to be split into different tables
使用递增计数器的线程同步工具 —— 信号量,它的原理是什么样子的?
Depth first search and breadth first search
作业2020.11.7-8
随机推荐
android开发中提示:requires permission android.permission write_settings解决方法
Common feature pyramid network FPN and its variants
你有没有想过为什么交易和退款要拆开不同的表
《MFC dialog中加入OpenGL窗体》
Python features and building environment
Are there many Python application scenarios?
云计算之路-出海记-小目标:Hello World from .NET 5.0 on AWS
Linked blocking queue based on linked list
salesforce零基础学习(九十八)Salesforce Connect & External Object
B. protocal has 7000eth assets in one week!
C/C++编程笔记:指针篇!从内存理解指针,让你完全搞懂指针
一堆代码忘了缩进?快捷方式教你无忧无虑!
Introduction skills of big data software learning
Table join
数据库设计:范式与反范式
1.操作系统是干什么的?
大数据岗位基础要求有哪些?
文件拷贝的实现
c++11-17 模板核心知识(二)—— 类模板
Combine theory with practice to understand CORS thoroughly