当前位置:网站首页>Natural language processing series (I) introduction overview
Natural language processing series (I) introduction overview
2022-07-05 12:56:00 【Yunlord】
Catalog
One 、NLP Series column Introduction
( One ) Original intention of design
( 3、 ... and ) How to learn NLP
( Two )NLP The difficulties of
( One ) Intelligent question and answer system
( 3、 ... and ) Machine translation
( 7、 ... and ) Personalized recommendation
( 8、 ... and ) Information extraction
( One )NLP Three dimensions of Technology
1. participle Word Segmentaition
2. Part of speech tagging Part-of-Speech Tagging
3. Semantic understanding Semantic Understanding
4. Named entity recognition Named Entity Recognition
( 3、 ... and )NLP Technology Overview
introduction
Natural language processing is in a period of rapid development .
With sensors 、 signal communication 、 Chip and AI Common progress of Algorithm , The era of Internet of things is coming . When the , Almost everything can collect comprehensive information through sensor technology , adopt 5G High speed transmission of massive information . On cloud services AI Algorithm realizes data analysis , Greatly improve social production efficiency .
AI technology is mainly divided into two areas , Computer vision (computer vision), natural language processing (natural language processing), And it affects every aspect of people's life . Computer vision focuses on the understanding and processing of images , Natural language processing is widely used in various scenes related to speech or text . In the middle of it , Naturally, there are overlapping areas of two technology application scenarios , That is, we need to comprehensively understand and process images and texts at the same time .
From this point of view , As long as there is text data, there are NLP The need for Technology . at present , Even in the field of financial technology, there is a large demand for text analysis , For example, analyze market public opinion by reading news and research reports , Or do stock market data analysis .
Over the past few years , We can see an obvious trend that is Text data is growing exponentially . In fact, this is inseparable from the data explosion brought by the mobile Internet . You can imagine , Social software we use every day , Such as WeChat 、 Tiktok , How much text data is carried . The sharp increase of text data is bound to be accompanied by the rapid increase of the industry's demand for text analysis , What follows is for NLP Talent needs .
Although the current track seems to be rolling , But I believe , This is a track that will be widened , It can make a difference in the field of natural language processing .
One 、NLP Series column Introduction
( One ) Original intention of design
The original intention of writing this series is as follows :
- Cultivate and attract new NLP/AI personnel . In this series of articles, we will explain natural language related knowledge and cases from simple to deep .
- NLP Is currently the AI The hottest direction in the field . Natural language processing has become very popular in recent years , Although it started a few years later than computer vision , But its momentum is very strong , And it is expected that in the future 5 This momentum will continue throughout the year .
- At present, there is no special systematization 、 In detail NLP Series of tutorials , Especially Chinese .
- Technology before deep learning
- Methodology based on deep learning
- NLP It has developed very fast in recent years , Knowledge iterations update rapidly . So in recent years AI Much of the blockbuster success in the field comes from natural language processing , Including previous years bert.

At the same time, for beginners ,NLP Is a faster AI Introduction choice and development direction of , Threshold ratio CV A lower .
( Two )NLP Post treatment

With the rapid development of artificial intelligence industry , Artificial intelligence talent competition is becoming more and more intense . In recent days, , released 《 Research Report on talent management in AI industry 》 Show , The turnover rate of the artificial intelligence industry increased year on year , The supply of personnel for science and technology posts is tight , The supply-demand ratio of algorithm design post and application development post is 0.2 following .
According to the Ministry of industry and information technology , It is estimated that the effective talent gap in China's AI industry will reach 30 ten thousand , Among them, there is a large gap of technical talents . According to the report , Artificial intelligence industry algorithm research post 、 The supply of talents for application development posts is extremely tight , The talent supply-demand ratio is only 0.13 and 0.17; The supply of talents for practical skill Posts basically meets the demand , The talent supply-demand ratio is 0.98, There is still a slight shortage . In different technical directions , Computer vision talents are extremely scarce , The supply-demand ratio is only 0.09, natural language processing 、 Machine learning and artificial intelligence chips are also in short supply , The talent supply-demand ratio is 0.2、0.23 and 0.37.
To make a long story short , Enter at present NLP There is nothing wrong with the ranks of , At least in the future 5 Opportunities in this industry will still exist within this year , But the demand of the market for talents will become higher and higher . therefore , The sooner you enter this industry , The greater the advantage .
( 3、 ... and ) How to learn NLP
、
Introduction knowledge system of natural language processing As shown in the figure above . Now I will make a brief introduction , And in the subsequent series of articles, we will introduce the relevant knowledge and articles in turn , And will help you better understand through cases and code explanations .
- Mathematical basis
- Programming based
- machine learning
- Text preprocessing
- Word segmentation and word embedding
- Model
- Application scenarios
Math knowledge :
First of all, we need some mathematical knowledge . Just know Basic Advanced Mathematics 、 linear algebra 、 Statistics and optimization theory Just go . Specific theories can be understood slowly after in-depth study .
Programming based :
Mainly through python Implement deep learning code , And now there are a lot of useful deep learning frameworks , Include pytorch、tensorflow2 wait , Easy to learn, easy to use . And the operating system actually linux and windows Will do , But when it comes to real industrial deployment linux More applicable .
machine learning :
We need to learn some basic concepts of machine learning , Although at present, deep learning is basically used to solve complex application problems in natural language processing , Compared with traditional machine learning methods , The performance will be excellent . But deep learning and machine learning come down in one continuous line , quite a lot The concept of machine learning is universal in deep learning , Like data sets , Loss function , Over fitting, etc .
Text preprocessing :
Natural language processing tasks The first process is also a key process —— Text preprocessing . When you get the text data set , Need to go through word segmentation 、 Stop word filtering 、 Text vector conversion and a series of text preprocessing steps , Only in this way can we achieve better results in the follow-up tasks . At the same time, if the data set is small , The data set can be expanded through operations such as data enhancement .
Word segmentation and word embedding
For various applications of natural language processing , The most basic task is text representation . Because we know that a text cannot be directly used as the input of the model , So we have to convert the text into vector form first , Then import it into the model for training . The so-called representation of text , In fact, it is to study how to express the text in the form of vector or matrix . The basic operation is to divide the text into word combinations , Then for each word , Use an eigenvector to represent it .
Model
When we use eigenvectors to represent text , You can input vectors into the model to perform downstream tasks , Such as classification or translation . Next, we will introduce various model results and explain the code .
Application scenarios
NLP There are many application scenarios for , For example, speech recognition 、 Machine translation 、 Text classification and summary generation . We will With industrialized engineering practice Let's explain one by one .
Two 、 What is? NLP
( One )NLP summary
Three concepts of natural language processing :
- natural language processing (Natural Language Processing, NLP)
- natural language understanding (Natural Language Understanding, NLU), Understand the meaning of the text
- Natural language generation (Natural Language Generation, NLG), Generate text according to meaning
People use voice 、 Image and text to convey information exchange . So the core problem is how to understand this information ? and NLP The main task of is to understand the text and generate text information , So there's a formula :
NLP=NLU+NLG
from NLU To NLG The diagram of this is as follows :

So overall ,NLP There are mainly two aspects , One is to study how to better understand the meaning of text transmission (NLU), On the other hand, it studies how to generate text according to the expressed meaning (NLG).
( Two )NLP The difficulties of
Why is it more difficult to understand text than image ?

Because pictures are generally WYSIWYG , There is no deep meaning behind , And there is basically no situation that has different meanings in different scenes .
The text is not like this , All we see directly is the text , We also need to understand the deep meaning behind it , Specifically, there are the following points :
- One meaning and many tables , One meaning has many expressions .
- polysemy , A word expresses different meanings in context .
In addition to the technology itself , Natural language processing industrial processes usually involve many modules , Basic includes text cleaning 、 participle 、 Feature Engineering 、 Named entity recognition 、 A series of steps such as classification , In fact, every operation will continue to accumulate errors , Ultimately, it will affect the performance of the actual system . So in the design NLP System time , Every link is crucial , There can be no neglect .
3、 ... and 、NLP application
Natural language processing has many application scenarios , Including intelligent question answering system 、 The text generated 、 Machine translation, etc .
( One ) Intelligent question and answer system
With the rapid development of Internet , The amount of network information is increasing , People need more accurate information . The traditional search engine technology has been unable to meet people's higher and higher needs , And automatic question answering technology has become an effective means to solve this problem . Automatic question answering refers to the task of using computer to automatically answer the questions raised by users to meet the knowledge needs of users , When answering user questions , First of all, we should correctly understand the questions raised by users , Extract the key information , Search in the existing corpus or knowledge base 、 matching , Feedback the obtained answer to the user .
( Two ) The text generated
Text generation technology is another important technology in the field of natural language processing . Users can use the established information and text generation model to generate text sequences that meet specific goals . The application scenarios of text generation model are rich, such as generative reading comprehension 、 Man machine conversation or intelligent writing . The current development of deep learning has also promoted the progress of this technology , More and more highly available text generation models have been born , Promote industry efficiency , Serve the intelligent society .
( 3、 ... and ) Machine translation
With the rapid development of communication technology and Internet technology 、 The rapid increase of information and the increasingly close international links , The challenge of enabling all people in the world to access information across language barriers has gone beyond the ability of human translation .
Machine translation because of its high efficiency 、 The low cost meets the needs of fast translation of multilingual information all over the world . Machine translation belongs to a branch of natural language information processing , A computer system that can automatically generate one natural language into another without human help . at present , Google Translate 、 Baidu translation 、 Translation platforms launched by artificial intelligence giants such as Sogou translation have gradually occupied a leading position in the translation industry with the efficiency and accuracy of their translation process .
( Four ) Sentiment analysis
In the digital age , Information overload is a real phenomenon , Our ability to acquire knowledge and information has far exceeded our ability to understand it . also , This trend shows no sign of slowing down , Therefore, the ability to summarize the meaning of documents and information becomes more and more important . The application of emotion analysis as a common natural language processing method , It enables us to identify and absorb relevant information from a large amount of data , But also can understand the deeper meaning . such as , Enterprises analyze consumers' feedback on products , Or check the bad comment information in the online comments .
( 5、 ... and ) chatbot
Chat robot can realize functions such as no one placing orders .
It is divided into chat type and task oriented robot : The chat type uses the generative method , Include seq2seq、Transformer Wait for the model ; Task oriented , Prefer to use the way of filling grooves .
A chat robot can also be built in a way similar to a question and answer system , That is, Retrieval .
( 6、 ... and ) Spam filtering
At present , Spam filter has become the first line of defense against spam . however , Many people have encountered these problems when using e-mail : Unwanted emails are still received , Or important emails are filtered out .
Natural language processing (NLP) analyzes the text content in emails , It can judge whether the mail is spam relatively accurately . at present , Bayes (Bayesian) Spam filtering is one of the most concerned technologies , It learns a lot about spam and non spam , Collect characteristic words in e-mail and generate garbage thesaurus and non garbage Thesaurus , Then calculate the probability that the mail belongs to spam according to the statistical frequency of these Thesaurus , To judge .
( 7、 ... and ) Personalized recommendation
Natural language processing can be based on big data and historical behavior records , Learn the user's interests , Predict the user's score or preference for a given item , Achieve a precise understanding of the user's intention , At the same time, the language matching calculation , Achieve accurate matching . for example , In the field of news services , Through what users read 、 Duration 、 Comments and other preferences , And social networks and even the mobile device models used , Comprehensively analyze the information sources and core vocabulary concerned by users , Conduct professional detailed analysis , So as to push the news , Realize the personal customized service of news , End user stickiness .
( 8、 ... and ) Information extraction
Many important decisions in the financial market are increasingly divorced from human supervision and control . Algorithmic trading is becoming more and more popular , This is a form of financial investment completely controlled by technology . however , Many of these financial decisions are influenced by the news . therefore , One of the main tasks of naturallanguageprocessing is to obtain these plaintext announcements , And extract relevant information in a format that can be incorporated into algorithmic trading decisions . for example , News of mergers between companies may have a significant impact on trading decisions , Merge details ( Including participants 、 purchasing price ) Into the trading algorithm , This may have a profit impact of millions of dollars .
In addition to the above applications , In fact, there are many application scenarios . Even for one application , We can also derive many different tasks . You can choose a topic you are most interested in to systematically understand the knowledge in this field , Or research all the relevant articles . Hope that at the end of the series of tutorials , You can have a deep understanding of a certain field .
So the next tutorial will focus on these application scenarios from Algorithmic theory 、 Code implementation and Project landing Explain from these three perspectives , So that everyone can have the ability to independently develop models and deploy Services .
Four 、NLP The core technology
( One )NLP Three dimensions of Technology

The three dimensions are :
- word Morphology , The meaning of words 、 Part of speech, etc .
- Sentence structure Syntax, Analyzing sentence components based on language grammar , Get the syntax tree , So as to get the relationship between different modules of the sentence .
- semantics Semantic , Understand the meaning behind the sentence .
These are upstream tasks , Because it plays the most fundamental role . If these are not done well , That cannot be on the sentence level , Or for the analysis of the whole text . Only do a good job in the underlying basic technology , In order to better serve downstream tasks .
For example , For a text classification task , It depends very much on the representation of text , And the representation of text depends very much on the representation of words . The core here is how to better express words , Back to the upstream task .
( Two )NLP The key technology
1. participle Word Segmentaition
A participle is a sentence 、 The paragraph 、 The long text of the article , Decompose into data structures in terms of words , Facilitate subsequent processing and analysis . obviously , Chinese word segmentation is more complex than English word segmentation , But at the same time, the participle can pass jieba Wait for tools to call directly , As a completed task .
2. Part of speech tagging Part-of-Speech Tagging
Part of speech tagging is to determine the most appropriate part of speech marker for each word in a given sentence . Whether part of speech tagging is correct or not will directly affect the subsequent parsing 、 Semantic analysis , It is one of the basic subjects of Chinese information processing .
3. Semantic understanding Semantic Understanding
Semantic understanding refers to the interpretation of natural language sentences or chapters ( word 、 phrase 、 The sentence 、 The paragraph 、 Chapter ) The meaning of .
4. Named entity recognition Named Entity Recognition
Named entity recognition is from the actual text data set ( corpus ) Middle analysis , Judge , Mark the specific named entity , Two key points are usually involved :(1) Boundary recognition of named entities ;(2) The category to which the named entity belongs ( For example, person names. 、 Place names 、 Organization name, etc ).
5. Syntactic parsing Parsing
Syntactic analysis is the processing process of analyzing the input text sentence to get the syntactic structure of the sentence . Analyze the syntactic structure , On the one hand, it is the own needs of language understanding , Syntactic analysis is an important part of language understanding , On the other hand, it also provides support for other natural language processing tasks .
( 3、 ... and )NLP Technology Overview

These are some technical terms related to natural language processing , And the subsequent tutorials will also be introduced step by step .
summary
NLP In the past ten years, we have achieved unexpected development , With the development of large-scale language model , Computers will break through the boundaries of language and can grasp more and more sensory information . As Manning said , Models that can understand more sensory information also mean that they will be used more widely , And that's why , It is possible that within the next decade, people will see a more basic form of artificial intelligence that is universally applicable .
therefore ,NLP Future period , I also hope that more people can join in this field .
Want to go from Xiaobai to great God together , Friends who learn natural language processing can Click on the link below perhaps Subscribe to my Natural language processing from Xiaobai to proficient special column , The project deployment and complete code practice involved in it are all free .
Reference resources :
natural language processing (NLP) Introduction - You know
Greedy College NLP Course
边栏推荐
- 946. Verify stack sequence
- A few years ago, I outsourced for four years. Qiu Zhao felt that life was like this
- Kotlin function
- Add a new cloud disk to Huawei virtual machine
- 上午面了个腾讯拿 38K 出来的,让我见识到了基础的天花
- Kotlin variable
- [cloud native] event publishing and subscription in Nacos -- observer mode
- NLP engineer learning summary and index
- What if wechat is mistakenly sealed? Explain the underlying logic of wechat seal in detail
- View and modify the MySQL data storage directory under centos7
猜你喜欢

激动人心!2022开放原子全球开源峰会报名火热开启!

How to connect the API interface of Taobao open platform (super detailed)

Taobao product details API | get baby SKU, main map, evaluation and other API interfaces

SAP UI5 FlexibleColumnLayout 控件介绍

10 minute fitness method reading notes (3/5)

将函数放在模块中

SAP UI5 DynamicPage 控件介绍

滴滴开源DELTA:AI开发者可轻松训练自然语言模型

Laravel文档阅读笔记-mews/captcha的使用(验证码功能)

RHCSA3
随机推荐
Halcon 模板匹配实战代码(一)
自然语言处理系列(一)入门概述
RHCSA4
About the single step debugging of whether SAP ui5 floating footer is displayed or not and the benefits of using SAP ui5
#yyds干货盘点#js截取文件后缀名
石臻臻的2021总结和2022展望 | 文末彩蛋
解决 UnicodeDecodeError: ‘gbk‘ codec can‘t decode byte 0xa2 in position 107
2021.12.16-2021.12.20 empty four hand transaction records
由扫地增而引起的小叙
SAP SEGW 事物码里的导航属性(Navigation Property) 和 EntitySet 使用方法
Tips and tricks of image segmentation summarized from 39 Kabul competitions
NFT: how to make money with unique assets?
View and terminate the executing thread in MySQL
How do e-commerce sellers refund in batches?
RHCSA1
Insmod prompt invalid module format
leetcode:221. 最大正方形【dp状态转移的精髓】
VoneDAO破解组织发展效能难题
What if wechat is mistakenly sealed? Explain the underlying logic of wechat seal in detail
Iterator details in list... Interview pits
