当前位置:网站首页>Natural language processing series (I) introduction overview
Natural language processing series (I) introduction overview
2022-07-05 12:56:00 【Yunlord】
Catalog
One 、NLP Series column Introduction
( One ) Original intention of design
( 3、 ... and ) How to learn NLP
( Two )NLP The difficulties of
( One ) Intelligent question and answer system
( 3、 ... and ) Machine translation
( 7、 ... and ) Personalized recommendation
( 8、 ... and ) Information extraction
( One )NLP Three dimensions of Technology
1. participle Word Segmentaition
2. Part of speech tagging Part-of-Speech Tagging
3. Semantic understanding Semantic Understanding
4. Named entity recognition Named Entity Recognition
( 3、 ... and )NLP Technology Overview
introduction
Natural language processing is in a period of rapid development .
With sensors 、 signal communication 、 Chip and AI Common progress of Algorithm , The era of Internet of things is coming . When the , Almost everything can collect comprehensive information through sensor technology , adopt 5G High speed transmission of massive information . On cloud services AI Algorithm realizes data analysis , Greatly improve social production efficiency .
AI technology is mainly divided into two areas , Computer vision (computer vision), natural language processing (natural language processing), And it affects every aspect of people's life . Computer vision focuses on the understanding and processing of images , Natural language processing is widely used in various scenes related to speech or text . In the middle of it , Naturally, there are overlapping areas of two technology application scenarios , That is, we need to comprehensively understand and process images and texts at the same time .
From this point of view , As long as there is text data, there are NLP The need for Technology . at present , Even in the field of financial technology, there is a large demand for text analysis , For example, analyze market public opinion by reading news and research reports , Or do stock market data analysis .
Over the past few years , We can see an obvious trend that is Text data is growing exponentially . In fact, this is inseparable from the data explosion brought by the mobile Internet . You can imagine , Social software we use every day , Such as WeChat 、 Tiktok , How much text data is carried . The sharp increase of text data is bound to be accompanied by the rapid increase of the industry's demand for text analysis , What follows is for NLP Talent needs .
Although the current track seems to be rolling , But I believe , This is a track that will be widened , It can make a difference in the field of natural language processing .
One 、NLP Series column Introduction
( One ) Original intention of design
The original intention of writing this series is as follows :
- Cultivate and attract new NLP/AI personnel . In this series of articles, we will explain natural language related knowledge and cases from simple to deep .
- NLP Is currently the AI The hottest direction in the field . Natural language processing has become very popular in recent years , Although it started a few years later than computer vision , But its momentum is very strong , And it is expected that in the future 5 This momentum will continue throughout the year .
- At present, there is no special systematization 、 In detail NLP Series of tutorials , Especially Chinese .
- Technology before deep learning
- Methodology based on deep learning
- NLP It has developed very fast in recent years , Knowledge iterations update rapidly . So in recent years AI Much of the blockbuster success in the field comes from natural language processing , Including previous years bert.
At the same time, for beginners ,NLP Is a faster AI Introduction choice and development direction of , Threshold ratio CV A lower .
( Two )NLP Post treatment
With the rapid development of artificial intelligence industry , Artificial intelligence talent competition is becoming more and more intense . In recent days, , released 《 Research Report on talent management in AI industry 》 Show , The turnover rate of the artificial intelligence industry increased year on year , The supply of personnel for science and technology posts is tight , The supply-demand ratio of algorithm design post and application development post is 0.2 following .
According to the Ministry of industry and information technology , It is estimated that the effective talent gap in China's AI industry will reach 30 ten thousand , Among them, there is a large gap of technical talents . According to the report , Artificial intelligence industry algorithm research post 、 The supply of talents for application development posts is extremely tight , The talent supply-demand ratio is only 0.13 and 0.17; The supply of talents for practical skill Posts basically meets the demand , The talent supply-demand ratio is 0.98, There is still a slight shortage . In different technical directions , Computer vision talents are extremely scarce , The supply-demand ratio is only 0.09, natural language processing 、 Machine learning and artificial intelligence chips are also in short supply , The talent supply-demand ratio is 0.2、0.23 and 0.37.
To make a long story short , Enter at present NLP There is nothing wrong with the ranks of , At least in the future 5 Opportunities in this industry will still exist within this year , But the demand of the market for talents will become higher and higher . therefore , The sooner you enter this industry , The greater the advantage .
( 3、 ... and ) How to learn NLP
、
Introduction knowledge system of natural language processing As shown in the figure above . Now I will make a brief introduction , And in the subsequent series of articles, we will introduce the relevant knowledge and articles in turn , And will help you better understand through cases and code explanations .
- Mathematical basis
- Programming based
- machine learning
- Text preprocessing
- Word segmentation and word embedding
- Model
- Application scenarios
Math knowledge :
First of all, we need some mathematical knowledge . Just know Basic Advanced Mathematics 、 linear algebra 、 Statistics and optimization theory Just go . Specific theories can be understood slowly after in-depth study .
Programming based :
Mainly through python Implement deep learning code , And now there are a lot of useful deep learning frameworks , Include pytorch、tensorflow2 wait , Easy to learn, easy to use . And the operating system actually linux and windows Will do , But when it comes to real industrial deployment linux More applicable .
machine learning :
We need to learn some basic concepts of machine learning , Although at present, deep learning is basically used to solve complex application problems in natural language processing , Compared with traditional machine learning methods , The performance will be excellent . But deep learning and machine learning come down in one continuous line , quite a lot The concept of machine learning is universal in deep learning , Like data sets , Loss function , Over fitting, etc .
Text preprocessing :
Natural language processing tasks The first process is also a key process —— Text preprocessing . When you get the text data set , Need to go through word segmentation 、 Stop word filtering 、 Text vector conversion and a series of text preprocessing steps , Only in this way can we achieve better results in the follow-up tasks . At the same time, if the data set is small , The data set can be expanded through operations such as data enhancement .
Word segmentation and word embedding
For various applications of natural language processing , The most basic task is text representation . Because we know that a text cannot be directly used as the input of the model , So we have to convert the text into vector form first , Then import it into the model for training . The so-called representation of text , In fact, it is to study how to express the text in the form of vector or matrix . The basic operation is to divide the text into word combinations , Then for each word , Use an eigenvector to represent it .
Model
When we use eigenvectors to represent text , You can input vectors into the model to perform downstream tasks , Such as classification or translation . Next, we will introduce various model results and explain the code .
Application scenarios
NLP There are many application scenarios for , For example, speech recognition 、 Machine translation 、 Text classification and summary generation . We will With industrialized engineering practice Let's explain one by one .
Two 、 What is? NLP
( One )NLP summary
Three concepts of natural language processing :
- natural language processing (Natural Language Processing, NLP)
- natural language understanding (Natural Language Understanding, NLU), Understand the meaning of the text
- Natural language generation (Natural Language Generation, NLG), Generate text according to meaning
People use voice 、 Image and text to convey information exchange . So the core problem is how to understand this information ? and NLP The main task of is to understand the text and generate text information , So there's a formula :
NLP=NLU+NLG
from NLU To NLG The diagram of this is as follows :
So overall ,NLP There are mainly two aspects , One is to study how to better understand the meaning of text transmission (NLU), On the other hand, it studies how to generate text according to the expressed meaning (NLG).
( Two )NLP The difficulties of
Why is it more difficult to understand text than image ?
Because pictures are generally WYSIWYG , There is no deep meaning behind , And there is basically no situation that has different meanings in different scenes .
The text is not like this , All we see directly is the text , We also need to understand the deep meaning behind it , Specifically, there are the following points :
- One meaning and many tables , One meaning has many expressions .
- polysemy , A word expresses different meanings in context .
In addition to the technology itself , Natural language processing industrial processes usually involve many modules , Basic includes text cleaning 、 participle 、 Feature Engineering 、 Named entity recognition 、 A series of steps such as classification , In fact, every operation will continue to accumulate errors , Ultimately, it will affect the performance of the actual system . So in the design NLP System time , Every link is crucial , There can be no neglect .
3、 ... and 、NLP application
Natural language processing has many application scenarios , Including intelligent question answering system 、 The text generated 、 Machine translation, etc .
( One ) Intelligent question and answer system
With the rapid development of Internet , The amount of network information is increasing , People need more accurate information . The traditional search engine technology has been unable to meet people's higher and higher needs , And automatic question answering technology has become an effective means to solve this problem . Automatic question answering refers to the task of using computer to automatically answer the questions raised by users to meet the knowledge needs of users , When answering user questions , First of all, we should correctly understand the questions raised by users , Extract the key information , Search in the existing corpus or knowledge base 、 matching , Feedback the obtained answer to the user .
( Two ) The text generated
Text generation technology is another important technology in the field of natural language processing . Users can use the established information and text generation model to generate text sequences that meet specific goals . The application scenarios of text generation model are rich, such as generative reading comprehension 、 Man machine conversation or intelligent writing . The current development of deep learning has also promoted the progress of this technology , More and more highly available text generation models have been born , Promote industry efficiency , Serve the intelligent society .
( 3、 ... and ) Machine translation
With the rapid development of communication technology and Internet technology 、 The rapid increase of information and the increasingly close international links , The challenge of enabling all people in the world to access information across language barriers has gone beyond the ability of human translation .
Machine translation because of its high efficiency 、 The low cost meets the needs of fast translation of multilingual information all over the world . Machine translation belongs to a branch of natural language information processing , A computer system that can automatically generate one natural language into another without human help . at present , Google Translate 、 Baidu translation 、 Translation platforms launched by artificial intelligence giants such as Sogou translation have gradually occupied a leading position in the translation industry with the efficiency and accuracy of their translation process .
( Four ) Sentiment analysis
In the digital age , Information overload is a real phenomenon , Our ability to acquire knowledge and information has far exceeded our ability to understand it . also , This trend shows no sign of slowing down , Therefore, the ability to summarize the meaning of documents and information becomes more and more important . The application of emotion analysis as a common natural language processing method , It enables us to identify and absorb relevant information from a large amount of data , But also can understand the deeper meaning . such as , Enterprises analyze consumers' feedback on products , Or check the bad comment information in the online comments .
( 5、 ... and ) chatbot
Chat robot can realize functions such as no one placing orders .
It is divided into chat type and task oriented robot : The chat type uses the generative method , Include seq2seq、Transformer Wait for the model ; Task oriented , Prefer to use the way of filling grooves .
A chat robot can also be built in a way similar to a question and answer system , That is, Retrieval .
( 6、 ... and ) Spam filtering
At present , Spam filter has become the first line of defense against spam . however , Many people have encountered these problems when using e-mail : Unwanted emails are still received , Or important emails are filtered out .
Natural language processing (NLP) analyzes the text content in emails , It can judge whether the mail is spam relatively accurately . at present , Bayes (Bayesian) Spam filtering is one of the most concerned technologies , It learns a lot about spam and non spam , Collect characteristic words in e-mail and generate garbage thesaurus and non garbage Thesaurus , Then calculate the probability that the mail belongs to spam according to the statistical frequency of these Thesaurus , To judge .
( 7、 ... and ) Personalized recommendation
Natural language processing can be based on big data and historical behavior records , Learn the user's interests , Predict the user's score or preference for a given item , Achieve a precise understanding of the user's intention , At the same time, the language matching calculation , Achieve accurate matching . for example , In the field of news services , Through what users read 、 Duration 、 Comments and other preferences , And social networks and even the mobile device models used , Comprehensively analyze the information sources and core vocabulary concerned by users , Conduct professional detailed analysis , So as to push the news , Realize the personal customized service of news , End user stickiness .
( 8、 ... and ) Information extraction
Many important decisions in the financial market are increasingly divorced from human supervision and control . Algorithmic trading is becoming more and more popular , This is a form of financial investment completely controlled by technology . however , Many of these financial decisions are influenced by the news . therefore , One of the main tasks of naturallanguageprocessing is to obtain these plaintext announcements , And extract relevant information in a format that can be incorporated into algorithmic trading decisions . for example , News of mergers between companies may have a significant impact on trading decisions , Merge details ( Including participants 、 purchasing price ) Into the trading algorithm , This may have a profit impact of millions of dollars .
In addition to the above applications , In fact, there are many application scenarios . Even for one application , We can also derive many different tasks . You can choose a topic you are most interested in to systematically understand the knowledge in this field , Or research all the relevant articles . Hope that at the end of the series of tutorials , You can have a deep understanding of a certain field .
So the next tutorial will focus on these application scenarios from Algorithmic theory 、 Code implementation and Project landing Explain from these three perspectives , So that everyone can have the ability to independently develop models and deploy Services .
Four 、NLP The core technology
( One )NLP Three dimensions of Technology
The three dimensions are :
- word Morphology , The meaning of words 、 Part of speech, etc .
- Sentence structure Syntax, Analyzing sentence components based on language grammar , Get the syntax tree , So as to get the relationship between different modules of the sentence .
- semantics Semantic , Understand the meaning behind the sentence .
These are upstream tasks , Because it plays the most fundamental role . If these are not done well , That cannot be on the sentence level , Or for the analysis of the whole text . Only do a good job in the underlying basic technology , In order to better serve downstream tasks .
For example , For a text classification task , It depends very much on the representation of text , And the representation of text depends very much on the representation of words . The core here is how to better express words , Back to the upstream task .
( Two )NLP The key technology
1. participle Word Segmentaition
A participle is a sentence 、 The paragraph 、 The long text of the article , Decompose into data structures in terms of words , Facilitate subsequent processing and analysis . obviously , Chinese word segmentation is more complex than English word segmentation , But at the same time, the participle can pass jieba Wait for tools to call directly , As a completed task .
2. Part of speech tagging Part-of-Speech Tagging
Part of speech tagging is to determine the most appropriate part of speech marker for each word in a given sentence . Whether part of speech tagging is correct or not will directly affect the subsequent parsing 、 Semantic analysis , It is one of the basic subjects of Chinese information processing .
3. Semantic understanding Semantic Understanding
Semantic understanding refers to the interpretation of natural language sentences or chapters ( word 、 phrase 、 The sentence 、 The paragraph 、 Chapter ) The meaning of .
4. Named entity recognition Named Entity Recognition
Named entity recognition is from the actual text data set ( corpus ) Middle analysis , Judge , Mark the specific named entity , Two key points are usually involved :(1) Boundary recognition of named entities ;(2) The category to which the named entity belongs ( For example, person names. 、 Place names 、 Organization name, etc ).
5. Syntactic parsing Parsing
Syntactic analysis is the processing process of analyzing the input text sentence to get the syntactic structure of the sentence . Analyze the syntactic structure , On the one hand, it is the own needs of language understanding , Syntactic analysis is an important part of language understanding , On the other hand, it also provides support for other natural language processing tasks .
( 3、 ... and )NLP Technology Overview
These are some technical terms related to natural language processing , And the subsequent tutorials will also be introduced step by step .
summary
NLP In the past ten years, we have achieved unexpected development , With the development of large-scale language model , Computers will break through the boundaries of language and can grasp more and more sensory information . As Manning said , Models that can understand more sensory information also mean that they will be used more widely , And that's why , It is possible that within the next decade, people will see a more basic form of artificial intelligence that is universally applicable .
therefore ,NLP Future period , I also hope that more people can join in this field .
Want to go from Xiaobai to great God together , Friends who learn natural language processing can Click on the link below perhaps Subscribe to my Natural language processing from Xiaobai to proficient special column , The project deployment and complete code practice involved in it are all free .
Reference resources :
natural language processing (NLP) Introduction - You know
Greedy College NLP Course
边栏推荐
- SAP SEGW 事物码里的 ABAP Editor
- Docker configures redis and redis clusters
- Introduction to the principle of DNS
- OPPO小布推出预训练大模型OBERT,晋升KgCLUE榜首
- Using docker for MySQL 8.0 master-slave configuration
- Four common problems of e-commerce sellers' refund and cash return, with solutions
- Concurrent performance test of SAP Spartacus with JMeter
- Distance measuring sensor chip 4530a used in home intelligent lighting
- Kotlin function
- 155. Minimum stack
猜你喜欢
RHCSA7
函数传递参数小案例
I'm doing open source in Didi
激动人心!2022开放原子全球开源峰会报名火热开启!
SAP ui5 objectpagelayout control usage sharing
Introduction to sap ui5 dynamicpage control
Introduction aux contrôles de la page dynamique SAP ui5
OPPO小布推出预训练大模型OBERT,晋升KgCLUE榜首
Taobao product details API | get baby SKU, main map, evaluation and other API interfaces
ABAP editor in SAP segw transaction code
随机推荐
国内市场上的BI软件,到底有啥区别
Super efficient! The secret of swagger Yapi
以VMware创新之道,重塑多云产品力
Compilation principle reading notes (1/12)
Distance measuring sensor chip 4530a used in home intelligent lighting
Rasa Chat Robot Tutorial (translation) (1)
Redis master-slave configuration and sentinel mode
RHCSA3
SAP UI5 ObjectPageLayout 控件使用方法分享
A possible investment strategy and a possible fuzzy fast stock valuation method
Difference between JUnit theories and parameterized tests
Docker configures redis and redis clusters
Concurrent performance test of SAP Spartacus with JMeter
Simply take stock reading notes (3/8)
函数传递参数小案例
stirring! 2022 open atom global open source summit registration is hot!
10 minute fitness method reading notes (3/5)
Neural network of PRML reading notes (1)
从39个kaggle竞赛中总结出来的图像分割的Tips和Tricks
Taobao order amount check error, avoid capital loss API