当前位置:网站首页>Speech and language processing (3rd ed. draft) Chapter 2 - regular expression, text normalization, editing distance reading notes
Speech and language processing (3rd ed. draft) Chapter 2 - regular expression, text normalization, editing distance reading notes
2022-07-27 06:52:00 【Haulyn5】
Preface
Title of this chapter :REGULAR EXPRESSIONS, TEXT NORMALIZATION, EDIT DISTANCE
from DLHLP 2020 I learned about this book over there , Free open source , About Speech and NLP A very excellent book ,cs224n Recommended by the official website of ref There is also this book in it . From the perspective of books, I have hamster syndrome. Of course, I download it the first time , Then I didn't read a few pages , The recent discovery CSDN Blogging is also good , Can synchronize , You can also save the draft , It won't make my desktop so messy , After that, some learning things can also be hung here , Of course, I also hope to write more and practice my writing , Be able to write something that makes others look comfortable rather than being able to understand by yourself ( Of course, this is what requires effort and patience , It may take several times longer )
Textbook open source website :
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
The recording order of this note is basically in accordance with the original textbook , Basically, it belongs to the refinement of the original textbook .
Text
lemmatization: Judge whether two words have the same root .
Lemma: “sing” Namely “sang”, “sung”, “sings” Of common lemma.
Stemming: simpler version of lemmatization in which we mainly just strip suffixes from the end of the word.
Regular has many variants , The regularity introduced in the book is called extended regular expressions.
Regular is case sensitive .
square brackets "[ ]" Represents a set of contents to be matched . For example, I want to make a letter case insensitive , I can use the following example .
[gG]reat You can match great perhaps Great.
There can also be a lot of content in square brackets ,[0123456789] You can match any number . Of course, you must find it very troublesome , So you can use horizontal lines in regular dash“-” To express range. Unlike python Or some programming languages , Regular range It is closed before and after .
therefore [1-3] Can match 1 or 2 or 3. [0-9] All the numbers ,[A-Z] and [a-z] It's all uppercase and lowercase letters .
caret ^ The beginning in square brackets means Take the opposite , Is not in the scope of the match . If it doesn't appear in square brackets , It is in the normal expression ^ , Then it matches himself .
[^123] Except for 1 or 2 or 3 Any character of can be matched .[1^] matching 1 perhaps ^ .1^ Represents a match 1^ These two consecutive characters .
question mark ? Indicates that the previous character may or may not .
such as dogs? You can match dog or dogs.
asterisk asterisk * Means to match zero or several preceding characters . The question mark upgraded version belongs to .
Ah*! You can match A! Ah! Ahh! wait . The interesting thing is ,a* Yes, it can match 123 Of , Thought there was no "a" Ha ha ha ha ha ha ha . Combine the previous knowledge points ,a[12]* Namely a Add zero or several 1 perhaps 2, such as a111 a222 a121 a345 Can match .
OK, I'm done with the basics , Break exam , How to express positive integers ?
[0-9][0-9]*, Because if only [0-9]*, Nothing can match . This is not elegant , So there is an upgraded version .
plus + , It means at least one , Or several .
So if it represents a string of numbers , The general elegant expression is [0-9]+ .
period (/./), English full stop , Represents a wildcard , Can match any character except carriage return .
such as a.b You can match aab a1b a'b wait .
Wildcards are often used with asterisks ,.* Represents a string of arbitrary length ( It can be nothing ).
An interesting application is , For example, I want to find two times Tom A line , You can use regular Tom.*Tom To match .
Anchors It is an interesting high-level symbol , Let's introduce them separately .
^ Indicates the beginning of a line , “^The” Only match the one that appears at the beginning The, If The If it appears elsewhere, it will not be matched . Corresponding $ At the end of a line . therefore “^The mouse.$” Only... In one line will be matched “The mouse.” Lines of these characters .
\b Express word boundary ,“\bwhat\b” Will only match the individual word "what", It doesn't match “whatever”. It is worth noting that , here word The definition of is based on the programming language ,word It means underline , Letter , Combination of Numbers , in other words “\b88” Will match “88” But can't match “188”, because 1 It's also word The content of , Not to the border yet . however "$88" Can be matched , because "$" Not included word.
\B That is to say \b Antonym of , All cannot match \b All of them can match \B, vice versa .
Just remember 2.1.2 It has been used for so long …… The notes are a little too detailed
边栏推荐
- After adding a camera (camera) to the UAV in gazebo, the UAV cannot take off
- Problems related to compilation and training of Darknet yolov3 and Yolo fast using CUDA environment of rtx30 Series graphics card on win10 platform
- Packaging of logging logs
- Log in to Alibaba cloud server with a key
- Go语言学习
- 如何避免漏洞?向日葵远程为你讲解不同场景下的安全使用方法
- RAID详解与配置
- Build cloud native operating environment
- Soul submitted an application for listing in Hong Kong stocks, accelerating the diversified and scene based layout of social gathering places
- Tips - completely delete the files on the USB flash drive
猜你喜欢

Soul持续发力社交渠道赴港上市,“Soul式社交”凭什么火出圈?

NFS introduction and configuration

FTX Foundation funded 15million to help covid-19 clinical trials, which will affect global public health

Soul submitted an application for listing in Hong Kong stocks, accelerating the diversified and scene based layout of social gathering places

云原生运行环境搭建

Soul 递交港股上市申请,加快社交聚集地多元化、场景化的布局

如何删除或替换EasyPlayer流媒体播放器的loading样式?

1. Install redis in CentOS 7

Detection and identification data set and yolov5 model of helmet reflective clothing

What if the website server is attacked? Sunflower tips that preventing loopholes is the key
随机推荐
使用密钥方式登录阿里云服务器
NAT(网络地址转换)
LVM与磁盘配额
Project training experience 1
Ancient art - make good use of long tail keywords
Numpy array and image conversion
NFS简介和配置
Packaging of logging logs
PSI|CSI和ROC|AUC和KS -备忘录
Shell script one click configuration lamp
Create a container that does not depend on any underlying image
1. Install redis in CentOS 7
FTX 基金会资助1500万帮助新冠疫苗临床实验,将影响全球公共卫生
一键修复漏洞可行吗?向日葵来告诉你一键修复漏洞可行吗?向日葵来告诉你一键修复漏洞可行吗?向日葵来告诉你一键修复漏洞可行吗?向日葵来告诉你一键修复漏洞可行吗?向日葵来告诉你一键修复漏洞可行吗?向日葵来告
Ftx.us launched stock and ETF trading services to make trading more transparent
Express框架
EasyCVR平台播放设备录像时,拖动时间轴播放无效是什么原因?
Use -wall to clear code hidden dangers
Multimodal database | star ring technology multimode database argodb "one database for multiple purposes", building a high-performance Lake warehouse integrated platform
Project training experience 2