当前位置:网站首页>Speech and language processing (3rd ed. draft) Chapter 2 - regular expression, text normalization, editing distance reading notes
Speech and language processing (3rd ed. draft) Chapter 2 - regular expression, text normalization, editing distance reading notes
2022-07-27 06:52:00 【Haulyn5】
Preface
Title of this chapter :REGULAR EXPRESSIONS, TEXT NORMALIZATION, EDIT DISTANCE
from DLHLP 2020 I learned about this book over there , Free open source , About Speech and NLP A very excellent book ,cs224n Recommended by the official website of ref There is also this book in it . From the perspective of books, I have hamster syndrome. Of course, I download it the first time , Then I didn't read a few pages , The recent discovery CSDN Blogging is also good , Can synchronize , You can also save the draft , It won't make my desktop so messy , After that, some learning things can also be hung here , Of course, I also hope to write more and practice my writing , Be able to write something that makes others look comfortable rather than being able to understand by yourself ( Of course, this is what requires effort and patience , It may take several times longer )
Textbook open source website :
Speech and Language Processing
https://web.stanford.edu/~jurafsky/slp3/
The recording order of this note is basically in accordance with the original textbook , Basically, it belongs to the refinement of the original textbook .
Text
lemmatization: Judge whether two words have the same root .
Lemma: “sing” Namely “sang”, “sung”, “sings” Of common lemma.
Stemming: simpler version of lemmatization in which we mainly just strip suffixes from the end of the word.
Regular has many variants , The regularity introduced in the book is called extended regular expressions.
Regular is case sensitive .
square brackets "[ ]" Represents a set of contents to be matched . For example, I want to make a letter case insensitive , I can use the following example .
[gG]reat You can match great perhaps Great.
There can also be a lot of content in square brackets ,[0123456789] You can match any number . Of course, you must find it very troublesome , So you can use horizontal lines in regular dash“-” To express range. Unlike python Or some programming languages , Regular range It is closed before and after .
therefore [1-3] Can match 1 or 2 or 3. [0-9] All the numbers ,[A-Z] and [a-z] It's all uppercase and lowercase letters .
caret ^ The beginning in square brackets means Take the opposite , Is not in the scope of the match . If it doesn't appear in square brackets , It is in the normal expression ^ , Then it matches himself .
[^123] Except for 1 or 2 or 3 Any character of can be matched .[1^] matching 1 perhaps ^ .1^ Represents a match 1^ These two consecutive characters .
question mark ? Indicates that the previous character may or may not .
such as dogs? You can match dog or dogs.
asterisk asterisk * Means to match zero or several preceding characters . The question mark upgraded version belongs to .
Ah*! You can match A! Ah! Ahh! wait . The interesting thing is ,a* Yes, it can match 123 Of , Thought there was no "a" Ha ha ha ha ha ha ha . Combine the previous knowledge points ,a[12]* Namely a Add zero or several 1 perhaps 2, such as a111 a222 a121 a345 Can match .
OK, I'm done with the basics , Break exam , How to express positive integers ?
[0-9][0-9]*, Because if only [0-9]*, Nothing can match . This is not elegant , So there is an upgraded version .
plus + , It means at least one , Or several .
So if it represents a string of numbers , The general elegant expression is [0-9]+ .
period (/./), English full stop , Represents a wildcard , Can match any character except carriage return .
such as a.b You can match aab a1b a'b wait .
Wildcards are often used with asterisks ,.* Represents a string of arbitrary length ( It can be nothing ).
An interesting application is , For example, I want to find two times Tom A line , You can use regular Tom.*Tom To match .
Anchors It is an interesting high-level symbol , Let's introduce them separately .
^ Indicates the beginning of a line , “^The” Only match the one that appears at the beginning The, If The If it appears elsewhere, it will not be matched . Corresponding $ At the end of a line . therefore “^The mouse.$” Only... In one line will be matched “The mouse.” Lines of these characters .
\b Express word boundary ,“\bwhat\b” Will only match the individual word "what", It doesn't match “whatever”. It is worth noting that , here word The definition of is based on the programming language ,word It means underline , Letter , Combination of Numbers , in other words “\b88” Will match “88” But can't match “188”, because 1 It's also word The content of , Not to the border yet . however "$88" Can be matched , because "$" Not included word.
\B That is to say \b Antonym of , All cannot match \b All of them can match \B, vice versa .
Just remember 2.1.2 It has been used for so long …… The notes are a little too detailed
边栏推荐
- Numpy array and image conversion
- Pymysql query result conversion JSON
- Project training experience 1
- GoLand 编写go程序
- Geonode GeoServer win10 installation tutorial (personal test)
- Is it feasible to fix the vulnerability with one click? Sunflower to tell you that one click fix vulnerability is feasible? Sunflower to tell you that one click fix vulnerability is feasible? Sunflowe
- 关于在Gazebo中给无人机添加相机(摄像头)之后,无人机无法起飞
- Explanation of server related indicators
- MangoDB
- Some problems about too fast s verification code
猜你喜欢

Multimodal database | star ring technology multimode database argodb "one database for multiple purposes", building a high-performance Lake warehouse integrated platform

Alibaba cloud SMS authentication third-party interface (fast use)

关于ES6的新特性

Build cloud native operating environment

What is special about the rehabilitation orthopedic branch of 3D printing brand?

Linux Installation and uninstallation of MySQL

Esxi virtual machine starts, and the module "monitorloop" fails to power on

向日葵全面科普,为你的远程控制设备及时规避漏洞

2022年全球6家最具技术实力的的智能合约审计公司盘点

Express框架
随机推荐
如何避免漏洞?向日葵远程为你讲解不同场景下的安全使用方法
PXE efficient batch network installation
Vscode solves the problem of using stuck ipynb files when running
2022年全球最具技术实力的的智能合约安全审计公司推荐
When a subclass calls the constructor of its parent class
Raid explanation and configuration
Is it feasible to fix the vulnerability with one click? Sunflower to tell you that one click fix vulnerability is feasible? Sunflower to tell you that one click fix vulnerability is feasible? Sunflowe
如何删除或替换EasyPlayer流媒体播放器的loading样式?
QGIS series (1) -qgis (server APACHE) win10 installation
Build cloud native operating environment
Keras OCR instance test
Use of getattr, hasattr, delattr and setattr in reflectors
Webodm win10 installation tutorial (personal test)
Sunflower popularizes Science in an all-round way to avoid loopholes for your remote control equipment in time
GoLand writes Go program
自己动手实现容器
FTX US launched FTX stocks, striding forward to the mainstream financial industry
Shell common commands - memos
win10 添加虚拟网卡,配置op路由
darknet-yolov3、yolo-fastect使用rtx30系显卡cuda环境在win10平台编译与训练的相关问题