当前位置:网站首页>[pyhton crawler] regular expression
[pyhton crawler] regular expression
2022-06-11 13:38:00 【Riding a snail chasing a missile '】
List of articles
In the development of reptiles , You need to extract useful information from a large piece of text ,
Regular expressionsIs one of the ways to extract information .
1、 Regular expression basis
Regular expressions (Regular Expression) Is a string , It can represent a regular piece of information .Python Comes with a regular expression module - re, Through this module, you can find 、 extract 、 Replace a regular message . In program development , To get a computer program to find what it needs from a large piece of text , You can use regular expressions to achieve .
Using regular expressions has the following steps :
(1) Look for patterns
(2) Use regular symbols to represent laws
(3) Extract information
2、 The basic symbols of regular expressions
2.1 Order number “.”
A dot can Replace any character except the newline character , Including but not limited to English letters 、 Numbers 、 Chinese characters 、 English punctuation and Chinese punctuation .
2.2 asterisk “*”
An asterisk can Represents a subexpression in front of it ( Ordinary character 、 Another regular expression symbol or symbols )0 Times to infinity .
2.3 question mark “?”
A question mark can Represents the subexpression in front of it 0 Time or 1 Time . Be careful , The question mark here is English question mark .
2.4 The backslash “\”
Backslashes cannot be used alone in regular expressions , Even in the whole Python You can't use it alone . Backslashes need to be Use it with other characters to turn special symbols into ordinary symbols , Turn ordinary symbols into special symbols :

2.5 Numbers “\d”
Regular expressions make use “\d” To represent a number . Why use letters d Well ? because d It's English “digital( Numbers )” The first letter of . emphasize ,“\d” Although it is made up of backslashes and letters d Composed of , But we should “\d” As a whole of regular expression symbols .
2.6 parentheses “()”
Parentheses can Extract the contents in the brackets .
3、Python Using regular expressions
Python Regular expression has a very powerful module . Using this module, you can easily extract regular information from a large paragraph of text through regular expressions .Python The name of the regular expression module is “re”, That is to say “regularexpression” An acronym for . stay Python You need to import this module before using it . The imported statement is :
import re # pycharm If you make a mistake Alt+Enter Import automatically
Now let's introduce the commonly used API:
3.1 findall
Python The regular expression module of contains a findall Method , It can Returns all strings that meet the requirements in the form of a list .
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string. If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
pattern Represents a regular expression ,string Represents the original string ,flags Signs indicating some special functions .findall The result is a list , Contains all the matching results . If it doesn't match the result , Will return an empty list :
content = ' My computer password is :123456, My mobile password is :888888, My house code is :000000, Don't forget !'
pwd_list = re.findall(' yes :(.*?),', content)
machine_list = re.findall(' my (.*?) The password is :', content)
name_list = re.findall(' The name is (.*?),', content)
print(' All passwords are :{}'.format(pwd_list))
print(' Belongs to :{}'.format(machine_list))
print(' User name: :{}'.format(name_list))
It is obvious that there is no match in the result and the result is empty List . Here's another change : When matching the password , As shown in the figure on the left, there will be one less . The reason lies in matching , My matching rule is :' yes :(.*?),', The middle password part of the text that must strictly meet this format can be extracted , The point is the latter , , As shown on the right, add , Don't forget ! Make the previous text meet the matching rules , So as to extract :
When something needs to be extracted , Use parentheses to enclose these contents , So you don't get irrelevant information . If Contains multiple “(.*?)” As shown in the figure below , The return is still a list , But the elements in the list become tuples , The... In the tuple 1 The first element is the account number , The first 2 The first element is password :

There is one in the function prototype flags Parameters . This parameter can be omitted ; When not omitted , It has some auxiliary functions , for example Ignore case 、 Ignore line breaks etc. . Here, ignore the newline character as an example to illustrate :
Common parameters :
re.I
IGNORECASE
Ignore letter case
re.L
LOCALE
influence “w, “W, “b, and “B, It depends on the current localization settings .
re.M
MULTILINE
After using this logo ,‘^’ and ‘$’ When you match the beginning and end of a line , Will increase the position before and after the line break .
re.S
DOTALL
send “.” Special characters match exactly any character , Including line breaks ; There is no such sign , “.” Matches any character except the newline character .
re.X
VERBOSE
When the flag is specified , stay RE The whitespace in the string is ignored , Unless the whitespace is in a character class or after a backslash .
It also allows you to write comments to RE, These comments are ignored by the engine ;
With notes “#” Number To mark , However, the symbol cannot follow a string or backslash .
Reference resources :Python Regular expressions flags Parameters
3.2 serach
search() Usage and findall() In the same way , however search() It will only return the first 1 A string that meets the requirements . Once you find what meets the requirements , It will stop looking . For finding only the first... From the super large text 1 This data is particularly useful , Can greatly improve the efficiency of the program .
def search(pattern, string, flags=0):
"""Scan through string looking for a match to the pattern, returning a Match object, or None if no match was found."""
return _compile(pattern, flags).search(string)
For results , If the match is successful , It is a Regular expression objects , To get a match , You have to go through .group() This method to get the value inside ; If no data is matched , Namely None:

Only in .group() The parameter is 1 When , Will print out the results in the brackets in the regular expression ..group() Parameters of The maximum number of parentheses in a regular expression cannot be exceeded . Parameter is 1 Indicates that the second page is read 1 What's in brackets , Parameter is 2 Indicates that the second page is read 2 What's in brackets , And so on :
3.3 “.* ” and “.*?” The difference between
In crawler development ,.*? this 3 Two symbols are used together in most cases .
- A dot indicates any non newline character , An asterisk means to match the character before it 0 Times or any number of times . therefore
“.*” Means to match a string of arbitrary length any time. - This time must be in “.*” Add other symbols before and after to limit the range , Otherwise, the result is the original whole string .
- If in
“.*”Add a question mark after , become“.*?”, So what kind of results can be obtained ? A question mark means to match the symbol in front of it 0 Time or 1 Time . therefore“.*?” It means to match the shortest string that can meet the requirements.

Use “(.*)” What you get is a list of only one element , Inside is a long string .
Use “(.*?)” The result is to include 3 A list of elements , Each element directly corresponds to each password in the original text .
summary :
- ①
“.*”: Greedy mode , Get the longest string that satisfies the condition. - ②
“.*?”: Non greedy model , Get the shortest string that satisfies the condition.
4、 Regular expression extraction techniques
4.1 No need to use compile
def findall(pattern, string, flags=0):
"""Return a list of all non-overlapping matches in the string. If one or more capturing groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result."""
return _compile(pattern, flags).findall(string)
def compile(pattern, flags=0):
"Compile a regular expression pattern, returning a Pattern object."
return _compile(pattern, flags)
Use re.compile() When , What is called inside the program is _compile() Method ; When using re.finall() When , Automatically call... Inside the module first _compile() Method , Call again findall() Method .re.findall() Bring their own re.compile() The function of , So there's no need to use re.compile().
4.2 First the big and then the small
Some invalid content and valid content may have the same rules . In this case, it's easy to mix valid content with invalid content , Such as the following paragraph :
Effective users :
full name : Zhang San
full name : Li Si
full name : Wang Wu
Invalid user :
full name : Unknown shrimp
full name : Invisible Great Xia Zhang
The names of valid and invalid users are preceded by “ full name : ” start , If you use “ full name : (.*?)\n” To match , Will mix valid information with invalid information , It's hard to distinguish :

To solve this problem , You need to use the technique of grasping the big and then the small . First, match the effective users as a whole , Then match the name from the valid users :
4.3 Inside and outside parentheses
In the example above , Brackets and “.*?” All used together , Therefore, some readers may think that there can only be this in parentheses 3 Characters , There can be no other ordinary characters . But actually , There can also be other characters in parentheses , The impact on the matching results is as follows :

It's not hard to understand , Just remember :" Find according to the matching rules , The in parentheses is extracted " That's all right. !
边栏推荐
- Terraformer importing cloud resources
- About PHP: the original deployment environment written by PHP is deployed in phpstudy, PHP + MySQL + Apache. However, the computer must be turned on every time
- Just after the college entrance examination, I was confused and didn't know what to do? Tell me what I think
- Energy storage operation and configuration analysis of high proportion wind power system (realized by Matlab)
- 长连接简介
- 高比例风电电力系统储能运行及配置分析(Matlab实现)
- InfoQ 极客传媒 15 周年庆征文|移动端开发之动态排行【MUI+Flask+MongoDB】
- Nomad application layout scheme 07 of hashicopy (submit job)
- No delay / no delay live video instance effect cases
- kubernetes 二进制安装(v1.20.15)(七)加塞一个工作节点
猜你喜欢

BS-XX-007基于JSP实现户籍管理系统

AGV robot RFID sensor ck-g06a and Siemens 1200plc Application Manual

Collapse expression

Variable parameter expression

Explain in detail the differences between real participation formal parameters in C language

LNMP部署
![[signal processing] digital signal processing matlab design with GUI interface and report](/img/bd/7bcb03b5619998dda4c07bbf6d6436.png)
[signal processing] digital signal processing matlab design with GUI interface and report

cadence SPB17.4 - group operation(add to group, view group list, delete group)

为什么每运行一部都显示一次数据库已存在,都要删除数据库,然后才能成功,每运行一部都要删除一次数据库,重新运行整体才成功.
Explanation of waitgroup usage in go language learning
随机推荐
Two small things, feel the gap with the great God
[201] PHP exception handling - try catch finally exception handling in PHP
Hamad application scheduling scheme 06 of hashicopy (configure task)
C# 设置窗体和系统的光标形状
Add environment path
【201】php异常处理-PHP中的try catch finally异常处理
Pki/tls Swiss Army knife cfssl
InfoQ geek media's 15th anniversary essay solicitation - dynamic ranking of mobile terminal development [mui+flask+mongodb]
Kubernetes binary installation (v1.20.15) (VII) plug a work node
Question bank and answers for 2022 tool fitter (intermediate) operation certificate examination
漫谈软件缺陷与漏洞
Network information system emergency response
PKI/TLS瑞士军刀之cfssl
SQL: how to use the data of purchase order and sales order to calculate commodity cost by moving weighted average method
Application choreography nomad vs. kubernetes
JDBC connection pool is used for batch import. 5million data are run each time, but various problems will occur in the middle
Stochastic dynamic economic dispatching model of wind power (realized by matlab code)
SQL:如何用采购单销售单的数据 通过移动加权平均法 计算商品成本
What do you need to do to "surpass" the general database in the time sequence scenario?
优化调度(火电、风、储能)(Matlab实现)

