2022-07-05 21:49:00 A.way30

A syntax rule that uses expressions to match strings

Conditions of use

stay python The data analysis part of the crawler , The parsed local text content will be stored between tags or in the attributes corresponding to tags , When you want to extract the stored information , You need regular expressions .


Use metacharacters to match strings


Special symbols with fixed meaning ( Match one digit by default )

Metacharacters meaning
. Matches any character other than a newline character
\w Match letters or numbers or underscores
\s Match any whitespace
\d Match the Numbers
\n Match a line break
\t Match a tab
^ Matches the beginning of the string
$ Match the end of the string
\W Match non letters or numbers or underscores
\D Match non numeric
\S Match non whitespace
a|b Matching character α Or character b
() Match the expression in brackets , It also means a group
[…] Match characters in a character set (a-z representative a To z All the letters )
[^…] Matches all characters except those in the character group


^\d\d\d\d\d  No matching results 

\d\d\d\d\d$  No matching results 

eg: My phone number is :10010
[ I 10]   I  1 0 0 1 0


Controls the number of occurrences of the preceding metacharacter

quantifiers meaning
* Repeat zero or more times
+ Repeat one or more times
Repeat zero or one time
{n} repeat n Time
{n,} repeat n Times or more
{n,m} repeat n To m Time

Matching mode

.* Greedy matching
.*? Laziness matches
stay python Lazy matching is commonly used in crawlers

re modular


pattern: Regular expressions ;string: character string ;flags: Status bit , Embeddable rules (re.S,re.M etc. )

function meaning
compile(pattern, flags=0) Compiling a regular expression returns a regular expression object ( Preload regular expressions )
match(pattern, string, flags=0) Matching strings with regular expressions , Match from the beginning Matching object returned successfully Otherwise return to None
search(pattern, string, flags=0) The pattern of the first occurrence of a regular expression in a search string Matching object returned successfully Otherwise return to None
split(pattern, string, maxsplit=0, flags=0) Splits a string with a pattern separator specified by a regular expression Returns a list of
sub(pattern, repl, string, count=0, flags=0) Replace the pattern matching the regular expression in the original string with the specified string It can be used count Specify the number of replacements
fullmatch(pattern, string, flags=0)match Exact match of function ( From the beginning to the end of a string ) edition
findall(pattern, string, flags=0) Find all patterns in a string that match a regular expression Returns a list of strings
finditer(pattern, string, flags=0) Find all patterns in a string that match a regular expression Returns an iterator , Get the content from the iterator
purge() Clear cache of implicitly compiled regular expressions
re.I / re.IGNORECASE Ignore case match mark
re.M / re.MULTILINE Multiline match mark
re.S Single line match mark
(?P< Group name > Regular ) The content can be further extracted from the regular matching content alone (p Use capital letters )


1.findall() Return to list form

ls=re.fiindall(r'\d+'," My phone number is :10086, His phone number is :10010.")


2.finditer() Return iterator

it=re.fiindall(r'\d+'," My phone number is :10086, His phone number is :10010.")
for i in it:


3.search() return match object

ls=re.search(r'\d+'," My phone number is :10086, His phone number is :10010.")

 Only the first result is returned 

4.match() Match from the beginning , And search() similar

ls=re.match(r'\d+',"10086, His phone number is :10010.")
ls=re.match(r'\d+'," My phone number is :10086, His phone number is :10010.")

# Report errors 

5.compile() Preload regular expressions

obj = re.compile(r'\d')
ret = obj.finditer(" My phone number is :10086, My girlfriend's phone number is :10010")
for it in ret:
ret = obj.findall(" Population 1000000")

#10086 10010

6.(?P< Group name > Regular )

s="<div class='a'><span id='1'> Zhang San </span></div>
     <div class='b'><span id='2'> Li Si </span></div>
	<div class='c'><span id='3'> Wang Wu </span></div>
	<div class='d' ><span id='4'> garage </span></div>
	<div class='e'><span id='5'> Thompson </span></div>"
obj = re.compile(r"<div class='.*?'><span id=' (?P<id>\d+) '>(?P<name>.*?)</span></div>",re.S)
result = obj.finditer(s)
for it in result:
	print(it.group( "name"))

# Zhang San  1
  Li Si  2
  Wang Wu  3
  garage  4
  Thompson  5

7.re.S Match the string as a whole

a = "hello123
b = re.findall('hello(.*?)world',a)
c = re.findall('hello(.*?)world',a,re.S)
print ('b = ' , b)
print ('c = ' , c)

# b =[]  An empty list 
# c =['123']
