当前位置：网站首页>Self taught programming series - 1 regular expression

Self taught programming series - 1 regular expression

2022-06-26 09:08:00 【ML_ python_ get√】

Regular expressions

1.1 Do not use regular expressions
1.2 Regular expressions
- summary ： Regular expression steps
1.3 Group search
1.4 Greedy matching and non greedy matching
1.5 findall and search
1.6 Character classification
1.7 Precise matching
1.8 compile The second parameter
1.9 Alternative text
1.10 Phone number and email address extractor

1.1 Do not use regular expressions


def isPhoneNumber(text):
    if len(text) !=12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            #  Decimal character or not 
            return False
    if text[3] !='-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] !='-':
        return False
    for i in range(8,12):
          if not text[i].isdecimal():
                return False
    return  True
    
    
# print("191-666-1234 is a phone number: ")
# print(isPhoneNumber('191-666-1234'))
# print("bilibili is a phone number :")
# print(isPhoneNumber('bilibili'))
# #  Find... In a longer string 
# message = "call me at 415-555-1011 tomorrow. 415-555-9999 is my office."
# for i in range(len(message)):
# chunk = message[i:i+12]
# if isPhoneNumber(chunk):
# print('phone number found: ' + chunk)
# print('Done')

1.2 Regular expressions

\d representative 0-9 The above telephone numbers can be used \d\d\d-\d\d\d-\d\d\d\d To express
\d{3}-\d{3}-\d{4} among {3} It means match three times , Regular expressions are a way to match , The returned object has properties and methods
search() Method lookup returns a match object ,match Objects have group Method , Returns the actual matching text


import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
# r To get the original string , And escape symbols \ You need to prefix each character with \ More complicated 
mo = phoneNumRegex.search('My number is 415-555-4242.')
print("phone number found:" + mo.group())

summary ： Regular expression steps

Import re
re.compile Create a regex object
Yes regex Use search Method to pass in the string you want to find , Return to one match object
Yes match Object use group Method , Returns the actual string

1.3 Group search

Simple grouping
Pipe matching
？* . Equisign



# groups Print all groups 
Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = Regex.search('my number is 123-456-8888.')
print(mo.group(0))
print(mo.group(1))
print(mo.groups())
a,b = mo.groups()
print(a)
print(b)

##  There are parentheses in the text , It's the parentheses that lose their meaning in the function 
Regex1 = re.compile(r' (\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = Regex1.search('my number is (123) 456-8888.')
print(mo.group(1))

#  The pipe matches the first of the words that appear 
Regex_hero = re.compile(r'Ironman|Batman')
mo = Regex_hero.search('Ironman and Batman!')
print(mo.group())
mo = Regex_hero.search('Batman and Ironman!')
print(mo.group())
##  utilize findall All matches can be found 
#  Use pipes to match the first occurrence of any word （ The prefix is the same ）
Regex_a = re.compile(r'Bat(man|mobile|copter|bat)')
mo = Regex_a.search('Batbat and Batmobile are best!')
print(mo.group())

# （group）? It means that we should group For optional grouping 
Regex_chioce = re.compile(r'Bat(wo)?man')
mo = Regex_chioce.search(' I am Batman')
mo1 = Regex_chioce.search('you are Batwoman！')
mo.group()
mo1.group()

# （group）*  It means that we should group matching 0 Times or more 
Regex_new = re.compile(r'Bat(wo)*man')
mo  = Regex_new.search('Batman is my lover!')
print(mo.group())
mo1 = Regex_new.search('my name is Batwowowowowoman!')
print(mo1.group())
mo2 = Regex_new.search('my name is Batman')
print(mo2.group())
# （group）+ It means that we should group matching 1 Times or more 
# Regex_add = re.compile(r'Bat(wo)+man')
# mo3 = Regex_add.search('my name is Batman')
# print(mo3.group())
# AttributeError: 'NoneType' object has no attribute 'group'

# {} Specify the number of matches （group）{3}3 Time  {3,} 3 More than once  {,5}5 Below 
Regex_ha = re.compile(r'(ha){3}')
mo = Regex_ha.search('hahaha')
print(mo.group())

1.4 Greedy matching and non greedy matching

python The default regular expression for is greedy , That is, match the longest string ,（group）{3,5}？ Non greedy matching can be realized


Regex_nogreedy = re.compile(r'(ha){3,5}?')
mo = Regex_nogreedy.search('hahahahahaha')
print(mo.group())

1.5 findall and search

findall() Find all strings ,search() Find the first string in the string that meets the condition
findall() Return a list ,search() Return to one match object ; You can also use group matching



Regex_phone = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo =  Regex_phone.search('Cell:123-456-8888 Work:123-567-9999')
mo1 = Regex_phone.findall('Cell:123-456-8888 Work:123-567-9999')
print(mo.group())
print(mo1)

1.6 Character classification

\d 0-9 \D except 0-9 Other characters
\w Word characters include letters 、 Numbers 、 Underline \W Characters other than words
\s Blank character \S Nonwhite space character

Regex_str = re.compile(r'\d+\s*\w+')
# + Match once or more , A string has multiple numbers and multiple words 
mo = Regex_str.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8maids, 7swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
print(mo)

#  Establish your own character classification 
Regex_own = re.compile(r'[AEIOUaeiou]')
mo = Regex_own.findall('RoboCop eats baby food. BABY FOOD!')
print(mo)
# [ Custom characters ] Customize  [^] Match characters other than custom  - Can be connected 
Regex_own1 = re.compile(r'[^AEIOUaeiou]')
mo1 = Regex_own1.findall('RoboCop eats baby food. BABY FOOD!')
print(mo1)

1.7 Precise matching

^ Insert symbols and $ End symbol
wildcard .


Regex1 = re.compile(r'^Hello')
mo = Regex1.search('Hello world!')
mo1 = Regex1.search('he said hello!')
print(mo,'\n',mo1)
Regex2 = re.compile(r'^\d+$')
mo2 = Regex2.search('111111111x23333333')
mo3 = Regex2.search('222213232131')
print(mo2,'\n',mo3)
Regex3 = re.compile(r'\d+$')
mo4  = Regex3.search('my age is 26')
print(mo4)

#  wildcard .  Match all characters except line breaks , But only one character is matched 
Regex_at = re.compile(r'.at')
mo = Regex_at.findall('The cat in the hat sat on the flat mat.')
print(mo)

# .*  Match any character , For example, when entering name and password 
Regex_name = re.compile(r'First name: (.*) Last name: (.*)')
mo = Regex_name.search('First name: AI Last name: Sweigart')
print(mo.group(1))
print(mo.group(2))

#  The use of non greedy algorithms 
Regex_greed = re.compile(r'<.*>')
mo = Regex_greed.search('<To serve man> for dinner>')
print(mo.group())
Regex_nogreed = re.compile(r'<.*?>')
mo1 = Regex_nogreed.search('<To serve man> for dinner>')
print(mo1.group())

1.8 compile The second parameter

Ignore blanks re.VERBOSE
Ignore case re.I
Wildcard newline find re.DOTALL

#  wildcard .  To match the newline character, you need to pass in the parameter re.DOTALL
Regex_nonewline = re.compile(r'.*')
mo = Regex_nonewline.findall('Serve the public trust. \nProtect the innocent\nUphold the law')
print(mo)
Regex_newline = re.compile(r'.*',re.DOTALL)
mo1 = Regex_newline.findall('Serve the public trust. \nProtect the innocent\nUphold the law')
print(mo1)

#  Ignore case 
Regex_cop = re.compile(r'robocop', re.I)
mo=Regex_cop.search('RoboCop is part man, part machine,all cop.')
print(mo.group())

#  Let regular expressions ignore whitespace re.VERBOSE
Regex_group = re.compile(r'Agent (\w)\w*',re.VERBOSE)
mo1 = Regex_group.sub(r'\1****', 'A gent Alice gave the secret documents to Agent Bob')
print(mo1)
Regex_group = re.compile(r'Agent (\w)\w*',re.VERBOSE | re.I|re.DOTALL)
#  Different values are used for the same parameter , Press bit or

1.9 Alternative text

# sub Replace matching text 
Regex_sub = re.compile(r'Agent \w+')
mo = Regex_sub.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob')
print(mo)
#  Replace the matching text with some matching elements, such as the initials of names , Just group , Then incoming \1 \2 \3 that will do 
Regex_group = re.compile(r'Agent (\w)\w*')
mo1 = Regex_group.sub(r'\1****', 'Agent Alice gave the secret documents to Agent Bob')
print(mo1)

1.10 Phone number and email address extractor

Paste the message to the clipboard ： Manual or programmed
Get text from clipboard ： Use pyperclip Module copy and paste string , Create two regular expressions to match the phone number and email address respectively
Find all phone numbers and... In the text E-mail Address ： Find all matching results （ Not a one-time match ）
Paste them on the clipboard ： Put the matched strings in good format , Put it in a string , For pasting
If no match is found , Then the message

import pyperclip, re

#  Define two regular expressions 
phoneRegex = re.compile(r'''( (\d{3} | \(\d{3}\))? (\s | - | \ .)? #  Space 、- or . Number  (\d{3}) (\s | - | \ .) (\d{4}) (\s*(ext|x|ext.)\s*(\d{2,5}))? #  Optional extension number , Here is the third group 8 The first group is the extension number  )''', re.VERBOSE)
#  First return the group with the largest bracket, and then return a total of 9 Elements 

emailRegex = re.compile(r'''( [a-zA-Z0-9._%+-]+ #  user name  @ [a-zA-Z0-9.-]+ #  domain name  (\.[a-zA-Z]{2,4}) # .com .cn etc.  )''' , re.VERBOSE)

#  Continue matching 
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
    print(groups)
    phoneNum = '-'.join([groups[1],groups[3],groups[5]])
    if groups[8] !=' ':
        phoneNum+= ' x'+groups[8]
    matches.append(groups[0])
for groups in emailRegex.findall(text):
    matches.append(groups[0])

#  Concatenate into a string , Copy to clipboard 
if  len(matches) >0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard: ')
    print('\n'.join(matches))
else:
    print('No phone numbers or email address found')

eg: We can use csdn Home page as an example , give the result as follows ：

Copied to clipboard: 
400-660-0108

999-2021
472464
1900

658
1101
[email protected].net

原网站

版权声明
本文为[ML_ python_ get√]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202170553131907.html