当前位置:网站首页>Self taught programming series - 1 regular expression

Self taught programming series - 1 regular expression

2022-06-26 09:08:00 ML_ python_ get√

1.1 Do not use regular expressions


def isPhoneNumber(text):
    if len(text) !=12:
        return False
    for i in range(0, 3):
        if not text[i].isdecimal():
            #  Decimal character or not 
            return False
    if text[3] !='-':
        return False
    for i in range(4,7):
        if not text[i].isdecimal():
            return False
    if text[7] !='-':
        return False
    for i in range(8,12):
          if not text[i].isdecimal():
                return False
    return  True
    
    
# print("191-666-1234 is a phone number: ")
# print(isPhoneNumber('191-666-1234'))
# print("bilibili is a phone number :")
# print(isPhoneNumber('bilibili'))
# #  Find... In a longer string 
# message = "call me at 415-555-1011 tomorrow. 415-555-9999 is my office."
# for i in range(len(message)):
# chunk = message[i:i+12]
# if isPhoneNumber(chunk):
# print('phone number found: ' + chunk)
# print('Done')

1.2 Regular expressions

  • \d representative 0-9 The above telephone numbers can be used \d\d\d-\d\d\d-\d\d\d\d To express
  • \d{3}-\d{3}-\d{4} among {3} It means match three times , Regular expressions are a way to match , The returned object has properties and methods
  • search() Method lookup returns a match object ,match Objects have group Method , Returns the actual matching text

import re

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
# r To get the original string , And escape symbols \ You need to prefix each character with \ More complicated 
mo = phoneNumRegex.search('My number is 415-555-4242.')
print("phone number found:" + mo.group())

summary : Regular expression steps

  • Import re
  • re.compile Create a regex object
  • Yes regex Use search Method to pass in the string you want to find , Return to one match object
  • Yes match Object use group Method , Returns the actual string

1.3 Group search

  • Simple grouping
  • Pipe matching
  • ?* . Equisign


# groups Print all groups 
Regex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = Regex.search('my number is 123-456-8888.')
print(mo.group(0))
print(mo.group(1))
print(mo.groups())
a,b = mo.groups()
print(a)
print(b)
##  There are parentheses in the text , It's the parentheses that lose their meaning in the function 
Regex1 = re.compile(r' (\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = Regex1.search('my number is (123) 456-8888.')
print(mo.group(1))

#  The pipe matches the first of the words that appear 
Regex_hero = re.compile(r'Ironman|Batman')
mo = Regex_hero.search('Ironman and Batman!')
print(mo.group())
mo = Regex_hero.search('Batman and Ironman!')
print(mo.group())
##  utilize findall All matches can be found 
#  Use pipes to match the first occurrence of any word ( The prefix is the same )
Regex_a = re.compile(r'Bat(man|mobile|copter|bat)')
mo = Regex_a.search('Batbat and Batmobile are best!')
print(mo.group())
# (group)? It means that we should group For optional grouping 
Regex_chioce = re.compile(r'Bat(wo)?man')
mo = Regex_chioce.search(' I am Batman')
mo1 = Regex_chioce.search('you are Batwoman!')
mo.group()
mo1.group()
# (group)*  It means that we should group matching 0 Times or more 
Regex_new = re.compile(r'Bat(wo)*man')
mo  = Regex_new.search('Batman is my lover!')
print(mo.group())
mo1 = Regex_new.search('my name is Batwowowowowoman!')
print(mo1.group())
mo2 = Regex_new.search('my name is Batman')
print(mo2.group())
# (group)+ It means that we should group matching 1 Times or more 
# Regex_add = re.compile(r'Bat(wo)+man')
# mo3 = Regex_add.search('my name is Batman')
# print(mo3.group())
# AttributeError: 'NoneType' object has no attribute 'group'
# {} Specify the number of matches (group){3}3 Time  {3,} 3 More than once  {,5}5 Below 
Regex_ha = re.compile(r'(ha){3}')
mo = Regex_ha.search('hahaha')
print(mo.group())

1.4 Greedy matching and non greedy matching

  • python The default regular expression for is greedy , That is, match the longest string ,(group){3,5}? Non greedy matching can be realized

Regex_nogreedy = re.compile(r'(ha){3,5}?')
mo = Regex_nogreedy.search('hahahahahaha')
print(mo.group())

1.5 findall and search

  • findall() Find all strings ,search() Find the first string in the string that meets the condition
  • findall() Return a list ,search() Return to one match object ; You can also use group matching


Regex_phone = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo =  Regex_phone.search('Cell:123-456-8888 Work:123-567-9999')
mo1 = Regex_phone.findall('Cell:123-456-8888 Work:123-567-9999')
print(mo.group())
print(mo1)

1.6 Character classification

  • \d 0-9 \D except 0-9 Other characters
  • \w Word characters include letters 、 Numbers 、 Underline \W Characters other than words
  • \s Blank character \S Nonwhite space character
Regex_str = re.compile(r'\d+\s*\w+')
# + Match once or more , A string has multiple numbers and multiple words 
mo = Regex_str.findall('12 drummers, 11 pipers, 10 lords, 9 ladies, 8maids, 7swans, 6 geese, 5 rings, 4 birds, 3 hens, 2 doves, 1 partridge')
print(mo)
#  Establish your own character classification 
Regex_own = re.compile(r'[AEIOUaeiou]')
mo = Regex_own.findall('RoboCop eats baby food. BABY FOOD!')
print(mo)
# [ Custom characters ] Customize  [^] Match characters other than custom  - Can be connected 
Regex_own1 = re.compile(r'[^AEIOUaeiou]')
mo1 = Regex_own1.findall('RoboCop eats baby food. BABY FOOD!')
print(mo1)

1.7 Precise matching

  • ^ Insert symbols and $ End symbol
  • wildcard .

Regex1 = re.compile(r'^Hello')
mo = Regex1.search('Hello world!')
mo1 = Regex1.search('he said hello!')
print(mo,'\n',mo1)
Regex2 = re.compile(r'^\d+$')
mo2 = Regex2.search('111111111x23333333')
mo3 = Regex2.search('222213232131')
print(mo2,'\n',mo3)
Regex3 = re.compile(r'\d+$')
mo4  = Regex3.search('my age is 26')
print(mo4)
#  wildcard .  Match all characters except line breaks , But only one character is matched 
Regex_at = re.compile(r'.at')
mo = Regex_at.findall('The cat in the hat sat on the flat mat.')
print(mo)
# .*  Match any character , For example, when entering name and password 
Regex_name = re.compile(r'First name: (.*) Last name: (.*)')
mo = Regex_name.search('First name: AI Last name: Sweigart')
print(mo.group(1))
print(mo.group(2))
#  The use of non greedy algorithms 
Regex_greed = re.compile(r'<.*>')
mo = Regex_greed.search('<To serve man> for dinner>')
print(mo.group())
Regex_nogreed = re.compile(r'<.*?>')
mo1 = Regex_nogreed.search('<To serve man> for dinner>')
print(mo1.group())

1.8 compile The second parameter

  • Ignore blanks re.VERBOSE
  • Ignore case re.I
  • Wildcard newline find re.DOTALL
#  wildcard .  To match the newline character, you need to pass in the parameter re.DOTALL
Regex_nonewline = re.compile(r'.*')
mo = Regex_nonewline.findall('Serve the public trust. \nProtect the innocent\nUphold the law')
print(mo)
Regex_newline = re.compile(r'.*',re.DOTALL)
mo1 = Regex_newline.findall('Serve the public trust. \nProtect the innocent\nUphold the law')
print(mo1)
#  Ignore case 
Regex_cop = re.compile(r'robocop', re.I)
mo=Regex_cop.search('RoboCop is part man, part machine,all cop.')
print(mo.group())

#  Let regular expressions ignore whitespace re.VERBOSE
Regex_group = re.compile(r'Agent (\w)\w*',re.VERBOSE)
mo1 = Regex_group.sub(r'\1****', 'A gent Alice gave the secret documents to Agent Bob')
print(mo1)
Regex_group = re.compile(r'Agent (\w)\w*',re.VERBOSE | re.I|re.DOTALL)
#  Different values are used for the same parameter , Press bit or 

1.9 Alternative text

  • sub
# sub Replace matching text 
Regex_sub = re.compile(r'Agent \w+')
mo = Regex_sub.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob')
print(mo)
#  Replace the matching text with some matching elements, such as the initials of names , Just group , Then incoming \1 \2 \3 that will do 
Regex_group = re.compile(r'Agent (\w)\w*')
mo1 = Regex_group.sub(r'\1****', 'Agent Alice gave the secret documents to Agent Bob')
print(mo1)

1.10 Phone number and email address extractor

  • Paste the message to the clipboard : Manual or programmed
  • Get text from clipboard : Use pyperclip Module copy and paste string , Create two regular expressions to match the phone number and email address respectively
  • Find all phone numbers and... In the text E-mail Address : Find all matching results ( Not a one-time match )
  • Paste them on the clipboard : Put the matched strings in good format , Put it in a string , For pasting
  • If no match is found , Then the message
import pyperclip, re

#  Define two regular expressions 
phoneRegex = re.compile(r'''( (\d{3} | \(\d{3}\))? (\s | - | \ .)? #  Space 、- or . Number  (\d{3}) (\s | - | \ .) (\d{4}) (\s*(ext|x|ext.)\s*(\d{2,5}))? #  Optional extension number , Here is the third group 8 The first group is the extension number  )''', re.VERBOSE)
#  First return the group with the largest bracket, and then return a total of 9 Elements 

emailRegex = re.compile(r'''( [a-zA-Z0-9._%+-]+ #  user name  @ [a-zA-Z0-9.-]+ #  domain name  (\.[a-zA-Z]{2,4}) # .com .cn etc.  )''' , re.VERBOSE)
#  Continue matching 
text = str(pyperclip.paste())
matches = []
for groups in phoneRegex.findall(text):
    print(groups)
    phoneNum = '-'.join([groups[1],groups[3],groups[5]])
    if groups[8] !=' ':
        phoneNum+= ' x'+groups[8]
    matches.append(groups[0])
for groups in emailRegex.findall(text):
    matches.append(groups[0])
#  Concatenate into a string , Copy to clipboard 
if  len(matches) >0:
    pyperclip.copy('\n'.join(matches))
    print('Copied to clipboard: ')
    print('\n'.join(matches))
else:
    print('No phone numbers or email address found')

eg: We can use csdn Home page as an example , give the result as follows :

Copied to clipboard: 
400-660-0108

999-2021
472464
1900

658
1101
[email protected].net
原网站

版权声明
本文为[ML_ python_ get√]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202170553131907.html