当前位置：网站首页>Crawler learning 5--- anti crawling identification picture verification code (ddddocr and pyteseract measured effect)

Crawler learning 5--- anti crawling identification picture verification code (ddddocr and pyteseract measured effect)

2022-06-27 06:06:00 【lufei0920】

Crawler learning 5— Anti crawl identification image verification code

name	Environment version	explain
ddddocr	linux System installation ;python3 edition ：3.6.8; command ：python3 -m pip install ddddocr; Installed version ：ddddocr-1.4.3	/usr/local/lib/python3.6/site-packages/ddddocr-1.4.3-py3.6.egg/ddddocr/init.py Note the item description in , The recognition effect is good ; See the picture below ：
pytesseract	linux System installation ;python3 edition ：3.6.8; Need to install tesseract	Recognition effect is generally not recommended

One 、 utilize ddddocr Example of identification picture verification code

 First installation ddddocr modular ：python3 -m pip install ddddocr
	 The installation process is rather tortuous , Always reporting mistakes , Later, it was installed separately according to the associated modules that reported errors , Installation completed .

1、 Sample code

from selenium import webdriver
import time
from PIL import Image,ImageEnhance

import ddddocr

ocr = ddddocr.DdddOcr()   

url = " Page to visit "
options = webdriver.ChromeOptions()
options.add_argument("--headless")  #  Turn on interface free mode 
options.add_argument('--no-sandbox')
options.add_argument("--disable-gpu")
options.add_argument('--disable-dev-shm-usage')  # linux The above four items need to be set on the .
driver = webdriver.Chrome(chrome_options=options,executable_path='/usr/bin/chromedriver')

driver.get(url)  #  request Url
driver.maximize_window()    #  Full screen display 
driver.save_screenshot('m3.png')    #  Screenshot of the entire page , And save as a picture 
location = driver.find_element_by_xpath('//*[@id="login"]/div[5]/span')   #  Get the coordinates of the verification code area 
# print(location.location)
size = location.size   #  Coordinate size 
# print(size)
rangle = (int(location.location['x']),int(location.location['y']),int(location.location['x'] + size['width']),int(location.location['y'] + size['height']))   #  Get the coordinate size of the verification code picture 
i = Image.open('m3.png')   #  Open the saved picture by image 
imgry=i.crop(rangle)    #  Intercept verification code area 
imgry.save('getVerifyCode1.png')   #  Save captcha image 
im=Image.open('getVerifyCode1.png')   #  Open the newly intercepted verification code image again 
sharpness =ImageEnhance.Contrast(im)     # Contrast enhancement , It is easier to identify the verification code in the picture 
#
sharp_img = sharpness.enhance(2.0)
#
sharp_img.save("newVerifyCode1.png")    #  Save the optimized captcha image 


with open('newVerifyCode1.png', 'rb') as f:
    img_bytes = f.read()   #  Read the picture 
res = ocr.classification(img_bytes)   #  Get the characters in the picture 
print(res)

2、 Code demonstration results

Insert picture description here

Prove that the obtained verification code information is the same as that in the picture .

Two 、pytesseract Implementation verification code

1、 install pytesseract

python3 -m pip install pytesseract

2、 install tesseract

 See... For installation details ：https://blog.csdn.net/weixin_44575268/article/details/117258508

3、 Code example

from selenium import webdriver
import time
from PIL import Image,ImageEnhance

import pytesseract

tesseract_cmd = r'/usr/local/bin/tesseract'
pytesseract.pytesseract.tesseract_cmd =tesseract_cmd
url = " Page to visit "
options = webdriver.ChromeOptions()
options.add_argument("--headless")  #  Turn on interface free mode 
options.add_argument('--no-sandbox')
options.add_argument("--disable-gpu")
options.add_argument('--disable-dev-shm-usage')  # linux The above four items need to be set on the .
driver = webdriver.Chrome(chrome_options=options,executable_path='/usr/bin/chromedriver')

driver.get(url)  #  request Url
driver.maximize_window()    #  Full screen display 
driver.save_screenshot('m3.png')    #  Screenshot of the entire page , And save as a picture 
location = driver.find_element_by_xpath('//*[@id="login"]/div[5]/span')   #  Get the coordinates of the verification code area 
# print(location.location)
size = location.size   #  Coordinate size 
# print(size)
rangle = (int(location.location['x']),int(location.location['y']),int(location.location['x'] + size['width']),int(location.location['y'] + size['height']))   #  Get the coordinate size of the verification code picture 
i = Image.open('m3.png')   #  Open the saved picture by image 
imgry=i.crop(rangle)    #  Intercept verification code area 
imgry.save('getVerifyCode1.png')   #  Save captcha image 
im=Image.open('getVerifyCode1.png')   #  Open the newly intercepted verification code image again 
sharpness =ImageEnhance.Contrast(im)     # Contrast enhancement , It is easier to identify the verification code in the picture 
#
sharp_img = sharpness.enhance(2.0)
#
sharp_img.save("newVerifyCode1.png")    #  Save the optimized captcha image 
#
 newVerify = Image.open('newVerifyCode1.png')
#
 mm = pytesseract.image_to_string(newVerify,'eng')
 print(mm)