当前位置:网站首页>基于百度OCR的网站验证码在线识别
基于百度OCR的网站验证码在线识别
2022-08-01 06:40:00 【Anuttarasamyasambodh】
0.问题:
动态识别网站验证码以便后续操作
1.思路:
1.1.获取验证码图片
1.2.使用百度OCR接口在线识别验证码
2.实现:
2.1.获取验证码图片
2.1.1使用webdriver模拟浏览器获取网页
2.1.2根据页面元素中的验证码图片位置属性截取验证码图片并保存
代码实现如下:
def verifycode():
driver = webdriver.Chrome()
driver.set_page_load_timeout(5)
driver.set_script_timeout(5)
try:
driver.get("https://query.ruankao.org.cn/certificate/main")
except Exception as e:
print('time out in search page')
# 1.将注册页面截图保存,这里需要以png结尾,其他图片格式会有warning
driver.save_screenshot("scr_img.png")
# 2.定位到验证码图片元素
#code_ele = driver.find_element_by_id("imgVerifyCode")
code_ele = driver.find_element_by_id("pic")
# 3.元素的位置,结果:{'y': 478, 'x': 565},为图片左上角的位置
print(code_ele.location)
# 4.元素的大小,结果:{'height': 37, 'width': 135}
print(code_ele.size)
# 5.得到将元素的具体位置
x0 = code_ele.location["x"] # 565
y0 = code_ele.location["y"] # 478
x1 = code_ele.size["width"] + x0
y1 = code_ele.size["height"] + y0
img = Image.open("scr_img.png")
image = img.crop((x0, y0, x1, y1)) # 左、上、右、下
image.save("code_img.png") # 将验证码图片保存为code_img.png
或者使用xpath定位到验证码的url然后直接下载验证码图片,实现如下:
def verifycode():
headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Referer': 'https://query.ruankao.org.cn/certificate/main',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 Edg/88.0.705.74',
'Cookie': 'PHPSESSID=trq1o40; acw_tc=784e8288dgh67f; SERVERID=f7154867dcfa|1618889640|1618887987'
}
# 先用带Cookie的header请求验证码,则服务端存储 _cookis:_verifycode的对应,并返回验证码图片
xpath_str = '//img[@name="pic"]/@src'
base_url = "https://query.ruankao.org.cn/certificate/main"
html_res = requests.get(base_url, headers=headers).text
dom = etree.HTML(html_res)
items = dom.xpath(xpath_str)
if len(items) > 0:
cap_url = items[0]
print(cap_url)
cap = requests.get(cap_url, headers=headers)
with open("cap.png", "wb") as f:
f.write(cap.content)
f.close()
2.2 使用百度OCR接口在线识别验证码
2.2.1 登录百度智能云,创建OCR应用实例,获取APP_ID和APP_KEY
https://cloud.baidu.com/product/ocr_general
根据文档一步一步来肯定能成功,目前有免费额度个人认证 1,000 次/月,企业认证 2,000 次/月,免费测试资源用尽后按照如下价格进行计费
获取到APP_ID和APP_KEY后,就可以调用其接口在线识别了,可以参考技术文档文字识别OCR (baidu.com)
# encoding:utf-8
import requests
import base64
'''
通用文字识别(高精度版)
'''
request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/accurate_basic"
# 二进制方式打开图片文件
f = open('[本地文件]', 'rb')
img = base64.b64encode(f.read())
params = {"image":img}
access_token = '[调用鉴权接口获取的t oken]'
request_url = request_url + "?access_token=" + access_token
headers = {'content-type': 'application/x-www-form-urlencoded'}
response = requests.post(request_url, data=params, headers=headers)
if response:
print (response.json())边栏推荐
- I have three degrees, and I have five faces. I was "confessed" by the interviewer, and I got an offer of 33*15.
- 牛客刷SQL---2
- After the image is updated, Glide loading is still the original image problem
- 仿牛客网讨论社区项目—项目总结及项目常见面试题
- [Translation] Securing cloud-native communications: From ingress to service mesh and beyond
- Robot growth in China
- LeetCode每日一题(309. Best Time to Buy and Sell Stock with Cooldown)
- 权重等比分配
- Dart 异常详解
- sum of special numbers
猜你喜欢
随机推荐
A,H,K,N
湖仓一体电商项目(一):项目背景和架构介绍
Matlab simulink particle swarm optimization fuzzy pid control motor pump
导致锁表的原因及解决方法
Three aspects of Ali: How to solve the problem of MQ message loss, duplication and backlog?
Hunan institute of technology in 2022 ACM training sixth week antithesis
Srping中bean的生命周期
深度比较两个对象是否相同
curl (7) Failed connect to localhost8080; Connection refused
NUMPY
Dell PowerEdge Server R450 RAID Configuration Steps
仿牛客网讨论社区项目—项目总结及项目常见面试题
从离线到实时对客,湖仓一体释放全量数据价值
Malicious attacks on mobile applications surge by 500%
AspNet.WebApi.Owin 自定义Token请求参数
七、MFC序列化机制和序列化类对象
解决浏览器滚动条导致的页面闪烁问题
LeetCode 0150. Reverse Polish Expression Evaluation
2022.7.26 模拟赛
leetcode125 验证回文串









