当前位置:网站首页>Teach you to deal with JS reverse picture camouflage hand in hand
Teach you to deal with JS reverse picture camouflage hand in hand
2022-07-05 19:04:00 【VIP_ CQCRE】
This is a 「 Attacking Coder」 Of the 655 Technology sharing
author : Xinganguo
source :AirPython
“
It is necessary to read this article 6 minute .
”Recently, I am updating the content related to the anti crawl series , This one is about the simplest 「 Picture camouflage 」
Image camouflage is in web page elements , Put words 、 The pictures are mixed together for display , This restricts the crawler from directly obtaining web page content
Target audience :
aHR0cHM6Ly93d3cuZ3hyYy5jb20vam9iRGV0YWlsL2Q2NmExNjQxNzc2MjRlNzA4MzU5NWIzMjI1ZWJjMTBi
1 - analysis
Open the page , By analyzing the page, it is found that the phone number in the web page source code is hidden and protected by default
And check the phone number , You must log in through your account first

After logging in , Clicking the view button on the page will call an interface , Then the phone number is completely displayed
https://**/getentcontacts/b2147f6a-6ec7-403e-a836-62978992841b
PS: The URL In the address b2147f6a-6ec7-403e-a836-62978992841b You can get it from the web source code , Corresponding to the enterprise one by one

Through the picture below , We found that in the above interface response values 「 tel 」 Fields can be spliced into a picture , The content in the picture is consistent with the telephone number
therefore , We just need to download this picture , utilize OCR It is possible to identify

2 - Realization
Because the text and picture background on the website is very clean , Therefore, no additional training is required to improve the character recognition rate
First , We call the interface to get the one-to-one correspondence of telephone numbers tel Field
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36',
'Cookie': '***'
}
# Get the corresponding mobile phone number tel Field id( One-to-one correspondence )
def get_tel_id():
# b2147f6a-6ec7-403e-a836-62978992841b Corresponding enterprise , It is also a one-to-one correspondence ( Web source code )
url = "https://**/getentcontacts/b2147f6a-6ec7-403e-a836-62978992841b"
payload = {}
resp = requests.request("GET", url, headers=headers, data=payload).json()
tel_id = resp.get("tel")
return tel_idthen , Use the above tel Field composition picture URL Address
Last , You can recognize the characters of the pictures
Here are 2 Ways of planting :
Baidu OCR
pytesseract
2-1 Baidu OCR
First , Install dependency packages
# Install dependency packages
pip3 install baidu-aip then , Create an application for character recognition , Get applied APP_ID、API_KEY、SECRET_KEY data
Last , Refer to the official documentation and call the following method to identify the image , Get mobile number data
Official document :
https://cloud.baidu.com/doc/OCR/s/wkibizyjk
from aip import AipOcr
def get_phone(tel_id):
"""
Baidu OCR Identify pictures , Get text content
:param tel_id:
:return:
"""
url = f'https://www.**.com/home/Phone/{tel_id}'
APP_ID = '262**'
API_KEY = '1btP8uUSzfDbji**'
SECRET_KEY = 'NGm6NgAM5ajHcksKs0**'
client = AipOcr(APP_ID, API_KEY, SECRET_KEY)
result = client.basicGeneralUrl(url)
# {'words_result': [{'words': '0771-672**'}], 'words_result_num': 1, 'log_id': 1527210***}
print(' The recognized mobile phone number is :', result)2-2 pytesseract
Again , We need to install character recognition first 、 Dependent package for image processing
# Install dependency packages
pip3 install pillow
pip3 install pytesseractthen , According to the picture URL Address get picture byte stream , The use of pytesseract Just recognize the words in the picture
import io
import pytesseract
import requests
from PIL import Image
if __name__ == '__main__':
# Get mobile phone number URL Address
image_url = f'https://www.**.com/home/Phone/{get_tel_id()}'
resp = requests.get(image_url, headers=headers)
# images.content: Get the binary byte stream of the picture
# io.BytesIO(): Operations handle binary data
# Image.open(): Open picture byte stream , Get a picture object
images_c = Image.open(io.BytesIO(resp.content))
# utilize pytesseract Identify the string in the picture , It is the mobile phone number
phone = pytesseract.image_to_string(images_c)
print(f' Contact information : {phone}')The above is the conventional way to apply image camouflage , We just need to find out the rules of image generation , And then use it OCR To be recognized as text , Finally, they can be assembled together

End
Cui Qingcai's new book 《Python3 Web crawler development practice ( The second edition )》 It's officially on the market ! The book details the use of zero basis Python Develop all aspects of reptile knowledge , At the same time, compared with the first edition, it has added JavaScript reverse 、Android reverse 、 Asynchronous crawler 、 Deep learning 、Kubernetes Related content , At the same time, this book has obtained Python The father of Guido The recommendation of , At present, this book is on sale at a 20% discount !
Content introduction :《Python3 Web crawler development practice ( The second edition )》 Content introduction

Scan purchase


You'd better watch it

边栏推荐
- Common time complexity
- 紧固件行业供应商绩效考核繁琐?选对工具才能轻松逆袭!
- ICML2022 | 长尾识别中分布外检测的部分和非对称对比学习
- Interprocess communication (IPC): shared memory
- 技术分享 | 常见接口协议解析
- Applet modification style (placeholder, checkbox style)
- Windows Oracle 开启远程连接 Windows Server Oracle 开启远程连接
- XML基础知识概念
- C language makes it easy to add, delete, modify and check the linked list "suggested collection"
- What is text mining? "Suggested collection"
猜你喜欢

Oracle 中文排序 Oracle 中文字段排序

Applet modification style (placeholder, checkbox style)

Mysql database indexing tutorial (super detailed)
![2022 the most complete Tencent background automation testing and continuous deployment practice in the whole network [10000 words]](/img/4b/90f07cd681b1e0595fc06c9429b338.jpg)
2022 the most complete Tencent background automation testing and continuous deployment practice in the whole network [10000 words]

The main thread anr exception is caused by too many binder development threads

华为让出的高端市场,小米12S靠徕卡能抢到吗?

Thoroughly understand why network i/o is blocked?

鱼和熊掌可以兼得!天翼云弹性裸金属一招鲜!

Interprocess communication (IPC): shared memory

The road of enterprise digital transformation starts from here
随机推荐
MySQL优化六个点的总结
2022年5月腾讯云开发者社区视频月度榜单公布
Low code practice of xtransfer, a cross-border payment platform: how to integrate with other medium-sized platforms is the core
Chinese postman? Really powerful!
[today in history] July 5: the mother of Google was born; Two Turing Award pioneers born on the same day
Golang through pointer for Range implements the change of the value of the element in the slice
泰山OFFICE技术讲座:由行的布局高度,谈绘制高度的高度溢出、高度缩水(全网首发)
Use file and directory properties and properties
EasyCVR电子地图中设备播放器loading样式的居中对齐优化
Take a look at semaphore, the current limiting tool provided by JUC
Quickly generate IPA package
国内低代码开发平台靠谱的都有哪些?
How to write good code defensive programming
AI open2022 | overview of recommendation systems based on heterogeneous information networks: concepts, methods, applications and resources
一文读懂简单查询代价估算
The worse the AI performance, the higher the bonus? Doctor of New York University offered a reward for the task of making the big model perform poorly
Taishan Office Technology Lecture: from the layout height of the line, talk about the height overflow and height shrinkage of the drawing height (launched in the whole network)
The road of enterprise digital transformation starts from here
一朵云开启智慧交通新未来
解决 contents have differences only in line separators