当前位置:网站首页>Reverse crawling verification code identification login (OCR character recognition)
Reverse crawling verification code identification login (OCR character recognition)
2022-06-26 08:51:00 【Feng Dashao】
In today's big data era , There are various ways of data transmission and presentation on the Internet , How to get these messy data ? Reptiles are one of them , It is one of the most widely used technologies on the Internet , It has been applied to finance 、 Real estate 、 Trade, science and technology and many other fields . Whether it's big data computing 、 Data analysis or machine learning , Can't live without reptiles . Crawler work is often the basis and main line of enterprise business development , Clean and dispose the crawling contents , What you get is valuable data .
Many enterprises in order to ensure the normal operation of the server , Anti - Crawler engineers have to use a variety of technical means to prevent crawler engineers from asking for resources from the server without restraint , for example JavaScript confusion 、WebSocket、 typeface 、WebDriver、App、 Verification code reverse crawling, etc .
The verification code is a sheet with characters ( chinese , english , Pictures of numbers, etc , Users only need to input the characters in the picture into the text box , But this simple verification code was quickly bypassed . So people added some confusing elements to the picture , Like a slash 、 Colored spots , Character distortion 、 Angle rotation and text overlap .
The verification code is a sheet with characters ( chinese , english , Pictures of numbers, etc , Users only need to input the characters in the picture into the text box , But some websites add some confusing elements , Like a slash 、 A straight line , Color spots, etc. have the purpose of disturbing anti climbing , In a recent project , However, I encountered such anti - crawling mechanism on the website . After use Tesseract-OCR To identify some of these verification codes , Different results were found , Here's the picture , Although some can be identified , But with a variety of spaces and newline symbols , Some are completely unrecognizable , Some are incorrectly recognized as other characters, etc . Actually Tesseract-OCR For characters with interference or irregular arrangement, the recognition rate is not too high , That is, use noise reduction methods , Tested repeatedly , The result is the same .
according to the understanding of , Except for the simple OCR distinguish , You can also use machine learning and CNN combination , But this method is relatively complex , Without systematic learning, machine learning and CNN
In this respect , It's hard to finish .
Verification code except character recognition , There are other anti crawl recognition mechanisms , For example, the module moves , Puzzle , Click the characters in the specified order 、 Objects and so on .
After many tests and observation of the law , It is found that the random verification code has certain vulnerabilities , Each change of verification code is about 2-3 Next time , A normal interference free code will appear , So we can add some conditions , For example, when interference is encountered and cannot be recognized , Automatically change to the next verification code ; When the normal verification code can be recognized , After testing , With spaces , Line breaks, etc , So you can use regular expressions or other methods ( split, replace etc. ) Delete all empty characters , To compare whether the string length is 4, Some half interference verification codes are identified , There will be more symbols , For example, the last one in the figure below , There are more horizontal lines . So through this train of thought , We can add a loop to combine these conditions to determine the execution code , When the conditions are fully met , Just break End the cycle , Then go to the next step .
Through the “VRZD” This verification code is regularized , We get a length of 4 Bit str, In this way, you can successfully pass the verification . Although this method is not the most direct , But if in a limited way , Just observe the regular changes , There is always a way to achieve the results you want .


边栏推荐
- Digital image processing learning (II): Gaussian low pass filter
- Opencv learning notes II
- (3) Dynamic digital tube
- opencv学习笔记二
- Discrete device ~ resistance capacitance
- 爬虫 对 Get/Post 请求时遇到编码问题的解决方案
- 关于极客时间 | MySQL实战45讲的部分总结
- What is Qi certification Qi certification process
- Embedded Software Engineer (6-15k) written examination interview experience sharing (fresh graduates)
- Analysis of Yolo series principle
猜你喜欢

Compiling owncloud client on win10

Jupyter的安装

First character that appears only once

Structure diagram of target detection network

Opencv learning notes II

Remote centralized control of distributed sensor signals using wireless technology

Selenium 搭建 Cookies池 绕过验证反爬登录
![[unity mirror] use of networkteam](/img/b8/93f55d11ea4ce2c86df01a9b03b7e7.png)
[unity mirror] use of networkteam

Koa_mySQL_Ts 的整合

And are two numbers of S
随机推荐
RecyclerView Item 根据 x,y 坐标得到当前position(位置)
Optimize quiver function in MATLAB to draw arrow diagram or vector diagram (1) -matlab development
Remote centralized control of distributed sensor signals using wireless technology
Text to SQL model ----irnet
Formula understanding in quadruped control
Structure diagram of target detection network
Embedded Software Engineer (6-15k) written examination interview experience sharing (fresh graduates)
Install Anaconda + NVIDIA graphics card driver + pytorch under win10_ gpu
Record the problem yaml file contains Chinese message 'GBK' error
VS2005 compiles libcurl to normaliz Solution of Lib missing
(5) Matrix key
Jupyter的安装
XSS 跨站脚本攻击
Matlab function foundation (directly abandon version)
ROS learning notes (5) -- Exploration of customized messages
Trimming_ nanyangjx
Playing card image segmentation
[已解决]setOnNavigationItemSelectedListener()被弃用
Use a switch to control the lighting and extinguishing of LEP lamp
51 single chip microcomputer project design: schematic diagram of timed pet feeding system (LCD 1602, timed alarm clock, key timing) Protues, KEIL, DXP