当前位置:网站首页>Reverse crawling verification code identification login (OCR character recognition)

Reverse crawling verification code identification login (OCR character recognition)

2022-06-26 08:51:00 Feng Dashao

   In today's big data era , There are various ways of data transmission and presentation on the Internet , How to get these messy data ? Reptiles are one of them , It is one of the most widely used technologies on the Internet , It has been applied to finance 、 Real estate 、 Trade, science and technology and many other fields . Whether it's big data computing 、 Data analysis or machine learning , Can't live without reptiles . Crawler work is often the basis and main line of enterprise business development , Clean and dispose the crawling contents , What you get is valuable data .

   Many enterprises in order to ensure the normal operation of the server , Anti - Crawler engineers have to use a variety of technical means to prevent crawler engineers from asking for resources from the server without restraint , for example JavaScript confusion 、WebSocket、 typeface 、WebDriver、App、 Verification code reverse crawling, etc .

   The verification code is a sheet with characters ( chinese , english , Pictures of numbers, etc , Users only need to input the characters in the picture into the text box , But this simple verification code was quickly bypassed . So people added some confusing elements to the picture , Like a slash 、 Colored spots , Character distortion 、 Angle rotation and text overlap .

   The verification code is a sheet with characters ( chinese , english , Pictures of numbers, etc , Users only need to input the characters in the picture into the text box , But some websites add some confusing elements , Like a slash 、 A straight line , Color spots, etc. have the purpose of disturbing anti climbing , In a recent project , However, I encountered such anti - crawling mechanism on the website . After use Tesseract-OCR To identify some of these verification codes , Different results were found , Here's the picture , Although some can be identified , But with a variety of spaces and newline symbols , Some are completely unrecognizable , Some are incorrectly recognized as other characters, etc . Actually Tesseract-OCR For characters with interference or irregular arrangement, the recognition rate is not too high , That is, use noise reduction methods , Tested repeatedly , The result is the same .

   according to the understanding of , Except for the simple OCR distinguish , You can also use machine learning and CNN combination , But this method is relatively complex , Without systematic learning, machine learning and CNN
In this respect , It's hard to finish .

   Verification code except character recognition , There are other anti crawl recognition mechanisms , For example, the module moves , Puzzle , Click the characters in the specified order 、 Objects and so on .

   After many tests and observation of the law , It is found that the random verification code has certain vulnerabilities , Each change of verification code is about 2-3 Next time , A normal interference free code will appear , So we can add some conditions , For example, when interference is encountered and cannot be recognized , Automatically change to the next verification code ; When the normal verification code can be recognized , After testing , With spaces , Line breaks, etc , So you can use regular expressions or other methods ( split, replace etc. ) Delete all empty characters , To compare whether the string length is 4, Some half interference verification codes are identified , There will be more symbols , For example, the last one in the figure below , There are more horizontal lines . So through this train of thought , We can add a loop to combine these conditions to determine the execution code , When the conditions are fully met , Just break End the cycle , Then go to the next step .

   Through the “VRZD” This verification code is regularized , We get a length of 4 Bit str, In this way, you can successfully pass the verification . Although this method is not the most direct , But if in a limited way , Just observe the regular changes , There is always a way to achieve the results you want .


 Insert picture description here


 Insert picture description here

原网站

版权声明
本文为[Feng Dashao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206260830111824.html