当前位置：网站首页>[1 data collection] complete learning path of data crawler

[1 data collection] complete learning path of data crawler

2022-06-09 09:40:00 【Ofter Data Science】

Required reading ： Data ethics and regulations

For whatever purpose , Need to use data crawler or similar technology , Be sure to read it first and follow it strictly 《 Data security law of the people's Republic of China 》 And relevant laws and regulations . in addition , What is not covered in the regulations , Please also out of the social morality , Be careful , Never use potential vulnerabilities and technical means , Affect the normal work and life of the organization or individual .

1、 The basics of reptiles

1.1 What is a reptile

A program that automatically grabs Internet Information , Grabbing valuable information from the Internet .

1.2 Crawler architecture

It is mainly composed of five parts ： Crawler scheduler 、URL Manager 、 Web downloader 、 Web parser 、 Data storage ( Applications ).

1.3 python Top ten common crawler frameworks

scrapy： Application framework for extracting structural data .
PySpider： With powerful WebUI, It can write scripts on the browser interface , Function scheduling and real-time view of crawling results , The backend uses the common database to store the crawling results , It can also set tasks and task priorities on a regular basis .
Crawley： Can crawl the content of the corresponding website at high speed , Support relational and non relational databases , Data can be exported as JSON、XML etc. .
Portia： Crawl the web site without any programming knowledge , Simply annotate the pages you are interested in ,Portia A spider will be created to extract data from similar pages .
Newspaper： Can be used to extract news 、 Article and content analysis . Using multithreading , Support 10 Many languages, etc .
Beautiful Soup： It can be downloaded from HTML or XML Extract data from files , obtain html Common libraries for elements .
Grab： A complex asynchronous web crawler that can handle millions of web pages .
Cola： Distributed crawler framework , For users , Just write a few specific functions , Without paying attention to the details of distributed operation . Tasks are automatically assigned to multiple machines , The whole process is transparent to users .
Selenium： Automated test tool , Support for multi-language development , such as Java,C,Ruby,python wait .
Python-goose： Extractable information includes ： The main content of the article 、 Main picture 、 Any... Embedded in the article Youtube/Vimeo video 、 Metadescription 、 Meta tags .

2、 The way of reptile advancement

In terms of the complexity of technology, data crawlers , It can be roughly divided into 3 Stages ： introduction 、 intermediate 、 senior .

2.1 introduction

1） Network foundation ：tcp/ip agreement 、socket Network programming 、http agreement ;

2）web front end ：HTML, CSS, JavaScript, DOM, Ajax, jQuery, json etc. ;

3） Regular expressions ： Can use regular expressions to extract information from web pages ;

4）HTML analysis ：beautifulsoup,xpath and css Selectors ;

5）HTML download ：urllib, requests Simple data fetching ;

6） Other common sense ：python Programming language syntax , Database knowledge .

You can refer to the practice of getting started ：1896-2021 Animation of dynamic ranking of medals in previous Olympic Games （Python Data collection ）_Ofter Data Science Blog -CSDN Blog It takes about 5 Minute pick Before going through 4 Explanation of this data analysis article , From this week OF To bring you the actual combat of data analysis . Material selection for actual combat ,OF It's a random selection , If you have any problems you want to analyze , You can communicate by private letter . Originally, I wanted to dynamically present the medal data of the next Olympic Games directly from the Internet （ Include 1896-2021 Each session 、 year 、 Country / region 、 gold medal 、 Silver medal 、 Bronze Medal 、 total 、 ranking ）, Strange to say , I can't find anyone who can meet these conditions on the Internet , At most 1896-2012 The data of , But not all . No way out , I have to go to the crawler to collect data by myself , Of course, it is recommended that you do not spend this time if you have ready-made data .https://blog.csdn.net/weixin_42341655/article/details/119961749?spm=1001.2014.3001.5501

2.2 intermediate

1） Simulated Login ： Will use MD5,Hash Wait for the encryption and decryption algorithm , Setting agent user-agent and Nginx agreement , simulation post/get request , Grab the client cookie or session Sign in ;

2） Verification code recognition ： Including the most basic verification code identification , such as ocr distinguish , For complex verification codes, third-party services need to be called ;

3） Dynamic web page analysis ： Use selenium+phantomjs/chromedriver Grab some dynamic web information ;

4） Multithreading and concurrency ： Inter thread communication and synchronization , And accelerating data crawling through parallel downloading ;

5）ajax Request data ： Use the packet capture tool to capture ajax Request the corresponding packet , Extract... From the packet url And the corresponding request parameters , Send request ( Processing parameters ), obtain json Format response data response.json().

2.3 senior

1） machine learning ： Use machine learning to deal with some anti climbing strategies , Avoid being banned ;

2） data storage ： Use common databases for data storage , Inquire about , And avoid the problem of repeated downloading through caching ;

3） Distributed crawlers ： Use some open source frameworks such as scrapy,scrapy-redis, Deploy distributed crawlers for large-scale data crawling ;

4） Other related applications , Such as data crawling at the mobile terminal , Monitor and operate the crawler ......

3、 The difficulty of reptiles

The biggest difficulty of data crawler lies in the game process of crawling and anti crawling , When we develop a crawling technology , There will be a corresponding anti climbing strategy , Then we will develop new crawling methods , Go round and begin again . Of course , For normal crawling , There will also be complicated processes ： Processes and threads 、 Breakpoint creep 、 Distributed 、 Reptile monitoring 、 Abnormal notice .