当前位置:网站首页>[1 data collection] complete learning path of data crawler

[1 data collection] complete learning path of data crawler

2022-06-09 09:40:00 Ofter Data Science

Required reading : Data ethics and regulations

For whatever purpose , Need to use data crawler or similar technology , Be sure to read it first and follow it strictly 《 Data security law of the people's Republic of China 》 And relevant laws and regulations . in addition , What is not covered in the regulations , Please also out of the social morality , Be careful , Never use potential vulnerabilities and technical means , Affect the normal work and life of the organization or individual .

1、 The basics of reptiles

1.1 What is a reptile

A program that automatically grabs Internet Information , Grabbing valuable information from the Internet .

1.2 Crawler architecture

It is mainly composed of five parts : Crawler scheduler 、URL Manager 、 Web downloader 、 Web parser 、 Data storage ( Applications ).

1.3 python Top ten common crawler frameworks

  • scrapy: Application framework for extracting structural data .
  • PySpider: With powerful WebUI, It can write scripts on the browser interface , Function scheduling and real-time view of crawling results , The backend uses the common database to store the crawling results , It can also set tasks and task priorities on a regular basis .
  • Crawley: Can crawl the content of the corresponding website at high speed , Support relational and non relational databases , Data can be exported as JSON、XML etc. .
  • Portia: Crawl the web site without any programming knowledge , Simply annotate the pages you are interested in ,Portia A spider will be created to extract data from similar pages .
  • Newspaper: Can be used to extract news 、 Article and content analysis . Using multithreading , Support 10 Many languages, etc .
  • Beautiful Soup: It can be downloaded from HTML or XML Extract data from files , obtain html Common libraries for elements .
  • Grab: A complex asynchronous web crawler that can handle millions of web pages .
  • Cola: Distributed crawler framework , For users , Just write a few specific functions , Without paying attention to the details of distributed operation . Tasks are automatically assigned to multiple machines , The whole process is transparent to users .
  • Selenium: Automated test tool , Support for multi-language development , such as Java,C,Ruby,python wait .
  • Python-goose: Extractable information includes : The main content of the article 、 Main picture 、 Any... Embedded in the article Youtube/Vimeo video 、 Metadescription 、 Meta tags .

2、 The way of reptile advancement

In terms of the complexity of technology, data crawlers , It can be roughly divided into 3 Stages : introduction 、 intermediate 、 senior .

2.1 introduction

1) Network foundation :tcp/ip agreement 、socket Network programming 、http agreement ;

2)web front end :HTML, CSS, JavaScript, DOM, Ajax, jQuery, json etc. ;

3) Regular expressions : Can use regular expressions to extract information from web pages ;

4)HTML analysis :beautifulsoup,xpath and css Selectors ;

5)HTML download :urllib, requests Simple data fetching ;

6) Other common sense :python Programming language syntax , Database knowledge .

You can refer to the practice of getting started :1896-2021 Animation of dynamic ranking of medals in previous Olympic Games (Python Data collection )_Ofter Data Science Blog -CSDN Blog It takes about 5 Minute pick Before going through 4 Explanation of this data analysis article , From this week OF To bring you the actual combat of data analysis . Material selection for actual combat ,OF It's a random selection , If you have any problems you want to analyze , You can communicate by private letter . Originally, I wanted to dynamically present the medal data of the next Olympic Games directly from the Internet ( Include 1896-2021 Each session 、 year 、 Country / region 、 gold medal 、 Silver medal 、 Bronze Medal 、 total 、 ranking ), Strange to say , I can't find anyone who can meet these conditions on the Internet , At most 1896-2012 The data of , But not all . No way out , I have to go to the crawler to collect data by myself , Of course, it is recommended that you do not spend this time if you have ready-made data .https://blog.csdn.net/weixin_42341655/article/details/119961749?spm=1001.2014.3001.5501

2.2 intermediate

1) Simulated Login : Will use MD5,Hash Wait for the encryption and decryption algorithm , Setting agent user-agent and Nginx agreement , simulation post/get request , Grab the client cookie or session Sign in ;

2) Verification code recognition : Including the most basic verification code identification , such as ocr distinguish , For complex verification codes, third-party services need to be called ;

3) Dynamic web page analysis : Use selenium+phantomjs/chromedriver Grab some dynamic web information ;

4) Multithreading and concurrency : Inter thread communication and synchronization , And accelerating data crawling through parallel downloading ;

5)ajax Request data : Use the packet capture tool to capture ajax Request the corresponding packet , Extract... From the packet url And the corresponding request parameters , Send request ( Processing parameters ), obtain json Format response data response.json().

2.3 senior

1) machine learning : Use machine learning to deal with some anti climbing strategies , Avoid being banned ;

2) data storage : Use common databases for data storage , Inquire about , And avoid the problem of repeated downloading through caching ;

3) Distributed crawlers : Use some open source frameworks such as scrapy,scrapy-redis, Deploy distributed crawlers for large-scale data crawling ;

4) Other related applications , Such as data crawling at the mobile terminal , Monitor and operate the crawler ......

3、 The difficulty of reptiles

The biggest difficulty of data crawler lies in the game process of crawling and anti crawling , When we develop a crawling technology , There will be a corresponding anti climbing strategy , Then we will develop new crawling methods , Go round and begin again . Of course , For normal crawling , There will also be complicated processes : Processes and threads 、 Breakpoint creep 、 Distributed 、 Reptile monitoring 、 Abnormal notice .

原网站

版权声明
本文为[Ofter Data Science]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206090911219658.html