当前位置:网站首页>[1 data collection] complete learning path of data crawler
[1 data collection] complete learning path of data crawler
2022-06-09 09:40:00 【Ofter Data Science】
Required reading : Data ethics and regulations
For whatever purpose , Need to use data crawler or similar technology , Be sure to read it first and follow it strictly 《 Data security law of the people's Republic of China 》 And relevant laws and regulations . in addition , What is not covered in the regulations , Please also out of the social morality , Be careful , Never use potential vulnerabilities and technical means , Affect the normal work and life of the organization or individual .

1、 The basics of reptiles

1.1 What is a reptile
A program that automatically grabs Internet Information , Grabbing valuable information from the Internet .
1.2 Crawler architecture
It is mainly composed of five parts : Crawler scheduler 、URL Manager 、 Web downloader 、 Web parser 、 Data storage ( Applications ).

1.3 python Top ten common crawler frameworks
- scrapy: Application framework for extracting structural data .
- PySpider: With powerful WebUI, It can write scripts on the browser interface , Function scheduling and real-time view of crawling results , The backend uses the common database to store the crawling results , It can also set tasks and task priorities on a regular basis .
- Crawley: Can crawl the content of the corresponding website at high speed , Support relational and non relational databases , Data can be exported as JSON、XML etc. .
- Portia: Crawl the web site without any programming knowledge , Simply annotate the pages you are interested in ,Portia A spider will be created to extract data from similar pages .
- Newspaper: Can be used to extract news 、 Article and content analysis . Using multithreading , Support 10 Many languages, etc .
- Beautiful Soup: It can be downloaded from HTML or XML Extract data from files , obtain html Common libraries for elements .
- Grab: A complex asynchronous web crawler that can handle millions of web pages .
- Cola: Distributed crawler framework , For users , Just write a few specific functions , Without paying attention to the details of distributed operation . Tasks are automatically assigned to multiple machines , The whole process is transparent to users .
- Selenium: Automated test tool , Support for multi-language development , such as Java,C,Ruby,python wait .
- Python-goose: Extractable information includes : The main content of the article 、 Main picture 、 Any... Embedded in the article Youtube/Vimeo video 、 Metadescription 、 Meta tags .
2、 The way of reptile advancement
In terms of the complexity of technology, data crawlers , It can be roughly divided into 3 Stages : introduction 、 intermediate 、 senior .
2.1 introduction
1) Network foundation :tcp/ip agreement 、socket Network programming 、http agreement ;
2)web front end :HTML, CSS, JavaScript, DOM, Ajax, jQuery, json etc. ;
3) Regular expressions : Can use regular expressions to extract information from web pages ;
4)HTML analysis :beautifulsoup,xpath and css Selectors ;
5)HTML download :urllib, requests Simple data fetching ;
6) Other common sense :python Programming language syntax , Database knowledge .
2.2 intermediate
1) Simulated Login : Will use MD5,Hash Wait for the encryption and decryption algorithm , Setting agent user-agent and Nginx agreement , simulation post/get request , Grab the client cookie or session Sign in ;
2) Verification code recognition : Including the most basic verification code identification , such as ocr distinguish , For complex verification codes, third-party services need to be called ;
3) Dynamic web page analysis : Use selenium+phantomjs/chromedriver Grab some dynamic web information ;
4) Multithreading and concurrency : Inter thread communication and synchronization , And accelerating data crawling through parallel downloading ;
5)ajax Request data : Use the packet capture tool to capture ajax Request the corresponding packet , Extract... From the packet url And the corresponding request parameters , Send request ( Processing parameters ), obtain json Format response data response.json().
2.3 senior
1) machine learning : Use machine learning to deal with some anti climbing strategies , Avoid being banned ;
2) data storage : Use common databases for data storage , Inquire about , And avoid the problem of repeated downloading through caching ;
3) Distributed crawlers : Use some open source frameworks such as scrapy,scrapy-redis, Deploy distributed crawlers for large-scale data crawling ;
4) Other related applications , Such as data crawling at the mobile terminal , Monitor and operate the crawler ......
3、 The difficulty of reptiles
The biggest difficulty of data crawler lies in the game process of crawling and anti crawling , When we develop a crawling technology , There will be a corresponding anti climbing strategy , Then we will develop new crawling methods , Go round and begin again . Of course , For normal crawling , There will also be complicated processes : Processes and threads 、 Breakpoint creep 、 Distributed 、 Reptile monitoring 、 Abnormal notice .

边栏推荐
猜你喜欢

KusionStack 开源有感|历时两年,打破“隔行如隔山”困境

Postman interface pressure test

50% cost savings, 9-person team develops WOLAI online document application using function calculation

数据科学的道德与法规知识

MySQL基础 查询语句

MySQL basic database creation foundation

了解图数据库neo4j(一)

Omit 应用 减少 TS 重复代码

three.js学习笔记(十六)——汹涌的海洋

Anatomy of illusory rendering system (15) - XR topic
随机推荐
Sofa weekly | kusion open source, QA this week, contributor this week
如今的Android 开发都怎么了?我问的面试题有这么难吗?
使用Canvas画出多个多边形Polygon
2022-2028 global mobile phone jammer industry research and trend analysis report
Linux在线安装一个Neo4j图数据库
SSM details
2022-2028 global linear LED lighting industry research and trend analysis report
MySQL基础 DML与DDL学习
Opencv获取图像像素值数据类型
LeetCode_单调栈_中等_581.最短无序连续子数组
Neo4j realizes social recommendation (4)
MySQL basic subquery
MySQL基础 函数篇
ERP 系统,编译和学习
neo4j实现社交推荐(五)
Jpa查找ById 返回值是Optional 返回值不匹配 .orElse
【概率论】变量之间的相关性计算
Paper understanding [RL - exp replay] - an equivalence between loss functions and non uniform sampling in exp replay
[code comment] Doxygen
Open application for "cloud native technology application and practice" demonstration course project in Colleges and Universities