当前位置:网站首页>Web crawler principle analysis "suggestions collection"
Web crawler principle analysis "suggestions collection"
2022-07-25 20:05:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
1、 The principle of web crawler
Web crawler refers to following certain rules ( Simulate the way to log in to the web page manually ), Automatically grab programs on the network . To put it simply , That is to say, you can get the content on the page you see online , And store it . The crawling strategies of web crawlers are divided into depth first and breadth first . As shown in the figure below, a depth first traversal method is A To B To D To E To C To F(ABDECF) And width first traversal ABCDEF .
2、 The reason for writing web crawlers
(1) There is a large amount of data on the Internet , We can't collect data manually , This will be a waste of time and money . The crawler has a characteristic that it can batch 、 Automatic data acquisition and processing . For example, crawlers on major auto forums and dianping.com ,tripadvisor( Foreign websites, ) The reptiles of , Tens of millions of data have been crawled , But if you copy one by one , You can't finish it until you die of old age . (2) Reptiles are cool . A few days ago , See someone use to climb to Tencent 3000 ten thousand QQ data , contain (QQ Number , nickname , Space name , Membership level , Head portrait , The latest one is about , The latest release time , Space Introduction , Gender , Birthday , Province , City , Marital status ) Detailed data , And draw various interesting charts . (3) For Graduate School 、 Read Bo , Do data mining 、 For those who analyze data , There is no data to do experiments , It's a very painful thing . You may ask for data in various forums every day , Isn't it annoying .
3、 The process of web crawler
Simple web crawler , Through the above figure, you can complete . The first is to give a to be crawled URL queue , Then through the way of grabbing bags , Get the real request address of the data . Then we use httpclient The simulation browser captures the corresponding data ( It's usually html Documents or json data ). Because there is a lot of content in the web page , Very complicated , A lot of content is not what we need , So we need to parse it . in the light of html The parsing of is very simple , adopt Jsoup(Dom Parsing tool )、 Regular expressions can be completed . in the light of Json Data analysis , Here I suggest a quick parsing tool fastjson( A tool of Alibaba open source )
4、 Network packet capture
Network packet capture ,(packet capture) It is to intercept the data packets sent and received by network transmission 、 retransmission 、 edit 、 Transfer and deposit, etc , It is often used for data interception and so on . The response to the data is Json Or you need a user name 、 Password login website , Catching bags is particularly important , Capturing packets is also the first step in writing web crawlers .
The picture shows Oriental Fortune , The result of packet capture , You can see the real response address :Request URL It is different from the address requested by the above page , And let's take a look at the stock data in response . The data format of the response is JSON file , Here we can see , There are a total of 61 page , The data of the current page is data【Json data 】.
So use the network to capture packets , Is the first step of web crawler , It can intuitively see the real address of the data request , Request mode (post、get request ), Type of data (html still Json data )
5、HTTP Status code description
HTTP Status code (HTTP Status Code) Is used to represent a web server HTTP Responding to the state of 3 Digit code . When we open a web page , If the web page can return data , That is to say, it affects success , Generally, the status response code is 200. Of course, the status response code , It includes a lot of content , Here is a list of , Status response code , And its meaning , The wrong one is often encountered in reptiles : 100: continue Client should continue to send request . The client should continue to send the rest of the request , Or if the request has been completed , Ignore this response . 101: Conversion protocol After sending the last blank line of this response , The server will switch to the Upgrade Those protocols defined in the header . Similar measures should only be taken when switching over new agreements is more beneficial . 102: To continue processing from WebDAV(RFC 2518) Extended status code , Representative processing will be continued . 200 : The request is successful Processing mode : Get the content of the response , To deal with 201: Request completed , The result is the creation of new resources . Newly created resources URI You can get... In the responding entity Processing mode : You won't meet 202: The request is accepted , But the processing has not been completed Processing mode : Block waiting 204: The server side has implemented the request , But no new letters were returned Rest . If the customer is a user agent , You don't need to update your own document view for this . Processing mode : discarded 300: The status code is not HTTP/1.0 The application directly uses , Just as 3XX Default interpretation of type response . There are multiple available requested resources . Processing mode : If the program can handle , Then further processing , If it can't be dealt with in the program , Then discard 301: The requested resource will be assigned a permanent URL, In this way, we can pass the URL Visit and ask about this resource Processing mode : Redirect to assigned URL 302: The requested resource is in a different URL Temporary storage Processing mode : Redirect to temporary URL 304: Requested resource not updated Processing mode : discarded 400: Illegal request Processing mode : discarded 401: unauthorized Processing mode : discarded 403 : prohibit Processing mode : discarded 404 : Can't find Processing mode : discarded 500: Server internal error The server encountered an unexpected situation , Which makes it unable to complete the processing of the request . Generally speaking , This problem occurs when the source code on the server side is wrong . 501: The server does not recognize The server does not support a function required by the current request . When the server does not recognize the request , And can't support its request for any resources . 502: Bad Gateway When a server working as a gateway or proxy attempts to execute a request , Invalid response received from upstream server . 503 : Service error Due to temporary server maintenance or overload , The server is currently unable to process the request . This situation is temporary , And will be restored in a period of time .
6、Java Basic knowledge required by web crawlers
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/127751.html Link to the original text :https://javaforall.cn
边栏推荐
- 统信UOS下配置安装cocos2dx开发环境
- PMP每日一练 | 考试不迷路-7.25
- VMware 虚拟机下载、安装和使用教程
- 接口请求合并的3种技巧,性能直接爆表!
- Three skills of interface request merging, and the performance is directly exploded!
- CarSim simulation quick start (16) - ADAS sensor objects of CarSim sensor simulation (2)
- Selenium runs slowly - speed up by setting selenium load policy
- Mindspore1.1.1 source code compilation and installation -- errors in the core compilation stage
- Six axis sensor use learning record
- qml 结合 QSqlTableModel 动态加载数据 MVC「建议收藏」
猜你喜欢

Socket error Event: 32 Error: 10053. Connection closing...Socket close

tiktok手机网络环境怎么设置?tiktok怎么破播放量?

打印数据库返回的查询数据是null,或者是默认值。与数据库返回的值不相符

UNET and mask RCNN

Rainbond插件扩展:基于Mysql-Exporter监控Mysql

PMP每日一练 | 考试不迷路-7.25

10. < tag dynamic programming and subsequence, subarray> lt.53. maximum subarray and + lt.392. Judge subsequence DBC

sentinel简单限流和降级demo问题记录

Concept of IP address

Share 25 useful JS single line codes
随机推荐
推荐系统专题 | MiNet:跨域CTR预测
9. < tag dynamic programming and subsequence, subarray> lt.718. Longest repeated subarray + lt.1143. Longest common subsequence
Split very long line of words into separate lines of max length
[wp]ctfshow-web getting started - Explosion
PyTorch 模型 onnx 文件的导出和调用
Stock software development
Prescan quick start to master the special functions of prescan track editing in lecture 18
wallys//wifi6 wifi5 router IPQ6018 IPQ4019 IPQ4029 802.11ax 802.11ac
网络爬虫原理解析「建议收藏」
10.< tag-动态规划和子序列, 子数组>lt.53. 最大子数组和 + lt.392. 判断子序列 dbc
Configure and install cocos2dx development environment under Tongxin UOS
Array of sword finger offer question bank summary (I) (C language version)
Redis source code -ziplist
Connecting to the database warning establishing SSL connection without server's identity verification is not recommended
打印数据库返回的查询数据是null,或者是默认值。与数据库返回的值不相符
【高等数学】【4】不定积分
Creative drop-down multi choice JS plug-in download
Oracle database download, installation, use tutorial and problem summary
Bash does not add single quotes to your string
Rainbond插件扩展:基于Mysql-Exporter监控Mysql