当前位置:网站首页>First, the basic concept of reptiles
First, the basic concept of reptiles
2022-08-04 23:39:00 【beyond proverb】
First, crawlers are classified according to usage scenarios
Crawler: By writing a program, it simulates the process of surfing the Internet by a browser and letting it fetch data from the Internet.
① Universal crawler: An important part of the crawling system, it grabs the data of a whole page
② Focused crawler: Based on the universal crawler, it grabs the content of a specific local area on the page
③ Incremental crawler: Detect the update of data in the website, and only grab the latest updated data in the website
Second, anti-climbing mechanism and anti-anti-climbing strategy
Anti-crawling mechanism: The portal website prevents crawler programs from crawling website data by formulating corresponding strategies or technical means
Anti-anti-crawling strategy: The crawler can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain the information of the portal website
Third, robots.txt protocol
Also known as the gentleman's agreement, it specifies which data in the website can be crawled by crawlers and which data cannot be crawled
Just add /robots.txt
after the specified domain nameView
For example: https://www.baidu.com/robots.txt
, you can see the relevant webpages that are not allowed to be crawled (Disallow), and those that are allowed to be crawled (Allow)Web pages, of course, are generally allowed to crawl other than those that are not allowed.
Fourth, http&https protocol
Ⅰ, http protocol
Hyper Text Transfer Protocol (HTTP): a form of data interaction between server and client
Ⅱ, commonly used request header and response header information
Request header:
① User-Agent: The identity representation of the request carrier
② Connection: After the request is completed, whether to disconnect or keep the connection
Response header:
Content-Type: The data type that the server responds to the client
For example: https://blog.csdn.net/qq_41264055
Press F12,Click Network, F5 refresh and re-access the server, you can see some content information of the request header and response header
Ⅲ, https protocol
Hyper Text Transfer Protocol over SecureSocket Layer based on http protocol
Ⅳ, encryption method
① Symmetric key encryption method
② NotSymmetric key encryption method
③ Certificate key encryption method
边栏推荐
- The Go Programming Language (Introduction)
- 年薪50W+的测试工程师都在用这个:Jmeter 脚本开发之——扩展函数
- 情侣牵手[贪心 & 抽象]
- 统计单词(DAY 101)华中科技大学考研机试题
- [Happy Qixi Festival] How does Nacos realize the service registration function?
- ~ hand AHB - APB Bridge 】 【 AMBA AHB bus
- I was rejected by the leader for a salary increase, and my anger rose by 9.5K after switching jobs. This is my mental journey
- 使用OpenCV实现一个文档自动扫描仪
- KT148A电子语音芯片ic方案适用的场景以及常见产品类型
- KT6368A蓝牙的认证问题_FCC和BQB_CE_KC认证或者其它说明
猜你喜欢
随机推荐
MVCC是什么
中日颜色风格
KT148A电子语音芯片ic方案适用的场景以及常见产品类型
Uniapp dynamic sliding navigation effect demo (finishing)
The Go Programming Language (Introduction)
加解密在线工具和进制转化在线工具
矩阵数学原理
MySQL的安装与卸载
Day118. Shangyitong: order list, details, payment
【CVA估值训练营】财务建模指南——第一讲
一点点读懂thermal(一)
一点点读懂cpufreq(二)
统计单词(DAY 101)华中科技大学考研机试题
零基础如何入门软件测试?再到测开(小编心得)
大师教你3D实时角色制作流程,游戏建模流程分享
Xiaohei leetcode surfing: 94. Inorder traversal of binary tree
OpenCV:10特征检测
Nuclei(二)进阶——深入理解workflows、Matchers和Extractors
头脑风暴:完全背包
学生管理系统架构设计