当前位置:网站首页>First, the basic concept of reptiles
First, the basic concept of reptiles
2022-08-04 23:39:00 【beyond proverb】
First, crawlers are classified according to usage scenarios
Crawler: By writing a program, it simulates the process of surfing the Internet by a browser and letting it fetch data from the Internet.
① Universal crawler: An important part of the crawling system, it grabs the data of a whole page
② Focused crawler: Based on the universal crawler, it grabs the content of a specific local area on the page
③ Incremental crawler: Detect the update of data in the website, and only grab the latest updated data in the website
Second, anti-climbing mechanism and anti-anti-climbing strategy
Anti-crawling mechanism: The portal website prevents crawler programs from crawling website data by formulating corresponding strategies or technical means
Anti-anti-crawling strategy: The crawler can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain the information of the portal website
Third, robots.txt protocol
Also known as the gentleman's agreement, it specifies which data in the website can be crawled by crawlers and which data cannot be crawled
Just add /robots.txt after the specified domain nameView
For example: https://www.baidu.com/robots.txt, you can see the relevant webpages that are not allowed to be crawled (Disallow), and those that are allowed to be crawled (Allow)Web pages, of course, are generally allowed to crawl other than those that are not allowed.
Fourth, http&https protocol
Ⅰ, http protocol
Hyper Text Transfer Protocol (HTTP): a form of data interaction between server and client
Ⅱ, commonly used request header and response header information
Request header:
① User-Agent: The identity representation of the request carrier
② Connection: After the request is completed, whether to disconnect or keep the connection
Response header:
Content-Type: The data type that the server responds to the client
For example: https://blog.csdn.net/qq_41264055
Press F12,Click Network, F5 refresh and re-access the server, you can see some content information of the request header and response header
Ⅲ, https protocol
Hyper Text Transfer Protocol over SecureSocket Layer based on http protocol
Ⅳ, encryption method
① Symmetric key encryption method
② NotSymmetric key encryption method


③ Certificate key encryption method

边栏推荐
猜你喜欢

Implementing class target method exception using proxy object execution
![[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)](/img/b6/5a1c8b675dc7f67f359c25908403e1.png)
[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)

如何写好测试用例
手写分布式配置中心(1)

uniapp horizontal tab (horizontal scrolling navigation bar) effect demo (organization)

OpenCV:10特征检测

Since a new byte of 20K came out, I have seen what the ceiling is

仪表板展示 | DataEase看中国:数据呈现中国资本市场

基于深度学习的路面坑洞检测(详细教程)

~ hand AHB - APB Bridge 】 【 AMBA AHB bus
随机推荐
uniapp sharing function - share to friends group chat circle of friends effect (sorting)
怎么将自己新文章自动推送给自己的粉丝(巨简单,学不会来打我)
TypeScript - the use of closure functions
【手撕AHB-APB Bridge】~ AMBA总线 之 AHB
游戏3D建模入门,有哪些建模软件可以选择?
功耗控制之DVFS介绍
线程三连鞭之“线程的状态”
什么是次世代建模(附学习资料)
Xiaohei leetcode surfing: 94. Inorder traversal of binary tree
吐槽 | 参加IT培训的正确姿势
MVCC是什么
typeScript-闭包函数的使用
Ab3d.PowerToys and Ab3d.DXEngine Crack
OPENCV学习DAY8
Nuclei (2) Advanced - In-depth understanding of workflows, Matchers and Extractors
请你说一下final关键字以及static关键字
Pytorch分布式训练/多卡/多GPU训练DDP的torch.distributed.launch和torchrun
基于Appian低代码平台开发一个SpaceX网站
truffle
DNS常见资源记录类型详解