当前位置:网站首页>First, the basic concept of reptiles
First, the basic concept of reptiles
2022-08-04 23:39:00 【beyond proverb】
First, crawlers are classified according to usage scenarios
Crawler: By writing a program, it simulates the process of surfing the Internet by a browser and letting it fetch data from the Internet.
① Universal crawler: An important part of the crawling system, it grabs the data of a whole page
② Focused crawler: Based on the universal crawler, it grabs the content of a specific local area on the page
③ Incremental crawler: Detect the update of data in the website, and only grab the latest updated data in the website
Second, anti-climbing mechanism and anti-anti-climbing strategy
Anti-crawling mechanism: The portal website prevents crawler programs from crawling website data by formulating corresponding strategies or technical means
Anti-anti-crawling strategy: The crawler can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain the information of the portal website
Third, robots.txt protocol
Also known as the gentleman's agreement, it specifies which data in the website can be crawled by crawlers and which data cannot be crawled
Just add /robots.txt
after the specified domain nameView
For example: https://www.baidu.com/robots.txt
, you can see the relevant webpages that are not allowed to be crawled (Disallow), and those that are allowed to be crawled (Allow)Web pages, of course, are generally allowed to crawl other than those that are not allowed.
Fourth, http&https protocol
Ⅰ, http protocol
Hyper Text Transfer Protocol (HTTP): a form of data interaction between server and client
Ⅱ, commonly used request header and response header information
Request header:
① User-Agent: The identity representation of the request carrier
② Connection: After the request is completed, whether to disconnect or keep the connection
Response header:
Content-Type: The data type that the server responds to the client
For example: https://blog.csdn.net/qq_41264055
Press F12,Click Network, F5 refresh and re-access the server, you can see some content information of the request header and response header
Ⅲ, https protocol
Hyper Text Transfer Protocol over SecureSocket Layer based on http protocol
Ⅳ, encryption method
① Symmetric key encryption method
② NotSymmetric key encryption method
③ Certificate key encryption method
边栏推荐
- 安全软件 Avast 与赛门铁克诺顿 NortonLifeLock 合并案获英国批准,市值暴涨 43%
- ~ hand AHB - APB Bridge 】 【 AMBA AHB bus
- kernel hung_task死锁检测机制原理实现
- 如何根据地址获取函数名
- 【无标题】线程三连鞭之“线程池”
- Will we still need browsers in the future?(feat. Maple words Maple language)
- 仪表板展示 | DataEase看中国:数据呈现中国资本市场
- Nuclei (2) Advanced - In-depth understanding of workflows, Matchers and Extractors
- 社区分享|腾讯海外游戏基于JumpServer构建游戏安全运营能力
- 4 - "PyTorch Deep Learning Practice" - Backpropagation
猜你喜欢
一点点读懂thermal(一)
365天深度学习训练营-学习线路
一点点读懂cpufreq(一)
Go 语言快速入门指南:什么是 TSL 安全传输层
基于内容的图像检索系统设计与实现--颜色信息--纹理信息--形状信息--PHASH--SHFT特征点的综合检测项目,包含简易版与完整版的源码及数据!
Day118.尚医通:订单列表、详情、支付
[Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)
App测试和Web测试的区别
游戏3D建模入门,有哪些建模软件可以选择?
4-《PyTorch深度学习实践》-反向传播
随机推荐
Basic web in PLSQL
「津津乐道播客」#397 厂长来了:怎样用科技给法律赋能?
对写作的一些感悟
typeScript-promise
加解密在线工具和进制转化在线工具
基于Appian低代码平台开发一个SpaceX网站
资深游戏建模师告知新手,游戏场景建模师必备软件有哪些?
[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)
什么是次世代建模(附学习资料)
Xiaohei leetcode surfing: 94. Inorder traversal of binary tree
MySQL基础篇【子查询】
[Cultivation of internal skills of string functions] strncpy + strncat + strncmp (2)
一点点读懂cpufreq(一)
NebulaGraph v3.2.0 Release Note,对查询最短路径的性能等多处优化
Basic web in PLSQL
一点点读懂cpufreq(二)
typeScript-闭包函数的使用
OpenCV:10特征检测
ClickHouse 二级索引
npm基本操作及命令详解