当前位置:网站首页>First, the basic concept of reptiles
First, the basic concept of reptiles
2022-08-04 23:39:00 【beyond proverb】
First, crawlers are classified according to usage scenarios
Crawler: By writing a program, it simulates the process of surfing the Internet by a browser and letting it fetch data from the Internet.
① Universal crawler: An important part of the crawling system, it grabs the data of a whole page
② Focused crawler: Based on the universal crawler, it grabs the content of a specific local area on the page
③ Incremental crawler: Detect the update of data in the website, and only grab the latest updated data in the website
Second, anti-climbing mechanism and anti-anti-climbing strategy
Anti-crawling mechanism: The portal website prevents crawler programs from crawling website data by formulating corresponding strategies or technical means
Anti-anti-crawling strategy: The crawler can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain the information of the portal website
Third, robots.txt protocol
Also known as the gentleman's agreement, it specifies which data in the website can be crawled by crawlers and which data cannot be crawled
Just add /robots.txt
after the specified domain nameView
For example: https://www.baidu.com/robots.txt
, you can see the relevant webpages that are not allowed to be crawled (Disallow), and those that are allowed to be crawled (Allow)Web pages, of course, are generally allowed to crawl other than those that are not allowed.
Fourth, http&https protocol
Ⅰ, http protocol
Hyper Text Transfer Protocol (HTTP): a form of data interaction between server and client
Ⅱ, commonly used request header and response header information
Request header:
① User-Agent: The identity representation of the request carrier
② Connection: After the request is completed, whether to disconnect or keep the connection
Response header:
Content-Type: The data type that the server responds to the client
For example: https://blog.csdn.net/qq_41264055
Press F12,Click Network, F5 refresh and re-access the server, you can see some content information of the request header and response header
Ⅲ, https protocol
Hyper Text Transfer Protocol over SecureSocket Layer based on http protocol
Ⅳ, encryption method
① Symmetric key encryption method
② NotSymmetric key encryption method
③ Certificate key encryption method
- 安全软件 Avast 与赛门铁克诺顿 NortonLifeLock 合并案获英国批准,市值暴涨 43%
- ~ hand AHB - APB Bridge 】 【 AMBA AHB bus
- kernel hung_task死锁检测机制原理实现
- 如何根据地址获取函数名
- 【无标题】线程三连鞭之“线程池”
- Will we still need browsers in the future?(feat. Maple words Maple language)
- 仪表板展示 | DataEase看中国:数据呈现中国资本市场
- Nuclei (2) Advanced - In-depth understanding of workflows, Matchers and Extractors
- 社区分享|腾讯海外游戏基于JumpServer构建游戏安全运营能力
- 4 - "PyTorch Deep Learning Practice" - Backpropagation
Go 语言快速入门指南:什么是 TSL 安全传输层
[Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)
Basic web in PLSQL
「津津乐道播客」#397 厂长来了:怎样用科技给法律赋能?
[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)
Xiaohei leetcode surfing: 94. Inorder traversal of binary tree
[Cultivation of internal skills of string functions] strncpy + strncat + strncmp (2)
NebulaGraph v3.2.0 Release Note,对查询最短路径的性能等多处优化
Basic web in PLSQL
ClickHouse 二级索引