当前位置：网站首页>First, the basic concept of reptiles

First, the basic concept of reptiles

2022-08-04 23:39:00 【beyond proverb】

First, crawlers are classified according to usage scenarios

Crawler: By writing a program, it simulates the process of surfing the Internet by a browser and letting it fetch data from the Internet.
① Universal crawler: An important part of the crawling system, it grabs the data of a whole page
② Focused crawler: Based on the universal crawler, it grabs the content of a specific local area on the page
③ Incremental crawler: Detect the update of data in the website, and only grab the latest updated data in the website

Second, anti-climbing mechanism and anti-anti-climbing strategy

Anti-crawling mechanism: The portal website prevents crawler programs from crawling website data by formulating corresponding strategies or technical means

Anti-anti-crawling strategy: The crawler can crack the anti-crawling mechanism in the portal website by formulating relevant strategies or technical means, so as to obtain the information of the portal website

Third, robots.txt protocol

Also known as the gentleman's agreement, it specifies which data in the website can be crawled by crawlers and which data cannot be crawled
Just add /robots.txt after the specified domain nameView
For example: https://www.baidu.com/robots.txt, you can see the relevant webpages that are not allowed to be crawled (Disallow), and those that are allowed to be crawled (Allow)Web pages, of course, are generally allowed to crawl other than those that are not allowed.
insert image description here

Fourth, http&https protocol

Ⅰ, http protocol

Hyper Text Transfer Protocol (HTTP): a form of data interaction between server and client

Ⅱ, commonly used request header and response header information

Request header:
① User-Agent: The identity representation of the request carrier
② Connection: After the request is completed, whether to disconnect or keep the connection

Response header:
Content-Type: The data type that the server responds to the client
For example: https://blog.csdn.net/qq_41264055
Press F12,Click Network, F5 refresh and re-access the server, you can see some content information of the request header and response header
Insert image description here