当前位置:网站首页>How is crawler data collected and organized?
How is crawler data collected and organized?
2022-07-30 06:05:00 【oHuangBing】
Some users have been curious about how crawlers identify how the crawler data on the website is organized, and today we will come to youUncover how crawler data is collected and organized.
Get the rDNS method by Query IP Address
We can reverse lookup rDNS through the crawler's IP address, for example: we use Reverse DNS lookup toolFind this IP: 116.179.32.160 , rDNS is: baiduspider-116-179-32-160.crawl.baidu.com
It can be roughly judged from the above that it should be Baidu search engine spider.Since the Hostname can be faked, we only have a reverse lookup, which is still inaccurate.We also need to search forward. We use the ping command to find whether baiduspider-116-179-32-160.crawl.baidu.com can be parsed as: 116.179.32.160, we can see baiduspider-116-179-32 from the following figure-160.crawl.baidu.com is resolved to the IP address of 116.179.32.160, the description is Baidu search engine crawler for sure.

Find by ASN related information
Not all crawlers abide by the above regulations. Most crawlers have no results in reverse search. We need to query the ASN information of the IP address to determine whether the crawler information is correct.
Example: This IP is 74.119.118.20, we pass Query IP Information, you can see that this IP address is the IP address of Sunnyvale, California, USA.

From the ASN information we can see that he is the IP of Criteo Corp.

The above screenshot shows the record information of critieo crawler through log records.The yellow part is its User-agent, followed by its IP, there is nothing wrong with this record (this IP is indeed the IP address of CriteoBot).
IP address range published through the official documentation of the crawler
Some crawlers will publish IP address segments, and we will save the official crawlers IP address segments directly to the database, which is a simple and fast method.
By public log
We can often view public logs on the Internet. For example, the following image is the public log record I found:

We can parse the log records and judge which ones are crawlers and which ones are visitors according to the User-agent, which greatly enriches our crawler record database.
Summary
Through the above four methods, it explains in detail how the crawler identification website collects and organizes crawler data, and how to ensure the accuracy and reliability of crawler data. Of course, in the actual operation process, not only the above four methods, but all of them are used.It is relatively small, so I will not introduce it here.
边栏推荐
- I went to meet some successful people worth tens of millions on May 1st, and I have some new ideas and inspirations
- [Mysql] DATEDIFF函数
- MySQL stored procedure
- 分布式事务之 LCN框架的原理和使用(二)
- 手把手教你设计一个CSDN系统
- MySQL索引常见面试题(2022版)
- asyncawait和promise的区别
- idea设置自动带参数的方法注释(有效)
- JVM 垃圾回收 超详细学习笔记(二)
- curl (7) Failed connect to localhost8080; Connection refused
猜你喜欢

号称年薪30万占比最多的专业,你知道是啥嘛?

ClickHouse 数据插入、更新与删除操作 SQL

瑞吉外卖项目:新增菜品与菜品分页查询

The use of Conluce, an online document management system

一个老程序员的2020年总结回顾,2021年如何变的更牛逼
![[Image detection] Research on cumulative weighted edge detection method based on grayscale image with matlab code](/img/c1/f962f1c1d9f75732157d49a5d1d0d6.png)
[Image detection] Research on cumulative weighted edge detection method based on grayscale image with matlab code

图形镜像对称(示意图)

MySQL如何对SQL做prepare预处理(解决IN查询SQL预处理仅能查询出一条记录的问题)

Docker-compose安装mysql
![[GO语言基础] 一.为什么我要学习Golang以及GO语言入门普及](/img/ac/80ab67505f7df52d92a206bc3dd50e.png)
[GO语言基础] 一.为什么我要学习Golang以及GO语言入门普及
随机推荐
MySQL的存储过程
微信支付及支付回调
从字节码角度带你彻底理解i++与++i
Within the SQL connection table (link connections, left or right, cross connection, full outer join)
Detailed MySQL-Explain
108. 将有序数组转换为二叉搜索树
Seata exception: endpoint format should like ip:port
More fragrant open source projects than Ruoyi in 2022
This article will take you through js to deal with the addition, deletion, modification and inspection of tree structure data
Programmers care guide, give yourself a chance to make the occasional relaxation of body and mind
ugly programmer
面试题 17.13. 恢复空格(字典树)
[Mysql] DATEDIFF function
最新版MySQL 8.0 的下载与安装(详细教程)
MySQL索引常见面试题(2022版)
MySQL夺命10问,你能坚持到第几问?
Concurrent Programming Review
解决没有配置本地nacos但是一直发生localhost8848连接异常的问题
【Koltin Flow(二)】Flow操作符之末端操作符
mysql 中 in 的用法