当前位置:网站首页>Practical application of "experience" crawler in work "theory"
Practical application of "experience" crawler in work "theory"
2022-06-30 19:11:00 【Little fire dragon said data】
Estimated reading time :5min
Solve the pain : Many students have some doubts about reptiles , Little fire dragon hopes to explain the basic principles of reptiles to you in simple words , And how to implement it through a simple piece of code , Help you get started as soon as possible , This article focuses on beginners of reptiles .
00
preface
What is a reptile ? What are the application scenarios ? Implementation takes several steps ? How to implement by code ?
If you have the above doubts , I believe this article can help you . Because of space , This article first shares the first three points with you , The code implementation will be pushed to you in the next article .
01
What is a reptile ?
First, let's talk about what a reptile is . The present , We are in an era of information inflation , If you want to collect information in a comprehensive way , You need to capture all kinds of information on the network locally , Information integration . such “ A program that automatically requests web sites and extracts web site information ” Called a reptile .
There are two questions here :
1、 What can a crawler crawl ?
As long as you can see the content on the website, you can theoretically climb down , for example : written words 、 picture 、 Audio 、 Video etc. .
2、 Is a reptile illegal ?
A reptile is a technology , Technology is equivalent to tools , The tool itself is not illegal . But if someone uses tools to do something illegal , That's another matter . Crawlers need to meet the following specifications :
- comply with Robots agreement : The protocol is a file stored in the root directory of the network , Guide the website to what content is available , What is not available , Be similar to “ legal instrument ”.
- Stay away from illegal profits : Malicious crawling of competing data , Seek illegitimate interests , May violate the law .
- Avoid damaging the server : If the reptile is large , Cause the other party's website to be paralyzed , This belongs to the category of website attack , May involve illegal activities .
02
Crawler application scenario
What are the application scenarios for crawlers ? For our daily work 、 What help does life have ? List a few common directions :
- Search engine optimization : We are familiar with the search engine , One of the links is the web crawler , Move the latest pages from various websites , Sort by recall , In front of everyone . for example : Baidu 、 Google, etc .
- Platform information integration : In the process of online shopping , Some websites can see N The price of multiple platforms , This is actually the use of reptile technology , Sum up the prices of other platforms , So as to facilitate the pricing of the platform itself and provide consumers with reference . for example : JD.COM 、 Suning, etc. .
- Application data analysis : When we want to capture the information of a website , When analyzing something we want , Reptiles are essential . for example : Crawl chain home data , Analyze the price trend of second-hand houses .
- Grab tickets : Have you ever met , Spring Festival 、 There are no tickets for the concert ? There may be scalpers in the middle , Using crawler Software , Simulate human behavior , Achieve the purpose of grabbing tickets . In order to prevent this behavior of scalpers , Many websites also do anti - Crawler processing , Increase the cost of crawlers .
03
Common steps for reptiles
Come here , Are you eager to try , You want to build a reptile by yourself ? Here little fire dragon shares with you a relatively common reptile step , For your reference :
Step one : Find the website you need to crawl URL. for example : Chain family .
Step two : View page source code (HTML). adopt F12 Shortcut keys to access .
Step three : Find the location where you want to crawl . for example : House price .
Step four : adopt Python The code implements the website request 、 Grab 、 analysis . Next 『 Realization chapter 』 Share code .
Step five : Store crawl content locally .
The above is the content sharing of this issue .
边栏推荐
- The easynvr platform equipment channels are all online. What is the reason for the "network request failure" in the operation?
- openGauss数据库源码解析系列文章—— 密态等值查询技术详解(上)
- mysql for update 死锁问题排查
- slice
- torch stack() meshgrid()
- Do you really understand the persistence mechanism of redis?
- Some interesting modules
- 浏览器窗口切换激活事件 visibilitychange
- Where do the guests come from
- mysql 递归
猜你喜欢

Some interesting modules

【TiDB】TiCDC canal_ Practical application of JSON

Geoffrey Hinton: my 50 years of in-depth study and Research on mental skills

Video content production and consumption innovation

Opencv data type code table dtype

The easynvr platform equipment channels are all online. What is the reason for the "network request failure" in the operation?

充值满赠,IM+RTC+X 全通信服务「回馈季」开启

Pytorch learning (III)

Nodejs 安装与介绍

Classic problem of leetcode dynamic programming (I)
随机推荐
How does rust implement dependency injection?
Electron 入门
煤炭行业数智化供应商管理系统解决方案:数据驱动,供应商智慧平台助力企业降本增效
PHP uses queues to solve maze problems
电子元器件招标采购商城:优化传统采购业务,提速企业数字化升级
手机股票账号开户安全吗?是靠谱的吗?
The folder is transferred between servers. The folder content is empty
Construction and practice of full stack code test coverage and use case discovery system
MRO industrial products procurement management system: enable MRO enterprise procurement nodes to build a new digital procurement system
Some interesting modules
「干货」数据分析常用的10种统计学方法,附上重点应用场景
MySQL transaction concurrency and mvcc mechanism
金融服务行业SaaS项目管理系统解决方案,助力企业挖掘更广阔的增长服务空间
Solution of enterprise supply chain system in medical industry: realize collaborative visualization of medical digital intelligent supply chain
屏幕显示技术进化史
【TiDB】TiCDC canal_ Practical application of JSON
Evolution of screen display technology
AI chief architect 10-aica-lanxiang, propeller frame design and core technology
openGauss数据库源码解析系列文章—— 密态等值查询技术详解(上)
期货怎么开户安全些?现在哪些期货公司靠谱些?