当前位置:网站首页>Practical application of "experience" crawler in work "theory"
Practical application of "experience" crawler in work "theory"
2022-06-30 19:11:00 【Little fire dragon said data】
Estimated reading time :5min
Solve the pain : Many students have some doubts about reptiles , Little fire dragon hopes to explain the basic principles of reptiles to you in simple words , And how to implement it through a simple piece of code , Help you get started as soon as possible , This article focuses on beginners of reptiles .
00
preface
What is a reptile ? What are the application scenarios ? Implementation takes several steps ? How to implement by code ?
If you have the above doubts , I believe this article can help you . Because of space , This article first shares the first three points with you , The code implementation will be pushed to you in the next article .
01
What is a reptile ?
First, let's talk about what a reptile is . The present , We are in an era of information inflation , If you want to collect information in a comprehensive way , You need to capture all kinds of information on the network locally , Information integration . such “ A program that automatically requests web sites and extracts web site information ” Called a reptile .
There are two questions here :
1、 What can a crawler crawl ?
As long as you can see the content on the website, you can theoretically climb down , for example : written words 、 picture 、 Audio 、 Video etc. .
2、 Is a reptile illegal ?
A reptile is a technology , Technology is equivalent to tools , The tool itself is not illegal . But if someone uses tools to do something illegal , That's another matter . Crawlers need to meet the following specifications :
- comply with Robots agreement : The protocol is a file stored in the root directory of the network , Guide the website to what content is available , What is not available , Be similar to “ legal instrument ”.
- Stay away from illegal profits : Malicious crawling of competing data , Seek illegitimate interests , May violate the law .
- Avoid damaging the server : If the reptile is large , Cause the other party's website to be paralyzed , This belongs to the category of website attack , May involve illegal activities .
02
Crawler application scenario
What are the application scenarios for crawlers ? For our daily work 、 What help does life have ? List a few common directions :
- Search engine optimization : We are familiar with the search engine , One of the links is the web crawler , Move the latest pages from various websites , Sort by recall , In front of everyone . for example : Baidu 、 Google, etc .
- Platform information integration : In the process of online shopping , Some websites can see N The price of multiple platforms , This is actually the use of reptile technology , Sum up the prices of other platforms , So as to facilitate the pricing of the platform itself and provide consumers with reference . for example : JD.COM 、 Suning, etc. .
- Application data analysis : When we want to capture the information of a website , When analyzing something we want , Reptiles are essential . for example : Crawl chain home data , Analyze the price trend of second-hand houses .
- Grab tickets : Have you ever met , Spring Festival 、 There are no tickets for the concert ? There may be scalpers in the middle , Using crawler Software , Simulate human behavior , Achieve the purpose of grabbing tickets . In order to prevent this behavior of scalpers , Many websites also do anti - Crawler processing , Increase the cost of crawlers .
03
Common steps for reptiles
Come here , Are you eager to try , You want to build a reptile by yourself ? Here little fire dragon shares with you a relatively common reptile step , For your reference :
Step one : Find the website you need to crawl URL. for example : Chain family .
Step two : View page source code (HTML). adopt F12 Shortcut keys to access .
Step three : Find the location where you want to crawl . for example : House price .
Step four : adopt Python The code implements the website request 、 Grab 、 analysis . Next 『 Realization chapter 』 Share code .
Step five : Store crawl content locally .
The above is the content sharing of this issue .
边栏推荐
- 20220528【聊聊假芯片】贪便宜往往吃大亏,盘点下那些假的内存卡和固态硬盘
- TCP粘包问题
- dtd建模
- What if the apple watch fails to power on? Apple watch can not boot solution!
- Troubleshooting MySQL for update deadlock
- MySQL事务并发问题和MVCC机制
- How to use AI technology to optimize the independent station customer service system? Listen to the experts!
- Ambient light and micro distance detection system based on stm32f1
- 云上“视界” 创新无限 | 2022阿里云直播峰会正式上线
- [community star selection] the 23rd issue of the July revision plan | bit by bit creation, converging into a tower! Huawei freebuses 4E and other cool gifts
猜你喜欢

The cloud native landing practice of using rainbow for Tuowei information

深度学习编译器的理解

Countdowncatch and completabilefuture and cyclicbarrier

php利用队列解决迷宫问题

3.10 haas506 2.0 development tutorial example TFT

Multipass Chinese document - setting graphical interface

Adhering to the concept of 'home in China', 2022 BMW children's traffic safety training camp was launched

德国AgBB VoC有害物质测试

Video content production and consumption innovation

Swin-transformer --relative positional Bias
随机推荐
AI首席架构师10-AICA-蓝翔 《飞桨框架设计与核心技术》
What if icloud photos cannot be uploaded or synchronized?
Neon optimization 2: arm optimization high frequency Instruction Summary
Where do the guests come from
DTD modeling
TCP packet sticking problem
How to use AI technology to optimize the independent station customer service system? Listen to the experts!
Dlib library for face key point detection (openCV Implementation)
Entry node of link in linked list - linked list topic
Word——Word在试图打开文件时遇到错误的一种解决办法
开发那些事儿:如何在视频中添加文字水印?
删除排序链表中的重复元素 II[链表节点统一操作--dummyHead]
基于UDP协议设计的大文件传输软件
[Collection - industry solutions] how to build a high-performance data acceleration and data editing platform
Memory Limit Exceeded
Glacier teacher's book
Opencv data type code table dtype
Is it safe to open an account for goucai? Is it reliable?
《客从何处来》
[community star selection] the 23rd issue of the July revision plan | bit by bit creation, converging into a tower! Huawei freebuses 4E and other cool gifts