当前位置:网站首页>Introduction to reptile to give up 01: Hello, reptile!
Introduction to reptile to give up 01: Hello, reptile!
2022-06-24 13:22:00 【Call me ah Qi】
Preface
18 Beginning of the year , During my internship period, I began to contact because of work needs Java Reptiles , Crawled from a website 163W strip poi data , This is the first reptile in my life , And the only one Java Reptiles . Later, these poi Data has become a part of my graduation project . Later I began to learn Python Crawler and crawler framework Scrapy, In especial Scrapy, I studied it for more than a month , And make use of Scrapy We've built a database of tens of millions of data ICA( Internet content recognition ) The repository .
The main purpose of writing the crawler series is to record your own experience of learning crawler , And some of the problems , Also hope to be able to bring some enlightenment to beginners of reptiles . When I popularized reptiles to my colleagues before , I made the first one in my life by myself PPT, So the crawler series will focus on this PPT To carry out .
Series structure
Pictured , We will introduce reptiles from four aspects .
- Introduction to reptiles : It mainly includes the basic concept of reptile 、 Technology stack 、 Crawler program development, etc .
- Anti climbing technology : Mainly about the common anti crawler technology and coping methods .
- Scrapy frame : The best crawler framework at the moment , It's also the focus of this series .
- Risk aversion : How to write a standard crawler , How to avoid data risk .
Preface
Many people, including me , There's a kind of hazy feeling when you hear reptiles at first 、 The feeling of being out of reach . Many people think that only programmers need to use crawlers , It's not . At least ,Python The ability to process documents and crawlers is day-to-day oriented .
Take a chestnut : Someone needs to paste hundreds of pieces of data from various websites to excel in , If you use crawlers , One requests、pandas or xlwt Just like the , Dozens of lines of code . Daily online needs to write three documents to upload according to the template , It takes four or five minutes to paste back and forth , Later I used to be lazy Python I wrote a program and packaged it into exe, It's done in a few seconds with a click . therefore ,Python Make daily work more efficient , It's worth learning for more people .
This article starts with the first chapter: introduction to reptiles .
The concept of reptiles
What is a reptile ?
This is when I learned to develop reptiles , The first question that comes to mind . No matter how the Internet introduces crawlers , yes spider、crawler Good , yes robots Let it be . My understanding is that : A program that simulates human behavior and gets data from a web page . More concrete : stay Java The middle crawler is Jsoup.jar, stay Python The middle crawler is requests modular , even to the extent that Shell Medium curl Commands can also be seen as reptiles .
The crawler library can be divided into two parts . One is the request part , Mainly responsible for requesting data , for example Python Of requests; The second part is the analysis part , Responsible for parsing html get data , for example Python Of BS4.
What does the reptile do ?
Getting data from a web page by imitating human behavior . A person , You need to open your browser first 、 Enter url , Get the web page from the background of the website and load it into the browser to display , Finally, we can get the data . The request part of the crawler , It's the role of a browser , According to your input url Get it from the website background html, And the parsing part will be based on the preset rules , from html Get data in .
And the work of developers , One is the decoration request part , For example, add... To the request header User-Agent、Cookie etc. , Let the website feel like a person visiting through a browser , Not a program . The second is to write rules through selectors , Getting data from a page .
This is the request header content of the browser .
Technology stack
What kind of skills do you need to be a crawler ? Isn't it that only big guys can ? It's not . There are mainly two levels of requirements .
The basic requirements
programing language : Just need to have Java perhaps Python The foundation is enough , A basic Html Reading ability and CSS Selectors 、Xpath Selectors 、 The ability to use regular expressions .
data storage : The crawled data is meaningful only when it is stored . Data can be saved in files or databases , This requires that developers have the ability to read and write files or operate databases . For the database , Master the basic table structure design 、 The ability of adding, deleting, modifying and checking is enough .
Developer tools : Crawler developers use the most tools , All kinds of browsers press F12 Will pop up. . Usually used to intercept requests , Positioning elements , see JS Source file .
Advanced requirements
In the development of reptiles , There will be all kinds of problems , We need to have the ability to think and solve problems independently . at present , Many websites use asynchronous data loading or JS encryption , So we need to have Ajax and JS Knowledge of .
Network knowledge . Basic status code :20x success ,30x Forward redirection ,40x The request does not exist 、50x Server side problems . Sometimes it needs to be TCP Knowledge , for example established、time_waited etc. TCP What does the connection state represent .
The crawler development
We've finished with the basic concepts , How to develop a crawler ? Take a chestnut :
Pictured , The stars are desolate 、 Moon lit animation Douluo mainland broadcast page . Let's take this as an example , Develop crawlers to get page data .
Java Reptiles
Java Crawler development mainly uses Jsoup.
introduce Jsoup rely on :
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.11.2</version>
</dependency>Application development :
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class JavaCrawler {
public static void main(String[] args) throws IOException {
String url = "https://v.qq.com/detail/m/m441e3rjq9kwpsc.html";
// Initiate request , Access to the page
Document document = Jsoup.connect(url).get();
// analysis html, get data
Element body = document.body();
Element common = body.getElementsByClass("video_title_cn").get(0);
String name = common.getElementsByAttribute("_stat").text();
String category = common.getElementsByClass("type").text();
Elements type_txt = body.getElementsByClass("type_txt");
String alias = type_txt.get(0).text();
String area = type_txt.get(1).text();
String parts = type_txt.get(2).text();
String date = type_txt.get(3).text();
String update = type_txt.get(4).text();
String tag = body.getElementsByClass("tag").text();
String describe = body.getElementsByClass("_desc_txt_lineHight").text();
System.out.println(name + "\n" + category + "\n" + alias + "\n" + area + "\n" + parts + "\n" + date + "\n" + update + "\n" + tag + "\n" + describe);
}
}Python Reptiles
about Python Crawler development , It uses requests and bs4.
Install the module :
pip install requests bs4
Application development :
import requests
from bs4 import BeautifulSoup
url = 'https://v.qq.com/detail/m/m441e3rjq9kwpsc.html'
# Initiate request , Access to the page
response = requests.get(url)
# analysis html, get data
soup = BeautifulSoup(response.text, 'html.parser')
name = soup.select(".video_title_cn a")[0].string
category = soup.select("span.type")[0].string
alias = soup.select("span.type_txt")[0].string
area = soup.select("span.type_txt")[1].string
parts =soup.select("span.type_txt")[2].string
date = soup.select("span.type_txt")[3].string
update = soup.select("span.type_txt")[4].string
tag = soup.select("a.tag")[0].string
describe = soup.select("span._desc_txt_lineHight")[0].string
print(name, category, alias, parts, date, update, tag, describe, sep='\n')The above two programs output the same result :
thus , The reptile development work in Douluo is completed . You can see from the code , The request section is just one line , Most of them are analytic parts , Use here css Selector to complete the data analysis .
Let's take a look at the content of the request section :
Of course , A complete crawler program in addition to the above modules , You also need to have storage modules , The agent pool module is also needed when necessary . secondly , For the whole large-scale website data crawling, we also need to carry out in-depth analysis of the website / Breadth traversal to complete , You also need to consider if the crawler interrupts , How to start from the breakpoint to continue crawling and other aspects of the design . This section will be followed by .
Conclusion
This article doesn't do much in-depth research on program development , Only about the concept of crawler and program demonstration . And the next article will follow the above procedure , Focus on Jsoup and requests、bs4 Modules and css The use of selectors delves into . Looking forward to the next meeting .
边栏推荐
- 一文理解OpenStack网络
- YOLOv6:又快又准的目标检测框架开源啦
- Several common DoS attacks
- CVPR 2022 | 美团技术团队精选论文解读
- Implement Domain Driven Design - use ABP framework - create entities
- The data value reported by DTU cannot be filled into Tencent cloud database through Tencent cloud rule engine
- Perhaps the greatest romance of programmers is to commemorate their dead mother with a software
- Codereview tool chain for micro medicine
- Parse NC format file and GRB format file dependent package edu ucar. API learning of netcdfall
- Comparator sort functional interface
猜你喜欢

Use abp Zero builds a third-party login module (I): Principles

About the hacked database

不用Home Assistant,智汀也开源接入HomeKit、绿米设备?

敏捷之道 | 敏捷开发真的过时了么?

CVPR 2022 - Interpretation of selected papers of meituan technical team

我从根上解决了微信占用手机内存问题

WPF from zero to 1 tutorial details, suitable for novices on the road

Kubernetes cluster deployment

nifi从入门到实战(保姆级教程)——环境篇

"I, an idiot, have recruited a bunch of programmers who can only" Google "
随机推荐
How can ffmpeg streaming to the server save video as a file through easydss video platform?
Nifi from introduction to practice (nanny level tutorial) - environment
Creation and use of unified links in Huawei applinking
Teach you how to use airtestide to connect your mobile phone wirelessly!
RAID5 array recovery case tutorial of a company in Shanghai
One article explains R & D efficiency! Your concerns are
[database] final review (planning Edition)
Sinomeni vine was selected as the "typical solution for digital technology integration and innovative application in 2021" of the network security center of the Ministry of industry and information te
I have fundamentally solved the problem of wechat occupying mobile memory
使用 Abp.Zero 搭建第三方登录模块(一):原理篇
【数据挖掘】期末复习(样卷题目+少量知识点)
Baidu simian: talk about persistence mechanism and rdb/aof application scenario analysis!
[data mining] final review (sample questions + a few knowledge points)
AGCO AI frontier promotion (6.24)
简述聚类分析
What if the WordPress website forgets its password
Experience of IOS interview strategy - App testing and launching
Generate the NC file of 4-D air pressure and temperature, and then read the code (provide the code)
Integrate API interface parameter Dictionary of accounts of multiple local distribution companies - Express 100
Main steps of system test