当前位置:网站首页>Introduction to reptile to give up 01: Hello, reptile!

Introduction to reptile to give up 01: Hello, reptile!

2022-06-24 13:22:00 Call me ah Qi

Preface

18 Beginning of the year , During my internship period, I began to contact because of work needs Java Reptiles , Crawled from a website 163W strip poi data , This is the first reptile in my life , And the only one Java Reptiles . Later, these poi Data has become a part of my graduation project . Later I began to learn Python Crawler and crawler framework Scrapy, In especial Scrapy, I studied it for more than a month , And make use of Scrapy We've built a database of tens of millions of data ICA( Internet content recognition ) The repository .

The main purpose of writing the crawler series is to record your own experience of learning crawler , And some of the problems , Also hope to be able to bring some enlightenment to beginners of reptiles . When I popularized reptiles to my colleagues before , I made the first one in my life by myself PPT, So the crawler series will focus on this PPT To carry out .

Series structure

Catalog

Pictured , We will introduce reptiles from four aspects .

  1. Introduction to reptiles : It mainly includes the basic concept of reptile 、 Technology stack 、 Crawler program development, etc .
  2. Anti climbing technology : Mainly about the common anti crawler technology and coping methods .
  3. Scrapy frame : The best crawler framework at the moment , It's also the focus of this series .
  4. Risk aversion : How to write a standard crawler , How to avoid data risk .

Preface

Many people, including me , There's a kind of hazy feeling when you hear reptiles at first 、 The feeling of being out of reach . Many people think that only programmers need to use crawlers , It's not . At least ,Python The ability to process documents and crawlers is day-to-day oriented .

Take a chestnut : Someone needs to paste hundreds of pieces of data from various websites to excel in , If you use crawlers , One requests、pandas or xlwt Just like the , Dozens of lines of code . Daily online needs to write three documents to upload according to the template , It takes four or five minutes to paste back and forth , Later I used to be lazy Python I wrote a program and packaged it into exe, It's done in a few seconds with a click . therefore ,Python Make daily work more efficient , It's worth learning for more people .

This article starts with the first chapter: introduction to reptiles .

Introduction to reptiles

The concept of reptiles

Concept

What is a reptile ?

This is when I learned to develop reptiles , The first question that comes to mind . No matter how the Internet introduces crawlers , yes spider、crawler Good , yes robots Let it be . My understanding is that : A program that simulates human behavior and gets data from a web page . More concrete : stay Java The middle crawler is Jsoup.jar, stay Python The middle crawler is requests modular , even to the extent that Shell Medium curl Commands can also be seen as reptiles .

The crawler library can be divided into two parts . One is the request part , Mainly responsible for requesting data , for example Python Of requests; The second part is the analysis part , Responsible for parsing html get data , for example Python Of BS4.

What does the reptile do ?

Getting data from a web page by imitating human behavior . A person , You need to open your browser first 、 Enter url , Get the web page from the background of the website and load it into the browser to display , Finally, we can get the data . The request part of the crawler , It's the role of a browser , According to your input url Get it from the website background html, And the parsing part will be based on the preset rules , from html Get data in .

And the work of developers , One is the decoration request part , For example, add... To the request header User-Agent、Cookie etc. , Let the website feel like a person visiting through a browser , Not a program . The second is to write rules through selectors , Getting data from a page .

This is the request header content of the browser .

Request header

Technology stack

Technology stack

What kind of skills do you need to be a crawler ? Isn't it that only big guys can ? It's not . There are mainly two levels of requirements .

The basic requirements

programing language : Just need to have Java perhaps Python The foundation is enough , A basic Html Reading ability and CSS Selectors 、Xpath Selectors 、 The ability to use regular expressions .

data storage : The crawled data is meaningful only when it is stored . Data can be saved in files or databases , This requires that developers have the ability to read and write files or operate databases . For the database , Master the basic table structure design 、 The ability of adding, deleting, modifying and checking is enough .

Developer tools : Crawler developers use the most tools , All kinds of browsers press F12 Will pop up. . Usually used to intercept requests , Positioning elements , see JS Source file .

Developer tools

Advanced requirements

In the development of reptiles , There will be all kinds of problems , We need to have the ability to think and solve problems independently . at present , Many websites use asynchronous data loading or JS encryption , So we need to have Ajax and JS Knowledge of .

Network knowledge . Basic status code :20x success ,30x Forward redirection ,40x The request does not exist 、50x Server side problems . Sometimes it needs to be TCP Knowledge , for example established、time_waited etc. TCP What does the connection state represent .

The crawler development

We've finished with the basic concepts , How to develop a crawler ? Take a chestnut :

Doulo land

Pictured , The stars are desolate 、 Moon lit animation Douluo mainland broadcast page . Let's take this as an example , Develop crawlers to get page data .

Java Reptiles

Java Crawler development mainly uses Jsoup.

introduce Jsoup rely on :

    <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.11.2</version>
    </dependency>

Application development :

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;

public class JavaCrawler {
    public static void main(String[] args) throws IOException {
        String url = "https://v.qq.com/detail/m/m441e3rjq9kwpsc.html";
        //  Initiate request , Access to the page 
        Document document = Jsoup.connect(url).get();
        //  analysis html, get data 
        Element body = document.body();
        Element common = body.getElementsByClass("video_title_cn").get(0);
        String name = common.getElementsByAttribute("_stat").text();
        String category = common.getElementsByClass("type").text();
        Elements type_txt = body.getElementsByClass("type_txt");
        String alias = type_txt.get(0).text();
        String area = type_txt.get(1).text();
        String parts = type_txt.get(2).text();
        String date = type_txt.get(3).text();
        String update = type_txt.get(4).text();
        String tag = body.getElementsByClass("tag").text();
        String describe = body.getElementsByClass("_desc_txt_lineHight").text();
        System.out.println(name + "\n" + category + "\n" + alias + "\n" + area + "\n" + parts + "\n" + date + "\n" + update + "\n" + tag + "\n" + describe);

    }
}

Python Reptiles

about Python Crawler development , It uses requests and bs4.

Install the module :

pip install requests bs4

Application development :

import requests
from bs4 import BeautifulSoup

url = 'https://v.qq.com/detail/m/m441e3rjq9kwpsc.html'
#  Initiate request , Access to the page 
response = requests.get(url)
#  analysis html, get data 
soup = BeautifulSoup(response.text, 'html.parser')
name = soup.select(".video_title_cn a")[0].string
category = soup.select("span.type")[0].string
alias = soup.select("span.type_txt")[0].string
area = soup.select("span.type_txt")[1].string
parts =soup.select("span.type_txt")[2].string
date = soup.select("span.type_txt")[3].string
update = soup.select("span.type_txt")[4].string
tag = soup.select("a.tag")[0].string
describe = soup.select("span._desc_txt_lineHight")[0].string
print(name, category, alias, parts, date, update, tag, describe, sep='\n')

The above two programs output the same result :

Output results

thus , The reptile development work in Douluo is completed . You can see from the code , The request section is just one line , Most of them are analytic parts , Use here css Selector to complete the data analysis .

Let's take a look at the content of the request section :

Request response content

Of course , A complete crawler program in addition to the above modules , You also need to have storage modules , The agent pool module is also needed when necessary . secondly , For the whole large-scale website data crawling, we also need to carry out in-depth analysis of the website / Breadth traversal to complete , You also need to consider if the crawler interrupts , How to start from the breakpoint to continue crawling and other aspects of the design . This section will be followed by .

Conclusion

This article doesn't do much in-depth research on program development , Only about the concept of crawler and program demonstration . And the next article will follow the above procedure , Focus on Jsoup and requests、bs4 Modules and css The use of selectors delves into . Looking forward to the next meeting .

原网站

版权声明
本文为[Call me ah Qi]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/05/20210525115039634r.html