当前位置：网站首页>Getting started with scratch

Getting started with scratch

2022-07-25 04:43:00 【HHYZBC】

install scrapy

 pip install scrapy

scrapy Development process

Create project
Generate a reptile :
Extract the data :
Save the data :

Create project

establish scrapy The order of the project ：

scrapy startproject < Project name >

example

scrapy startproject myspider

The generated directory and file results are as follows ：

Create crawler

command ：

Execute under the project path :

 scrapy genspider < Reptile name > < Domain name allowed to crawl >

Reptile name : As a parameter of the crawler runtime

Domain name allowed to crawl : Crawl range set for crawlers , After setting, it is used to filter the url, If you climb and take url If the domain is not allowed, it will be filtered out .

Example ：

 scrapy genspider itcast itcast.cn

The generated directory and file results are as follows ：

Perfect reptiles

stay /myspider/myspider/spiders/itcast.py The content of the modification is as follows :

import scrapy

class ItcastSpider(scrapy.Spider):  #  Inherit scrapy.spider
    #  Reptile name  
    name = 'itcast' 
    #  The range allowed to climb 
    allowed_domains = ['itcast.cn'] 
    #  Starting to climb url Address 
    start_urls = ['http://www.itcast.cn/channel/teacher.shtml']

    #  Data extraction method , Accept the download middleware response
    def parse(self, response): 
        # scrapy Of response The object can proceed directly xpath
        names = response.xpath('//div[@class="tea_con"]//li/div/h3/text()') 
        print(names)

        #  The way to get the specific data text is as follows 
        #  grouping 
        li_list = response.xpath('//div[@class="tea_con"]//li') 
        for li in li_list:
            #  Create a data dictionary 
            item = {}
            #  utilize scrapy Packaged xpath Selector positioning elements , And pass extract() or extract_first() To get results 
            item['name'] = li.xpath('.//h3/text()').extract_first() #  The teacher's name 
            item['level'] = li.xpath('.//h4/text()').extract_first() #  The level of the teacher 
            item['text'] = li.xpath('.//p/text()').extract_first() #  Teacher's introduction 
            print(item)

Be careful ：

scrapy.Spider Reptiles must be named parse Parsing
If the structure of the website is complex , You can also customize other parsing functions
Extracted from the analytic function url Address if you want to send a request , Must belong to allowed_domains Within the scope of , however start_urls Medium url The address is not subject to this restriction
When starting the crawler, pay attention to the starting position , It's started in the project path
parse() Use in a function yield Return the data , Be careful ： In analytic functions yield The only objects that can be passed are ：BaseItem, Request, dict, None

Locate elements and extract data 、 Method of attribute value

response.xpath Method returns a similar result list The type of , It includes selector object , The operation is the same as the list , But there are some extra ways
Additional methods extract()： Returns a list of strings
Additional methods extract_first()： Returns the first string in the list , The list is empty and does not return None

response Common properties of response objects

response.url： Currently responding to url Address
response.request.url： The current response to the corresponding request url Address
response.headers： Response head
response.requests.headers： The request header of the current response
response.body： Response body , That is to say html Code ,byte type
response.status： Response status code

Save the data

stay pipelines.py Operations on data are defined in the file

Define a pipe class
Override the process_item Method
process_item The method is finished item And then it has to go back to the engine

It should be noted that , After the definition is completed, the pipeline also needs to configure the application

stay settings.py Configure enable pipeline

ITEM_PIPELINES = {
    'myspider.pipelines.ItcastPipeline': 400
}

The key in the configuration item is the pipe class used , Use of pipeline . Segmentation , The first is the project directory , The second is the document , The third is the defined pipeline class .
The value of the configuration item is the order in which the pipes are used , The smaller the value, the higher the priority , This value is generally set to 1000 within .