当前位置：网站首页>Scrape crawler framework

Scrape crawler framework

2022-07-03 18:50:00 【Hua Weiyun】

Scrapy

scrapy Crawling process of crawler framework

scrapy Introduction to each component of the framework

For the above four steps , That is, each component , There is no direct connection between them , All by scrapy Engine to connect and transfer data . The engine is made up of scrapy Framework has been implemented , The manual implementation is generally spider Reptiles and pipeline The Conduit , For complex crawler projects, you can write by hand downloader and spider Middleware to meet more complex business needs .

scrapy Simple use of the framework

After installation scrapy After the third-party library , adopt terminal Console to enter commands directly

Create a scrapy project

scrapy startproject myspider

Generate a reptile

scrapy genspider itcast itcast.cn

Extract the data

perfect spider, Use xpath etc.

Save the data

stay pipeline Operation in

Start the crawler

scrapy crawl itcast

scrapy The simple process used by the framework

establish scrapy project , Will automatically generate a series of py Files and configuration files

Create a custom name , Confirm to crawl the domain name （ Optional ） The reptiles of

Write code to perfect the custom crawler , To achieve the desired effect

Use yield Pass the parsed data to pipeline

Use pipeline Store data （ stay pipeline The operation data needs to be in settings.py Turn configuration on , The default is off ）

Use pipeline A few points for attention

Use logging modular

stay scrapy in

settings Set in LOG_LEVEL = “WARNING”

settings Set in LOG_FILE = “./a.log” # Set the storage location and file name of the log file , At the same time, the log content will not be displayed in the terminal

import logging, Instantiation logger In any file logger Output content

In ordinary projects

import logging

logging.basicConfig(…) # Set the style of log output , Format

Instantiate a ’logger = logging.getLogger(name)’

In any py Call in file logger that will do

scrapy Realize the page turning request

Case study Crawling Tencent recruitment

Because now the mainstream trend of websites is to separate from each other , Go directly to get The website can only get a bunch of data free html label , The data displayed on the web page is made up of js Request the back-end interface to obtain data, and then splice the data in html in , So you can't directly visit the website address , But through chrome The developer tool learns the back-end interface address requested by the website , Then ask for the address

By comparing the website request back-end interface querystring, Determine what you want to request url

In Tencent recruitment network , Turning pages to view recruitment information is also achieved by requesting the back-end interface , Therefore, page crawling is actually a request for the back-end interface, but it needs to pass different querystring

spider Code

import scrapyimport randomimport jsonclass TencenthrSpider(scrapy.Spider):    name = 'tencenthr'    allowed_domains = ['tencent.com']    start_urls = ['https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn']    # start_urls = "https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1614839354704&parentCategoryId=40001&pageIndex=1&pageSize=10&language=zh-cn&area=cn"    def parse(self, response):        #  Because it is a request back-end interface , So the return is json data , So get response Object's text Content ,        #  And then convert to dict The data type is easy to operate         gr_list = response.text        gr_dict = json.loads(gr_list)        #  Because the realization of page turning function is querystring Medium pageIndex The change of , So get every time index, Then the next time index Just one more         start_url = str(response.request.url)        start_index = int(start_url.find("Index") + 6)        mid_index = int(start_url.find("&", start_index))        num_ = start_url[start_index:mid_index]		#  Generally returned json How many pieces of data will there be , Take it out here         temp = gr_dict["Data"]["Count"]        #  Define a dictionary         item = {}        for i in range(10):            #  Fill in the required data , By visiting dict  Take out the data             item["Id"] = gr_dict["Data"]["Posts"][i]["PostId"]            item["Name"] = gr_dict["Data"]["Posts"][i]["RecruitPostName"]            item["Content"] = gr_dict["Data"]["Posts"][i]["Responsibility"]            item["Url"] = "https://careers.tencent.com/jobdesc.html?postid=" + gr_dict["Data"]["Posts"][i]["PostId"]            #  take item Give the data to the engine             yield item        #  next url        #  Here is the next request url, meanwhile url Medium timestamp It's just one. 13 A random number of bits         rand_num1 = random.randint(100000, 999999)        rand_num2 = random.randint(1000000, 9999999)        rand_num = str(rand_num1) + str(rand_num2)        #  It's certain that pageindex  The numerical         nums = int(start_url[start_index:mid_index]) + 1        if nums > int(temp)/10:            pass        else:            nums = str(nums)            next_url = 'https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=' + rand_num + '&parentCategoryId=40001&pageIndex=' + nums +'&pageSize=10&language=zh-cn&area=cn'            #  take   Next request url Encapsulated into request Object is passed to the engine             yield scrapy.Request(next_url, callback=self.parse)

pipeline Code

import csvclass TencentPipeline:    def process_item(self, item, spider):        #  All the data obtained   Save to csv file         with open('./tencent_hr.csv', 'a+', encoding='utf-8') as file:            fieldnames = ['Id', 'Name', 'Content', 'Url']            writer = csv.DictWriter(file, fieldnames=fieldnames)            writer.writeheader()            print(item)            writer.writerow(item)        return item

Add scrapy.Request

scrapy Of item Use

Case study Crawl the political information of Sunshine Network

Crawl the information of sunshine government network , adopt chrome The developer tool knows that the data of the web page is normally filled in html in , So climbing Yang Guan net is just a normal analysis html Tag data .

But pay attention to , Because you also need to crawl the pictures and other information on the political information details page , So in writing spider When coding, you need to pay attention to parse The writing of methods

spider Code

import scrapyfrom yangguang.items import YangguangItemclass YangguanggovSpider(scrapy.Spider):    name = 'yangguanggov'    allowed_domains = ['sun0769.com']    start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?page=1']    def parse(self, response):        start_url = response.url        #  Crawl and parse data by page grouping         li_list = response.xpath("/html/body/div[2]/div[3]/ul[2]")        for li in li_list:            #  stay item Tool classes defined in . To carry the required data             item = YangguangItem()            item["Id"] = str(li.xpath("./li/span[1]/text()").extract_first())            item["State"] = str(li.xpath("./li/span[2]/text()").extract_first()).replace(" ", "").replace("\n", "")            item["Content"] = str(li.xpath("./li/span[3]/a/text()").extract_first())            item["Time"] = li.xpath("./li/span[5]/text()").extract_first()            item["Link"] = "http://wz.sun0769.com" + str(li.xpath("./li/span[3]/a[1]/@href").extract_first())            #  Visit the details page of each political information , And use parse_detail Methods to deal with             #  With the help of scrapy Of meta  Parameter will item Pass on to parse_detail In the method             yield scrapy.Request(                item["Link"],                callback=self.parse_detail,                meta={"item": item}            )        #  Ask for the next page         start_url_page = int(str(start_url)[str(start_url).find("=")+1:]) + 1        next_url = "http://wz.sun0769.com/political/index/politicsNewest?page=" + str(start_url_page)        yield scrapy.Request(            next_url,            callback=self.parse        )	#  Analyze the data of the detail page     def parse_detail(self, response):        item = response.meta["item"]        item["Content_img"] = response.xpath("/html/body/div[3]/div[2]/div[2]/div[3]/img/@src")        yield item

items Code

import scrapy#  stay item Class to define the required fields class YangguangItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    Id = scrapy.Field()    Link = scrapy.Field()    State = scrapy.Field()    Content = scrapy.Field()    Time = scrapy.Field()    Content_img = scrapy.Field()

pipeline Code

class YangguangPipeline:    #  Simply print out the required data     def process_item(self, item, spider):        print(item)        return item

scrapy Of debug Information recognition

By looking at scrapy Frame printed debug Information , You can see scrapy Startup sequence , When something goes wrong , It can help solve the problem of becoming .

scrapy In depth scrapy shell

adopt scrapy shell Can be used when not started spider Try and debug the code , In some cases where the operation cannot be determined, you can go through shell To verify the attempt .

scrapy In depth settings And piping

settings

Yes scrapy Project settings Introduction of documents ：

# Scrapy settings for yangguang project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html#  Project name BOT_NAME = 'yangguang'#  The location of the crawler module SPIDER_MODULES = ['yangguang.spiders']#  New crawler location NEWSPIDER_MODULE = 'yangguang.spiders'#  Output log level LOG_LEVEL = 'WARNING'#  Set the... Carried each time you send a request headers Of user-argent# Crawl responsibly by identifying yourself (and your website) on the user-agent# USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.72 Safari/537.36 Edg/89.0.774.45'#  Set whether to comply  robot agreement # Obey robots.txt rulesROBOTSTXT_OBEY = True#  Set the maximum number of simultaneous requests # Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#  Set the interval time for each request #DOWNLOAD_DELAY = 3#  Generally, it is less useful # The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# cookie Open or not , It can be turned on by default # Disable cookies (enabled by default)#COOKIES_ENABLED = False#  Whether the console component is turned on # Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False#  Set the default request header ,user-argent Cannot be placed here at the same time # Override the default request headers:#DEFAULT_REQUEST_HEADERS = {# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}#  Set whether the crawler middleware is enabled # Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'yangguang.middlewares.YangguangSpiderMiddleware': 543,#}#  Set whether the download middleware is enabled # Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'yangguang.middlewares.YangguangDownloaderMiddleware': 543,#}## Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}#  Set whether the pipe is opened # Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'yangguang.pipelines.YangguangPipeline': 300,}#  Relevant settings of automatic speed limit # Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# HTTP Cache related settings # Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

The Conduit pipeline

In the pipeline, there is not only the time when the project is created process_item Method , There is also open_spider,close_spider Such method , These two methods are executed once when the crawler starts and once when the crawler ends .

Example code ：

class YangguangPipeline:    def process_item(self, item, spider):        print(item)        #  If you don't return Words , Another one with a lower weight pipeline You won't get the item        return item    	def open_spider(self, spider):        #  This is performed once when the crawler is turned on         spider.test = "hello"        #  by spider Added an attribute value , After the pipeline Medium process_item or spider This attribute value can be used in     def close_spider(self, spider):        #  This is performed once when the crawler is turned off         spider.test = ""

mongodb A supplement to

With the help of pymongo Third party packages to operate

scrapy Medium crawlspider Reptiles

Generate crawlspider The order of ：

scrapy genspider -t crawl Reptile name Domain name to crawl

crawlspider Use

Create crawler scrapy genspider -t crawl Reptile name allow_domain
Appoint start_url, The corresponding response will go through rules extract url Address
perfect rules, add to Rule

Rule(LinkExtractor(allow=r’ /web/site0/tab5240/info\d+.htm’), callback=‘parse_ item’),

Be careful :

url The address is incomplete , crawlspider It will be automatically completed after the request
parse Functions cannot be defined , He has special functions to implement
callback: Connect the... Extracted by the extractor url Give him the response corresponding to the address
follow: Connect the... Extracted by the extractor url Whether the response corresponding to the address continues to be rules To filter

LinkExtractors Link extractor ：

Use LinkExtractors You don't need programmers to extract what you want url, Then send the request . These jobs can be handed over to LinkExtractors, He will find all the pages that meet the rules url, Realize automatic crawling . The following for LinkExtractors Class ：

class scrapy.linkextractors.LinkExtractor(    allow = (),    deny = (),    allow_domains = (),    deny_domains = (),    deny_extensions = None,    restrict_xpaths = (),    tags = ('a','area'),    attrs = ('href'),    canonicalize = True,    unique = True,    process_value = None)

Explanation of main parameters ：

allow： Allow the url. All that satisfy this regular expression url Will be extracted .
deny： Prohibited url. All that satisfy this regular expression url Will not be extracted .
allow_domains： Allowed domain names . Only the domain name specified in this url Will be extracted .
deny_domains： Forbidden domain name . All domain names specified in this url Will not be extracted .
restrict_xpaths： Strict xpath. and allow Common filtering Links .

Rule Rule class ：

Define the rule class of the crawler . The following is a brief introduction to this class ：

class scrapy.spiders.Rule(    link_extractor,     callback = None,     cb_kwargs = None,     follow = None,     process_links = None,     process_request = None)

Explanation of main parameters ：

link_extractor： One LinkExtractor object , Used to define crawling rules .
callback： Those who meet this rule url, Which callback function should be executed . because CrawlSpider Used parse As a callback function , So don't cover parse As a callback function, its own callback function .
follow： Specify from... According to this rule response Whether the links extracted in need to be followed up .
process_links： from link_extractor After getting the link in, it will be passed to this function , Used to filter links that don't need to crawl .

Case study Crawl the joke website

analysis xiaohua.zolcom.cn It can be learned that , The data of the web page is directly embedded in HTML in , Request website domain name , The server directly returns html The tag contains all the information visible in the web page . Therefore, directly respond to the server html The tag is parsed .
When turning pages and crawling data at the same time , Also found on the next page url Embedded in html in , So with the help of crawlspider It is very convenient to extract the next page url.

spider Code :

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleimport reclass XhzolSpider(CrawlSpider):name = 'xhzol'allowed_domains = ['xiaohua.zol.com.cn']start_urls = ['http://xiaohua.zol.com.cn/lengxiaohua/1.html']rules = (    #  It is defined here to extract from the corresponding url Address , And can automatically complete , callpack Indicate which processing function handles the response , follow Represents the regular url  Do you want to continue the request     Rule(LinkExtractor(allow=r'/lengxiaohua/\d+\.html'), callback='parse_item', follow=True),)def parse_item(self, response):    item = {}    # item["title"] = response.xpath("/html/body/div[6]/div[1]/ul/li[1]/span/a/text()").extract_first()    # print(re.findall("<span class='article-title'><a target='_blank' href='.*?\d+\.html'>(.*?)</a></span>", response.body.decode("gb18030"), re.S))    #  Here, search for the title of the joke by regular     for i in re.findall(r'<span class="article-title"><a target="_blank" href="/detail\d+/\d+\.html">(.*?)</a></span>', response.body.decode("gb18030"), re.S):        item["titles"] = i        yield item    return item

pipeline Code :

class XiaohuaPipeline:    def process_item(self, item, spider):        print(item)        return item

Simply print to see the running results

Case study Crawl the punishment information on the CBRC website

Analyze the information on the web page and find out , The specific data information of the web page is sent through the web page Ajax request , Request the backend interface to get json data , And then through js Dynamically embed data in html in , renders . So you can't directly request the website domain name , But to request the backend api Interface . And by comparing the backend requested when turning pages api Interface changes , Determine the next page when turning pages url.

spider Code ：

import scrapyimport reimport jsonclass CbircSpider(scrapy.Spider):    name = 'cbirc'    allowed_domains = ['cbirc.gov.cn']    start_urls = ['https://www.cbirc.gov.cn/']    def parse(self, response):        start_url = "http://www.cbirc.gov.cn/cbircweb/DocInfo/SelectDocByItemIdAndChild?itemId=4113&pageSize=18&pageIndex=1"        yield scrapy.Request(            start_url,            callback=self.parse1        )    def parse1(self, response):        #  Data processing         json_data = response.body.decode()        json_data = json.loads(json_data)        for i in json_data["data"]["rows"]:            item = {}            item["doc_name"] = i["docSubtitle"]            item["doc_id"] = i["docId"]            item["doc_time"] = i["builddate"]            item["doc_detail"] = "http://www.cbirc.gov.cn/cn/view/pages/ItemDetail.html?docId=" + str(i["docId"]) + "&itemId=4113&generaltype=" + str(i["generaltype"])            yield item        #  Page turning ,  Confirm the next page url        str_url = response.request.url        page = re.findall(r'.*?pageIndex=(\d+)', str_url, re.S)[0]        mid_url = str(str_url).strip(str(page))        page = int(page) + 1        #  Requested url Change is  page  An increase in         if page <= 24:            next_url = mid_url + str(page)            yield scrapy.Request(                next_url,                callback=self.parse1            )

pipeline Code ：

import csvclass CircplusPipeline:    def process_item(self, item, spider):        with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:            fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']            writer = csv.DictWriter(file, fieldnames=fieldnames)            writer.writerow(item)        return item    def open_spider(self, spider):        with open('./circ_gb.csv', 'a+', encoding='gb2312') as file:            fieldnames = ['doc_id', 'doc_name', 'doc_time', 'doc_detail']            writer = csv.DictWriter(file, fieldnames=fieldnames)            writer.writeheader()

Save the data in csv In file

Download Middleware

Study download middleware Use , The download middleware is used to preliminarily handle the task of sending the scheduler to the downloader request url or Obtained after preliminary processing of the downloader request response

As well as process_exception Method is used to handle the exception handling when the middleware program throws an exception .

Download the simple use of middleware

Custom middleware classes , Defined in class process Three methods of , Method . Pay attention to settings In the open , Register the class .

Code try ：

import random# useful for handling different item types with a single interfacefrom itemadapter import is_item, ItemAdapterclass RandomUserArgentMiddleware:	#  Processing requests     def process_request(self, request, spider):        ua = random.choice(spider.settings.get("USER_ARENT_LIST"))        request.headers["User-Agent"] = ua[0]class SelectRequestUserAgent:    #  Process response     def process_response(self, request, response, spider):        print(request.headers["User=Agent"])        #  I need to return a response（ Through the engine response hand spider） or request（ Through the engine request Give it to the scheduler ） or none        return responseclass HandleMiddlewareEcxeption:    #  Handling exceptions     def process_exception(self, request, exception, spider):        print(exception)

settings Code ：

DOWNLOADER_MIDDLEWARES = {    'suningbook.middlewares.RandomUserArgentMiddleware': 543,    'suningbook.middlewares.SelectRequestUserAgent': 544,    'suningbook.middlewares.HandleMiddlewareEcxeption': 544,}

scrapy Simulated Login

scrapy carry cookie Sign in

stay scrapy in , start_url Will not pass allowed_domains The filter , Yes, I will be asked , see scrapy Source code , request start_url Is the start_requests Method operation , So rewrite by yourself start_requests Methods can be requests start_url Take with you cookie Information, etc , Realize simulated Login and other functions .

By rewriting start_requests Method , Bring... For our request cookie Information , To realize the simulated login function .

Supplementary information ：
scrapy in cookie Messages are enabled by default , So the default request is to use directly cookie Of . You can turn it on COOKIE_DEBUG = True You can see the details of cookie Passing in function .

Case study carry cookie Simulated Login Renren

By rewriting start_requests Method , Bring... For request cookie Information , Go to the page that can be accessed only after login , pick up information . Simulation realizes the function of simulated Login .

import scrapyimport reclass LoginSpider(scrapy.Spider):    name = 'login'    allowed_domains = ['renren.com']    start_urls = ['http://renren.com/975252058/profile']	#  Rewriting methods     def start_requests(self):        #  add cookie Information , After that, the request will carry this cookie Information         cookies = "anonymid=klx1odv08szk4j; depovince=GW; _r01_=1; taihe_bi_sdk_uid=17f803e81753a44fe40be7ad8032071b; taihe_bi_sdk_session=089db9062fdfdbd57b2da32e92cad1c2; ick_login=666a6c12-9cd1-433b-9ad7-97f4a595768d; _de=49A204BB9E35C5367A7153C3102580586DEBB8C2103DE356; t=c433fa35a370d4d8e662f1fb4ea7c8838; societyguester=c433fa35a370d4d8e662f1fb4ea7c8838; id=975252058; xnsid=fadc519c; jebecookies=db5f9239-9800-4e50-9fc5-eaac2c445206|||||; JSESSIONID=abcb9nQkVmO0MekR6ifGx; ver=7.0; loginfrom=null; wp_fold=0"        cookie = {i.split("=")[0]:i.split("=")[1] for i in cookies.split("; ")}        yield scrapy.Request(            self.start_urls[0],            callback=self.parse,            cookies=cookie        )	#  Print user name , Verify whether the simulated Login is successful     def parse(self, response):        print(re.findall(" The user has not opened ", response.body.decode(), re.S))

scrapy Simulate the sending of login post request

With the help of scrapy Provided FromRequest Object to send Post request , And you can set fromdata,headers,cookies Equal parameter .

Case study scrapy Simulated Login github

Simulated Login GitHub, visit github.com/login, obtain from Parameters , To request /session Verify account password , Last login successful

spider Code :

import scrapyimport reimport randomclass GithubSpider(scrapy.Spider):    name = 'github'    allowed_domains = ['github.com']    start_urls = ['https://github.com/login']    def parse(self, response):        #  First from login  Get the response from the page authenticity_token and commit, Login is required when requesting         authenticity_token = response.xpath("//*[@id='login']/div[4]/form/input[1]/@value").extract_first()        rand_num1 = random.randint(100000, 999999)        rand_num2 = random.randint(1000000, 9999999)        rand_num = str(rand_num1) + str(rand_num2)        commit = response.xpath("//*[@id='login']/div[4]/form/div/input[12]/@value").extract_first()        form_data = dict(            commit=commit,            authenticity_token=authenticity_token,            login="[email protected]",            password="tcc062556",            timestamp=rand_num,            # rusted_device="",        )        # form_data["webauthn-support"] = ""        # form_data["webauthn-iuvpaa-support"] = ""        # form_data["return_to"] = ""        # form_data["allow_signup"] = ""        # form_data["client_id"] = ""        # form_data["integration"] = ""        # form_data["required_field_b292"] = ""        headers = {            "referer": "https://github.com/login",            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',            'accept-language': 'zh-CN,zh;q=0.9',            'accept-encoding': 'gzip, deflate, br',            'origin': 'https://github.com'        }        #  With the help of fromrequest  send out post request , Log in         yield scrapy.FormRequest.from_response(            response,            formdata=form_data,            headers=headers,            callback=self.login_data        )    def login_data(self, response):        #  Print the user name to verify whether the login is successful         print(re.findall("xiangshiersheng", response.body.decode()))        #  Save locally html  file         with open('./github.html', 'a+', encoding='utf-8') as f:            f.write(response.body.decode())

summary :

Three ways to simulate login ：

1. carry cookie Sign in

Use scrapy.Request(url, callback=, cookies={})
take cookies fill , In the request url When carrying cookie To request .

2. Use FormRequest

scrapy.FromRequest(url, formdata={}, callback=)
formdata Is the request body , stay formdata Fill in the form data to be submitted

3. With the help of from_response

scrapy.FromRequest.from_response(response, formdata={}, callback=)
from_response The address of the form submission will be automatically searched from the response ( If there is a form and submission address )

原网站

版权声明
本文为[Hua Weiyun]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207031844469538.html