当前位置：网站首页>Scrape crawler framework

Scrape crawler framework

2022-07-29 10:25:00 【Star and Dream Star_ dream】

1. Create a project

scrapy startproject  project name 
D:.
│ scrapy.cfg
│
└─firstSpider
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
└─spiders
__init__.py
（ Blue is folder , Green is file ）

The role of each major document ：（ Can't delete ！！）
scrapy.cfg The configuration file for the project
firstSpider Project Python modular , The code will be referenced from here
items.py The project's target file
pipelines.py Pipeline files for the project
settings.py The setup file for the project
spiders Store crawler code directory
__init__.py Project initialization file

2. Define the fields of the target data

stay items.py Write code in the file
Get into firstSpider In file , find items.py file , Enter the following 2 Line code ：（ Field name self fetching ）
import scrapy

#  Define the fields of the target data 
class FirstspiderItem(scrapy.Item):
    title = scrapy.Field()  #  Chapter name 
    link = scrapy.Field() #  Links to chapters 

3. Write crawler code

In the project root directory （ Which contains firstSpider Folder 、scrapy.cfg file ） Next , stay cmd window Enter the following command , Create crawler file .
scrapy genspider  file name   The host address of the web page to be crawled 
such as ：scrapy genspider novelSpider www.shucw.com
In your spiders A crawler file will be added to the directory

The contents of the document ：
import scrapy


class novelSpider(scrapy.Spider):
    name = 'novelSpider' #  Reptile name 
    allowed_domains = ['www.shucw.com'] #  The host name of the web page to be crawled 
    start_urls = ['http://www.shucw.com/'] #  Page to crawl 【 You can modify 】

    def parse(self, response):
        pass

4. In crawler files novelSpider Write the crawler code in the file

The crawler code ： It's just parse Method Write in

import scrapy
from bs4 import BeautifulSoup
from firstSpider.items import FirstspiderItem # Guide pack 

#  All adopt Tap Key indent 

class NovelspiderSpider(scrapy.Spider):
	name = 'novelSpider' #  Crawl identification name 
	allowed_domains = ['www.shucw.com'] #  Crawl the page range 
	start_urls = ['http://www.shucw.com/html/13/13889/'] # start url

	def parse(self, response):

		soup = BeautifulSoup(response.body,'lxml')

		titles = [] #  Used to save chapter titles ( use list preservation )
		for i in soup.select('dd a'):
			titles.append(i.get_text()) #  Add into titles in 

		links = [] #  Links to save chapters 
		for i in soup.select('dd a'):
			link = "http://www.shucw.com" + i.attrs['href']
			links.append(link)

		for i in range(0,len(titles)):
			item = FirstspiderItem()
			item["title"] = titles[i]
			item["link"] = links[i]

			yield item #  Return every time item

5. stay pipelines.py Every one in the file item Save to local

from itemadapter import ItemAdapter

# All adopt Tab key , Prevent spaces and Tab Bond hybrid 

#	 Pipeline files , be responsible for item Post processing or preservation of 
class FirstspiderPipeline:
	#  Define some parameters that need to be initialized 
	def __init__(self):
        #  The file address written here ： It's in the root directory article Folder 【 You need to create... Manually 】
		self.file = open("article/novel.txt","a")

	#	 Every time the pipe receives item Post execution method 	
	def process_item(self, item, spider):
		content = str(item) + "\n"
		self.file.write(content)	# Write data to local 
		return item

	#	 Method executed when crawling is over 
	def close_spider(self,spider):
		self.file.close()

Not only in the pipeline pipelines.py Write code in file , And in settings.py Set in the code
Open pipeline priority 【0-1000】【 The smaller the number is. , The higher the priority 】

6. Run the crawler

In the project root directory Next , stay cmd window Enter the following command , Create crawler file .
scrapy crawl  Crawler file name 
such as ：scrapy crawl novelSpider

We go back to root directory , Get into article Folder , open novel.txt, We get the information of the crawler

7. How to do post Request and add request headers

stay Crawler file （youdaoSpider.py） Enter the following code in 【 This is another project 】

import scrapy
import random


class TranslateSpider(scrapy.Spider):
	name = 'translate'
	allowed_domains = ['fanyi.youdao.com']
	# start_urls = ['http://fanyi.youdao.com/']
	
	agent1 = "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 " \
		 "Mobile/10A5376e Safari/8536.25 "
	agent2 = "Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30"
	agent3 = "Mozilla/5.0 (Linux; Android 9; LON-AL00 Build/HUAWEILON-AL00; wv) AppleWebKit/537.36 (KHTML, like Gecko) " \
			 "Version/4.0 Chrome/76.0.3809.89 Mobile Safari/537.36 T7/11.25 SP-engine/2.17.0 flyflow/4.21.5.31 lite " \
			 "baiduboxapp/4.21.5.31 (Baidu; P1 9) "
	agent4 = "Mozilla/5.0 (Linux; Android 10; MIX 2S Build/QKQ1.190828.002; wv) AppleWebKit/537.36 (KHTML, like Gecko) " \
			 "Version/4.0 Chrome/76.0.3809.89 Mobile Safari/537.36 T7/12.5 SP-engine/2.26.0 baiduboxapp/12.5.1.10 (Baidu; " \
			 "P1 10) NABar/1.0 "
	agent5 = "Mozilla/5.0 (Linux; U; Android 10; zh-CN; TNY-AL00 Build/HUAWEITNY-AL00) AppleWebKit/537.36 (KHTML, " \
			 "like Gecko) Version/4.0 Chrome/78.0.3904.108 UCBrowser/13.2.0.1100 Mobile Safari/537.36 "
	agent6 = "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 " \
			 "Safari/533.21.1 "

	agent_list = [agent1, agent2, agent3, agent4, agent5, agent6]

	header = {
		"User-Agent":random.choice(agent_list)
	}
	
	def start_requests(self):
		url = "https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"

		#  Add a with form information to the queue post request 
		yield scrapy.FormRequest(
			url = url,
			formdata={
				"i": key,
				"from": "AUTO",
				"to": "AUTO",
				"smartresult": "dict",
				"client": " fanyideskweb",
				"salt": "16568305467837",
				"sign": "684b7fc03a39eebebf045749a7759621",
				"lts": "1656830546783",
				"bv": "38d2f7b6370a18835effaf2745b8cc28",
				"doctype": "json",
				"version": "2.1",
				"keyfrom": "fanyi.web",
				"action": "FY_BY_REALTlME"
			},
			headers=header,
			callback=self.parse
		)

	def parse(self, response):
		pass

The egg part of this article ：

stay cmd Input in scrapy, You can understand various commands

If you don't know the meaning of these commands , You can add -h, Get details
Like ：scrapy runspider -h