当前位置:网站首页>The first scratch crawler
The first scratch crawler
2022-07-25 12:20:00 【Tota King Li】
scrapy The directory structure is as follows

What we want to crawl is the title of the book in the book network , author , And the description of the book
First, we need to define the model of crawling data , stay items.py In file
import scrapy
class MoveItem(scrapy.Item): # Define the model of crawled data title = scrapy.Field() auth = scrapy.Field() desc = scrapy.Field() The main thing is spiders In the catalog move.py file
import scrapy
from douban.items import MoveItem
class MovieSpider(scrapy.Spider):
# Indicates the name of the spider , The name of each spider must be unique
name = 'movie'
# Indicates the domain name filtered and crawled
allwed_domians = ['dushu.com']
# Indicates the first thing to crawl url
start_urls = ['https://www.dushu.com/book/1188.html']
def parse(self,response):
li_list = response.xpath('/html/body/div[6]/div/div[2]/div[2]/ul/li')
for li in li_list:
item = MoveItem()
item['title'] = li.xpath('div/h3/a/text()').extract_first()
item['auth'] = li.xpath('div/p[1]/a/text()').extract_first()
item['desc'] = li.xpath('div/p[2]/text()').extract_first()
# generator
yield item
href_list = response.xpath('/html/body/div[6]/div/div[2]/div[3]/div/a/@href').extract()
for href in href_list:
# Crawl on the page url completion
url = response.urljoin(href)
# A generator ,response Link inside , Then proceed to request, Keep executing parse, It's a recursion .
yield scrapy.Request(url=url,callback=self.parse)
The only way to persist data is to save it : stay settings.py Set in file 
stay pipelines.py In the document :
import pymongo
class DoubanPipeline(object):
def __init__(self):
self.mongo_client = pymongo.MongoClient('mongodb://39.108.188.19:27017')
def process_item(self, item, spider):
db = self.mongo_client.data
message = db.messages
message.insert(dict(item))
return item边栏推荐
- RestTemplate与Ribbon简单使用
- Multi label image classification
- 马斯克的“灵魂永生”:一半炒作,一半忽悠
- 循环创建目录与子目录
- 微软Azure和易观分析联合发布《企业级云原生平台驱动数字化转型》报告
- R语言使用lm函数构建多元回归模型(Multiple Linear Regression)、使用step函数构建前向逐步回归模型筛选预测变量的最佳子集、scope参数指定候选预测变量
- 【五】页面和打印设置
- Brpc source code analysis (I) -- the main process of RPC service addition and server startup
- Scott+scott law firm plans to file a class action against Yuga labs, or will confirm whether NFT is a securities product
- Hydrogen entrepreneurship competition | Liu Yafang, deputy director of the science and Technology Department of the National Energy Administration: building a high-quality innovation system is the cor
猜你喜欢

Jenkins配置流水线

【GCN-RS】Towards Representation Alignment and Uniformity in Collaborative Filtering (KDD‘22)

记录一次线上死锁的定位分析

Unexpected rollback exception analysis and transaction propagation strategy for nested transactions

和特朗普吃了顿饭后写下了这篇文章

基于Caffe ResNet-50网络实现图片分类(仅推理)的实验复现

MySQL练习二

After having a meal with trump, I wrote this article

【GCN-RS】Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for RS (SIGIR‘22)

Those young people who left Netease
随机推荐
Median (二分答案 + 二分查找)
scrapy 设置随机的user_agent
技术管理杂谈
【六】地图框设置
R language uses LM function to build multiple linear regression model, step function to build forward stepwise regression model to screen the best subset of prediction variables, and scope parameter t
客户端开放下载, 欢迎尝鲜
Implement anti-theft chain through referer request header
Eureka注册中心开启密码认证-记录
【GCN-CTR】DC-GNN: Decoupled GNN for Improving and Accelerating Large-Scale E-commerce Retrieval WWW22
R语言ggplot2可视化:使用ggpubr包的ggstripchart函数可视化点状条带图、设置palette参数配置不同水平数据点的颜色、设置add参数在点状条带图中添加均值标准差竖线
容错机制记录
给生活加点惊喜,做创意生活的原型设计师丨编程挑战赛 x 选手分享
苹果供应链十年浮沉:洋班主任和它的中国学生们
R语言使用lm函数构建多元回归模型(Multiple Linear Regression)、使用step函数构建前向逐步回归模型筛选预测变量的最佳子集、scope参数指定候选预测变量
After having a meal with trump, I wrote this article
【AI4Code】《Unified Pre-training for Program Understanding and Generation》 NAACL 2021
【CTR】《Towards Universal Sequence Representation Learning for Recommender Systems》 (KDD‘22)
【四】布局视图和布局工具条使用
【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)
搭建Vision Transformer系列实践,终于见面了,Timm库!