当前位置:网站首页>分布式爬虫中的增量爬虫
分布式爬虫中的增量爬虫
2022-07-25 07:05:00 【范之度】
增量式爬虫:检测网站数据更新的概况,然后更新出来的数据进行爬取
核心:去重
记录表:存放抓取过的数据标识 redis的set做数据更新表。
思路是在爬取的时候去redis中确认一下,url是否存在,如下:
li_list=response.xpath('./span[3]/ul/li')
for li in li_list
detail-url="http://baidu.com"+li.xpath('/li/@href').extract_first()
ex=self.conn.sadd('urls',detail-url)
if ex==1:
#ex代表返回成功了,有数据更新,没有重复
yield scrapy Request(detail-url)
else:
print("没有更新数据")边栏推荐
- Ideal L9, can't cross a pit on the road?
- Dynamic memory management
- Leetcode 115. different subsequences
- 10 minutes to understand how JMeter plays with redis database
- What are the hazards of insufficient sleep?
- 列表推导式
- 章鱼网络 Community Call #1|开启 Octopus DAO 构建
- Lpad() function and (row_number() over (order by) +...)
- Standard C language 89
- Baidu xirang's first yuan universe auction ended, and Chen Danqing's six printmaking works were all sold!
猜你喜欢

百度希壤首场元宇宙拍卖落槌,陈丹青六幅版画作品全部成交!

Insight into mobile application operation growth in 2022 white paper: the way to "break the situation" in the era of diminishing traffic dividends

Special analysis of data security construction in banking industry

Install, configure, and use the metroframework in the C WinForms application

【terminal】x86 Native Tools Command Prompt for VS 2017

Leetcode sword finger offer brush question notes

2022深圳杯

error: redefinition of

File operation-

使用 Web API 上传和下载多个文件
随机推荐
Shell run command
MySQL remote login
【C】 Program environment and pretreatment
[semidrive source code analysis] [drive bringup] 39 - touch panel touch screen debugging
Rust标准库-实现一个TCP服务、Rust使用套接字
Scavenging vultures or woodpeckers? How to correctly understand short selling
Discuss the important factors that affect the success or failure of automated testing
容器内组播
Create a new STM32 project and configure it - based on registers
杜教筛
YOLOv7模型推理和训练自己的数据集
Tp5.1 foreach adds a new field in the controller record, and there is no need to write all the other fields again without changing them (not operating in the template) (paging)
Cointegraph wrote: relying on the largest Dao usdd to become the most reliable stable currency
Cointelegraph撰文:依托最大的DAO USDD成为最可靠的稳定币
vulnhub CyberSploit: 1
Wechat applet wx.request interface
%d,%s,%c,%x
【每日一题】1184. 公交站间的距离
CTF Crypto---RSA KCS1_ Oaep mode
Devops has been practiced for many years. What is the most painful thing?