当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- Arduino hexapod robot
- [groovy] JSON string deserialization (use jsonslurper to deserialize JSON strings | construct related classes according to the map set)
- The third season of ape table school is about to launch, opening a new vision for developers under the wave of going to sea
- KDD 2022 | 脑电AI助力癫痫疾病诊断
- MobileNet系列(5):使用pytorch搭建MobileNetV3并基于迁移学习训练
- Construction plan of Zhuhai food physical and chemical testing laboratory
- BiShe - College Student Association Management System Based on SSM
- 详细页返回列表保留原来滚动条所在位置
- WordPress collection plug-in automatically collects fake original free plug-ins
- Five challenges of ads-npu chip architecture design
猜你喜欢
可恢复保险丝特性测试
[groovy] JSON serialization (convert class objects to JSON strings | convert using jsonbuilder | convert using jsonoutput | format JSON strings for output)
测试/开发程序员的成长路线,全局思考问题的问题......
1791. Find the central node of the star diagram / 1790 Can two strings be equal by performing string exchange only once
[groovy] compile time metaprogramming (compile time method injection | method injection using buildfromspec, buildfromstring, buildfromcode)
Study diary: February 13, 2022
Five challenges of ads-npu chip architecture design
The inconsistency between the versions of dynamic library and static library will lead to bugs
毕设-基于SSM高校学生社团管理系统
2020.2.13
随机推荐
Cf:c. the third problem
SSH login is stuck and disconnected
cf:H. Maximal AND【位运算练习 + k次操作 + 最大And】
从 1.5 开始搭建一个微服务框架——调用链追踪 traceId
NLP basic task word segmentation third party Library: ICTCLAS [the third party library with the highest accuracy of Chinese word segmentation] [Chinese Academy of Sciences] [charge]
Differences between standard library functions and operators
Browser reflow and redraw
95后CV工程师晒出工资单,狠补了这个,真香...
Comment faire votre propre robot
BiShe - College Student Association Management System Based on SSM
Construction plan of Zhuhai food physical and chemical testing laboratory
Intensive learning weekly, issue 52: depth cuprl, distspectrl & double deep q-network
Questions about database: (5) query the barcode, location and reader number of each book in the inventory table
《强化学习周刊》第52期:Depth-CUPRL、DistSPECTRL & Double Deep Q-Network
Dedecms plug-in free SEO plug-in summary
Arduino hexapod robot
Cf:h. maximum and [bit operation practice + K operations + maximum and]
几百行代码实现一个 JSON 解析器
Gartner released the prediction of eight major network security trends from 2022 to 2023. Zero trust is the starting point and regulations cover a wider range
Why can't mathematics give machine consciousness