当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- The value of applet containers
- Getting started with devkit
- 从 1.5 开始搭建一个微服务框架——调用链追踪 traceId
- Lone brave man
- Use of crawler manual 02 requests
- GNSS terminology
- Intensive learning weekly, issue 52: depth cuprl, distspectrl & double deep q-network
- Cve-2017-11882 reappearance
- 2020.2.13
- Hundreds of lines of code to implement a JSON parser
猜你喜欢
MIT doctoral thesis | robust and reliable intelligent system using neural symbol learning
Folding and sinking sand -- weekly record of ETF
Vulhub vulnerability recurrence 75_ XStream
cf:D. Insert a Progression【关于数组中的插入 + 绝对值的性质 + 贪心一头一尾最值】
Xunrui CMS plug-in automatically collects fake original free plug-ins
Browser reflow and redraw
MobileNet系列(5):使用pytorch搭建MobileNetV3并基于迁移学习训练
SAP Spartacus home 页面读取 product 数据的请求的 population 逻辑
MIT博士论文 | 使用神经符号学习的鲁棒可靠智能系统
Introduction to robotics I. spatial transformation (1) posture, transformation
随机推荐
devkit入门
vSphere实现虚拟机迁移
ADS-NPU芯片架构设计的五大挑战
After 95, the CV engineer posted the payroll and made up this. It's really fragrant
[groovy] JSON string deserialization (use jsonslurper to deserialize JSON strings | construct related classes according to the map set)
Zhuhai laboratory ventilation system construction and installation instructions
[day 30] given an integer n, find the sum of its factors
Gartner发布2022-2023年八大网络安全趋势预测,零信任是起点,法规覆盖更广
[simple implementation of file IO]
孤勇者
BiShe - College Student Association Management System Based on SSM
Dynamic programming -- linear DP
[Arduino syntax - structure]
SAP Spartacus home 页面读取 product 数据的请求的 population 逻辑
The detailed page returns to the list and retains the original position of the scroll bar
MYSQL GROUP_ The concat function realizes the content merging of the same ID
云导DNS和知识科普以及课堂笔记
Beginner redis
Installation and use of esxi
Zhuhai's waste gas treatment scheme was exposed