当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- 毕设-基于SSM高校学生社团管理系统
- [groovy] XML serialization (use markupbuilder to generate XML data | create sub tags under tag closures | use markupbuilderhelper to add XML comments)
- Keepalive component cache does not take effect
- MIT doctoral thesis | robust and reliable intelligent system using neural symbol learning
- The population logic of the request to read product data on the sap Spartacus home page
- 从 1.5 开始搭建一个微服务框架——调用链追踪 traceId
- Cglib dynamic agent -- example / principle
- In the era of industrial Internet, we will achieve enough development by relying on large industrial categories
- 新手入门深度学习 | 3-6:优化器optimizers
- 测试/开发程序员的成长路线,全局思考问题的问题......
猜你喜欢

看抖音直播Beyond演唱会有感

SAP Spartacus home 页面读取 product 数据的请求的 population 逻辑

Convert binary search tree into cumulative tree (reverse middle order traversal)

关于softmax函数的见解

Five challenges of ads-npu chip architecture design

How to extract MP3 audio from MP4 video files?

Cve-2017-11882 reappearance

Recoverable fuse characteristic test

Starting from 1.5, build a micro Service Framework - call chain tracking traceid

I'm interested in watching Tiktok live beyond concert
随机推荐
After Luke zettlemoyer, head of meta AI Seattle research | trillion parameters, will the large model continue to grow?
MobileNet系列(5):使用pytorch搭建MobileNetV3并基于迁移学习训练
How spark gets columns in dataframe --column, $, column, apply
Idea remotely submits spark tasks to the yarn cluster
Finding the nearest common ancestor of binary tree by recursion
FFT learning notes (I think it is detailed)
Mobilenet series (5): use pytorch to build mobilenetv3 and learn and train based on migration
Pbootcms plug-in automatically collects fake original free plug-ins
朝招金安全吗 会不会亏损本金
Programmer growth Chapter 9: precautions in real projects
[simple implementation of file IO]
Leetcode study - day 35
How to make your own robot
WordPress collection plug-in automatically collects fake original free plug-ins
FFT 学习笔记(自认为详细)
Zhuhai's waste gas treatment scheme was exposed
Modify the ssh server access port number
Building core knowledge points
Dede collection plug-in free collection release push plug-in
JVM_ 15_ Concepts related to garbage collection