当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- Mobilenet series (5): use pytorch to build mobilenetv3 and learn and train based on migration
- logstash清除sincedb_path上传记录,重传日志数据
- IP storage and query in MySQL
- Arduino hexapod robot
- 孤勇者
- [pat (basic level) practice] - [simple mathematics] 1062 simplest fraction
- JVM_ 15_ Concepts related to garbage collection
- Differences between standard library functions and operators
- Building core knowledge points
- [groovy] compile time meta programming (AST syntax tree conversion with annotations | define annotations and use groovyasttransformationclass to indicate ast conversion interface | ast conversion inte
猜你喜欢
Beginner redis
Leetcode study - day 35
Starting from 1.5, build a micro Service Framework - call chain tracking traceid
测试/开发程序员的成长路线,全局思考问题的问题......
Illustrated network: the principle behind TCP three-time handshake, why can't two-time handshake?
Cve-2017-11882 reappearance
What is the most suitable book for programmers to engage in open source?
The third season of ape table school is about to launch, opening a new vision for developers under the wave of going to sea
282. Stone consolidation (interval DP)
Cf:c. the third problem
随机推荐
Meta AI西雅图研究负责人Luke Zettlemoyer | 万亿参数后,大模型会持续增长吗?
云导DNS和知识科普以及课堂笔记
I'm interested in watching Tiktok live beyond concert
Getting started with devkit
Mobilenet series (5): use pytorch to build mobilenetv3 and learn and train based on migration
新手入门深度学习 | 3-6:优化器optimizers
Live video source code, realize local storage of search history
[groovy] compile time metaprogramming (compile time method injection | method injection using buildfromspec, buildfromstring, buildfromcode)
在产业互联网时代,将会凭借大的产业范畴,实现足够多的发展
Pbootcms plug-in automatically collects fake original free plug-ins
How spark gets columns in dataframe --column, $, column, apply
Xunrui CMS plug-in automatically collects fake original free plug-ins
Distributed base theory
golang mqtt/stomp/nats/amqp
vSphere实现虚拟机迁移
Overview of Zhuhai purification laboratory construction details
logstash清除sincedb_path上传记录,重传日志数据
JVM_ 15_ Concepts related to garbage collection
[groovy] compile time metaprogramming (compile time method interception | find the method to be intercepted in the myasttransformation visit method)
[Arduino syntax - structure]