当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- 程序员成长第九篇:真实项目中的注意事项
- The detailed page returns to the list and retains the original position of the scroll bar
- curlpost-php
- Logstash clear sincedb_ Path upload records and retransmit log data
- Spark AQE
- vSphere实现虚拟机迁移
- For a deadline, the IT fellow graduated from Tsinghua suddenly died on the toilet
- Study diary: February 13, 2022
- KDD 2022 | EEG AI helps diagnose epilepsy
- Cglib dynamic agent -- example / principle
猜你喜欢
Meta AI西雅图研究负责人Luke Zettlemoyer | 万亿参数后,大模型会持续增长吗?
Cf:d. insert a progression [about the insert in the array + the nature of absolute value + greedy top-down]
Dede collection plug-in free collection release push plug-in
IP storage and query in MySQL
cf:C. The Third Problem【关于排列这件事】
95后CV工程师晒出工资单,狠补了这个,真香...
[groovy] XML serialization (use markupbuilder to generate XML data | set XML tag content | set XML tag attributes)
Leetcode study - day 35
Study diary: February 13, 2022
Fibonacci number
随机推荐
95后CV工程师晒出工资单,狠补了这个,真香...
MIT doctoral thesis | robust and reliable intelligent system using neural symbol learning
详细页返回列表保留原来滚动条所在位置
如何制作自己的機器人
Exciting, 2022 open atom global open source summit registration is hot
cf:C. The Third Problem【关于排列这件事】
Cf:h. maximum and [bit operation practice + K operations + maximum and]
Lone brave man
MIT博士论文 | 使用神经符号学习的鲁棒可靠智能系统
Gartner released the prediction of eight major network security trends from 2022 to 2023. Zero trust is the starting point and regulations cover a wider range
Overview of Zhuhai purification laboratory construction details
关于softmax函数的见解
[pat (basic level) practice] - [simple mathematics] 1062 simplest fraction
GNSS terminology
devkit入门
[groovy] JSON serialization (jsonbuilder builder | generates JSON string with root node name | generates JSON string without root node name)
1791. Find the central node of the star diagram / 1790 Can two strings be equal by performing string exchange only once
Dede collection plug-in free collection release push plug-in
Arduino hexapod robot
RAID disk redundancy queue