当前位置:网站首页>Paging of a scratch (page turning processing)
Paging of a scratch (page turning processing)
2022-07-06 01:07:00 【Keep a low profile】
import scrapy
from bs4 import BeautifulSoup
class BookSpiderSpider(scrapy.Spider):
name = 'book_spider'
allowed_domains = ['17k.com']
start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']
""" This will be explained later start_requests Is the method swollen or fat """
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url,
callback=self.parse
)
def parse(self, response, **kwargs):
print(response.url)
soup = BeautifulSoup(response.text, 'lxml')
trs = soup.find('div', attrs={
'class': 'alltable'}).find('tbody').find_all('tr')[1:]
for tr in trs:
book_type = tr.find('td', attrs={
'class': 'td2'}).find('a').text
book_name = tr.find('td', attrs={
'class': 'td3'}).find('a').text
book_words = tr.find('td', attrs={
'class': 'td5'}).text
book_author = tr.find('td', attrs={
'class': 'td6'}).find('a').text
print(book_type, book_name, book_words, book_author)
#
break
""" This is xpath The way of parsing """
# trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
# for tr in trs:
# type = tr.xpath("./td[2]/a/text()").extract_first()
# name = tr.xpath("./td[3]/span/a/text()").extract_first()
# words = tr.xpath("./td[5]/text()").extract_first()
# author = tr.xpath("./td[6]/a/text()").extract_first()
# print(type, name, words, author)
""" 1 Find... On the next page url, Request to the next page Pagination logic This logic is the simplest kind Keep climbing linearly """
# next_page_url = soup.find('a', text=' The next page ')['href']
# if 'javascript' not in next_page_url:
# yield scrapy.Request(
# url=response.urljoin(next_page_url),
# method='get',
# callback=self.parse
# )
# 2 Brutal paging logic
""" Get all the url, Just send the request Will pass the engine 、 Scheduler ( aggregate queue ) Finish the heavy work here , Give it to the downloader , Then return to the reptile , But that start_url It will be repeated , The reason for repetition is inheritance Spider This class The methods of the parent class are dont_filter=True This thing So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default That's it start_url Repetitive questions """
#
#
a_list = soup.find('div', attrs={
'class': 'page'}).find_all('a')
for a in a_list:
if 'javascript' not in a['href']:
yield scrapy.Request(
url=response.urljoin(a['href']),
method='get',
callback=self.parse
)
边栏推荐
- Intensive learning weekly, issue 52: depth cuprl, distspectrl & double deep q-network
- 详细页返回列表保留原来滚动条所在位置
- Synchronized and reentrantlock
- DD's command
- 可恢复保险丝特性测试
- After Luke zettlemoyer, head of meta AI Seattle research | trillion parameters, will the large model continue to grow?
- Cglib dynamic agent -- example / principle
- 从 1.5 开始搭建一个微服务框架——调用链追踪 traceId
- For a deadline, the IT fellow graduated from Tsinghua suddenly died on the toilet
- Dede collection plug-in free collection release push plug-in
猜你喜欢
[groovy] JSON serialization (jsonbuilder builder | generates JSON string with root node name | generates JSON string without root node name)
Keepalive component cache does not take effect
cf:H. Maximal AND【位运算练习 + k次操作 + 最大And】
MYSQL GROUP_ The concat function realizes the content merging of the same ID
毕设-基于SSM高校学生社团管理系统
BiShe - College Student Association Management System Based on SSM
Convert binary search tree into cumulative tree (reverse middle order traversal)
KDD 2022 | EEG AI helps diagnose epilepsy
Some features of ECMAScript
SAP Spartacus home 页面读取 product 数据的请求的 population 逻辑
随机推荐
Cve-2017-11882 reappearance
[groovy] JSON string deserialization (use jsonslurper to deserialize JSON strings | construct related classes according to the map set)
Finding the nearest common ancestor of binary tree by recursion
Cannot resolve symbol error
Finding the nearest common ancestor of binary search tree by recursion
KDD 2022 | EEG AI helps diagnose epilepsy
Daily practice - February 13, 2022
Differences between standard library functions and operators
Synchronized and reentrantlock
Live broadcast system code, custom soft keyboard style: three kinds of switching: letters, numbers and punctuation
MYSQL---查询成绩为前5名的学生
Spark DF adds a column
Gartner released the prediction of eight major network security trends from 2022 to 2023. Zero trust is the starting point and regulations cover a wider range
Keepalive component cache does not take effect
SAP Spartacus home 页面读取 product 数据的请求的 population 逻辑
Live video source code, realize local storage of search history
FFT learning notes (I think it is detailed)
The detailed page returns to the list and retains the original position of the scroll bar
Some features of ECMAScript
有谁知道 达梦数据库表的列的数据类型 精度怎么修改呀