当前位置:网站首页>Paging of a scratch (page turning processing)

Paging of a scratch (page turning processing)

2022-07-06 01:07:00 Keep a low profile

import scrapy
from bs4 import BeautifulSoup


class BookSpiderSpider(scrapy.Spider):
    name = 'book_spider'
    allowed_domains = ['17k.com']
    start_urls = ['https://www.17k.com/all/book/2_0_0_0_0_0_0_0_1.html']

	"""  This will be explained later start_requests Is the method swollen or fat  """

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse
            )

    def parse(self, response, **kwargs):
        print(response.url)
        soup = BeautifulSoup(response.text, 'lxml')
        trs = soup.find('div', attrs={
    'class': 'alltable'}).find('tbody').find_all('tr')[1:]
        for tr in trs:
            book_type = tr.find('td', attrs={
    'class': 'td2'}).find('a').text
            book_name = tr.find('td', attrs={
    'class': 'td3'}).find('a').text
            book_words = tr.find('td', attrs={
    'class': 'td5'}).text
            book_author = tr.find('td', attrs={
    'class': 'td6'}).find('a').text
            print(book_type, book_name, book_words, book_author)
            #
            break
        """  This is xpath The way of parsing  """
        # trs = response.xpath("//div[@class='alltable']/table/tbody/tr")[1:]
        # for tr in trs:
        # type = tr.xpath("./td[2]/a/text()").extract_first()
        # name = tr.xpath("./td[3]/span/a/text()").extract_first()
        # words = tr.xpath("./td[5]/text()").extract_first()
        # author = tr.xpath("./td[6]/a/text()").extract_first()
        # print(type, name, words, author)

        
        """ 1  Find... On the next page url, Request to the next page   Pagination logic   This logic is the simplest kind   Keep climbing linearly  """
        # next_page_url = soup.find('a', text=' The next page ')['href']
        # if 'javascript' not in next_page_url:
        # yield scrapy.Request(
        # url=response.urljoin(next_page_url),
        # method='get',
        # callback=self.parse
        # )

        # 2  Brutal paging logic 
        """  Get all the url, Just send the request   Will pass the engine 、 Scheduler ( aggregate   queue )  Finish the heavy work here , Give it to the downloader ,  Then return to the reptile ,  But that start_url It will be repeated , The reason for repetition is inheritance Spider This class   The methods of the parent class are dont_filter=True This thing   So here we need to rewrite Spider The inside of the class start_requests Methods , Filter by default   That's it start_url Repetitive questions  """
        # 
        # 
        a_list = soup.find('div', attrs={
    'class': 'page'}).find_all('a')
        for a in a_list:
            if 'javascript' not in a['href']:
                yield scrapy.Request(
                    url=response.urljoin(a['href']),
                    method='get',
                    callback=self.parse
                )
原网站

版权声明
本文为[Keep a low profile]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140145080398.html