当前位置：网站首页>Thick willow dustpan, thin willow bucket, who hates reptile man? Asynchronous synergism, half a second to strip away a novel

Thick willow dustpan, thin willow bucket, who hates reptile man? Asynchronous synergism, half a second to strip away a novel

2022-07-25 09:26:00 【Ride Hago to travel】

A while ago , I overheard my classmate say , I want to crawl hundreds of thousands of data from XXX website for data analysis , But the speed is too slow , I am very worried about this ... In fact, the common ways to speed up crawlers are multithreading , Multi process , Asynchronous co process, etc , What Xiaobian wants to say is Asynchronous coroutine Speed up the reptile ！

The reason why reptiles are slow , Often because the program waits IO And blocked , For example, the common blocking in reptiles is ： Network congestion , Disk blocking, etc ; Let's talk about network congestion in detail , If used requests To make a request , If the response speed of the website is too slow , The program has been waiting for the network response , Finally, it leads to extremely low efficiency of reptiles ！

So what is asynchronous crawler ？
Insert picture description here
Generally speaking, it's ： When the program detects IO Blocking , It will automatically switch to other tasks of the program , In this way IO To the minimum , There will be more tasks when the program is ready , In order to deceive the operating system , The operating system thinks that the program IO Less , So as to allocate as much as possible CPU, Achieve the purpose of improving the efficiency of program execution .

One , coroutines

import asyncio
import time

async def func1():  # asnyc  Define a coroutine 
    print(' Next to the Lao wang ！')
    await asyncio.sleep(3)  #  Analog blocking   Asynchronous operations  await  Wait asynchronously 
    print(' Next to the Lao wang ')


async def func2():  # asnyc  Define a coroutine 
    print(' Siberian Husky ')
    await asyncio.sleep(2)  #  Analog blocking   Asynchronous operations  await  Wait asynchronously 
    print(' Siberian Husky ')


async def func3():  # asnyc  Define a coroutine 
    print(' Alaska ')
    await asyncio.sleep(1)  #  Analog blocking   Asynchronous operations  await  Wait asynchronously 
    print(' Alaska ')


async def main():
    tasks = [  # tasks： Mission , It is a further encapsulation of the coroutine object , Contains the various states of the task 
        asyncio.create_task(func1()),
        asyncio.create_task(func2()),
        asyncio.create_task(func3()),
    ]
    await asyncio.wait(tasks)

if __name__ == '__main__':
    start_time = time.time()
    asyncio.run(main())  #  Start multiple tasks at once ( coroutines )
    print(' The process takes time \033[31;1m%s\033[0ms' % (time.time() - start_time))

Insert picture description here

Two , Asynchronous requests aiohttp And asynchronous write aiofiles

Casually take a few picture link addresses on the Internet as an exercise

import aiohttp
import aiofiles
import asyncio
import os

if not os.path.exists('./Bantu'):
    os.mkdir('./Bantu')


async def umei_picture_download(url):
    name = url.split('/')[-1]
    picture_path = './Bantu/' + name
    async with aiohttp.ClientSession() as session:  # aiohttp.ClientSession()  amount to  requests
        async with session.get(url) as resp:  #  or session.post() Send asynchronously 
            # resp.content.read() Read binary ( video , Pictures, etc ),resp.text() Read the text ,resp.json() read json
            down_pict = await resp.content.read()  # resp.content.read() amount to requests(xxx).content
            async with aiofiles.open(picture_path, 'wb') as f:  # aiofiles.open()  Open files asynchronously 
                await f.write(down_pict)  #  Writing content is also asynchronous , Need to suspend 
    print(' Crawl the picture to complete ！！！')


async def main():
    tasks = []
    for url in urls:
        tasks.append(asyncio.create_task(umei_picture_download(url)))

    await asyncio.wait(tasks)


if __name__ == '__main__':

    urls = [
        'https://tenfei02.cfp.cn/creative/vcg/veer/1600water/veer-158109176.jpg',
        'https://alifei03.cfp.cn/creative/vcg/veer/1600water/veer-151526132.jpg',
        'https://tenfei05.cfp.cn/creative/vcg/veer/1600water/veer-141027139.jpg',
        'https://tenfei03.cfp.cn/creative/vcg/veer/1600water/veer-132395407.jpg'
    ]

    # asyncio.run(main())
    loop = asyncio.get_event_loop()  # get_event_loop() Method creates an event loop loop
    loop.run_until_complete(main())  #  Called loop Object's run_until_complete() Method registers the coroutine into the event loop loop in , Then start

Insert picture description here

3、 ... and , Asynchronous synergetic process takes away a novel in half a second

The object of this time is a novel on a certain degree ,url The address is ：http://dushu.baidu.com/pc/detail?gid=4308271440

1, Simply analyze
Insert picture description here
Copy URL Address and open the page shown above on the browser , You can see that only a few chapter titles are displayed , When you click to view all , The URL has not changed, but all the chapter titles have been loaded .

So the first reaction is , The chapter title of the novel is probably through AJAX Partially loaded ！

f12 Open developer tools , Point to Network Under the XHR On , And click on Check all Button , Here's the picture ：
Insert picture description here

Capture a package , Open it and see. , Here's the picture ：

OK, Find the title ！ Continue with the details of each chapter

Click on any chapter name , Enter the details page , Look at the package captured by the browser , Here's the picture ：
Insert picture description here
Except for the first bag just now , Look carefully in the other three bags , Is there any trace , Here's the picture

Chapter details have been found ！！！

2, Start rolling code

Before starting, put the chapter title corresponding to url The address corresponds to the chapter details page url Take the address and have a look

The chapter title corresponds to url Address ：

# http://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%224308271440%22}

The chapter details page corresponds to url Address ：

# http://dushu.baidu.com/api/pc/getChapterContent?data={%22book_id%22:%224308271440%22,%22cid%22:%224308271440|20222925%22,%22need_bookinfo%22:1}

forehead , This is a certain degree ！url Although it's a little messy, it's so messy , But still can't get rid of the fate of being picked , Here's the picture ：
Insert picture description here
You can see , Two url They all use book_id, So will book_id Extract it separately

① Let's start with a basic frame

import aiohttp
import asyncio
import requests
import aiofiles
import pprint
import os
import time


def get_chapter_content(url, headers):
    pass


if __name__ == '__main__':
    book_id = '4308271440'
    url = 'http://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%22' + book_id + '%22}'
    headers = {
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }

    get_chapter_content(url, headers)

② Get the chapter name and the corresponding cid

Read out the corresponding json file

def get_chapter_content(url, headers):
    response = requests.get(url=url, headers=headers).json()
    pprint.pprint(response)

The operation is as follows ：
Insert picture description here
Take out cid and title

def get_chapter_content(url, headers):
    response = requests.get(url=url, headers=headers).json()
    # pprint.pprint(response)
    chapter_details = response['data']['novel']['items']  #  Getting contains every chapter cid And chapter name 
    for chapter in chapter_details:
        chapter_cid = chapter['cid']  #  Get the corresponding cid
        chapter_title = chapter['title']  #  Get the title corresponding to each chapter 
        print(chapter_cid, chapter_title)

Insert picture description here
③ Upper asynchronous

async def get_chapter_content(url, headers):
    response = requests.get(url=url, headers=headers).json()
    # pprint.pprint(response)
    tasks = []
    chapter_details = response['data']['novel']['items']  #  Getting contains every chapter cid And chapter name 
    for chapter in chapter_details:
        chapter_cid = chapter['cid']  #  Get the corresponding cid
        chapter_title = chapter['title']  #  Get the title corresponding to each chapter 
        # print(chapter_cid, chapter_title)
        tasks.append(asyncio.create_task(aio_download_novel(headers, chapter_cid, chapter_title, book_id)))

    await asyncio.wait(tasks)


async def aio_download_novel(headers, chapter_cid, chapter_title, book_id):
    pass


if __name__ == '__main__':
    book_id = '4308271440'
    url = 'http://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%22' + book_id + '%22}'
    headers = {
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }

    asyncio.run(get_chapter_content(url, headers))

④ Take out the details

async def aio_download_novel(headers, chapter_cid, chapter_title, book_id):
    details_url = 'http://dushu.baidu.com/api/pc/getChapterContent?data={%22book_id%22:%22' + book_id + '%22,%22cid%22:%22' + book_id + '|' + chapter_cid + '%22,%22need_bookinfo%22:1}'
    # print(details_url)
    async with aiohttp.ClientSession() as session:
        async with session.get(url=details_url, headers=headers) as response:
            content = await response.json()
            # pprint.pprint(content)
            details_content = content['data']['novel']['content']
            print(details_content)

Insert picture description here
⑤ Persistent storage

import aiohttp
import asyncio
import requests
import aiofiles
import pprint
import os
import time

if not os.path.exists('./ Don't make cannon fodder, sister '):
    os.mkdir('./ Don't make cannon fodder, sister ')


async def get_chapter_content(url, headers):
    response = requests.get(url=url, headers=headers).json()
    # pprint.pprint(response)
    tasks = []
    chapter_details = response['data']['novel']['items']  #  Getting contains every chapter cid And chapter name 
    for chapter in chapter_details:
        chapter_cid = chapter['cid']  #  Get the corresponding cid
        chapter_title = chapter['title']  #  Get the title corresponding to each chapter 
        # print(chapter_cid, chapter_title)
        tasks.append(asyncio.create_task(aio_download_novel(headers, chapter_cid, chapter_title, book_id)))

    await asyncio.wait(tasks)


async def aio_download_novel(headers, chapter_cid, chapter_title, book_id):
    details_url = 'http://dushu.baidu.com/api/pc/getChapterContent?data={%22book_id%22:%22' + book_id + '%22,%22cid%22:%22' + book_id + '|' + chapter_cid + '%22,%22need_bookinfo%22:1}'
    # print(details_url)
    novel_path = './ Don't make cannon fodder, sister /' + chapter_title
    async with aiohttp.ClientSession() as session:
        async with session.get(url=details_url, headers=headers) as response:
            content = await response.json()
            # pprint.pprint(content)
            details_content = content['data']['novel']['content']
            # print(details_content)
            async with aiofiles.open(novel_path, mode='w', encoding='utf-8') as f:
                await f.write(details_content)

    print(chapter_title, '\033[31;1m  Crawling is complete ！！！\033[0m')


if __name__ == '__main__':
    book_id = '4308271440'
    url = 'http://dushu.baidu.com/api/pc/getCatalog?data={%22book_id%22:%22' + book_id + '%22}'
    headers = {
    
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
    }
    start_time = time.time()
    asyncio.run(get_chapter_content(url, headers))
    print('\n')
    print(' Crawlers take a lot of time : \033[31;1m%s\033[0m s' % (time.time() - start_time))

Insert picture description here

To this end ！

原网站

版权声明
本文为[Ride Hago to travel]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250919017351.html

当前位置：网站首页>Thick willow dustpan, thin willow bucket, who hates reptile man? Asynchronous synergism, half a second to strip away a novel

Thick willow dustpan, thin willow bucket, who hates reptile man? Asynchronous synergism, half a second to strip away a novel

One , coroutines

Two , Asynchronous requests aiohttp And asynchronous write aiofiles

3、 ... and , Asynchronous synergetic process takes away a novel in half a second

边栏推荐

猜你喜欢

随机推荐