当前位置：网站首页>Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)

Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)

2022-07-03 09:18:00 【fishfuck】

List of articles

Preface
Thought analysis
- 1. Page source analysis
- 2. Reptilian thinking
The crawler code
Crawling results

Preface

Start with this article , We will crawl through several articles in a row （url ：https://imoemei.com/） All the pictures of my little sister . With this example, let's learn simple python Reptiles .
Please read the previous article

A reptilian career from scratch （ One ）： Crawling for a picture of my little sister ①
A reptilian career from scratch （ Two ）： Crawling for a picture of my little sister ②

Thought analysis

1. Page source analysis

After the first two crawls , We have got the photo of our little sister under the self photo item
But now our reptiles still have two problems ：
First , The website has more than one sub item of selfie ,
thirdly , In the last article, we only got the first 30 Page url, This is a little far from our goal of crawling to take photos of the whole station
For the first question , We just need to crawl through the sub items on the homepage url You can solve it
Insert picture description here

Just take each of the above a Labeled herf It's done

For the second question , After turning a few pages, we found that he turned pages （ Such as ：https://imoemei.com/meinv/page/2） By page The following figures are used to determine

So as long as we modify this number, we can control page turning
Here comes the question , The total page of each sub item is different , Then how should we control ？
Or use crawlers to get the number of pages
Insert picture description here
Just climb to label label , There are pages ？
But that's not the case

Crawling label, Found nothing to climb , And then I tried to climb label The parent label of the label , As a result, nothing can climb out
After consulting data , I find request Will only return html Original web page , And the target website label The label is through ajax To dynamically load

Fortunately, I found another div There are pages hidden in the block
But the number of pages is stored in the attribute
Insert picture description here
What should I do ？
So I thought of using regular expressions to get the number of pages , At last it worked

2. Reptilian thinking

For question 1 , Directly follow the idea of the previous article to crawl the sub items url

For question 2 , Take out div The block is first converted into a string , Then use regular expression to get the number of pages

The crawler code

1. development environment

development environment ：win10 python3.6.8
Using tools ：pycharm
Using third party libraries ：requests、os、BeatutifulSoup、re

2. Code decomposition

（1） Crawl each item url

import requests
from bs4 import BeautifulSoup


ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
    temp = index.get('href')
    ind.append(temp)
del ind[0]
print(ind)

（2） Crawl pages

import requests
import re
from bs4 import BeautifulSoup

target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius')# This div There are pages in the block 
b2_gap = str(b2_gap)# Convert to a string first 
regex = '(?<=pages=").[0-9_]*'#re Match the page book 
str_select = re.findall(regex,b2_gap)
print(str_select[0])

3. The overall code

import requests
import os
import re
from bs4 import BeautifulSoup

ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
    temp = index.get('href')
    ind.append(temp)
del ind[0]

l = 0
for l in range(len(ind)+1):
    target_url = ind[l]
    r = requests.get(url=target_url)
    html = BeautifulSoup(r.text, 'html5lib')
    b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius')  #  This div There are pages in the block 
    b2_gap = str(b2_gap)  #  Convert to a string first 
    regex = '(?<=pages=").[0-9_]*'  # re Match the page book 
    str_select = re.findall(regex, b2_gap)
    
    v = 0
    for v in range(int(str_select[0]) + 1):
        target_url = ind[l] + "//page//" + str(v)
        r = requests.get(url=target_url)
        html = BeautifulSoup(r.text, 'html5lib')
        b2_gap = html.find('ul', class_='b2_gap')
        print(str(1) + "page is OK")
        img_main = b2_gap.find_all('a', class_='thumb-link')
        img_main_urls = []

        for img in img_main:
            img_main_url = img.get('href')
            img_main_urls.append(img_main_url)

        for j in range(len(img_main_urls) + 1):
            print(img_main_urls[j])
            r = requests.get(url=img_main_urls[j])

            html = BeautifulSoup(r.text, 'html5lib')
            entry_content = html.find('div', class_='entry-content')
            img_list = entry_content.find_all('img')
            img_urls = []
            num = 0
            name = html.find('h1').text
            print(name)

            for img in img_list:
                img_url = img.get('src')
                result = requests.get(img_url).content

                path = ' picture '
                if not os.path.exists(path):
                    os.mkdir(path)

                f = open(path + '/' + name + str(num) + '.jpg', 'wb')
                f.write(result)
                num += 1
                print(' Downloading {} The first {} A picture '.format(name, num))