当前位置：网站首页>Crawler career from scratch (I): crawl the photos of my little sister ① (the website has been disabled)

Crawler career from scratch (I): crawl the photos of my little sister ① (the website has been disabled)

2022-07-03 09:18:00 【fishfuck】

List of articles

Preface
Display the page that needs to be crawled
Thought analysis
- 1. Page source analysis
- 2. Reptilian thinking
The crawler code
Crawling results

Preface

Start with this article , We will crawl through several articles in a row （url ：https://imoemei.com/） All the pictures of my little sister . With this example, let's learn simple python Reptiles .

See related articles

A reptilian career from scratch （ Two ）： Crawling for a picture of my little sister ②
A reptilian career from scratch （ 3、 ... and ）： Crawling for a picture of my little sister ③

Display the page that needs to be crawled

Insert picture description here

Thought analysis

1. Page source analysis

First, let's check the source code of the page

Insert picture description here

Found his picture url All in a class called entry-content Of div In block , Then our goal is to take out p Label under src, This is the address of each picture , Then save it to the computer .

2. Reptilian thinking

Direct use request Get the whole page , Reuse BeatutifulSoup Parse web pages , Take out all the picture links , Finally, it's preserved

The crawler code

1. development environment

development environment ：win10 python3.6.8
Using tools ：pycharm
Using third party libraries ：requests、os、BeatutifulSoup

2. Code decomposition

（1）. Import and stock in

import requests
import os
from bs4 import BeautifulSoup

（2）. Get the address of each picture

target_url = "https://imoemei.com/zipai/6288.html"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
for img in img_list:
    img_url = img.get('src')
    result = requests.get(img_url).content

（3）. Save the picture to the specified folder

num = 0
name = html.find('h1').text
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))

3. The overall code

import requests
import os
from bs4 import BeautifulSoup

target_url = "https://imoemei.com/zipai/6288.html"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text

for img in img_list:
    img_url = img.get('src')
    result = requests.get(img_url).content
    
    path = ' picture '
    if not os.path.exists(path):
        os.mkdir(path)

    f = open(path + '/' + name + str(num) + '.jpg', 'wb')
    f.write(result)
    num += 1
    print(' Downloading {} The first {} A picture '.format(name, num))