当前位置:网站首页>Crawler career from scratch (II): crawl the photos of my little sister ② (the website has been disabled)
Crawler career from scratch (II): crawl the photos of my little sister ② (the website has been disabled)
2022-07-03 09:18:00 【fishfuck】
List of articles
Preface
Start with this article , We will crawl through several articles in a row (url :https://imoemei.com/) All the pictures of my little sister . With this example, let's learn simple python Reptiles .
See related articles
A reptilian career from scratch ( One ): Crawling for a picture of my little sister ①
A reptilian career from scratch ( 3、 ... and ): Crawling for a picture of my little sister ③
Thought analysis
1. Page source analysis
Because last time we have climbed down all the pictures of the little sister on a page , So now we just need to get the of each page url, Then climb every page again OK 了
Do as you say !
First, let's check the source code of the page

Find a url, Go in and have a look

The result is just the cover ... Look again , I found it on the cover just now ! wuhu !
Observe the code of the whole page 
Of all pages url Just put it here li In block
Then we just need to take out each page's url It's done !
2. Reptilian thinking
Direct use request Get the whole page , Reuse BeatutifulSoup Parse web pages , Take out all page links , Then traverse the link , Save the picture according to the method in the previous article .
The crawler code
1. development environment
development environment :win10 python3.6.8
Using tools :pycharm
Using third party libraries :requests、os、BeatutifulSoup
2. Code decomposition
(1). Import and stock in
import requests
import os
from bs4 import BeautifulSoup
(2) Get the address of each page
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
(3). Get the address of each picture
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
(4). Save the picture to the specified folder
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
3. The overall code
import requests
import os
from bs4 import BeautifulSoup
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
Crawling results


This time, I only climbed the first page under the selfie item , In the next article, we will crawl all pages of all sub items , Coming soon .
边栏推荐
- Memory search acwing 901 skiing
- Education informatization has stepped into 2.0. How can jnpf help teachers reduce their burden and improve efficiency?
- 【点云处理之论文狂读经典版14】—— Dynamic Graph CNN for Learning on Point Clouds
- [set theory] order relation (chain | anti chain | chain and anti chain example | chain and anti chain theorem | chain and anti chain inference | good order relation)
- The "booster" of traditional office mode, Building OA office system, was so simple!
- LeetCode 871. 最低加油次数
- Vs2019 configuration opencv3 detailed graphic tutorial and implementation of test code
- 【点云处理之论文狂读前沿版11】—— Unsupervised Point Cloud Pre-training via Occlusion Completion
- Use the interface colmap interface of openmvs to generate the pose file required by openmvs mvs
- Simple use of MATLAB
猜你喜欢
![[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis](/img/c6/5f723d9021cf684dcfb662ed3acaec.png)
[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis

Jenkins learning (III) -- setting scheduled tasks
![[point cloud processing paper crazy reading classic version 11] - mining point cloud local structures by kernel correlation and graph pooling](/img/40/e0c7bad60b19cafa467c229419ac21.png)
[point cloud processing paper crazy reading classic version 11] - mining point cloud local structures by kernel correlation and graph pooling

数字化转型中,企业设备管理会出现什么问题?JNPF或将是“最优解”

2022-2-13 learn the imitation Niuke project - Project debugging skills

Education informatization has stepped into 2.0. How can jnpf help teachers reduce their burden and improve efficiency?

Save the drama shortage, programmers' favorite high-score American drama TOP10

AcWing 787. Merge sort (template)

AcWing 785. 快速排序(模板)
![[point cloud processing paper crazy reading frontier version 8] - pointview gcn: 3D shape classification with multi view point clouds](/img/ee/3286e76797a75c0f999c728fd2b555.png)
[point cloud processing paper crazy reading frontier version 8] - pointview gcn: 3D shape classification with multi view point clouds
随机推荐
Methods of checking ports according to processes and checking processes according to ports
[point cloud processing paper crazy reading classic version 9] - pointwise revolutionary neural networks
Shell script kills the process according to the port number
[point cloud processing paper crazy reading classic version 7] - dynamic edge conditioned filters in revolutionary neural networks on Graphs
PIC16F648A-E/SS PIC16 8位 微控制器,7KB(4Kx14)
【点云处理之论文狂读前沿版11】—— Unsupervised Point Cloud Pre-training via Occlusion Completion
LeetCode 513. Find the value in the lower left corner of the tree
LeetCode 871. 最低加油次数
The "booster" of traditional office mode, Building OA office system, was so simple!
【点云处理之论文狂读经典版14】—— Dynamic Graph CNN for Learning on Point Clouds
Low code momentum, this information management system development artifact, you deserve it!
Beego learning - JWT realizes user login and registration
Problems in the implementation of lenet
[graduation season | advanced technology Er] another graduation season, I change my career as soon as I graduate, from animal science to programmer. Programmers have something to say in 10 years
[set theory] order relation (chain | anti chain | chain and anti chain example | chain and anti chain theorem | chain and anti chain inference | good order relation)
Memory search acwing 901 skiing
LeetCode 57. Insert interval
With low code prospect, jnpf is flexible and easy to use, and uses intelligence to define a new office mode
AcWing 787. 归并排序(模板)
Discussion on enterprise informatization construction