当前位置:网站首页>Crawler career from scratch (II): crawl the photos of my little sister ② (the website has been disabled)
Crawler career from scratch (II): crawl the photos of my little sister ② (the website has been disabled)
2022-07-03 09:18:00 【fishfuck】
List of articles
Preface
Start with this article , We will crawl through several articles in a row (url :https://imoemei.com/) All the pictures of my little sister . With this example, let's learn simple python Reptiles .
See related articles
A reptilian career from scratch ( One ): Crawling for a picture of my little sister ①
A reptilian career from scratch ( 3、 ... and ): Crawling for a picture of my little sister ③
Thought analysis
1. Page source analysis
Because last time we have climbed down all the pictures of the little sister on a page , So now we just need to get the of each page url, Then climb every page again OK 了
Do as you say !
First, let's check the source code of the page
Find a url, Go in and have a look
The result is just the cover ... Look again , I found it on the cover just now ! wuhu !
Observe the code of the whole page
Of all pages url Just put it here li In block
Then we just need to take out each page's url It's done !
2. Reptilian thinking
Direct use request Get the whole page , Reuse BeatutifulSoup Parse web pages , Take out all page links , Then traverse the link , Save the picture according to the method in the previous article .
The crawler code
1. development environment
development environment :win10 python3.6.8
Using tools :pycharm
Using third party libraries :requests、os、BeatutifulSoup
2. Code decomposition
(1). Import and stock in
import requests
import os
from bs4 import BeautifulSoup
(2) Get the address of each page
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
(3). Get the address of each picture
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
(4). Save the picture to the specified folder
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
3. The overall code
import requests
import os
from bs4 import BeautifulSoup
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
Crawling results
This time, I only climbed the first page under the selfie item , In the next article, we will crawl all pages of all sub items , Coming soon .
边栏推荐
- LeetCode 324. Swing sort II
- Go language - IO project
- excel一小时不如JNPF表单3分钟,这样做报表,领导都得点赞!
- 干货!零售业智能化管理会遇到哪些问题?看懂这篇文章就够了
- AcWing 786. Number k
- Jenkins learning (II) -- setting up Chinese
- Recommend a low code open source project of yyds
- 【点云处理之论文狂读经典版13】—— Adaptive Graph Convolutional Neural Networks
- [set theory] order relation (eight special elements in partial order relation | ① maximum element | ② minimum element | ③ maximum element | ④ minimum element | ⑤ upper bound | ⑥ lower bound | ⑦ minimu
- Just graduate student reading thesis
猜你喜欢
网络安全必会的基础知识
Move anaconda, pycharm and jupyter notebook to mobile hard disk
[point cloud processing paper crazy reading classic version 11] - mining point cloud local structures by kernel correlation and graph pooling
Instant messaging IM is the countercurrent of the progress of the times? See what jnpf says
State compression DP acwing 291 Mondrian's dream
Beego learning - Tencent cloud upload pictures
In the digital transformation, what problems will occur in enterprise equipment management? Jnpf may be the "optimal solution"
LeetCode 513. Find the value in the lower left corner of the tree
20220630 learning clock in
剑指 Offer II 029. 排序的循环链表
随机推荐
Digital management medium + low code, jnpf opens a new engine for enterprise digital transformation
LeetCode 438. Find all letter ectopic words in the string
2022-1-6 Niuke net brush sword finger offer
Discussion on enterprise informatization construction
AcWing 785. 快速排序(模板)
Methods of using arrays as function parameters in shell
LeetCode 324. 摆动排序 II
Recommend a low code open source project of yyds
即时通讯IM,是时代进步的逆流?看看JNPF怎么说
The difference between if -n and -z in shell
We have a common name, XX Gong
【点云处理之论文狂读经典版7】—— Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs
Go language - JSON processing
Jenkins learning (I) -- Jenkins installation
LeetCode 508. 出现次数最多的子树元素和
[point cloud processing paper crazy reading classic version 8] - o-cnn: octree based revolutionary neural networks for 3D shape analysis
The less successful implementation and lessons of RESNET
[graduation season | advanced technology Er] another graduation season, I change my career as soon as I graduate, from animal science to programmer. Programmers have something to say in 10 years
LeetCode 513. 找树左下角的值
How to check whether the disk is in guid format (GPT) or MBR format? Judge whether UEFI mode starts or legacy mode starts?