当前位置:网站首页>Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)
Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)
2022-07-03 09:18:00 【fishfuck】
List of articles
Preface
Start with this article , We will crawl through several articles in a row (url :https://imoemei.com/) All the pictures of my little sister . With this example, let's learn simple python Reptiles .
Please read the previous article
A reptilian career from scratch ( One ): Crawling for a picture of my little sister ①
A reptilian career from scratch ( Two ): Crawling for a picture of my little sister ②
Thought analysis
1. Page source analysis
After the first two crawls , We have got the photo of our little sister under the self photo item
But now our reptiles still have two problems :
First , The website has more than one sub item of selfie ,
thirdly , In the last article, we only got the first 30 Page url, This is a little far from our goal of crawling to take photos of the whole station
For the first question , We just need to crawl through the sub items on the homepage url You can solve it 
Just take each of the above a Labeled herf It's done
For the second question , After turning a few pages, we found that he turned pages ( Such as :https://imoemei.com/meinv/page/2) By page The following figures are used to determine
So as long as we modify this number, we can control page turning
Here comes the question , The total page of each sub item is different , Then how should we control ?
Or use crawlers to get the number of pages 
Just climb to label label , There are pages ?
But that's not the case 
Crawling label, Found nothing to climb , And then I tried to climb label The parent label of the label , As a result, nothing can climb out
After consulting data , I find request Will only return html Original web page , And the target website label The label is through ajax To dynamically load
Fortunately, I found another div There are pages hidden in the block
But the number of pages is stored in the attribute 
What should I do ?
So I thought of using regular expressions to get the number of pages , At last it worked
2. Reptilian thinking
For question 1 , Directly follow the idea of the previous article to crawl the sub items url
For question 2 , Take out div The block is first converted into a string , Then use regular expression to get the number of pages
The crawler code
1. development environment
development environment :win10 python3.6.8
Using tools :pycharm
Using third party libraries :requests、os、BeatutifulSoup、re
2. Code decomposition
(1) Crawl each item url
import requests
from bs4 import BeautifulSoup
ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
temp = index.get('href')
ind.append(temp)
del ind[0]
print(ind)
(2) Crawl pages
import requests
import re
from bs4 import BeautifulSoup
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius')# This div There are pages in the block
b2_gap = str(b2_gap)# Convert to a string first
regex = '(?<=pages=").[0-9_]*'#re Match the page book
str_select = re.findall(regex,b2_gap)
print(str_select[0])
3. The overall code
import requests
import os
import re
from bs4 import BeautifulSoup
ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
temp = index.get('href')
ind.append(temp)
del ind[0]
l = 0
for l in range(len(ind)+1):
target_url = ind[l]
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius') # This div There are pages in the block
b2_gap = str(b2_gap) # Convert to a string first
regex = '(?<=pages=").[0-9_]*' # re Match the page book
str_select = re.findall(regex, b2_gap)
v = 0
for v in range(int(str_select[0]) + 1):
target_url = ind[l] + "//page//" + str(v)
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
Crawling results


边栏推荐
- 【Kotlin疑惑】在Kotlin类中重载一个算术运算符,并把该运算符声明为扩展函数会发生什么?
- 【点云处理之论文狂读前沿版8】—— Pointview-GCN: 3D Shape Classification With Multi-View Point Clouds
- [point cloud processing paper crazy reading classic version 7] - dynamic edge conditioned filters in revolutionary neural networks on Graphs
- Gaussian elimination acwing 883 Gauss elimination for solving linear equations
- [point cloud processing paper crazy reading frontier version 10] - mvtn: multi view transformation network for 3D shape recognition
- 2022-1-6 Niuke net brush sword finger offer
- On a un nom en commun, maître XX.
- [point cloud processing paper crazy reading classic version 8] - o-cnn: octree based revolutionary neural networks for 3D shape analysis
- LeetCode 715. Range module
- LeetCode 57. Insert interval
猜你喜欢

【点云处理之论文狂读前沿版9】—Advanced Feature Learning on Point Clouds using Multi-resolution Features and Learni

【点云处理之论文狂读经典版9】—— Pointwise Convolutional Neural Networks
![[advanced feature learning on point clouds using multi resolution features and learning]](/img/f0/abed28e94eb4a95c716ab8cecffe04.png)
[advanced feature learning on point clouds using multi resolution features and learning]
![[point cloud processing paper crazy reading classic version 14] - dynamic graph CNN for learning on point clouds](/img/7d/b66545284d6baea2763fd8d8555e1d.png)
[point cloud processing paper crazy reading classic version 14] - dynamic graph CNN for learning on point clouds

Instant messaging IM is the countercurrent of the progress of the times? See what jnpf says

LeetCode 324. 摆动排序 II

AcWing 786. Number k

低代码前景可期,JNPF灵活易用,用智能定义新型办公模式

LeetCode 513. 找树左下角的值

【点云处理之论文狂读经典版8】—— O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis
随机推荐
Memory search acwing 901 skiing
LeetCode 75. 颜色分类
即时通讯IM,是时代进步的逆流?看看JNPF怎么说
Discussion on enterprise informatization construction
2022-2-14 learning xiangniuke project - generate verification code
Bert install no package metadata was found for the 'sacraments' distribution
网络安全必会的基础知识
AcWing 786. Number k
LeetCode 532. K-diff number pairs in array
LeetCode 715. Range 模块
【Kotlin学习】运算符重载及其他约定——重载算术运算符、比较运算符、集合与区间的约定
【点云处理之论文狂读前沿版10】—— MVTN: Multi-View Transformation Network for 3D Shape Recognition
【点云处理之论文狂读经典版12】—— FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation
npm install安装依赖包报错解决方法
CSDN markdown editor help document
LeetCode 57. 插入区间
With low code prospect, jnpf is flexible and easy to use, and uses intelligence to define a new office mode
干货!零售业智能化管理会遇到哪些问题?看懂这篇文章就够了
传统企业数字化转型需要经过哪几个阶段?
Go language - JSON processing