当前位置:网站首页>Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)
Crawler career from scratch (3): crawl the photos of my little sister ③ (the website has been disabled)
2022-07-03 09:18:00 【fishfuck】
List of articles
Preface
Start with this article , We will crawl through several articles in a row (url :https://imoemei.com/) All the pictures of my little sister . With this example, let's learn simple python Reptiles .
Please read the previous article
A reptilian career from scratch ( One ): Crawling for a picture of my little sister ①
A reptilian career from scratch ( Two ): Crawling for a picture of my little sister ②
Thought analysis
1. Page source analysis
After the first two crawls , We have got the photo of our little sister under the self photo item
But now our reptiles still have two problems :
First , The website has more than one sub item of selfie ,
thirdly , In the last article, we only got the first 30 Page url, This is a little far from our goal of crawling to take photos of the whole station
For the first question , We just need to crawl through the sub items on the homepage url You can solve it
Just take each of the above a Labeled herf It's done
For the second question , After turning a few pages, we found that he turned pages ( Such as :https://imoemei.com/meinv/page/2) By page The following figures are used to determine
So as long as we modify this number, we can control page turning
Here comes the question , The total page of each sub item is different , Then how should we control ?
Or use crawlers to get the number of pages
Just climb to label label , There are pages ?
But that's not the case
Crawling label, Found nothing to climb , And then I tried to climb label The parent label of the label , As a result, nothing can climb out
After consulting data , I find request Will only return html Original web page , And the target website label The label is through ajax To dynamically load
Fortunately, I found another div There are pages hidden in the block
But the number of pages is stored in the attribute
What should I do ?
So I thought of using regular expressions to get the number of pages , At last it worked
2. Reptilian thinking
For question 1 , Directly follow the idea of the previous article to crawl the sub items url
For question 2 , Take out div The block is first converted into a string , Then use regular expression to get the number of pages
The crawler code
1. development environment
development environment :win10 python3.6.8
Using tools :pycharm
Using third party libraries :requests、os、BeatutifulSoup、re
2. Code decomposition
(1) Crawl each item url
import requests
from bs4 import BeautifulSoup
ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
temp = index.get('href')
ind.append(temp)
del ind[0]
print(ind)
(2) Crawl pages
import requests
import re
from bs4 import BeautifulSoup
target_url = "https://imoemei.com/zipai/"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius')# This div There are pages in the block
b2_gap = str(b2_gap)# Convert to a string first
regex = '(?<=pages=").[0-9_]*'#re Match the page book
str_select = re.findall(regex,b2_gap)
print(str_select[0])
3. The overall code
import requests
import os
import re
from bs4 import BeautifulSoup
ind = []
target_url = "https://imoemei.com"
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
menu = html.find('ul', class_='menu')
indexs = menu.find_all('a')
#print(indexs)
for index in indexs:
temp = index.get('href')
ind.append(temp)
del ind[0]
l = 0
for l in range(len(ind)+1):
target_url = ind[l]
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('div', class_='b2-pagenav post-nav box mg-t b2-radius') # This div There are pages in the block
b2_gap = str(b2_gap) # Convert to a string first
regex = '(?<=pages=").[0-9_]*' # re Match the page book
str_select = re.findall(regex, b2_gap)
v = 0
for v in range(int(str_select[0]) + 1):
target_url = ind[l] + "//page//" + str(v)
r = requests.get(url=target_url)
html = BeautifulSoup(r.text, 'html5lib')
b2_gap = html.find('ul', class_='b2_gap')
print(str(1) + "page is OK")
img_main = b2_gap.find_all('a', class_='thumb-link')
img_main_urls = []
for img in img_main:
img_main_url = img.get('href')
img_main_urls.append(img_main_url)
for j in range(len(img_main_urls) + 1):
print(img_main_urls[j])
r = requests.get(url=img_main_urls[j])
html = BeautifulSoup(r.text, 'html5lib')
entry_content = html.find('div', class_='entry-content')
img_list = entry_content.find_all('img')
img_urls = []
num = 0
name = html.find('h1').text
print(name)
for img in img_list:
img_url = img.get('src')
result = requests.get(img_url).content
path = ' picture '
if not os.path.exists(path):
os.mkdir(path)
f = open(path + '/' + name + str(num) + '.jpg', 'wb')
f.write(result)
num += 1
print(' Downloading {} The first {} A picture '.format(name, num))
Crawling results
边栏推荐
- Move anaconda, pycharm and jupyter notebook to mobile hard disk
- LeetCode 513. Find the value in the lower left corner of the tree
- Build a solo blog from scratch
- 【点云处理之论文狂读前沿版13】—— GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature
- LeetCode 715. Range 模块
- 我們有個共同的名字,XX工
- LeetCode 532. 数组中的 k-diff 数对
- Beego learning - Tencent cloud upload pictures
- AcWing 786. 第k个数
- Redis learning (I)
猜你喜欢
Digital statistics DP acwing 338 Counting problem
MySQL installation and configuration (command line version)
Excel is not as good as jnpf form for 3 minutes in an hour. Leaders must praise it when making reports like this!
LeetCode 1089. Duplicate zero
Sword finger offer II 091 Paint the house
我們有個共同的名字,XX工
On a un nom en commun, maître XX.
【点云处理之论文狂读经典版8】—— O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis
[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis
The "booster" of traditional office mode, Building OA office system, was so simple!
随机推荐
数字化转型中,企业设备管理会出现什么问题?JNPF或将是“最优解”
我们有个共同的名字,XX工
npm install安装依赖包报错解决方法
LeetCode 513. 找树左下角的值
【点云处理之论文狂读经典版9】—— Pointwise Convolutional Neural Networks
Save the drama shortage, programmers' favorite high-score American drama TOP10
[set theory] order relation (eight special elements in partial order relation | ① maximum element | ② minimum element | ③ maximum element | ④ minimum element | ⑤ upper bound | ⑥ lower bound | ⑦ minimu
Jenkins learning (III) -- setting scheduled tasks
Digital statistics DP acwing 338 Counting problem
LeetCode 715. Range module
LeetCode 57. 插入区间
Temper cattle ranking problem
AcWing 787. Merge sort (template)
我們有個共同的名字,XX工
Overview of database system
LeetCode 1089. 复写零
【点云处理之论文狂读经典版7】—— Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs
PIC16F648A-E/SS PIC16 8位 微控制器,7KB(4Kx14)
MySQL installation and configuration (command line version)
C language programming specification