当前位置:网站首页>Crawler practice website image batch download
Crawler practice website image batch download
2022-07-04 02:12:00 【Computer Trainee】
Crawler practice website image batch download
difficulty
1. Rules of interface address
# ---------------------------- Analyze the actual battle -----------------------------
from urllib.request import HTTPHandler, build_opener, Request, urlretrieve, urlopen
from lxml import etree
# first page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic
# The second page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=2
# The third page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=3
base_url = 'https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic'
Copy the interface of the first three pages url, We can see their similarities and differences , There are differences and rules to follow
So let's first write down the common parts .
2. Set the content of the request header
# header Li Wei collection , Each key value is comma , separate !!!
# 'accept-encoding': 'gzip, deflate, br', You can't take , Will report a mistake :
# '''utf-8' codec can't decode byte 0x8b in position 1: invalid start byte'''
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'cookie': 'gei_d_u=f207f2256e16481ca4d3d1367f0b17b2; oOO0OO0oOO00oo0o=true; geiweb-v=zZ+S93HA1QdGFy45oCXriWSc5taZy6ttcpWu5ieWhjSvCacb/kEBZke/G4OoYAWt; OooOO000oOOO00o=58c7f4f8054c4989866fc23f7032f13d; SESSION=18b24c86-ed10-4c96-8988-9304bc08c766; SERVERID=31234a7bd8ff50a386f3e53c5f85a5fd|1644824152|1644824122',
'referer': 'https://www.aigei.com/design/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76'
}
Above in NetWork You can copy directly in , What is needed is UA,referer,cookie, What you don't need is accept-encoding, Code error will be reported if it is added .
3. Looking for pictures url( Mainly node path )

According to steps , Write the code as :
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
4. Write function
def Build_request(page):
if page == 1:
urll = base_url
else:
urll = base_url + '&page=' + str(page)
request = Request(url=urll, headers=header)
return request
def Build_resopnse(request):
handler = HTTPHandler()
opener = build_opener(handler)
response = opener.open(request)
return response
def require_content(response):
content = response.read().decode('utf-8')
return content
def download_content(content):
# etree.HTML Parse the server response
html_content = etree.HTML(content)
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
for i in range(len(name_list)):
name = name_list[i]
src = src_list[i]
# Downloading requires https Prefix
# url = 'https:' + src, occasionally src Without a prefix ,
urlretrieve(url=src, filename='F://python/ Reptile practice / chart /' + name + '.jpg')
if __name__ == '__main__':
start_page = int(input(' Please enter the starting page number to download :'))
end_page = int(input(' Please enter the ending page number to download :'))
for page in range(start_page, end_page + 1):
request = Build_request(page)
response = Build_resopnse(request)
content = require_content(response)
download_content(content)
# ---------------------------------------------------------------
The above is the actual operation of batch downloading of crawler website pictures , If you have any questions, you can contact the author for communication .
Paying attention to the author can learn more about the actual operation of the program .
边栏推荐
- Applet graduation project based on wechat selection voting applet graduation project opening report function reference
- Remember another interview trip to Ali, which ends on three sides
- Huawei cloud micro certification Huawei cloud computing service practice has been stable
- Experimental animal models - current market situation and future development trend
- Rearrangement of tag number of cadence OrCAD components and sequence number of schematic page
- Day05 branch and loop (II)
- Yyds dry goods inventory hand-in-hand teach you the development of Tiktok series video batch Downloader
- 17. File i/o buffer
- Global and Chinese market of digital impression system 2022-2028: Research Report on technology, participants, trends, market size and share
- Comment la transformation numérique du crédit d'information de la Chine passe - t - elle du ciel au bout des doigts?
猜你喜欢
![Measurement fitting based on Halcon learning [4] measure_ arc. Hdev routine](/img/3a/cf6285ae1c01bda42874eeca9fe5b1.jpg)
Measurement fitting based on Halcon learning [4] measure_ arc. Hdev routine

Yyds dry goods inventory hand-in-hand teach you the development of Tiktok series video batch Downloader

High level application of SQL statements in MySQL database (I)

Rearrangement of tag number of cadence OrCAD components and sequence number of schematic page

Small program graduation project based on wechat examination small program graduation project opening report function reference

MySQL advanced (Advanced) SQL statement (I)

Maximum likelihood method, likelihood function and log likelihood function

LeetCode 168. Detailed explanation of Excel list name

Feign implements dynamic URL

Magical usage of edge browser (highly recommended by program ape and student party)
随机推荐
Properties of binary trees (numerical aspects)
Portapack application development tutorial (XVII) nRF24L01 launch C
JVM performance tuning and practical basic theory - medium
Bacteriostatic circle scanning correction template
Take you to master the formatter of visual studio code
MPLS③
Valentine's Day - 9 jigsaw puzzles with deep love in wechat circle of friends
It's corrected. There's one missing < /script >, why doesn't the following template come out?
Learn these super practical Google browser skills, girls casually flirt
Introduction to graphics: graphic painting (I)
Question C: Huffman tree
When the watch system of Jerry's is abnormal, it is used to restore the system [chapter]
Node write API
求esp32C3板子連接mssql方法
Pesticide synergist - current market situation and future development trend
IPv6 experiment
STM32 key content
Conditional statements of shell programming
Format character%* s
The boss said: whoever wants to use double to define the amount of goods, just pack up and go