当前位置:网站首页>Crawler practice website image batch download
Crawler practice website image batch download
2022-07-04 02:12:00 【Computer Trainee】
Crawler practice website image batch download
difficulty
1. Rules of interface address
# ---------------------------- Analyze the actual battle -----------------------------
from urllib.request import HTTPHandler, build_opener, Request, urlretrieve, urlopen
from lxml import etree
# first page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic
# The second page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=2
# The third page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=3
base_url = 'https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic'
Copy the interface of the first three pages url, We can see their similarities and differences , There are differences and rules to follow
So let's first write down the common parts .
2. Set the content of the request header
# header Li Wei collection , Each key value is comma , separate !!!
# 'accept-encoding': 'gzip, deflate, br', You can't take , Will report a mistake :
# '''utf-8' codec can't decode byte 0x8b in position 1: invalid start byte'''
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'cookie': 'gei_d_u=f207f2256e16481ca4d3d1367f0b17b2; oOO0OO0oOO00oo0o=true; geiweb-v=zZ+S93HA1QdGFy45oCXriWSc5taZy6ttcpWu5ieWhjSvCacb/kEBZke/G4OoYAWt; OooOO000oOOO00o=58c7f4f8054c4989866fc23f7032f13d; SESSION=18b24c86-ed10-4c96-8988-9304bc08c766; SERVERID=31234a7bd8ff50a386f3e53c5f85a5fd|1644824152|1644824122',
'referer': 'https://www.aigei.com/design/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76'
}
Above in NetWork You can copy directly in , What is needed is UA,referer,cookie, What you don't need is accept-encoding, Code error will be reported if it is added .
3. Looking for pictures url( Mainly node path )
According to steps , Write the code as :
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
4. Write function
def Build_request(page):
if page == 1:
urll = base_url
else:
urll = base_url + '&page=' + str(page)
request = Request(url=urll, headers=header)
return request
def Build_resopnse(request):
handler = HTTPHandler()
opener = build_opener(handler)
response = opener.open(request)
return response
def require_content(response):
content = response.read().decode('utf-8')
return content
def download_content(content):
# etree.HTML Parse the server response
html_content = etree.HTML(content)
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
for i in range(len(name_list)):
name = name_list[i]
src = src_list[i]
# Downloading requires https Prefix
# url = 'https:' + src, occasionally src Without a prefix ,
urlretrieve(url=src, filename='F://python/ Reptile practice / chart /' + name + '.jpg')
if __name__ == '__main__':
start_page = int(input(' Please enter the starting page number to download :'))
end_page = int(input(' Please enter the ending page number to download :'))
for page in range(start_page, end_page + 1):
request = Build_request(page)
response = Build_resopnse(request)
content = require_content(response)
download_content(content)
# ---------------------------------------------------------------
The above is the actual operation of batch downloading of crawler website pictures , If you have any questions, you can contact the author for communication .
Paying attention to the author can learn more about the actual operation of the program .
边栏推荐
- 在尋求人類智能AI的過程中,Meta將賭注押向了自監督學習
- Yyds dry goods inventory hand-in-hand teach you the development of Tiktok series video batch Downloader
- MySQL utilise la vue pour signaler les erreurs, Explicit / show ne peut pas être publié; Verrouillage des fichiers privés pour la table sous - jacente
- 中電資訊-信貸業務數字化轉型如何從星空到指尖?
- Introduction to graphics: graphic painting (I)
- Key knowledge of C language
- MPLS③
- After listening to the system clear message notification, Jerry informed the device side to delete the message [article]
- Keep an IT training diary 055- moral bitch
- 2022 new examination questions for safety management personnel of hazardous chemical business units and certificate examination for safety management personnel of hazardous chemical business units
猜你喜欢
在尋求人類智能AI的過程中,Meta將賭注押向了自監督學習
Node write API
Pytoch residual network RESNET
Dans la recherche de l'intelligence humaine ai, Meta a misé sur l'apprentissage auto - supervisé
C learning notes: C foundation - Language & characteristics interpretation
Should enterprises start building progressive web applications?
The boss said: whoever wants to use double to define the amount of goods, just pack up and go
SQL statement
String & memory function (detailed explanation)
When tidb meets Flink: tidb efficiently enters the lake "new play" | tilaker team interview
随机推荐
PMP daily three questions (February 14, 2022)
String: LV1 eat hot pot
Jerry's modification setting status [chapter]
13. Time conversion function
C # learning notes: structure of CS documents
Career development direction
Keep an IT training diary 054- opening and closing
Advanced learning of MySQL -- Application -- storage engine
Remember a lazy query error
The difference between lambda expressions and anonymous inner classes
17. File i/o buffer
Write the first CUDA program
1189. Maximum number of "balloons"
C language black Technology: Archimedes spiral! Novel, interesting, advanced~
When the watch system of Jerry's is abnormal, it is used to restore the system [chapter]
Winter vacation daily question -- a single element in an ordered array
Magical usage of edge browser (highly recommended by program ape and student party)
Gee import SHP data - crop image
Do you know the eight signs of a team becoming agile?
Jerry's watch information type table [chapter]