当前位置:网站首页>Crawler practice website image batch download
Crawler practice website image batch download
2022-07-04 02:12:00 【Computer Trainee】
Crawler practice website image batch download
difficulty
1. Rules of interface address
# ---------------------------- Analyze the actual battle -----------------------------
from urllib.request import HTTPHandler, build_opener, Request, urlretrieve, urlopen
from lxml import etree
# first page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic
# The second page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=2
# The third page :https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic&page=3
base_url = 'https://www.aigei.com/s?dim=cartoon_124_animatio&detailTab=file&type=pic'
Copy the interface of the first three pages url, We can see their similarities and differences , There are differences and rules to follow
So let's first write down the common parts .
2. Set the content of the request header
# header Li Wei collection , Each key value is comma , separate !!!
# 'accept-encoding': 'gzip, deflate, br', You can't take , Will report a mistake :
# '''utf-8' codec can't decode byte 0x8b in position 1: invalid start byte'''
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
'cache-control': 'max-age=0',
'cookie': 'gei_d_u=f207f2256e16481ca4d3d1367f0b17b2; oOO0OO0oOO00oo0o=true; geiweb-v=zZ+S93HA1QdGFy45oCXriWSc5taZy6ttcpWu5ieWhjSvCacb/kEBZke/G4OoYAWt; OooOO000oOOO00o=58c7f4f8054c4989866fc23f7032f13d; SESSION=18b24c86-ed10-4c96-8988-9304bc08c766; SERVERID=31234a7bd8ff50a386f3e53c5f85a5fd|1644824152|1644824122',
'referer': 'https://www.aigei.com/design/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36 Edg/97.0.1072.76'
}
Above in NetWork You can copy directly in , What is needed is UA,referer,cookie, What you don't need is accept-encoding, Code error will be reported if it is added .
3. Looking for pictures url( Mainly node path )
According to steps , Write the code as :
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
4. Write function
def Build_request(page):
if page == 1:
urll = base_url
else:
urll = base_url + '&page=' + str(page)
request = Request(url=urll, headers=header)
return request
def Build_resopnse(request):
handler = HTTPHandler()
opener = build_opener(handler)
response = opener.open(request)
return response
def require_content(response):
content = response.read().decode('utf-8')
return content
def download_content(content):
# etree.HTML Parse the server response
html_content = etree.HTML(content)
# Positioning pictures , Find id Of div, Look under it src
name_list = html_content.xpath('//div[@id="resContainer"]//li/@title')
src_list = html_content.xpath('//div[@id="resContainer"]//img/@src')
for i in range(len(name_list)):
name = name_list[i]
src = src_list[i]
# Downloading requires https Prefix
# url = 'https:' + src, occasionally src Without a prefix ,
urlretrieve(url=src, filename='F://python/ Reptile practice / chart /' + name + '.jpg')
if __name__ == '__main__':
start_page = int(input(' Please enter the starting page number to download :'))
end_page = int(input(' Please enter the ending page number to download :'))
for page in range(start_page, end_page + 1):
request = Build_request(page)
response = Build_resopnse(request)
content = require_content(response)
download_content(content)
# ---------------------------------------------------------------
The above is the actual operation of batch downloading of crawler website pictures , If you have any questions, you can contact the author for communication .
Paying attention to the author can learn more about the actual operation of the program .
边栏推荐
- Pesticide synergist - current market situation and future development trend
- Global and Chinese market of cell scrapers 2022-2028: Research Report on technology, participants, trends, market size and share
- 16. System and process information
- Huawei cloud micro certification Huawei cloud computing service practice has been stable
- Small program graduation project based on wechat reservation small program graduation project opening report reference
- Global and Chinese market of handheld melanoma scanners 2022-2028: Research Report on technology, participants, trends, market size and share
- Chain ide -- the infrastructure of the metauniverse
- Advanced learning of MySQL -- Application -- index
- Do you know the eight signs of a team becoming agile?
- Final consistency of MESI cache in CPU -- why does CPU need cache
猜你喜欢
Magical usage of edge browser (highly recommended by program ape and student party)
Network communication basic kit -- IPv4 socket structure
Take you to master the formatter of visual studio code
Conditional statements of shell programming
Small program graduation project based on wechat examination small program graduation project opening report function reference
Chain ide -- the infrastructure of the metauniverse
The contact data on Jerry's management device supports reading and updating operations [articles]
Example 072 calculation of salary it is known that the base salary of an employee of a company is 500 yuan. The amount of software sold by the employee and the Commission method are as follows: Sales
Comment la transformation numérique du crédit d'information de la Chine passe - t - elle du ciel au bout des doigts?
Will the memory of ParticleSystem be affected by maxparticles
随机推荐
MPLS③
Sequence sorting of basic exercises of test questions
Bacteriostatic circle scanning correction template
Mysql-15 aggregate function
Yyds dry goods inventory it's not easy to say I love you | use the minimum web API to upload files
Career development direction
High level application of SQL statements in MySQL database (I)
Magical usage of edge browser (highly recommended by program ape and student party)
The difference between int (1) and int (10)
Portapack application development tutorial (XVII) nRF24L01 launch C
The automatic control system of pump station has powerful functions and diverse application scenarios
LeetCode226. Flip binary tree
How to view the computing power of GPU?
Feign implements dynamic URL
Yyds dry goods inventory override and virtual of classes in C
Huawei cloud micro certification Huawei cloud computing service practice has been stable
On Valentine's day, I code a programmer's exclusive Bing Dwen Dwen (including the source code for free)
Méthode de calcul de la connexion MSSQL de la carte esp32c3
60 year old people buy medical insurance and recommend a better product
Small program graduation design is based on wechat order takeout small program graduation design opening report function reference