当前位置:网站首页>Ppt template crawler case
Ppt template crawler case
2022-06-26 06:13:00 【An Muxi】
PPT Templates python Crawling
Yes http://www.ypppt.com/moban/ Medium ppt Climbing of formwork , The website has set up some anti - crawling mechanisms , It needs careful analysis url Address can be crawled correctly !!!
#-*- coding = utf-8 -*-
#@Time:2020-08-13 16:43
#@Author: Have a bottle of anmuxi
#@File: Free resume crawling .py
#@ Start a good day @[email protected]
import requests
import os
from lxml import etree
import re
if __name__ == "__main__":
url = 'http://www.ypppt.com/moban/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# Create storage ppt Template files
if not os.path.exists('./ppt Templates '):
os.mkdir('./ppt Templates ')
# establish etree object
tree = etree.HTML(page_text)
# li_list Save first page ppt Template li
li_list = tree.xpath('//ul[@class="posts clear"]/li')
# Analyze each one li, Extract the concrete inside ppt Of url And name
for li in li_list:
ppt_url ='http://www.ypppt.com' +li.xpath('./a[1]/@href')[0]
ppt_name = li.xpath('./a[2]/text()')[0]
# print(ppt_url)
# print(ppt_name)
# Get every one ppt The web page of , Analyze where the download portal is , Find the download portal url
ppt_response = requests.get(url=ppt_url,headers = headers)
ppt_response.encoding = 'utf-8'
ppt_text = ppt_response.text
ppt_tree = etree.HTML(ppt_text)
load_path ='http://www.ypppt.com' +ppt_tree.xpath('//div[@class="button"]/a/@href')[0]
# Found the page of the download portal , Now we need to analyze , Find out where the download button is
load_response = requests.get(url=load_path,headers=headers)
load_response.encoding = 'utf-8'
final_text = load_response.text
final_tree = etree.HTML(final_text)
final_url = final_tree.xpath('//ul[@class="down clear"]/li[1]/a/@href')[0]
# Here the website makes a simple anti - crawl mechanism , Some Download Links url Directly for :/uploads/soft/200810/1-200Q0113H8.zip
# And some download links url:http://www.ypppt.com/uploads/soft/200810/1-200Q0113H8.zip
# So here we use regular expressions to judge
if len(re.findall('http:',str(final_url))) == 0:
final_url = 'http://www.ypppt.com' + final_url
else:
final_url = final_url
# Request to download , there zip Binary, too content
final_ppt = requests.get(url = final_url,headers = headers).content
# It 's going to crawl ppt Store
with open('./ppt Templates /'+ppt_name+'.zip','wb') as fp:
fp.write(final_ppt)
print(ppt_name+'---- Download complete ')
print(' Have a bottle of anmuxi : End of climb !!!!!!!')
End of climb :
The folder is shown in the above figure !!!
notes : Don't crawl maliciously , Just use it to learn reptiles ~
边栏推荐
- Record how to modify the control across threads
- 卷妹带你学jdbc---2天冲刺Day2
- Solve the problem that Cmdr cannot use find command under win10
- Prototype mode, Baa Baa
- Easy to understand from the IDE, and then talk about the applet IDE
- String class learning
- Logstash -- send an alert message to the nail using the throttle filter
- Logstash——Logstash将数据推送至Redis
- ES6的搭配环境
- EFK升级到ClickHouse的日志存储实战
猜你喜欢

Library management system

去哪儿网BI平台建设演进史

PyTorch使用多GPU并行训练及其原理和注意事项

University Information Management System

Import / export function implementation

Household accounting procedures (the second edition includes a cycle)

Several promotion routines of data governance

Introduction to canal deployment, principle and use

On site commissioning - final method of kb4474419 for win7 x64 installation and vs2017 flash back

ByteDance starts the employee's sudden wealth plan and buys back options with a large amount of money. Some people can earn up to 175%
随机推荐
Soft power and hard power in program development
Class and object learning
Keepalived to achieve high service availability
消息队列-消息事务管理对比
Selective Search for Object Recognition 论文笔记【图片目标分割】
Definition of Halcon hand eye calibration
423- binary tree (110. balanced binary tree, 257. all paths of binary tree, 100. same tree, 404. sum of left leaves)
MySQL-07
EFK升级到ClickHouse的日志存储实战
On site commissioning - final method of kb4474419 for win7 x64 installation and vs2017 flash back
Implementation of third-party wechat authorized login for applet
Tortoise and rabbit race example
MySQL 索引底层原理
NPM private server problem of peanut shell intranet penetration mapping
Typora activation method
Redis多线程与ACL
numpy.exp()
PyTorch使用多GPU并行训练及其原理和注意事项
Day3 - variables and operators
去哪儿网BI平台建设演进史