当前位置:网站首页>Ppt template crawler case
Ppt template crawler case
2022-06-26 06:13:00 【An Muxi】
PPT Templates python Crawling
Yes http://www.ypppt.com/moban/ Medium ppt Climbing of formwork , The website has set up some anti - crawling mechanisms , It needs careful analysis url Address can be crawled correctly !!!
#-*- coding = utf-8 -*-
#@Time:2020-08-13 16:43
#@Author: Have a bottle of anmuxi
#@File: Free resume crawling .py
#@ Start a good day @[email protected]
import requests
import os
from lxml import etree
import re
if __name__ == "__main__":
url = 'http://www.ypppt.com/moban/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
}
response = requests.get(url=url,headers=headers)
response.encoding = 'utf-8'
page_text = response.text
# Create storage ppt Template files
if not os.path.exists('./ppt Templates '):
os.mkdir('./ppt Templates ')
# establish etree object
tree = etree.HTML(page_text)
# li_list Save first page ppt Template li
li_list = tree.xpath('//ul[@class="posts clear"]/li')
# Analyze each one li, Extract the concrete inside ppt Of url And name
for li in li_list:
ppt_url ='http://www.ypppt.com' +li.xpath('./a[1]/@href')[0]
ppt_name = li.xpath('./a[2]/text()')[0]
# print(ppt_url)
# print(ppt_name)
# Get every one ppt The web page of , Analyze where the download portal is , Find the download portal url
ppt_response = requests.get(url=ppt_url,headers = headers)
ppt_response.encoding = 'utf-8'
ppt_text = ppt_response.text
ppt_tree = etree.HTML(ppt_text)
load_path ='http://www.ypppt.com' +ppt_tree.xpath('//div[@class="button"]/a/@href')[0]
# Found the page of the download portal , Now we need to analyze , Find out where the download button is
load_response = requests.get(url=load_path,headers=headers)
load_response.encoding = 'utf-8'
final_text = load_response.text
final_tree = etree.HTML(final_text)
final_url = final_tree.xpath('//ul[@class="down clear"]/li[1]/a/@href')[0]
# Here the website makes a simple anti - crawl mechanism , Some Download Links url Directly for :/uploads/soft/200810/1-200Q0113H8.zip
# And some download links url:http://www.ypppt.com/uploads/soft/200810/1-200Q0113H8.zip
# So here we use regular expressions to judge
if len(re.findall('http:',str(final_url))) == 0:
final_url = 'http://www.ypppt.com' + final_url
else:
final_url = final_url
# Request to download , there zip Binary, too content
final_ppt = requests.get(url = final_url,headers = headers).content
# It 's going to crawl ppt Store
with open('./ppt Templates /'+ppt_name+'.zip','wb') as fp:
fp.write(final_ppt)
print(ppt_name+'---- Download complete ')
print(' Have a bottle of anmuxi : End of climb !!!!!!!')
End of climb :
The folder is shown in the above figure !!!
notes : Don't crawl maliciously , Just use it to learn reptiles ~
边栏推荐
- Record how to modify the control across threads
- 消息队列-全方位对比
- MySQL-09
- 低代码实时数仓构建系统的设计与实践
- 数据可视化实战:实验报告
- 数据可视化实战:数据可视化
- Vs2022 offline installation package download and activation
- Message queue - function, performance, operation and maintenance comparison
- Day3 - variables and operators
- Force buckle 875 Coco, who likes bananas
猜你喜欢

Efk upgrade to Clickhouse log storage practice

ByteDance starts the employee's sudden wealth plan and buys back options with a large amount of money. Some people can earn up to 175%

Data visualization practice: Experimental Report

Kolla ansible deploy openstack Yoga version

GoF23—建造者模式

Keepalived to achieve high service availability

GoF23—抽象工厂模式

Redis underlying data structure

MySQL 索引底层原理

Solve the problem that Cmdr cannot use find command under win10
随机推荐
SSH keygen specifies the path
The purpose of writing programs is to solve problems
工作积累——Web请求中使用ThreadLocal遇见的问题
Hot! 11 popular open source Devops tools in 2021!
MySQL 索引底层原理
Implementation of third-party wechat authorized login for applet
Introduction to canal deployment, principle and use
Kolla ansible deploy openstack Yoga version
Comparison between Prometheus and ZABBIX
GoF23—原型模式
Logstash——Logstash将数据推送至Redis
04. basic data type - list, tuple
Interface oriented programming
Younger sister Juan takes you to learn JDBC -- two days' Sprint Day2
Message queuing - omnidirectional comparison
Unicloud cloud development obtains applet user openid
Message queue - function, performance, operation and maintenance comparison
Soft power and hard power in program development
Mongodb -- use mongodb to intercept the string content in the field and perform grouping statistics
GoF23—建造者模式