当前位置:网站首页>Pyhton爬取百度文库文字写入word文档
Pyhton爬取百度文库文字写入word文档
2022-06-27 17:52:00 【北岛末巷】
介绍
仅支持爬取百度文库的Word文档,文字写入Word文档或者文本文件(.txt),主要使用Python爬虫的requests库。
requests库是Python爬虫系列中请求库比较热门和便捷实用的库,另外urlib库(包)也是比较热门的。除此之外Python爬虫系列还有解析库lxml以及Beautiful Soup,Python爬虫框架scrapy。
请求网址
介绍一下headers的使用方法、及分页爬取,headers里面一般情况下其实只要User-Agent就够了。
def get_url(self):
url = input("请输入下载的文库URL地址:")
headers = {
# 接收请求处理
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 声明浏览器支持的编码类型
'Accept-Encoding': 'gzip, deflate, br',
# 对客户端浏览器发送的接受语言
'Accept-Language': 'zh-CN,zh;q=0.9',
# 获取浏览器缓存
'Cache-Control': 'max-age=0',
# 向同一个连接发送下一个请求,直到一方主动关闭连接
'Connection': 'keep-alive',
# 主地址(服务器的域名)
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
# 客户端标识证明(也像身份证一样)
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
爬取数据
获取服务器响应爬取文档数据写入Word文档,也可以将with open(‘百度文库.docx’, ‘a’, encoding=‘utf-8’)中的.docx改成.txt文本文件,这样写入的就是文本文件了,写入目前还没添加换行功能!
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open('百度文库.docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
完整代码
import requests
import re
import json
class WenKu():
def __init__(self):
self.session = requests.Session()
def get_url(self):
url = input("请输入下载的文库URL地址:")
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open('百度文库.docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
if __name__ == '__main__':
wk = WenKu()
wk.get_url()
边栏推荐
- Bit. Store: long bear market, stable stacking products may become the main theme
- 原创 | 2025实现“5个1”奋斗目标!解放动力全系自主非道路国四产品正式发布
- 数据分析师太火?月入3W?用数据告诉你这个行业的真实情况
- 使用logrotate对宝塔的网站日志进行自动切割
- 你知道 log4j2 各项配置的全部含义吗?带你了解 log4j2 的全部组件
- Error reported by Huada MCU Keil_ Weak's solution
- 教你打印自己的日志 -- 如何自定义 log4j2 各组件
- 实施MES管理系统前,要对哪些问题进行评估
- maxwell 报错(连接为mysql 8.x)解决方法
- Photoshop layer related concepts layercomp layers move rotate duplicate layer compound layer
猜你喜欢

Camera calibration with OpenCV

Buzzer experiment based on stm32f103zet6 library function

Error reported by Huada MCU Keil_ Weak's solution

什么是 ICMP ?ping和ICMP之间有啥关系?

基于STM32F103ZET6库函数外部中断实验
![[elt.zip] openharmony paper Club - memory compression for data intensive applications](/img/b3/ab915f0338174cba1a003edc262a5d.png)
[elt.zip] openharmony paper Club - memory compression for data intensive applications

实战回忆录:从Webshell开始突破边界

爬取国家法律法规数据库

一种朴素的消失点计算方法

openssl客户端编程:一个不起眼的函数导致的SSL会话失败问题
随机推荐
数仓的字符截取三胞胎:substrb、substr、substring
金源高端IPO被终止:曾拟募资7.5亿 儒杉资产与溧阳产投是股东
How to encapsulate and call a library
Is it safe to buy stocks and open an account on the account opening link of the securities manager? Ask the great God for help
Informatics Olympiad 1333: [example 2-2] blah data set | openjudge noi 3.4 2729:blah data set
binder hwbinder vndbinder
[elt.zip] openharmony paper Club - memory compression for data intensive applications
Gartner聚焦中国低代码发展 UniPro如何践行“差异化”
华大单片机KEIL报错_WEAK的解决方案
原创 | 2025实现“5个1”奋斗目标!解放动力全系自主非道路国四产品正式发布
Redis 原理 - String
你知道 log4j2 各项配置的全部含义吗?带你了解 log4j2 的全部组件
Making single test so simple -- initial experience of Spock framework
Solution of adding st-link to Huada MCU Keil
Cloud native database: the outlet of the database, you can also take off
DFS and BFS simple principle
循环遍历及函数基础知识
GIS遥感R语言学习看这里
TIA博途_基于SCL语言制作模拟量输入输出全局库的具体方法
破解仓储难题?WMS仓储管理系统解决方案