当前位置:网站首页>Pyhton爬取百度文库文字写入word文档
Pyhton爬取百度文库文字写入word文档
2022-06-27 17:52:00 【北岛末巷】
介绍
仅支持爬取百度文库的Word文档,文字写入Word文档或者文本文件(.txt),主要使用Python爬虫的requests库。
requests库是Python爬虫系列中请求库比较热门和便捷实用的库,另外urlib库(包)也是比较热门的。除此之外Python爬虫系列还有解析库lxml以及Beautiful Soup,Python爬虫框架scrapy。
请求网址
介绍一下headers的使用方法、及分页爬取,headers里面一般情况下其实只要User-Agent就够了。
def get_url(self):
url = input("请输入下载的文库URL地址:")
headers = {
# 接收请求处理
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# 声明浏览器支持的编码类型
'Accept-Encoding': 'gzip, deflate, br',
# 对客户端浏览器发送的接受语言
'Accept-Language': 'zh-CN,zh;q=0.9',
# 获取浏览器缓存
'Cache-Control': 'max-age=0',
# 向同一个连接发送下一个请求,直到一方主动关闭连接
'Connection': 'keep-alive',
# 主地址(服务器的域名)
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
# 客户端标识证明(也像身份证一样)
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
爬取数据
获取服务器响应爬取文档数据写入Word文档,也可以将with open(‘百度文库.docx’, ‘a’, encoding=‘utf-8’)中的.docx改成.txt文本文件,这样写入的就是文本文件了,写入目前还没添加换行功能!
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open('百度文库.docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
完整代码
import requests
import re
import json
class WenKu():
def __init__(self):
self.session = requests.Session()
def get_url(self):
url = input("请输入下载的文库URL地址:")
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open('百度文库.docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
if __name__ == '__main__':
wk = WenKu()
wk.get_url()
边栏推荐
- xctf攻防世界 MISC薪手进阶区
- Implementation of reliable distributed locks redlock and redisson
- Garbage collector driving everything -- G1
- Code and principle of RANSAC
- Making single test so simple -- initial experience of Spock framework
- Blink SQL内置函数大全
- International School of Digital Economics, South China Institute of technology 𞓜 unified Bert for few shot natural language understanding
- Blink SQL built in functions
- DFS and BFS simple principle
- Memoirs of actual combat: breaking the border from webshell
猜你喜欢

TIA博途_基于SCL语言制作模拟量输入输出全局库的具体方法

Core dynamic Lianke rushes to the scientific innovation board: with an annual revenue of 170million yuan, Beifang Electronics Institute and Zhongcheng venture capital are shareholders

VS code 运行yarn run dev 报yarn : 无法加载文件XXX的问题

C# 二维码生成、识别,去除白边、任意颜色

国际数字经济学院、华南理工 | Unified BERT for Few-shot Natural Language Understanding(用于小样本自然语言理解的统一BERT)

数仓的字符截取三胞胎:substrb、substr、substring

Function key input experiment based on stm32f103zet6 Library

Redis 原理 - String

Usage of rxjs mergemap

Minmei new energy rushes to Shenzhen Stock Exchange: the annual accounts receivable exceeds 600million and the proposed fund-raising is 450million
随机推荐
数据分析师太火?月入3W?用数据告诉你这个行业的真实情况
Market status and development prospect forecast of global active quality control air sampler industry in 2022
Market status and development prospect forecast of global functional polyethylene glycol (PEG) industry in 2022
Common errors and solutions of MySQL reading binlog logs
laravel框架中 定时任务的实现
金源高端IPO被终止:曾拟募资7.5亿 儒杉资产与溧阳产投是股东
GIS remote sensing R language learning see here
通过 G1 GC Log 重新认识 G1 垃圾回收器
Oracle 获取月初、月末时间,获取上一月月初、月末时间
明美新能源冲刺深交所:年应收账款超6亿 拟募资4.5亿
爬取国家法律法规数据库
Memoirs of actual combat: breaking the border from webshell
基于STM32F103ZET6库函数蜂鸣器实验
openssl客户端编程:一个不起眼的函数导致的SSL会话失败问题
Photoshop-图层相关概念-LayerComp-Layers-移动旋转复制图层-复合图层
maxwell 报错(连接为mysql 8.x)解决方法
实战回忆录:从Webshell开始突破边界
图扑数字孪生智慧能源一体化管控平台
Where to look at high-yield bank financial products?
Making single test so simple -- initial experience of Spock framework