当前位置:网站首页>Pyhton crawls Baidu library text and writes it into word document
Pyhton crawls Baidu library text and writes it into word document
2022-06-27 19:52:00 【Beidao end Lane】
Catalog
Introduce
Only supports crawling Baidu Library Word file , Text writing Word Document or text file (.txt), The main use of Python Reptile requests library .
requests Kuo is Python In the crawler series, request libraries are popular, convenient and practical , in addition urlib library ( package ) It is also quite popular . besides Python The crawler series also has a parsing library lxml as well as Beautiful Soup,Python The crawler frame scrapy.
Request URL
Introduce to you headers How to use 、 And paging crawling ,headers Generally speaking, it only needs User-Agent That's enough .
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
# Receive request processing
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# Declare the type of encoding supported by the browser
'Accept-Encoding': 'gzip, deflate, br',
# The acceptance language sent to the client browser
'Accept-Language': 'zh-CN,zh;q=0.9',
# Get browser cache
'Cache-Control': 'max-age=0',
# Send the next request to the same connection , Until one party actively closes the connection
'Connection': 'keep-alive',
# Main address ( Domain name of the server )
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
# Client identification certificate ( It's like an ID card )
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
Crawl data
Get the server response to crawl the document data to write Word file , Can also be with open(‘ Baidu library .docx’, ‘a’, encoding=‘utf-8’) Medium .docx Change to .txt text file , In this way, a text file is written , The line feed function has not been added to write !
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
Complete code
import requests
import re
import json
class WenKu():
def __init__(self):
self.session = requests.Session()
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
if __name__ == '__main__':
wk = WenKu()
wk.get_url()
边栏推荐
- CMS 执行的七个阶段
- 云笔记到底哪家强 -- 教你搭建自己的网盘服务器
- Cdga | what is the core of digital transformation in the transportation industry?
- (LC)46. 全排列
- OpenSSL client programming: SSL session failure caused by an obscure function
- 海底电缆探测技术总结
- 一对一关系
- Adding, deleting, modifying and querying MySQL tables (basic)
- 基于STM32F103ZET6库函数按键输入实验
- Is the account opening QR code given by CICC securities manager safe? Who can I open an account with?
猜你喜欢

International School of Digital Economics, South China Institute of technology 𞓜 unified Bert for few shot natural language understanding

429-二叉树(108. 将有序数组转换为二叉搜索树、538. 把二叉搜索树转换为累加树、 106.从中序与后序遍历序列构造二叉树、235. 二叉搜索树的最近公共祖先)

What is ssr/ssg/isr? How do I host them on AWS?

Blink SQL built in functions

Function key input experiment based on stm32f103zet6 Library

Buzzer experiment based on stm32f103zet6 library function

海底电缆探测技术总结

GIS遥感R语言学习看这里

Introduction to deep learning and neural networks

Summary of submarine cable detection technology
随机推荐
让单测变得如此简单 -- spock 框架初体验
一对一关系
海底电缆探测技术总结
Cdga | what is the core of digital transformation in the transportation industry?
Adding, deleting, modifying and querying MySQL tables (basic)
Where to look at high-yield bank financial products?
循环遍历及函数基础知识
带你认识图数据库性能和场景测试利器LDBC SNB
Implementation of reliable distributed locks redlock and redisson
NVIDIA Clara-AGX-Developer-Kit installation
1023 Have Fun with Numbers
【bug】上传图片出现错误(413 Request Entity Too Large)
作用域-Number和String的常用Api(方法)
華大單片機KEIL報錯_WEAK的解决方案
从感知机到前馈神经网络的数学推导
shell脚本常用命令(三)
CDGA|交通行业做好数字化转型的核心是什么?
【bug】联想小新出现问题,你的PIN不可用。
429- binary tree (108. convert the ordered array into a binary search tree, 538. convert the binary search tree into an accumulation tree, 106. construct a binary tree from the middle order and post o
308. 二维区域和检索 - 可变 线段树/哈希