当前位置:网站首页>Pyhton crawls Baidu library text and writes it into word document
Pyhton crawls Baidu library text and writes it into word document
2022-06-27 19:52:00 【Beidao end Lane】
Catalog
Introduce
Only supports crawling Baidu Library Word file , Text writing Word Document or text file (.txt), The main use of Python Reptile requests library .
requests Kuo is Python In the crawler series, request libraries are popular, convenient and practical , in addition urlib library ( package ) It is also quite popular . besides Python The crawler series also has a parsing library lxml as well as Beautiful Soup,Python The crawler frame scrapy.
Request URL
Introduce to you headers How to use 、 And paging crawling ,headers Generally speaking, it only needs User-Agent That's enough .
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
# Receive request processing
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
# Declare the type of encoding supported by the browser
'Accept-Encoding': 'gzip, deflate, br',
# The acceptance language sent to the client browser
'Accept-Language': 'zh-CN,zh;q=0.9',
# Get browser cache
'Cache-Control': 'max-age=0',
# Send the next request to the same connection , Until one party actively closes the connection
'Connection': 'keep-alive',
# Main address ( Domain name of the server )
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
# Client identification certificate ( It's like an ID card )
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
Crawl data
Get the server response to crawl the document data to write Word file , Can also be with open(‘ Baidu library .docx’, ‘a’, encoding=‘utf-8’) Medium .docx Change to .txt text file , In this way, a text file is written , The line feed function has not been added to write !
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
Complete code
import requests
import re
import json
class WenKu():
def __init__(self):
self.session = requests.Session()
def get_url(self):
url = input(" Please enter the downloaded Library URL Address :")
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wenku.baidu.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
json_data = re.findall('"json":(.*?}])', response.text)[0]
json_data = json.loads(json_data)
# print(json_data)
for index, page_load_urls in enumerate(json_data):
# print(page_load_urls)
page_load_url = page_load_urls['pageLoadUrl']
# print(index)
self.get_data(index, page_load_url)
def get_data(self, index, url):
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'wkbjcloudbos.bdimg.com',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36'
}
response = self.session.get(url=url,headers=headers)
# print(response.content.decode('unicode_escape'))
data = response.content.decode('unicode_escape')
comand = 'wenku_' + str(index+1)
json_data = re.findall(comand + "\((.*?}})\)", data)[0]
# print(json_data)
json_data = json.loads(json_data)
result = []
for i in json_data['body']:
data = i["c"]
# print(data)
result.append(data)
print(''.join(result).replace(' ', '\n'))
print("")
with open(' Baidu library .docx', 'a', encoding='utf-8') as f:
f.write('')
f.write(''.join(result).replace(' ', '\n'))
if __name__ == '__main__':
wk = WenKu()
wk.get_url()
边栏推荐
- 形参的默认值-及return的注意事项-及this的使用-和箭头函数的知识
- External interrupt experiment based on stm32f103zet6 library function
- Pyhton爬取百度文库文字写入word文档
- Adding, deleting, modifying and querying MySQL tables (basic)
- Running lantern experiment based on stm32f103zet6 library function
- 1023 Have Fun with Numbers
- GIS remote sensing R language learning see here
- 【help】JVM的CPU资源占用过高问题的排查
- 数仓的字符截取三胞胎:substrb、substr、substring
- 1023 Have Fun with Numbers
猜你喜欢

带你认识图数据库性能和场景测试利器LDBC SNB

散列表(Hash)-复习

Comprehensively analyze the zero knowledge proof: resolve the expansion problem and redefine "privacy security"

拥抱云原生:江苏移动订单中心实践

binder hwbinder vndbinder

Adding, deleting, modifying and querying MySQL tables (basic)

Mathematical derivation from perceptron to feedforward neural network

crontab的学习随笔

在线文本按行批量反转工具

MySQL初学者福利
随机推荐
Mathematical derivation from perceptron to feedforward neural network
作为软件工程师,给年轻时的自己的建议(下)
华大单片机KEIL报错_WEAK的解决方案
一对一关系
散列表(Hash)-复习
云笔记到底哪家强 -- 教你搭建自己的网盘服务器
Kotlin微信支付回调后界面卡死并抛出UIPageFragmentActivity WindowLeaked
Labelimg usage guide
【bug】联想小新出现问题,你的PIN不可用。
binder hwbinder vndbinder
1025 PAT Ranking
DCC888 :Register Allocation
多伦多大学博士论文 | 深度学习中的训练效率和鲁棒性
shell脚本常用命令(四)
1025 PAT Ranking
Leetcode 989. 数组形式的整数加法(简单)
ABAP随笔-通过api获取新冠数据
UE4:Build Configuration和Config的解释
【help】JVM的CPU资源占用过高问题的排查
ABAP随笔-面试回忆 望大家 需求不增 人天飙升