当前位置:网站首页>书包网小说多线程爬虫[通俗易懂]
书包网小说多线程爬虫[通俗易懂]
2022-07-02 14:43:00 【全栈程序员站长】
大家好,又见面了,我是你们的朋友全栈君。
书包网是个很好的小说网站,提供了小说txt下载,并且网站后端高并发,不用担心随便抓一下把网站抓崩了
既然如此,何不拿来练手爬虫项目呢。
直接上代码把,此多线程爬虫支持爬取各种这样类似的网站,关键需要网站支持高并发,否则分分钟崩了。
毕竟5分钟一本18mb的小说,属于超级快的那种了
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) #向列表中每个元素都加入前缀
novel_list = list(zip(title,link)) #将两个列表用zip打包成新的zip对象并转为列表对象
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=('第0001章 绝地中走出的少年', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f"待开启线程数量为{xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r线程{i}正在写入 {title}", end="")
except Exception as e:
print("\n爬得太快被拒绝连接,等1s递归继续")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n线程{i}执行完毕")
print(f"\n剩余线程数量{len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n全本下载完毕")
t2 = time()
print(f"\n本次下载小说总共耗时{round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n全部线程开启完毕")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print("小说合并完成")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
pass
发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/148031.html原文链接:https://javaforall.cn
边栏推荐
- js删除字符串中的子串
- 871. Minimum refueling times
- 宝宝巴士创业板IPO被终止:曾拟募资18亿 唐光宇控制47%股权
- Shutter: action feedback
- [shutter] dart data type (dynamic data type)
- 剑指 Offer 21. 调整数组顺序使奇数位于偶数前面
- Vscode setting delete line shortcut [easy to understand]
- 移动应用性能工具探索之路
- Detailed explanation of @accessories annotation of Lombok plug-in
- 伟立控股港交所上市:市值5亿港元 为湖北贡献一个IPO
猜你喜欢
绿竹生物冲刺港股:年期内亏损超5亿 泰格医药与北京亦庄是股东
Configure ARP table entry restrictions and port security based on the interface (restrict users' private access to fool switches or illegal host access)
Green bamboo biological sprint Hong Kong stocks: loss of more than 500million during the year, tiger medicine and Beijing Yizhuang are shareholders
Configure MySQL under Linux to authorize a user to access remotely, which is not restricted by IP
QStyle实现自绘界面项目实战(二)
A few lines of code to complete RPC service registration and discovery
Easy language ABCD sort
你想要的宏基因组-微生物组知识全在这(2022.7)
TCP拥塞控制详解 | 2. 背景
Cell: Tsinghua Chenggong group revealed an odor of skin flora. Volatiles promote flavivirus to infect the host and attract mosquitoes
随机推荐
几行代码搞定RPC服务注册和发现
ssb门限_SSB调制「建议收藏」
871. 最低加油次数
OpenPose的使用
GeoServer:发布PostGIS数据源
Does digicert SSL certificate support Chinese domain name application?
【Leetcode】14. 最长公共前缀
Chapter 3 of hands on deep learning - (1) linear regression is realized from scratch_ Learning thinking and exercise answers
什么是泛型?- 泛型入门篇
Changwan group rushed to Hong Kong stocks: the annual revenue was 289million, and Liu Hui had 53.46% voting rights
Ap和F107数据来源及处理
Use of openpose
Detailed explanation of @accessories annotation of Lombok plug-in
上传代码到远程仓库报错error: remote origin already exists.
js删除字符串中的子串
七张图,学会做有价值的经营分析
C语言中sprintf()函数的用法
Smart trash can (V) - light up OLED
Un an à dix ans
Baobab's gem IPO was terminated: Tang Guangyu once planned to raise 1.8 billion to control 47% of the equity