当前位置:网站首页>书包网小说多线程爬虫[通俗易懂]
书包网小说多线程爬虫[通俗易懂]
2022-07-02 14:43:00 【全栈程序员站长】
大家好,又见面了,我是你们的朋友全栈君。
书包网是个很好的小说网站,提供了小说txt下载,并且网站后端高并发,不用担心随便抓一下把网站抓崩了
既然如此,何不拿来练手爬虫项目呢。
直接上代码把,此多线程爬虫支持爬取各种这样类似的网站,关键需要网站支持高并发,否则分分钟崩了。
毕竟5分钟一本18mb的小说,属于超级快的那种了
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) #向列表中每个元素都加入前缀
novel_list = list(zip(title,link)) #将两个列表用zip打包成新的zip对象并转为列表对象
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=('第0001章 绝地中走出的少年', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f"待开启线程数量为{xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r线程{i}正在写入 {title}", end="")
except Exception as e:
print("\n爬得太快被拒绝连接,等1s递归继续")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n线程{i}执行完毕")
print(f"\n剩余线程数量{len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n全本下载完毕")
t2 = time()
print(f"\n本次下载小说总共耗时{round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n全部线程开启完毕")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print("小说合并完成")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
pass
发布者:全栈程序员栈长,转载请注明出处:https://javaforall.cn/148031.html原文链接:https://javaforall.cn
边栏推荐
- Fuyuan medicine is listed on the Shanghai Stock Exchange: the market value is 10.5 billion, and Hu Baifan is worth more than 4billion
- How to quickly distinguish controlled components from uncontrolled components?
- A few lines of code to complete RPC service registration and discovery
- Tech talk activity preview | building intelligent visual products based on Amazon kVs
- Changwan group rushed to Hong Kong stocks: the annual revenue was 289million, and Liu Hui had 53.46% voting rights
- Shutter: action feedback
- 871. Minimum refueling times
- &lt; IV & gt; H264 decode output YUV file
- Atcoder beginer contest 169 (B, C, D unique decomposition, e mathematical analysis f (DP))
- Exploration of mobile application performance tools
猜你喜欢
Use the API port of the bridge of knowledge and action to provide resources for partners to access
Dgraph: large scale dynamic graph dataset
福元医药上交所上市:市值105亿 胡柏藩身价超40亿
GeoServer:发布PostGIS数据源
深度之眼(二)——矩阵及其基本运算
TCP congestion control details | 2 background
剑指 Offer 27. 二叉树的镜像
Weili holdings listed on the Hong Kong Stock Exchange: with a market value of HK $500million, it contributed an IPO to Hubei
Error when uploading code to remote warehouse: remote origin already exists
Timing / counter of 32 and 51 single chip microcomputer
随机推荐
Xiaopeng P7 had an accident on rainy days, and the airbag did not pop up. Official response: the impact strength did not meet the ejection requirements
Usage of sprintf() function in C language
The impact of telecommuting on all aspects of our experience | community essay solicitation
Configure ARP table entry restrictions and port security based on the interface (restrict users' private access to fool switches or illegal host access)
Interpretation of key parameters in MOSFET device manual
IP地址转换地址段
Flutter: 动作反馈
畅玩集团冲刺港股:年营收2.89亿 刘辉有53.46%投票权
剑指 Offer 25. 合并两个排序的链表
剑指 Offer 27. 二叉树的镜像
Jiuxian's IPO was terminated: Sequoia and Dongfang Fuhai were shareholders who had planned to raise 1billion yuan
MOSFET器件手册关键参数解读
2322. Remove the minimum fraction of edges from the tree (XOR and & Simulation)
ssb门限_SSB调制「建议收藏」
你想要的宏基因组-微生物组知识全在这(2022.7)
伟立控股港交所上市:市值5亿港元 为湖北贡献一个IPO
A few lines of code to complete RPC service registration and discovery
Youzan won the "top 50 Chinese enterprise cloud technology service providers" together with Tencent cloud and Alibaba cloud [easy to understand]
Error when uploading code to remote warehouse: remote origin already exists
ETH数据集下载及相关问题