当前位置:网站首页>Schoolbag novel multithreaded crawler [easy to understand]
Schoolbag novel multithreaded crawler [easy to understand]
2022-07-02 17:19:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Schoolbag is a good novel website , Provides novels txt download , And the back end of the website is highly concurrent , Don't worry about grabbing the website casually
In that case , Why not practice the hand crawler project .
Go straight to the code , This multi-threaded crawler supports crawling various similar websites , The key is that the website supports high concurrency , Otherwise, it will collapse in minutes .
After all 5 Every minute 18mb The novel , It belongs to the super fast kind
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) # Prefix every element in the list
novel_list = list(zip(title,link)) # Use both lists zip Pack into new zip Object and turn it into a list object
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=(' The first 0001 Chapter The boy who came out of the Jedi ', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f" The number of threads to be opened is {xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r Threads {i} Are written to the {title}", end="")
except Exception as e:
print("\n Climbing too fast and being refused connection , etc. 1s Recursion continues ")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n Threads {i} completion of enforcement ")
print(f"\n Number of threads remaining {len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n Complete download ")
t2 = time()
print(f"\n It takes a total of time to download the novel {round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n All threads are started ")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print(" The novel is merged ")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
pass
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/148031.html Link to the original text :https://javaforall.cn
边栏推荐
- Role and function of uboot
- 一年顶十年
- 871. 最低加油次数
- ThreadLocal
- OpenHarmony如何启动远程设备的FA
- Soul, a social meta universe platform, rushed to Hong Kong stocks: Tencent is a shareholder with an annual revenue of 1.28 billion
- 剑指 Offer 24. 反转链表
- Tech Talk 活动预告 | 基于Amazon KVS打造智能视觉产品
- AP and F107 data sources and processing
- 剑指 Offer 26. 树的子结构
猜你喜欢
A case study of college entrance examination prediction based on multivariate time series
Changwan group rushed to Hong Kong stocks: the annual revenue was 289million, and Liu Hui had 53.46% voting rights
Easy language ABCD sort
体验居家办公完成项目有感 | 社区征文
13、Darknet YOLO3
移动应用性能工具探索之路
Sword finger offer 22 The penultimate node in the linked list
绿竹生物冲刺港股:年期内亏损超5亿 泰格医药与北京亦庄是股东
Eth data set download and related problems
Just a coincidence? The mysterious technology of apple ios16 is even consistent with the products of Chinese enterprises five years ago!
随机推荐
Vscode setting delete line shortcut [easy to understand]
Qstype implementation of self drawing interface project practice (II)
Sword finger offer 21 Adjust the array order so that odd numbers precede even numbers
Youzan won the "top 50 Chinese enterprise cloud technology service providers" together with Tencent cloud and Alibaba cloud [easy to understand]
Linux Installation PostgreSQL + Patroni cluster problem
Green bamboo biological sprint Hong Kong stocks: loss of more than 500million during the year, tiger medicine and Beijing Yizhuang are shareholders
远程办公对我们的各方面影响心得 | 社区征文
Use of openpose
深度之眼(三)——矩阵的行列式
深度之眼(二)——矩阵及其基本运算
OpenHarmony如何启动远程设备的FA
GeoServer:发布PostGIS数据源
相信自己,这次一把搞定JVM面试
博客主题 “Text“ 夏日清新特别版
R and rstudio download and installation tutorial (super detailed)
OpenPose的使用
The beginning of life
visibilitychange – 指定标签页可见时,刷新页面数据
剑指 Offer 22. 链表中倒数第k个节点
书包网小说多线程爬虫[通俗易懂]