当前位置:网站首页>Schoolbag novel multithreaded crawler [easy to understand]
Schoolbag novel multithreaded crawler [easy to understand]
2022-07-02 17:19:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Schoolbag is a good novel website , Provides novels txt download , And the back end of the website is highly concurrent , Don't worry about grabbing the website casually
In that case , Why not practice the hand crawler project .
Go straight to the code , This multi-threaded crawler supports crawling various similar websites , The key is that the website supports high concurrency , Otherwise, it will collapse in minutes .
After all 5 Every minute 18mb The novel , It belongs to the super fast kind
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) # Prefix every element in the list
novel_list = list(zip(title,link)) # Use both lists zip Pack into new zip Object and turn it into a list object
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=(' The first 0001 Chapter The boy who came out of the Jedi ', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f" The number of threads to be opened is {xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r Threads {i} Are written to the {title}", end="")
except Exception as e:
print("\n Climbing too fast and being refused connection , etc. 1s Recursion continues ")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n Threads {i} completion of enforcement ")
print(f"\n Number of threads remaining {len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n Complete download ")
t2 = time()
print(f"\n It takes a total of time to download the novel {round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n All threads are started ")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print(" The novel is merged ")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
passPublisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/148031.html Link to the original text :https://javaforall.cn
边栏推荐
- 一文看懂:数据指标体系的4大类型
- 绿竹生物冲刺港股:年期内亏损超5亿 泰格医药与北京亦庄是股东
- [essay solicitation activity] Dear developer, RT thread community calls you to contribute
- Chapter 3 of hands on deep learning - (1) linear regression is realized from scratch_ Learning thinking and exercise answers
- QStyle实现自绘界面项目实战(二)
- Notice on holding a salon for young editors of scientific and Technological Journals -- the abilities and promotion strategies that young editors should have in the new era
- Qstype implementation of self drawing interface project practice (II)
- The impact of telecommuting on all aspects of our experience | community essay solicitation
- 二、mock平台的扩展
- 畅玩集团冲刺港股:年营收2.89亿 刘辉有53.46%投票权
猜你喜欢

几行代码搞定RPC服务注册和发现

Baobab's gem IPO was terminated: Tang Guangyu once planned to raise 1.8 billion to control 47% of the equity
![[leetcode] 14. Préfixe public le plus long](/img/70/e5be1a7c2e10776a040bfc8d7711a0.png)
[leetcode] 14. Préfixe public le plus long

相信自己,这次一把搞定JVM面试

Goodbye, shucang. Alibaba's data Lake construction strategy is really awesome!

OpenPose的使用

Tech talk activity preview | building intelligent visual products based on Amazon kVs

电脑自带软件使图片底色变为透明(抠图白底)

Eye of depth (II) -- matrix and its basic operations

Use of openpose
随机推荐
剑指 Offer 24. 反转链表
Baobab's gem IPO was terminated: Tang Guangyu once planned to raise 1.8 billion to control 47% of the equity
ssb门限_SSB调制「建议收藏」
Configure ARP table entry restrictions and port security based on the interface (restrict users' private access to fool switches or illegal host access)
Eth data set download and related problems
Changwan group rushed to Hong Kong stocks: the annual revenue was 289million, and Liu Hui had 53.46% voting rights
Sword finger offer 21 Adjust the array order so that odd numbers precede even numbers
AP and F107 data sources and processing
Ocio V2 reverse LUT
一文看懂:数据指标体系的4大类型
远程办公对我们的各方面影响心得 | 社区征文
Use of openpose
关于举办科技期刊青年编辑沙龙——新时代青年编辑应具备的能力及提升策略的通知...
默认浏览器设置不了怎么办?
A few lines of code to complete RPC service registration and discovery
How openharmony starts FA of remote devices
What if the default browser cannot be set?
Vscode setting delete line shortcut [easy to understand]
[error record] error -32000 received from application: there are no running service protocol
Geoserver: publishing PostGIS data sources