当前位置:网站首页>Schoolbag novel multithreaded crawler [easy to understand]
Schoolbag novel multithreaded crawler [easy to understand]
2022-07-02 17:19:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Schoolbag is a good novel website , Provides novels txt download , And the back end of the website is highly concurrent , Don't worry about grabbing the website casually
In that case , Why not practice the hand crawler project .
Go straight to the code , This multi-threaded crawler supports crawling various similar websites , The key is that the website supports high concurrency , Otherwise, it will collapse in minutes .
After all 5 Every minute 18mb The novel , It belongs to the super fast kind
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) # Prefix every element in the list
novel_list = list(zip(title,link)) # Use both lists zip Pack into new zip Object and turn it into a list object
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=(' The first 0001 Chapter The boy who came out of the Jedi ', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f" The number of threads to be opened is {xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r Threads {i} Are written to the {title}", end="")
except Exception as e:
print("\n Climbing too fast and being refused connection , etc. 1s Recursion continues ")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n Threads {i} completion of enforcement ")
print(f"\n Number of threads remaining {len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n Complete download ")
t2 = time()
print(f"\n It takes a total of time to download the novel {round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n All threads are started ")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print(" The novel is merged ")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
passPublisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/148031.html Link to the original text :https://javaforall.cn
边栏推荐
- 智能垃圾桶(五)——点亮OLED
- visibilitychange – 指定标签页可见时,刷新页面数据
- Soul, a social meta universe platform, rushed to Hong Kong stocks: Tencent is a shareholder with an annual revenue of 1.28 billion
- OpenHarmony如何启动远程设备的FA
- 二、mock平台的扩展
- Qstype implementation of self drawing interface project practice (II)
- 剑指 Offer 26. 树的子结构
- 电脑自带软件使图片底色变为透明(抠图白底)
- IPtables中SNAT、DNAT和MASQUERADE的含义
- 相信自己,这次一把搞定JVM面试
猜你喜欢

福元医药上交所上市:市值105亿 胡柏藩身价超40亿

2020 "Lenovo Cup" National College programming online Invitational Competition and the third Shanghai University of technology programming competition (a sign in, B sign in, C sign in, D thinking +mst

PhD Debate-11 预告 | 回顾与展望神经网络的后门攻击与防御

对接保时捷及3PL EDI案例

Tech talk activity preview | building intelligent visual products based on Amazon kVs
![[shutter] dart data type (dynamic data type)](/img/6d/60277377852294c133b94205066e9e.jpg)
[shutter] dart data type (dynamic data type)

linux安装postgresql + patroni 集群问题

Smart trash can (V) - light up OLED

Chapter 3 of hands on deep learning - (1) linear regression is realized from scratch_ Learning thinking and exercise answers

【Leetcode】13. Roman numeral to integer
随机推荐
Leetcode question brushing record | 933_ Recent requests
体验居家办公完成项目有感 | 社区征文
[shutter] dart data type (dynamic data type)
2、 Expansion of mock platform
Vscode setting delete line shortcut [easy to understand]
书包网小说多线程爬虫[通俗易懂]
Un an à dix ans
Easy language ABCD sort
Dgraph: large scale dynamic graph dataset
Configure MySQL under Linux to authorize a user to access remotely, which is not restricted by IP
Connect Porsche and 3PL EDI cases
class和getClass()的区别
Blog theme "text" summer fresh Special Edition
Amazon cloud technology community builder application window opens
TCP congestion control details | 2 background
【Leetcode】13. Roman numeral to integer
远程办公对我们的各方面影响心得 | 社区征文
如何与博格华纳BorgWarner通过EDI传输业务数据?
VMware install win10 image
A few lines of code to complete RPC service registration and discovery