当前位置:网站首页>Schoolbag novel multithreaded crawler [easy to understand]
Schoolbag novel multithreaded crawler [easy to understand]
2022-07-02 17:19:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Schoolbag is a good novel website , Provides novels txt download , And the back end of the website is highly concurrent , Don't worry about grabbing the website casually
In that case , Why not practice the hand crawler project .
Go straight to the code , This multi-threaded crawler supports crawling various similar websites , The key is that the website supports high concurrency , Otherwise, it will collapse in minutes .
After all 5 Every minute 18mb The novel , It belongs to the super fast kind
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) # Prefix every element in the list
novel_list = list(zip(title,link)) # Use both lists zip Pack into new zip Object and turn it into a list object
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=(' The first 0001 Chapter The boy who came out of the Jedi ', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f" The number of threads to be opened is {xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r Threads {i} Are written to the {title}", end="")
except Exception as e:
print("\n Climbing too fast and being refused connection , etc. 1s Recursion continues ")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n Threads {i} completion of enforcement ")
print(f"\n Number of threads remaining {len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n Complete download ")
t2 = time()
print(f"\n It takes a total of time to download the novel {round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n All threads are started ")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print(" The novel is merged ")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
passPublisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/148031.html Link to the original text :https://javaforall.cn
边栏推荐
- How to transfer business data with BorgWarner through EDI?
- 电脑自带软件使图片底色变为透明(抠图白底)
- 剑指 Offer 24. 反转链表
- Notice on holding a salon for young editors of scientific and Technological Journals -- the abilities and promotion strategies that young editors should have in the new era
- Deep learning image data automatic annotation [easy to understand]
- 博客主题 “Text“ 夏日清新特别版
- A few lines of code to complete RPC service registration and discovery
- Configure ARP table entry restrictions and port security based on the interface (restrict users' private access to fool switches or illegal host access)
- 13、Darknet YOLO3
- 剑指 Offer 22. 链表中倒数第k个节点
猜你喜欢

你想要的宏基因组-微生物组知识全在这(2022.7)

linux下配置Mysql授权某个用户远程访问,不受ip限制

綠竹生物沖刺港股:年期內虧損超5億 泰格醫藥與北京亦莊是股東

Baobab's gem IPO was terminated: Tang Guangyu once planned to raise 1.8 billion to control 47% of the equity

Timing / counter of 32 and 51 single chip microcomputer

ThreadLocal

体验居家办公完成项目有感 | 社区征文
![[leetcode] 14. Préfixe public le plus long](/img/70/e5be1a7c2e10776a040bfc8d7711a0.png)
[leetcode] 14. Préfixe public le plus long

Interpretation of key parameters in MOSFET device manual

ETH数据集下载及相关问题
随机推荐
【Leetcode】14. Longest Common Prefix
智能垃圾桶(五)——点亮OLED
Ocio V2 reverse LUT
Notice on holding a salon for young editors of scientific and Technological Journals -- the abilities and promotion strategies that young editors should have in the new era
js删除字符串中的子串
TCP congestion control details | 2 background
What will you do after digital IC Verification?
綠竹生物沖刺港股:年期內虧損超5億 泰格醫藥與北京亦莊是股東
IPtables中SNAT、DNAT和MASQUERADE的含义
Flutter: 动作反馈
Cell:清华程功组揭示皮肤菌群的一种气味挥发物促进黄病毒感染宿主吸引蚊虫...
关于举办科技期刊青年编辑沙龙——新时代青年编辑应具备的能力及提升策略的通知...
PhD battle-11 preview | review and prospect backdoor attack and defense of neural network
What is the difference between JSP and servlet?
Use the API port of the bridge of knowledge and action to provide resources for partners to access
R and rstudio download and installation tutorial (super detailed)
Blog theme "text" summer fresh Special Edition
亚马逊云科技 Community Builder 申请窗口开启
Weili holdings listed on the Hong Kong Stock Exchange: with a market value of HK $500million, it contributed an IPO to Hubei
酒仙网IPO被终止:曾拟募资10亿 红杉与东方富海是股东