当前位置:网站首页>Schoolbag novel multithreaded crawler [easy to understand]
Schoolbag novel multithreaded crawler [easy to understand]
2022-07-02 17:19:00 【Full stack programmer webmaster】
Hello everyone , I meet you again , I'm your friend, Quan Jun .
Schoolbag is a good novel website , Provides novels txt download , And the back end of the website is highly concurrent , Don't worry about grabbing the website casually
In that case , Why not practice the hand crawler project .
Go straight to the code , This multi-threaded crawler supports crawling various similar websites , The key is that the website supports high concurrency , Otherwise, it will collapse in minutes .
After all 5 Every minute 18mb The novel , It belongs to the super fast kind
from lxml import etree
import requests
from threading import Thread,enumerate
import os
from time import sleep,time
headers={
# ':authority':'www.bookbao8.com',
# ':method': 'GET',
# ':path': '/book/201506/04/id_XNDMyMjA1.html',
# ':scheme': 'https',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'max-age=0',
'cookie': 'Hm_lvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lpvt_79d6c18dfed73a9524dc37b056df45ec=1577182135; Hm_lvt_9e424f40a62d01a6b9036c7d25ce9a05=1577182142; trustedsite_visit=1; bk_ad=2; __cm_warden_uid=840a745a752905060cd14982b4bbc922coo; __cm_warden_upi=MTE5LjQuMjI4LjE1Nw%3D%3D; Hm_lpvt_9e424f40a62d01a6b9036c7d25ce9a05=1577185720',
'referer': 'https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
}
def thread_it(func,*args):
t = Thread(target=func,args=args)
t.setDaemon(True)
t.start()
def getAll(url = "https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html"):
r = requests.get(url,headers=headers)
print(r.text)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
name = page_source.xpath('//*[@id="info"]/h1/text()')
author = page_source.xpath('//*[@id="info"]/p[1]/a/text()')
novel_type = page_source.xpath('//*[@id="info"]/p[2]/a/text()')
title = page_source.xpath('/html/body/div[7]/ul/li/a/text()')
link = page_source.xpath('/html/body/div[7]/ul/li/a/@href')
link = map(lambda x: 'https://www.bookbao8.com'+x, link) # Prefix every element in the list
novel_list = list(zip(title,link)) # Use both lists zip Pack into new zip Object and turn it into a list object
if len(novel_list) > 0:
return name[0], author[0], novel_type[0], novel_list
else:
return None,None,None,None
def getOne(link=(' The first 0001 Chapter The boy who came out of the Jedi ', 'https://www.bookbao8.com/views/201506/04/id_XNDMyMjA1_1.html')):
r = requests.get(link[1], headers=headers)
if r.status_code == 200:
r.encoding = r.apparent_encoding
ret = r.text
page_source = etree.HTML(ret)
node_title = link[0]
node_content = page_source.xpath('//*[@id="contents"]/text()')
node_content = "".join(node_content).replace("\n \xa0 \xa0","")
if len(node_title) > 0:
return node_title, node_content
else:
return None, None
def writeOne(title,content):
txt = "\t\t"+title+"\n"+content+"\n\n"
return txt
def runApp(novel_list,name,t1,cwd=''):
article_num = len(novel_list)
xc_num = article_num//20+1
print(f" The number of threads to be opened is {xc_num}")
def inter(link,f,i):
try:
title, content = getOne(link)
txt = writeOne(title, content)
f.write(txt)
print(f"\r Threads {i} Are written to the {title}", end="")
except Exception as e:
print("\n Climbing too fast and being refused connection , etc. 1s Recursion continues ")
sleep(1)
inter(link,f,i)
def inner(name,i,begin,end,cwd):
f = open(f"{cwd}downloads/{name}/{i}.txt", mode='w+', encoding='utf-8')
for link in novel_list[begin:end]:
inter(link, f,i)
if link == novel_list[end - 1]:
print(f"\n Threads {i} completion of enforcement ")
print(f"\n Number of threads remaining {len(enumerate())}")
base_xc = 2 if not cwd else 4
if len(enumerate()) <= base_xc:
print(enumerate())
print("\n Complete download ")
t2 = time()
print(f"\n It takes a total of time to download the novel {round(t2 - t1)}s")
hebing(f"{cwd}downloads/{name}")
f.close()
for i in range(1,xc_num+1):
begin = 20*(i-1)
end = 20*i if i != xc_num else article_num
if i == xc_num:
print(f"\n All threads are started ")
thread_it(inner,name,i,begin,end,cwd)
sleep(0.5)
def paixuRule(elem):
return int(elem.split(".")[0])
def hebing(path):
dirs = os.listdir(path)
dirs.sort(key=paixuRule, reverse=False)
f = open(path+".txt",mode='w+',encoding='utf-8')
for file in dirs:
with open(path+"/"+file,mode="r",encoding="utf-8") as f1:
f.write(f1.read())
f.close()
print(" The novel is merged ")
if __name__ == '__main__':
t1 = time()
name, _, _, novel_list = getAll(url="https://www.bookbao8.com/book/201506/04/id_XNDMyMjA1.html")
print(name)
if not os.path.exists("downloads/" + name):
os.mkdir("downloads/" + name)
runApp(novel_list, name, t1)
while True:
passPublisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/148031.html Link to the original text :https://javaforall.cn
边栏推荐
- linux下配置Mysql授权某个用户远程访问,不受ip限制
- ThreadLocal
- ssb门限_SSB调制「建议收藏」
- Changwan group rushed to Hong Kong stocks: the annual revenue was 289million, and Liu Hui had 53.46% voting rights
- Green bamboo biological sprint Hong Kong stocks: loss of more than 500million during the year, tiger medicine and Beijing Yizhuang are shareholders
- Learning Weekly - total issue 60 - 25th week of 2022
- Exploration of mobile application performance tools
- [leetcode] 14. Préfixe public le plus long
- Smart trash can (V) - light up OLED
- Error when uploading code to remote warehouse: remote origin already exists
猜你喜欢

Sword finger offer 26 Substructure of tree

如何与博格华纳BorgWarner通过EDI传输业务数据?

871. Minimum refueling times

Timing / counter of 32 and 51 single chip microcomputer

【Leetcode】13. Roman numeral to integer

Qwebengineview crash and alternatives

电脑自带软件使图片底色变为透明(抠图白底)

In MySQL and Oracle, the boundary and range of between and precautions when querying the date

A few lines of code to complete RPC service registration and discovery

剑指 Offer 22. 链表中倒数第k个节点
随机推荐
R and rstudio download and installation tutorial (super detailed)
Believe in yourself and finish the JVM interview this time
[cloud native] briefly talk about the understanding of flume, a massive data collection component
2020 "Lenovo Cup" National College programming online Invitational Competition and the third Shanghai University of technology programming competition (a sign in, B sign in, C sign in, D thinking +mst
Usage of sprintf() function in C language
Goodbye, shucang. Alibaba's data Lake construction strategy is really awesome!
Role and function of uboot
Just a coincidence? The mysterious technology of apple ios16 is even consistent with the products of Chinese enterprises five years ago!
Configure MySQL under Linux to authorize a user to access remotely, which is not restricted by IP
深度之眼(二)——矩阵及其基本运算
如何与博格华纳BorgWarner通过EDI传输业务数据?
綠竹生物沖刺港股:年期內虧損超5億 泰格醫藥與北京亦莊是股東
详细介绍scrollIntoView()方法属性
宝宝巴士创业板IPO被终止:曾拟募资18亿 唐光宇控制47%股权
Does digicert SSL certificate support Chinese domain name application?
【Leetcode】13. Roman numeral to integer
酒仙网IPO被终止:曾拟募资10亿 红杉与东方富海是股东
QWebEngineView崩溃及替代方案
Sword finger offer 22 The penultimate node in the linked list
Xiaopeng P7 had an accident on rainy days, and the airbag did not pop up. Official response: the impact strength did not meet the ejection requirements