当前位置:网站首页>初学爬虫-笔趣阁爬虫
初学爬虫-笔趣阁爬虫
2022-07-02 04:35:00 【weixin_43446292】
import requests
from lxml import etree
base_url=input(“请输入小说url:”) #如春日宴的url为https://www.xbiquge.la/20/20671/
headers={
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0”
}
response=requests.get(base_url,headers=headers)
r=response.content.decode()
r1=etree.HTML #将字符串格式的 html 片段解析成 html 文档
r2=r1.xpath(‘//[@id=“info”]/h1/text()‘) #获取小说名字
r3=r1.xpath(’//[@id=“list”]/dl/dd/a/text()’) #获取小说每章标题
r4=r1.xpath(‘//[@id=“list”]/dl/dd/a/@href’) #获取每章节链接
r5=r1.xpath('//[@id=“info”]/p[1]/text()’) #获取作者名字
r6=r1.xpath(‘//[@id=“intro”]/p[2]/text()‘) #获取文案
r7=’‘.join(r5).split(’:‘)[1] #获取作者名字
chapter_list=[]
for i in r4:
url=“https://www.xbiquge.la”+i
chapter_list.append(url) #合成每章节的url
for i in r2:
title=’{}by{}.txt’.format(i,r7) #获取保存时txt的名字
content_list=[]
with open(title,“a”,encoding=“utf-8”) as f:
f.writelines(r6)
f.write(‘\n’)
for (x,y) in zip(chapter_list,r3):
response2=requests.get(x)
res=response2.content.decode()
res1=etree.HTML(res)
res3=res1.xpath('//[@id=“content”]/text()’) #获取每章节小说内容
f.writelines(y)#写入每章节的标题
f.write(‘\n’)
f.writelines(res3)#写入每章节的小说内容
f.write(‘\n’)
print(“{}采集完毕,共{}章节”.format(title,len(chapter_list)))
边栏推荐
- Play with concurrency: draw a thread state transition diagram
- CorelDRAW Graphics Suite2022免费图形设计软件
- 66.qt quick QML Custom Calendar component (supports vertical and horizontal screens)
- Websites that it people often visit
- Learn AI safety monitoring project from zero [attach detailed code]
- Markdown edit syntax
- Bitmap principle code record
- Keil compilation code of CY7C68013A
- Design and implementation of general interface open platform - (44) log processing of API services
- What methods should service define?
猜你喜欢
云服务器的安全设置常识
Force buckle 540 A single element in an ordered array
Play with concurrency: what's the use of interruptedexception?
MySQL error: expression 1 of select list is not in group by claim and contains nonaggre
Exposure X8 Standard Version picture post filter PS, LR and other software plug-ins
Li Kou interview question 02.08 Loop detection
MySQL table insert Chinese change? Solution to the problem of No
idea自動導包和自動删包設置
Research on the security of ognl and El expressions and memory horse
[C language] basic learning notes
随机推荐
Markdown编辑语法
Exposure X8 Standard Version picture post filter PS, LR and other software plug-ins
Yyds dry goods inventory kubernetes introduction foundation pod concept and related operations
记录一次Unity 2020.3.31f1的bug
WiFi 5GHz frequency
Playing with concurrency: what are the ways of communication between threads?
第十六周作业
60后关机程序
Let正版短信测压开源源码
万卷共知,一书一页总关情,TVP读书会带你突围阅读迷障!
Yolov5网络修改教程(将backbone修改为EfficientNet、MobileNet3、RegNet等)
[JS event -- event flow]
Wechat applet calculates the distance between the two places
My first experience of shadowless cloud computer
二叉树解题(二)
Handling of inconsistency between cursor and hinttext position in shutter textfield
June book news | 9 new books are listed, with a strong lineup and eyes closed!
The difference between vectorresize and reverse.
CorelDRAW graphics suite2022 free graphic design software
Pytorch-Yolov5從0運行Bug解决: