当前位置:网站首页>初学爬虫-笔趣阁爬虫
初学爬虫-笔趣阁爬虫
2022-07-02 04:35:00 【weixin_43446292】
import requests
from lxml import etree
base_url=input(“请输入小说url:”) #如春日宴的url为https://www.xbiquge.la/20/20671/
headers={
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0”
}
response=requests.get(base_url,headers=headers)
r=response.content.decode()
r1=etree.HTML #将字符串格式的 html 片段解析成 html 文档
r2=r1.xpath(‘//[@id=“info”]/h1/text()‘) #获取小说名字
r3=r1.xpath(’//[@id=“list”]/dl/dd/a/text()’) #获取小说每章标题
r4=r1.xpath(‘//[@id=“list”]/dl/dd/a/@href’) #获取每章节链接
r5=r1.xpath('//[@id=“info”]/p[1]/text()’) #获取作者名字
r6=r1.xpath(‘//[@id=“intro”]/p[2]/text()‘) #获取文案
r7=’‘.join(r5).split(’:‘)[1] #获取作者名字
chapter_list=[]
for i in r4:
url=“https://www.xbiquge.la”+i
chapter_list.append(url) #合成每章节的url
for i in r2:
title=’{}by{}.txt’.format(i,r7) #获取保存时txt的名字
content_list=[]
with open(title,“a”,encoding=“utf-8”) as f:
f.writelines(r6)
f.write(‘\n’)
for (x,y) in zip(chapter_list,r3):
response2=requests.get(x)
res=response2.content.decode()
res1=etree.HTML(res)
res3=res1.xpath('//[@id=“content”]/text()’) #获取每章节小说内容
f.writelines(y)#写入每章节的标题
f.write(‘\n’)
f.writelines(res3)#写入每章节的小说内容
f.write(‘\n’)
print(“{}采集完毕,共{}章节”.format(title,len(chapter_list)))
边栏推荐
- June book news | 9 new books are listed, with a strong lineup and eyes closed!
- Microsoft Research Institute's new book "Fundamentals of data science", 479 Pages pdf
- Shenzhen will speed up the cultivation of ecology to build a global "Hongmeng Oula city", with a maximum subsidy of 10million yuan for excellent projects
- What methods should service define?
- IDEA xml中sql没提示,且方言设置没用。
- Let正版短信测压开源源码
- Pytoch --- use pytoch to realize u-net semantic segmentation
- Pytorch-Yolov5從0運行Bug解决:
- GeoTrust ov multi domain SSL certificate is 2100 yuan a year. How many domain names does it contain?
- Pytorch---使用Pytorch进行图像定位
猜你喜欢

Read "the way to clean code" - function names should express their behavior

Three years of experience in Android development interview (I regret that I didn't get n+1, Android bottom development tutorial

What methods should service define?

Shenzhen will speed up the cultivation of ecology to build a global "Hongmeng Oula city", with a maximum subsidy of 10million yuan for excellent projects

Idea automatic package import and automatic package deletion settings

unable to execute xxx. SH: operation not permitted

Thinkphp Kernel wo system source Commercial Open source multi - user + multi - Customer Service + SMS + email notification
![[source code analysis] NVIDIA hugectr, GPU version parameter server - (1)](/img/e1/620443dbc6ea8b326e1242f25d6d74.jpg)
[source code analysis] NVIDIA hugectr, GPU version parameter server - (1)

Yyds dry inventory compiler and compiler tools

Mysql表insert中文变?号的问题解决办法
随机推荐
Learn AI safety monitoring project from zero [attach detailed code]
Thinkphp内核工单系统源码商业开源版 多用户+多客服+短信+邮件通知
One click generation and conversion of markdown directory to word format
LeetCode-归并排序链表
The core idea of performance optimization, dry goods sharing
Markdown编辑语法
C language guessing numbers game
Shenzhen will speed up the cultivation of ecology to build a global "Hongmeng Oula city", with a maximum subsidy of 10million yuan for excellent projects
Handling of inconsistency between cursor and hinttext position in shutter textfield
office_ Delete the last page of word (the seemingly blank page)
Pytoch --- use pytoch to predict birds
正大美欧4的主账户关注什么数据?
Wechat applet calculates the distance between the two places
Pytoch --- use pytoch to realize u-net semantic segmentation
C - derived classes and constructors
oracle 存储过程与job任务设置
Feature Engineering: summary of common feature transformation methods
BGP experiment the next day
Learn what definitelytyped is through the typescript development environment of SAP ui5
What are the rules and trading hours of agricultural futures contracts? How much is the handling fee deposit?