当前位置:网站首页>初学爬虫-笔趣阁爬虫
初学爬虫-笔趣阁爬虫
2022-07-02 04:35:00 【weixin_43446292】
import requests
from lxml import etree
base_url=input(“请输入小说url:”) #如春日宴的url为https://www.xbiquge.la/20/20671/
headers={
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0”
}
response=requests.get(base_url,headers=headers)
r=response.content.decode()
r1=etree.HTML #将字符串格式的 html 片段解析成 html 文档
r2=r1.xpath(‘//[@id=“info”]/h1/text()‘) #获取小说名字
r3=r1.xpath(’//[@id=“list”]/dl/dd/a/text()’) #获取小说每章标题
r4=r1.xpath(‘//[@id=“list”]/dl/dd/a/@href’) #获取每章节链接
r5=r1.xpath('//[@id=“info”]/p[1]/text()’) #获取作者名字
r6=r1.xpath(‘//[@id=“intro”]/p[2]/text()‘) #获取文案
r7=’‘.join(r5).split(’:‘)[1] #获取作者名字
chapter_list=[]
for i in r4:
url=“https://www.xbiquge.la”+i
chapter_list.append(url) #合成每章节的url
for i in r2:
title=’{}by{}.txt’.format(i,r7) #获取保存时txt的名字
content_list=[]
with open(title,“a”,encoding=“utf-8”) as f:
f.writelines(r6)
f.write(‘\n’)
for (x,y) in zip(chapter_list,r3):
response2=requests.get(x)
res=response2.content.decode()
res1=etree.HTML(res)
res3=res1.xpath('//[@id=“content”]/text()’) #获取每章节小说内容
f.writelines(y)#写入每章节的标题
f.write(‘\n’)
f.writelines(res3)#写入每章节的小说内容
f.write(‘\n’)
print(“{}采集完毕,共{}章节”.format(title,len(chapter_list)))
边栏推荐
- [graduation season · advanced technology Er] young people have dreams, why are they afraid of hesitation
- Vmware安装win10报错:operating system not found
- Play with concurrency: draw a thread state transition diagram
- Major domestic quantitative trading platforms
- LCM of Spreadtrum platform rotates 180 °
- Www2022 | know your way back: self training method of graph neural network under distribution and migration
- okcc为什么云呼叫中心比传统呼叫中心更好?
- June book news | 9 new books are listed, with a strong lineup and eyes closed!
- Pytorch---使用Pytorch进行图像定位
- Pytorch-Yolov5從0運行Bug解决:
猜你喜欢
【c语言】动态规划---入门到起立
Let正版短信测压开源源码
How to write a client-side technical solution
Play with concurrency: what's the use of interruptedexception?
Leetcode- insert and sort the linked list
Thinkphp內核工單系統源碼商業開源版 多用戶+多客服+短信+郵件通知
Ten thousand volumes are known to all, and one page of a book is always relevant. TVP reading club will take you through the reading puzzle!
The solution to the complexity brought by lambda expression
MySQL table insert Chinese change? Solution to the problem of No
Free drawing software recommended - draw io
随机推荐
Idea autoguide package and autodelete package Settings
LeetCode-归并排序链表
Cache consistency solution - how to ensure the consistency between the cache and the data in the database when changing data
Realize the function of data uploading
How to write a client-side technical solution
LCM of Spreadtrum platform rotates 180 °
Pytorch---使用Pytorch实现U-Net进行语义分割
正大留4的主账户信息汇总
Thinkphp6 limit interface access frequency
The core idea of performance optimization, dry goods sharing
Play with concurrency: what's the use of interruptedexception?
Unit testing classic three questions: what, why, and how?
Introduction to JSON usage scenarios and precautions
One click generation and conversion of markdown directory to word format
【提高课】ST表解决区间最值问题【2】
C language practice - number guessing game
二叉树解题(二)
Actual combat | use composite material 3 in application
Federal learning: dividing non IID samples according to Dirichlet distribution
汇编语言中的标志位:CF、PF、AF、ZF、SF、TF、IF、DF、OF