当前位置:网站首页>初学爬虫-笔趣阁爬虫
初学爬虫-笔趣阁爬虫
2022-07-02 04:35:00 【weixin_43446292】
import requests
from lxml import etree
base_url=input(“请输入小说url:”) #如春日宴的url为https://www.xbiquge.la/20/20671/
headers={
“User-Agent”:“Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0”
}
response=requests.get(base_url,headers=headers)
r=response.content.decode()
r1=etree.HTML #将字符串格式的 html 片段解析成 html 文档
r2=r1.xpath(‘//[@id=“info”]/h1/text()‘) #获取小说名字
r3=r1.xpath(’//[@id=“list”]/dl/dd/a/text()’) #获取小说每章标题
r4=r1.xpath(‘//[@id=“list”]/dl/dd/a/@href’) #获取每章节链接
r5=r1.xpath('//[@id=“info”]/p[1]/text()’) #获取作者名字
r6=r1.xpath(‘//[@id=“intro”]/p[2]/text()‘) #获取文案
r7=’‘.join(r5).split(’:‘)[1] #获取作者名字
chapter_list=[]
for i in r4:
url=“https://www.xbiquge.la”+i
chapter_list.append(url) #合成每章节的url
for i in r2:
title=’{}by{}.txt’.format(i,r7) #获取保存时txt的名字
content_list=[]
with open(title,“a”,encoding=“utf-8”) as f:
f.writelines(r6)
f.write(‘\n’)
for (x,y) in zip(chapter_list,r3):
response2=requests.get(x)
res=response2.content.decode()
res1=etree.HTML(res)
res3=res1.xpath('//[@id=“content”]/text()’) #获取每章节小说内容
f.writelines(y)#写入每章节的标题
f.write(‘\n’)
f.writelines(res3)#写入每章节的小说内容
f.write(‘\n’)
print(“{}采集完毕,共{}章节”.format(title,len(chapter_list)))
边栏推荐
- 6月书讯 | 9本新书上市,阵容强大,闭眼入!
- How to write a client-side technical solution
- There is no prompt for SQL in idea XML, and the dialect setting is useless.
- Geotrust OV Multi - Domain Domain SSL Certificate rmb2100 per year contains several Domain names?
- Force buckle 540 A single element in an ordered array
- Pytoch --- use pytoch for image positioning
- ThinkPHP kernel work order system source code commercial open source version multi user + multi customer service + SMS + email notification
- Ten thousand volumes are known to all, and one page of a book is always relevant. TVP reading club will take you through the reading puzzle!
- Today's plan: February 15, 2022
- LeetCode-归并排序链表
猜你喜欢

CorelDRAW Graphics Suite2022免费图形设计软件

Record the bug of unity 2020.3.31f1 once

CY7C68013A之keil编译代码

Federal learning: dividing non IID samples according to Dirichlet distribution

MySQL table insert Chinese change? Solution to the problem of No
![[understand one article] FD_ Use of set](/img/57/276f5ef438adee2cba31dceeabb95c.jpg)
[understand one article] FD_ Use of set

C language practice - binary search (half search)
![Learn AI safety monitoring project from zero [attach detailed code]](/img/a9/cb93f349229e86cbb05ad196ae9553.jpg)
Learn AI safety monitoring project from zero [attach detailed code]

Fluent icon demo

Sword finger offer II 006 Sort the sum of two numbers in the array
随机推荐
How to write a client-side technical solution
Keil compilation code of CY7C68013A
[C language] Dynamic Planning --- from entry to standing up
Geotrust OV Multi - Domain Domain SSL Certificate rmb2100 per year contains several Domain names?
记录一次Unity 2020.3.31f1的bug
Learn AI safety monitoring project from zero [attach detailed code]
Feature Engineering: summary of common feature transformation methods
Free drawing software recommended - draw io
[JS -- map string]
The confusion I encountered when learning stm32
Pytoch --- use pytoch to realize u-net semantic segmentation
Pytoch yolov5 runs bug solution from 0:
Pytorch---使用Pytorch实现U-Net进行语义分割
缓存一致性解决方案——改数据时如何保证缓存和数据库中数据的一致性
Message mechanism -- message processing
Shenzhen will speed up the cultivation of ecology to build a global "Hongmeng Oula city", with a maximum subsidy of 10million yuan for excellent projects
Binary tree problem solving (2)
Which insurance company has a better product of anti-cancer insurance?
Markdown编辑语法
Ten thousand volumes are known to all, and one page of a book is always relevant. TVP reading club will take you through the reading puzzle!