当前位置:网站首页>50 lines of code to crawl TOP500 books and import TXT documents
50 lines of code to crawl TOP500 books and import TXT documents
2022-06-26 18:17:00 【Little fox dreams of going to fairy tale town】
50 Line code crawl Top500 Book Import TXT file
import re # Regular expressions , Text extraction
import requests
import json
def main(page):
# Claim to crawl the URL
baseurl = "http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-" + str(page)
# Crawling the web content
datalist = getData(baseurl)
# Save web data
savepath = "Top500_book.txt"
saveData(datalist,savepath)
# Get data
def getData(baseurl):
html = askURL(baseurl)
datalist = parse_result(html)
return datalist
# Parse the source code
def parse_result(html):
pattern = re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">¥(.*?)</span>.*?</li>',re.S)
items = re.findall(pattern,html)
for item in items:
yield {
'range': item[0],
'iamge': item[1],
'title': item[2],
'recommend': item[3],
'author': item[4],
'times': item[5],
'price': item[6]
}
# Get web source
def askURL(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
except requests.RequestException:
return None
# Save data to txt Text document
def saveData(datalst,savepath):
print("save....")
for item in datalst:
with open(savepath, 'a', encoding='UTF-8') as f:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
f.close()
if __name__ == '__main__':
#for Loop to turn the page
for i in range(1,26):
main(i)
【 Running results 】
边栏推荐
- Let torch cuda. is_ Experience of available() changing from false to true
- ROS的发布消息Publishers和订阅消息Subscribers
- 必须要掌握的面试重点——索引和事务(附讲B-树与B+树)
- Conditional compilation in precompiling instructions
- Do you know how to compare two objects
- Lm06 the mystery of constructing the bottom and top trading strategy only by trading volume
- Regular match same character
- JVM入个门(1)
- ROS query topic specific content common instructions
- Vscode 基础必备 常用插件
猜你喜欢

Numpy之matplotlib

Interview key points that must be mastered index and affairs (with B-tree and b+ tree)

深度学习之Numpy篇

pycharm如何修改多行注释快捷键

零时科技 | 智能合约安全系列文章之反编译篇

MYSQL的下载与配置 mysql远程操控

Decision tree and random forest

贝叶斯网络详解

Crawl Douban to read top250 and import it into SqList database (or excel table)

Clion compiling catkin_ WS (short for ROS workspace package) loads cmakelists Txt problems
随机推荐
Several delete operations in SQL
Tag dynamic programming - preliminary knowledge for question brushing -2 0-1 knapsack theory foundation and two-dimensional array solution template
Determine whether a sequence is a stack pop-up sequence
JNI的 静态注册与动态注册
JS common regular expressions
Please advise tonghuashun which securities firm to choose for opening an account? Is it safe to open an account online now?
数据加密标准(DES)概念及工作原理
Binary search-2
Get and set settings in 26class
JS cast
How about opening an account at Guojin securities? Is it safe to open an account?
PC端录制扫515地机器人/scan数据
深层次安全定义剖析及加密技术
Interview key points that must be mastered index and affairs (with B-tree and b+ tree)
Union, intersection and difference operations in SQL
行锁与隔离级别案例分析
Image binarization
Detailed explanation of asymmetric cryptosystem
JS 常用正则表达式
零时科技 | 智能合约安全系列文章之反编译篇