当前位置:网站首页>50 lines of code to crawl TOP500 books and import TXT documents
50 lines of code to crawl TOP500 books and import TXT documents
2022-06-26 18:17:00 【Little fox dreams of going to fairy tale town】
50 Line code crawl Top500 Book Import TXT file
import re # Regular expressions , Text extraction
import requests
import json
def main(page):
# Claim to crawl the URL
baseurl = "http://bang.dangdang.com/books/fivestars/01.00.00.00.00.00-recent30-0-0-1-" + str(page)
# Crawling the web content
datalist = getData(baseurl)
# Save web data
savepath = "Top500_book.txt"
saveData(datalist,savepath)
# Get data
def getData(baseurl):
html = askURL(baseurl)
datalist = parse_result(html)
return datalist
# Parse the source code
def parse_result(html):
pattern = re.compile('<li>.*?list_num.*?(\d+).</div>.*?<img src="(.*?)".*?class="name".*?title="(.*?)">.*?class="star">.*?class="tuijian">(.*?)</span>.*?class="publisher_info">.*?target="_blank">(.*?)</a>.*?class="biaosheng">.*?<span>(.*?)</span></div>.*?<p><span\sclass="price_n">¥(.*?)</span>.*?</li>',re.S)
items = re.findall(pattern,html)
for item in items:
yield {
'range': item[0],
'iamge': item[1],
'title': item[2],
'recommend': item[3],
'author': item[4],
'times': item[5],
'price': item[6]
}
# Get web source
def askURL(url):
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
except requests.RequestException:
return None
# Save data to txt Text document
def saveData(datalst,savepath):
print("save....")
for item in datalst:
with open(savepath, 'a', encoding='UTF-8') as f:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
f.close()
if __name__ == '__main__':
#for Loop to turn the page
for i in range(1,26):
main(i)
【 Running results 】
边栏推荐
- 决策树与随机森林
- Static registration and dynamic registration of JNI
- Data Encryption Standard DES security
- The eigen library calculates the angle between two vectors
- DoS及攻击方法详解
- 17.13 supplementary knowledge, thread pool discussion, quantity discussion and summary
- Image binarization
- properties文件乱码
- 17.13 补充知识、线程池浅谈、数量谈、总结
- MYSQL的下载与配置 mysql远程操控
猜你喜欢

DoS及攻击方法详解

分页查询、JOIN关联查询优化

IDEA收藏代码、快速打开favorites收藏窗口

Runtimeerror: CUDA error: out of memory own solution (it is estimated that it is not applicable to most people in special circumstances)

你了解如何比较两个对象吗

Tag dynamic programming - preliminary knowledge for question brushing -2 0-1 knapsack theory foundation and two-dimensional array solution template

LeetCode 面试题29 顺时针打印矩阵

Discussion and generation of digital signature and analysis of its advantages

(必须掌握的多线程知识点)认识线程,创建线程,使用Thread的常见方法及属性,以及线程的状态和状态转移的意义

行锁分析和死锁
随机推荐
深层次安全定义剖析及加密技术
Decompilation of zero time technology smart contract security series articles
Crawl Douban to read top250 and import it into SqList database (or excel table)
[unity] use C in unity to execute external files, such as Exe or bat
DoS及攻擊方法詳解
Row lock analysis and deadlock
临时关闭MySQL缓存
判断某个序列是否为栈的弹出序列
必须要掌握的面试重点——索引和事务(附讲B-树与B+树)
The eigen library calculates the angle between two vectors
Plt How to keep show() not closed
Binary search-1
Regular match same character
Several delete operations in SQL
Detailed explanation of dos and attack methods
Handwritten promise all
transforms. The input of randomcrop() can only be PIL image, not tensor
ISO documents
(multi threading knowledge points that must be mastered) understand threads, create threads, common methods and properties of using threads, and the meaning of thread state and state transition
Procedure steps for burning a disc