当前位置:网站首页>Requests + BS4 crawl Douban top250 movie information
Requests + BS4 crawl Douban top250 movie information
2022-07-05 13:48:00 【Weichi Begonia】
""" Climb and take the bean petals top250 A movie """
import requests
import bs4
import re
def open_url(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
res = requests.get(url, headers=headers)
return res
def find_movies(res):
""" Use bs4 Analyze the content of the web page :param res: :return: """
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# The movie name
movies = []
targets = soup.find_all('div', class_="hd")
for each in targets:
movies.append(each.a.span.text)
# score
ranks = []
targets = soup.find_all('span', class_='rating_num')
for each in targets:
ranks.append(each.text)
# Information
messages = []
targets = soup.find_all('div', class_='bd')
for each in targets:
try:
messages.append(each.p.text.split('\n')[1].strip() + each.p.text.split('\n')[2].strip())
except:
continue
result = []
length = len(movies)
for i in range(length):
result.append(movies[i] + ranks[i] + messages[i] + '\n')
return result
def find_depth(res):
""" Find out how many pages there are :param res: :return: """
soup = bs4.BeautifulSoup(res.text, 'html.parser')
depth = soup.find('span', class_='next').previous_sibling.previous_sibling.text
return int(depth)
def main():
host = 'https://movie.douban.com/top250'
res = open_url(host)
depth = find_depth(res)
result = []
for i in range(depth):
url = host + "?start=" + str(25*i)
res = open_url(url)
result.extend(find_movies(res))
with open('douban_250.txt', 'w', encoding='utf-8') as f:
for each in result:
f.write(each)
return
if __name__ == '__main__':
main()
pass
边栏推荐
- Aspx simple user login
- The "Baidu Cup" CTF competition was held in February 2017, Web: explosion-2
- 嵌入式软件架构设计-消息交互
- [server data recovery] a case of RAID5 data recovery stored in a brand of server
- Self built shooting range 2022
- The development of speech recognition app with uni app is simple and fast.
- 记录一下在深度学习-一些bug处理
- 那些考研后才知道的事
- [South China University of technology] information sharing of postgraduate entrance examination and re examination
- web3.eth. Filter related
猜你喜欢
Data Lake (VII): Iceberg concept and review what is a data Lake
[public class preview]: basis and practice of video quality evaluation
Attack and defense world crypto WP
几款分布式数据库的对比
TortoiseSVN使用情形、安装与使用
Can and can FD
Usage, installation and use of TortoiseSVN
Redis6 master-slave replication and clustering
Operational research 68 | the latest impact factors in 2022 were officially released. Changes in journals in the field of rapid care
About the problem and solution of 403 error in wampserver
随机推荐
[MySQL usage Script] catch all MySQL time and date types and related operation functions (3)
Solution to the prompt of could not close zip file during phpword use
Kafaka log collection
leetcode 10. Regular Expression Matching 正则表达式匹配 (困难)
What about data leakage? " Watson k'7 moves to eliminate security threats
Jasypt configuration file encryption | quick start | actual combat
网络安全-HSRP协议
Kotlin collaboration uses coroutinecontext to implement the retry logic after a network request fails
嵌入式软件架构设计-消息交互
Rk3566 add LED
一网打尽异步神器CompletableFuture
Data Lake (VII): Iceberg concept and review what is a data Lake
【MySQL 使用秘籍】一網打盡 MySQL 時間和日期類型與相關操作函數(三)
Ordering system based on wechat applet
laravel-dompdf导出pdf,中文乱码问题解决
Integer ==比较会自动拆箱 该变量不能赋值为空
Internal JSON-RPC error. {"code":-32000, "message": "execution reverted"} solve the error
What is information security? What is included? What is the difference with network security?
Clock cycle
Primary code audit [no dolls (modification)] assessment