当前位置:网站首页>【笔记】2022.5.28 从网页获取数据并写入数据库
【笔记】2022.5.28 从网页获取数据并写入数据库
2022-06-30 03:27:00 【Sprite.Nym】
""" example11 - 用数据库持久化爬虫数据 400 - Bad request. 401 - Unauthorized. 403 - Forbidden. 404 - Not Found. 405 - Method not allowed. 418 - I am a teapot. create table `tb_top_movie` ( `mov_id` bigint unsigned auto_increment comment '编号', `mov_title` varchar(200) not null comment '标题', `mov_rating_num` decimal(3,1) not null comment '评分', `mov_comments_count` bigint not null comment '评论数', primary key (`mov_id`) ) engine=innodb auto_increment=1001 comment '电影数据表'; Author: Hao Date: 2022/5/28 """
import bs4
import pymysql
import requests
from pymysql.cursors import Cursor
def fetch_page(session, url):
"""抓取页面 :param session: Session对象 :param url: 统一资源定位符 :return: 页面的HTML代码 """
resp = session.get(url=url)
return resp.text if resp.status_code == 200 else ''
def parse_page(html_code):
"""解析页面 :param html_code: 页面的HTML代码 :return: 从页面解析出来的数据 """
soup = bs4.BeautifulSoup(html_code, 'html.parser')
movie_items_list = soup.select('#content > div > div.article > ol > li')
data = []
for movie_item in movie_items_list:
title = movie_item.select_one('div > div.info > div.hd > a > span.title').text
rating_num = movie_item.select_one('div > div.info > div.bd > div > span.rating_num').text
comments_count = movie_item.select_one('div > div.info > div.bd > div > span:nth-child(4)').text[:-3]
data.append((title, rating_num, comments_count))
return data
def save_to_db(conn, data):
"""将数据保存到数据库 :param conn: 数据库连接 :param data: 数据 """
with conn.cursor() as cursor: # type: Cursor
cursor.executemany(
'insert into tb_top_movie (mov_title, mov_rating_num, mov_comments_count) '
'values (%s, %s, %s)',
data
)
conn.commit()
def main():
session = requests.Session()
session.headers = {
'User-Agent': 'Baiduspider'}
conn = pymysql.connect(host='localhost', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4')
try:
for page in range(10):
url = f'https://movie.douban.com/top250?start={
25 * page}'
html_code = fetch_page(session, url)
data = parse_page(html_code)
save_to_db(conn, data)
finally:
conn.close()
if __name__ == '__main__':
main()
边栏推荐
- Tidb 6.0: making Tso more efficient tidb Book rush
- Hash design and memory saving data structure design in redis
- 炒现货黄金的交易平台如何保障资金安全?
- Golang BiliBili live broadcast bullet screen
- Buffer pool of MySQL notes
- 【作业】2022.5.28 将CSV写入数据库
- The broadcast module code runs normally in autojs4.1.1, but an error is reported in pro7.0 (not resolved)
- If you can tell whether the external stock index futures trading platform I am trading is formal and safe?
- [0x0] open questions left by the principal
- 1152_ Makefile learning_ Pattern matching rules
猜你喜欢

Redis在windows系统中使用

专升本高数(四)

Usage record of unity input system (instance version)

X Book 6.97 shield unidbg calling method

The next change direction of database - cloud native database
![[qt] qmap usage details](/img/ee/6e71a3dc5b90d2d1b7f7d3f6b56221.png)
[qt] qmap usage details

1150_ Makefile learning_ Duplicate name target processing in makefile

Golang BiliBili live broadcast bullet screen

1152_ Makefile learning_ Pattern matching rules
![[ten minutes] manim installation 2022](/img/54/7b895d785c7866271f06ff49cb20aa.png)
[ten minutes] manim installation 2022
随机推荐
一篇文章带你入门vim
C#【高级篇】 C# 泛型(Generic)【需进一步补充:泛型接口、泛型事件的实例】
Stc89c52/90c516rd/89c516rd DHT11 temperature and humidity sensor drive code
What are outer chain and inner chain?
[wechat applet] how did the conditional rendering list render work?
Use of custom MVC
Link garbled escape character
laravel9本地安裝
MySQL performance optimization (6): read write separation
Auto.js学习笔记16:按项目保存到手机上,不用每次都保存单个js文件,方便调试和打包
C#【高级篇】 C# 匿名方法【待补充Lambda表达式。。。】
实用调试技巧
yarn的安装和使用
Personal PC installation software
专升本高数(四)
Practical debugging skills
Buffer pool of MySQL notes
Neo4j--- performance optimization
【常见问题】浏览器环境、node环境的模块化问题
[practical skills] how to write agile development documents