当前位置:网站首页>[note] on May 28, 2022, data is obtained from the web page and written into the database
[note] on May 28, 2022, data is obtained from the web page and written into the database
2022-06-30 03:40:00 【Sprite. Nym】
""" example11 - Persistent crawler data with database 400 - Bad request. 401 - Unauthorized. 403 - Forbidden. 404 - Not Found. 405 - Method not allowed. 418 - I am a teapot. create table `tb_top_movie` ( `mov_id` bigint unsigned auto_increment comment ' Number ', `mov_title` varchar(200) not null comment ' title ', `mov_rating_num` decimal(3,1) not null comment ' score ', `mov_comments_count` bigint not null comment ' comments ', primary key (`mov_id`) ) engine=innodb auto_increment=1001 comment ' Movie data sheet '; Author: Hao Date: 2022/5/28 """
import bs4
import pymysql
import requests
from pymysql.cursors import Cursor
def fetch_page(session, url):
""" Grab page :param session: Session object :param url: Uniform resource locator :return: Page HTML Code """
resp = session.get(url=url)
return resp.text if resp.status_code == 200 else ''
def parse_page(html_code):
""" Parsing the page :param html_code: Page HTML Code :return: Data parsed from the page """
soup = bs4.BeautifulSoup(html_code, 'html.parser')
movie_items_list = soup.select('#content > div > div.article > ol > li')
data = []
for movie_item in movie_items_list:
title = movie_item.select_one('div > div.info > div.hd > a > span.title').text
rating_num = movie_item.select_one('div > div.info > div.bd > div > span.rating_num').text
comments_count = movie_item.select_one('div > div.info > div.bd > div > span:nth-child(4)').text[:-3]
data.append((title, rating_num, comments_count))
return data
def save_to_db(conn, data):
""" Save the data to the database :param conn: Database connection :param data: data """
with conn.cursor() as cursor: # type: Cursor
cursor.executemany(
'insert into tb_top_movie (mov_title, mov_rating_num, mov_comments_count) '
'values (%s, %s, %s)',
data
)
conn.commit()
def main():
session = requests.Session()
session.headers = {
'User-Agent': 'Baiduspider'}
conn = pymysql.connect(host='localhost', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4')
try:
for page in range(10):
url = f'https://movie.douban.com/top250?start={
25 * page}'
html_code = fetch_page(session, url)
data = parse_page(html_code)
save_to_db(conn, data)
finally:
conn.close()
if __name__ == '__main__':
main()
边栏推荐
- 【作业】2022.5.23 MySQL入门
- UML图与List集合
- Mysql性能优化(6):读写分离
- Usage record of unity input system (instance version)
- TiDB 6.0:讓 TSO 更高效丨TiDB Book Rush
- Possible problems in MySQL cross database operation with database name
- [frequently asked questions] modularization of browser environment and node environment
- 1151_ Makefile learning_ Static matching pattern rules in makefile
- ZABBIX trigger explanation
- [punch in - Blue Bridge Cup] day 5 --- lower() small
猜你喜欢
【笔记】AB测试和方差分析
[punch in - Blue Bridge Cup] day 3 --- slice in reverse order list[: -1]
[Note] ab Test and Variance Analysis
Stc89c52/90c516rd/89c516rd DHT11 temperature and humidity sensor drive code
Implementation of property management system with ssm+ wechat applet
【论文阅读|深读】Role2Vec:Role-Based Graph Embeddings
图的邻接矩阵存储 C语言实现BFS
124 articles in total! Motianlun "high availability architecture" dry goods document sharing (including Oracle, mysql, PG)
Utilisation de foreach en Qt
Mysql性能优化(6):读写分离
随机推荐
Possible problems in MySQL cross database operation with database name
Product thinking - is the future of UAV express worth looking forward to?
Node-RED系列(二八):基于OPC UA节点与西门子PLC进行通讯
1151_ Makefile learning_ Static matching pattern rules in makefile
Vscode+anaconda+jupyter reports an error: kernel did with exit code
你清楚AI、数据库与计算机体系
[punch in - Blue Bridge Cup] day 1 --% 7F format output
华为云原生——数据开发与DataFactory
Hash design and memory saving data structure design in redis
Use common fileUpload to upload files
【作业】2022.5.28 将CSV写入数据库
数据库的下一个变革方向——云原生数据库
Hudi record
Global and Chinese market for sensor screwdrivers 2022-2028: Research Report on technology, participants, trends, market size and share
(04).NET MAUI实战 MVVM
Is the largest layoff and salary cut on the internet coming?
Redis high concurrency distributed locks (learning summary)
4-4 beauty ranking (10 points)
hudi记录
Selenium environment installation, 8 elements positioning --01