当前位置:网站首页>[note] on May 28, 2022, data is obtained from the web page and written into the database
[note] on May 28, 2022, data is obtained from the web page and written into the database
2022-06-30 03:40:00 【Sprite. Nym】
""" example11 - Persistent crawler data with database 400 - Bad request. 401 - Unauthorized. 403 - Forbidden. 404 - Not Found. 405 - Method not allowed. 418 - I am a teapot. create table `tb_top_movie` ( `mov_id` bigint unsigned auto_increment comment ' Number ', `mov_title` varchar(200) not null comment ' title ', `mov_rating_num` decimal(3,1) not null comment ' score ', `mov_comments_count` bigint not null comment ' comments ', primary key (`mov_id`) ) engine=innodb auto_increment=1001 comment ' Movie data sheet '; Author: Hao Date: 2022/5/28 """
import bs4
import pymysql
import requests
from pymysql.cursors import Cursor
def fetch_page(session, url):
""" Grab page :param session: Session object :param url: Uniform resource locator :return: Page HTML Code """
resp = session.get(url=url)
return resp.text if resp.status_code == 200 else ''
def parse_page(html_code):
""" Parsing the page :param html_code: Page HTML Code :return: Data parsed from the page """
soup = bs4.BeautifulSoup(html_code, 'html.parser')
movie_items_list = soup.select('#content > div > div.article > ol > li')
data = []
for movie_item in movie_items_list:
title = movie_item.select_one('div > div.info > div.hd > a > span.title').text
rating_num = movie_item.select_one('div > div.info > div.bd > div > span.rating_num').text
comments_count = movie_item.select_one('div > div.info > div.bd > div > span:nth-child(4)').text[:-3]
data.append((title, rating_num, comments_count))
return data
def save_to_db(conn, data):
""" Save the data to the database :param conn: Database connection :param data: data """
with conn.cursor() as cursor: # type: Cursor
cursor.executemany(
'insert into tb_top_movie (mov_title, mov_rating_num, mov_comments_count) '
'values (%s, %s, %s)',
data
)
conn.commit()
def main():
session = requests.Session()
session.headers = {
'User-Agent': 'Baiduspider'}
conn = pymysql.connect(host='localhost', port=3306,
user='guest', password='Guest.618',
database='hrs', charset='utf8mb4')
try:
for page in range(10):
url = f'https://movie.douban.com/top250?start={
25 * page}'
html_code = fetch_page(session, url)
data = parse_page(html_code)
save_to_db(conn, data)
finally:
conn.close()
if __name__ == '__main__':
main()
边栏推荐
- Usage record of unity input system (instance version)
- On the optimization and use of idea
- 你清楚AI、数据库与计算机体系
- Integrating viewbinding and viewholder with reflection
- Play with algorithm interview together, nanny level strategy (with high-definition codeless algorithm summary map), recommended collection
- Utilisation de foreach en Qt
- Hisense A7 ink screen mobile phone cannot be started
- Are you a "social bull" or a "social terrorist" in the interview?
- [0x0] open questions left by the principal
- Number of students from junior college to Senior College (4)
猜你喜欢

Laravel9 installation locale

Redis high concurrency distributed locks (learning summary)

X书6.89版本shield-unidbg调用方式

X Book 6.97 shield unidbg calling method

QT中foreach的使用

The 5-year Android development interview took 20 days to join Alibaba

laravel9本地安裝

Reasons for MySQL master-slave database synchronization failure

Hudi record

Redis is used in Windows system
随机推荐
dbt产品初体验
C # [advanced part] C # multithreading
ZABBIX trigger explanation
Number of students from junior college to Senior College (III)
Local, locallow and roaming in the appdata folder
C#【高级篇】 C# 接口(Interface)
专升本语文资源整理
Selenium environment installation, 8 elements positioning --01
Litjson parses the generated JSON file and reads the dictionary in the JSON file
Redis中的Hash设计和节省内存数据结构设计
C [advanced part] C generic [need to be further supplemented: generic interfaces and instances of generic events]
毕业设计EMS办公管理系统(B/S结构)+J2EE+SQLserver8.0
Linked list: insert a node in the head
【个人总结】学习计划
[punch in - Blue Bridge Cup] day 2 --- format output format, ASCII
Hudi record
C [advanced] C interface
X Book 6.89 shield unidbg calling method
C#【高级篇】 C# 泛型(Generic)【需进一步补充:泛型接口、泛型事件的实例】
Redis high concurrency distributed locks (learning summary)