当前位置:网站首页>百度百科数据爬取及内容分类识别
百度百科数据爬取及内容分类识别
2022-07-06 09:11:00 【CHQIUU】
前言
最近在学习知识图谱相关内容,需要爬取一些结构化的数据。下面介绍如何爬取百度百科的数据并提取出有效数据代码实现。
一、分析页面结构
页面可以分为5个区域,如下图标注所示(聚丙烯介绍的页面结构)。
https://baike.baidu.com/wikitag/taglist?tagId=76613
二、使用步骤
1.引入库
代码如下(示例):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
2.读入数据
代码如下(示例):
data = pd.read_csv(
'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())
该处使用的url网络请求的数据。
边栏推荐
- MySQL real battle optimization expert 11 starts with the addition, deletion and modification of data. Review the status of buffer pool in the database
- 竞赛vscode配置指南
- Why can't TN-C use 2p circuit breaker?
- Tianmu MVC audit II
- CAPL脚本中关于相对路径/绝对路径操作的几个傻傻分不清的内置函数
- 13 医疗挂号系统_【 微信登录】
- [flask] crud addition and query operation of data
- CDC: the outbreak of Listeria monocytogenes in the United States is related to ice cream products
- 在CANoe中通過Panel面板控制Test Module 運行(初級)
- The governor of New Jersey signed seven bills to improve gun safety
猜你喜欢
C杂讲 文件 续讲
C杂讲 文件 初讲
实现以form-data参数发送post请求
如何让shell脚本变成可执行文件
13 medical registration system_ [wechat login]
Hugo blog graphical writing tool -- QT practice
C杂讲 浅拷贝 与 深拷贝
The 32-year-old fitness coach turned to a programmer and got an offer of 760000 a year. The experience of this older coder caused heated discussion
西南大学:胡航-关于学习行为和学习效果分析
Const decorated member function problem
随机推荐
Target detection -- yolov2 paper intensive reading
Control the operation of the test module through the panel in canoe (Advanced)
Security design verification of API interface: ticket, signature, timestamp
再有人问你数据库缓存一致性的问题,直接把这篇文章发给他
CAPL 脚本对.ini 配置文件的高阶操作
C杂讲 浅拷贝 与 深拷贝
CANoe CAPL文件操作目录合集
[untitled]
Carolyn Rosé博士的社交互通演讲记录
[after reading the series] how to realize app automation without programming (automatically start Kwai APP)
Retention policy of RMAN backup
AI的路线和资源
14 medical registration system_ [Alibaba cloud OSS, user authentication and patient]
Installation de la pagode et déploiement du projet flask
MySQL底层的逻辑架构
竞赛vscode配置指南
Inject common SQL statement collation
[NLP] bert4vec: a sentence vector generation tool based on pre training
Sichuan cloud education and double teacher model
Contest3145 - the 37th game of 2021 freshman individual training match_ B: Password