当前位置:网站首页>Baidu Encyclopedia data crawling and content classification and recognition
Baidu Encyclopedia data crawling and content classification and recognition
2022-07-06 10:26:00 【CHQIUU】
List of articles
Preface
Recently, I am learning the related content of knowledge map , You need to crawl some structured data . Here is how to crawl the data of Baidu Encyclopedia and extract the effective data code .
One 、 Analyze the page structure
The page can be divided into 5 Regions , As shown in the following illustration ( polypropylene Page structure of introduction ).
https://baike.baidu.com/wikitag/taglist?tagId=76613
Two 、 Use steps
1. Import and stock in
The code is as follows ( Example ):
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
2. Read in the data
The code is as follows ( Example ):
data = pd.read_csv(
'https://labfile.oss.aliyuncs.com/courses/1283/adult.data.csv')
print(data.head())
It's used here url Data requested by the network .
边栏推荐
- MySQL storage engine
- MySQL實戰優化高手04 借著更新語句在InnoDB存儲引擎中的執行流程,聊聊binlog是什麼?
- Chrome浏览器端跨域不能访问问题处理办法
- MySQL ERROR 1040: Too many connections
- Export virtual machines from esxi 6.7 using OVF tool
- Flash operation and maintenance script (running for a long time)
- 13 medical registration system_ [wechat login]
- Not registered via @EnableConfigurationProperties, marked(@ConfigurationProperties的使用)
- PyTorch RNN 实战案例_MNIST手写字体识别
- MySQL learning diary (II)
猜你喜欢
如何搭建接口自动化测试框架?
docker MySQL解决时区问题
A necessary soft skill for Software Test Engineers: structured thinking
Introduction tutorial of typescript (dark horse programmer of station B)
17 medical registration system_ [wechat Payment]
Const decorated member function problem
宝塔的安装和flask项目部署
Typescript入门教程(B站黑马程序员)
Redis集群方案应该怎么做?都有哪些方案?
保姆级手把手教你用C语言写三子棋
随机推荐
Time complexity (see which sentence is executed the most times)
Mysql32 lock
Security design verification of API interface: ticket, signature, timestamp
[Julia] exit notes - Serial
Anaconda3 安装cv2
[unity] simulate jelly effect (with collision) -- tutorial on using jellysprites plug-in
Google login prompt error code 12501
MySQL实战优化高手03 用一次数据更新流程,初步了解InnoDB存储引擎的架构设计
寶塔的安裝和flask項目部署
Flash operation and maintenance script (running for a long time)
Constants and pointers
14 medical registration system_ [Alibaba cloud OSS, user authentication and patient]
MySQL的存储引擎
C miscellaneous shallow copy and deep copy
MySQL combat optimization expert 03 uses a data update process to preliminarily understand the architecture design of InnoDB storage engine
实现以form-data参数发送post请求
Security design verification of API interface: ticket, signature, timestamp
Contest3145 - the 37th game of 2021 freshman individual training match_ B: Password
Preliminary introduction to C miscellaneous lecture document
Solve the problem of remote connection to MySQL under Linux in Windows