当前位置:网站首页>Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
2022-07-02 15:49:00 【Junhong's road of data analysis】
I remember when I finished the college entrance examination , When I chose the school, I bought the book of university information ranking in the bookstore , However, the information in the book is long ago , Not much help . Today, let's take you to climb something really useful , National University Information , Covering the vast majority of universities , And make visual kanban . Don't talk much , Get to the point !
Data crawling
Address :https://www.gaokao.cn/school/140
F12 Open developer tools , It's easy to get through the packet capturing tool json file .
We directly request the link , You can get the information of the corresponding University .
In addition, the comparison found that https://static-data.eol.cn/www/2.0/school/140/info.json, The key parameters 140 For the school ID, but ID Not continuous , therefore , We can only crawl according to the approximate number of schools .
Crawling code
The import module
import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import timeBriefly explain the purpose of the main modules :
aiohttp: Single thread concurrency IO operation . If only used on the client , It doesn't have much power , Just to match
asyncioTo use , becauserequestsAsynchronous is not supported . If you putasyncioUsed on the server side , for example Web The server , because HTTP The connection is IO operation , So you can use Single thread +coroutineAchieve high concurrency support for multiple users .asyncio: It provides perfect asynchronous IO Support , You can combine multiple processes (
coroutine) Package into a groupTaskAnd then execute concurrently .pandas: Convert the crawled data to
DataFrametype , And generatecsvfile .pathlib: Object oriented programming to represent file system paths .
tqdm: Just use
tqdm(iterable)Wrap any iteratable object , You can make your loop generate an intelligent progress bar .
Generate URL Sequence
By the appointed URL Template and max_id Generate URL Sequence , I added one here Deduplication , If you have collected university information before , It will be based on the files in the same directory , Eliminate the colleges and universities that have been collected ID, Only collect the information of colleges and universities that have not been obtained .
def get_url_list(max_id):
url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
not_crawled = set(range(max_id))
if Path.exists(Path(current_path, 'college_info.csv')):
df = pd.read_csv(Path(current_path, 'college_info.csv'))
not_crawled -= set(df[' School id'].unique())
return [url%id for id in not_crawled]collection JSON data
Through synergetic pair URL The sequence sends a request , Be careful to limit the amount of concurrency ,Window:500,Linux:1024.
async def get_json_data(url, semaphore):
async with semaphore:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
try:
async with session.get(url=url, headers=headers, timeout=6) as response:
# Change the encoding format of the corresponding data
response.encoding = 'utf-8'
# encounter IO Request to suspend the current task , etc. IO The code after the operation is completed and executed , When the collaboration hangs , Event loops can perform other tasks .
json_data = await response.json()
if json_data != '':
return save_to_csv(json_data['data'])
except:
return NoneData analysis and storage
JSON There are many fields in the data , You can parse and save the fields you need according to your own situation .
def save_to_csv(json_info):
save_info = {}
save_info[' School id'] = json_info['school_id'] # School id
save_info[' School name '] = json_info['name'] # The name of the school
level = ""
if json_info['f985'] == '1' and json_info['f211'] == '1':
level += "985 211"
elif json_info['f211'] == '1':
level += "211"
else:
level += json_info['level_name']
save_info[' School level '] = level # School level
save_info[' Soft science ranking '] = json_info['rank']['ruanke_rank'] # Soft science ranking
save_info[' Alumni Association ranking '] = json_info['rank']['xyh_rank'] # Alumni Association ranking
save_info[' Ranking of martial arts company '] = json_info['rank']['wsl_rank'] # Ranking of martial arts company
save_info['QS World rankings '] = json_info['rank']['qs_world'] # QS World rankings
save_info['US World rankings '] = json_info['rank']['us_rank'] # US World rankings
save_info[' School type '] = json_info['type_name'] # School type
save_info[' Province '] = json_info['province_name'] # Province
save_info[' City '] = json_info['city_name'] # The city name
save_info[' Location '] = json_info['town_name'] # Location
save_info[' Phone number of Admissions Office '] = json_info['phone'] # Phone number of Admissions Office
save_info[' Official website of Admissions Office '] = json_info['site'] # Official website of Admissions Office
df = pd.DataFrame(save_info, index=[0])
header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)The scheduler
Schedule the entire acquisition program . obtain URL>> Limit the amount of concurrency >> Create task object >> Suspend task
async def main(loop):
# obtain url list
url_list = get_url_list(5000)
# Limit the amount of concurrency
semaphore = asyncio.Semaphore(500)
# Create a task object and add it to the task list
tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
# Pending task list
for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
await tThe above is the main code of the program .
Running effect
Collect here ID stay 5000 University information within , If you want to collect as much information about colleges and universities as possible , It is recommended to run multiple times , Until there is no new data .
First run ( Collect to 2140 That's ok )
Second operation ( Collect to 680 That's ok )
A total of 2820 Row data . Now let's start the visualization part .
Tableau visualization
Compared with other visualization tools or third-party drawing libraries , I prefer Tableau, It's easy to get started , You can take a look at what you want to know Tableau Public, Many works of Daniel have been released here .
https://public.tableau.com/app/discover
Its only drawback is charging , If it's a student , Free to use , otherwise , It is recommended to use the free version first Tableau Public, Wait until you fully understand before you consider paying .
For this visualization , There are four charts in total .
Distribution map of the number of colleges and universities

The top three universities are Jiangsu 、 guangdong 、 Henan ( For reference only )
Ranking of soft science universities TOP10
According to the ranking of Soft Sciences , The national TOP10 The vast majority of colleges and universities are comprehensive , Only the seventh place university of science and technology of China in science and Engineering .
The hierarchical distribution of colleges and universities
From the collected data , 211 The proportion of colleges and universities is about 9.5% ,985 The proportion of colleges and universities is about 3.5%, It's really rare .
Distribution of University types
The school type is mainly science and engineering and comprehensive , The number of the two is basically the same , Are far ahead of other types . The second level of quantity is finance , Teacher education , Medicine .
Composite Kanban
Combine the above worksheets into a Kanban , The process is very simple , Just drag the icon to the specified location .
Add another filtering operation , Click a province in the map to link with other worksheets .
Kanban has been posted to Tableau Public. It can be edited online , Or download the entire visualization work package , Links are as follows :https://public.tableau.com/shared/ZCXWTK6SP?:display_count=n&:origin=viz_share_link
See the attachment for the complete code :
National college information collection source code .
link :https://pan.baidu.com/s/1FCXwAyeeqkoH6M_ITWWAcw
Extraction code :6cbf- END -
contrast Excel The cumulative sales of the series of books reached 15w book , Make it easy for you to master data analysis skills , You can search the title of the book on the whole network to understand :边栏推荐
- 【LeetCode】695-岛屿的最大面积
- 【Experience Cloud】如何在VsCode中取得Experience Cloud的MetaData
- [leetcode] 417 - Pacific Atlantic current problem
- 基于 Nebula Graph 构建百亿关系知识图谱实践
- Some problems about pytorch extension
- [leetcode] 486 predict winners
- 使用FFmpeg命令行进行UDP、RTP推流(H264、TS),ffplay接收
- 《大学“电路分析基础”课程实验合集.实验四》丨线性电路特性的研究
- /Bin/ld: cannot find -lpam
- 全是精华的模电专题复习资料:基本放大电路知识点
猜你喜欢

Ant group's large-scale map computing system tugraph passed the national evaluation

How to use percona tool to add fields to MySQL table after interruption

Aike AI frontier promotion (7.2)

全是精华的模电专题复习资料:基本放大电路知识点

《大学“电路分析基础”课程实验合集.实验六》丨典型信号的观察与测量

《大学“电路分析基础”课程实验合集.实验七》丨正弦稳态电路的研究

Experiment collection of University Course "Fundamentals of circuit analysis". Experiment 5 - Research on equivalent circuit of linear active two terminal network

《大学“电路分析基础”课程实验合集.实验四》丨线性电路特性的研究

Traversal before, during and after binary tree

Nebula Graph & 数仓血缘关系数据的存储与读写
随机推荐
[leetcode] 1162 map analysis
Moveit obstacle avoidance path planning demo
树-二叉搜索树
6095. Strong password checker II
Golang MD5 encryption and MD5 salt value encryption
XPT2046 四线电阻式触摸屏
Fiddler realizes mobile packet capturing - getting started
Review materials for the special topic of analog electronics with all essence: basic amplification circuit knowledge points
Make p12 certificate [easy to understand]
Floyed "suggestions collection"
解决BASE64Encoder报错的问题
Finally, I understand the event loop, synchronous / asynchronous, micro task / macro task, and operation mechanism in JS (with test questions attached)
/Bin/ld: cannot find -lcrypto
lseek 出错
数组和链表的区别浅析
beforeEach
基于 Nebula Graph 构建百亿关系知识图谱实践
《大学“电路分析基础”课程实验合集.实验六》丨典型信号的观察与测量
The outline dimension function application of small motherboard
For the problem that Folium map cannot be displayed, the temporary solution is as follows