当前位置:网站首页>Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
2022-07-02 15:49:00 【Junhong's road of data analysis】
I remember when I finished the college entrance examination , When I chose the school, I bought the book of university information ranking in the bookstore , However, the information in the book is long ago , Not much help . Today, let's take you to climb something really useful , National University Information , Covering the vast majority of universities , And make visual kanban . Don't talk much , Get to the point !
Data crawling
Address :https://www.gaokao.cn/school/140
F12 Open developer tools , It's easy to get through the packet capturing tool json file .
We directly request the link , You can get the information of the corresponding University .
In addition, the comparison found that https://static-data.eol.cn/www/2.0/school/140/info.json, The key parameters 140 For the school ID, but ID Not continuous , therefore , We can only crawl according to the approximate number of schools .
Crawling code
The import module
import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import timeBriefly explain the purpose of the main modules :
aiohttp: Single thread concurrency IO operation . If only used on the client , It doesn't have much power , Just to match
asyncioTo use , becauserequestsAsynchronous is not supported . If you putasyncioUsed on the server side , for example Web The server , because HTTP The connection is IO operation , So you can use Single thread +coroutineAchieve high concurrency support for multiple users .asyncio: It provides perfect asynchronous IO Support , You can combine multiple processes (
coroutine) Package into a groupTaskAnd then execute concurrently .pandas: Convert the crawled data to
DataFrametype , And generatecsvfile .pathlib: Object oriented programming to represent file system paths .
tqdm: Just use
tqdm(iterable)Wrap any iteratable object , You can make your loop generate an intelligent progress bar .
Generate URL Sequence
By the appointed URL Template and max_id Generate URL Sequence , I added one here Deduplication , If you have collected university information before , It will be based on the files in the same directory , Eliminate the colleges and universities that have been collected ID, Only collect the information of colleges and universities that have not been obtained .
def get_url_list(max_id):
url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
not_crawled = set(range(max_id))
if Path.exists(Path(current_path, 'college_info.csv')):
df = pd.read_csv(Path(current_path, 'college_info.csv'))
not_crawled -= set(df[' School id'].unique())
return [url%id for id in not_crawled]collection JSON data
Through synergetic pair URL The sequence sends a request , Be careful to limit the amount of concurrency ,Window:500,Linux:1024.
async def get_json_data(url, semaphore):
async with semaphore:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
try:
async with session.get(url=url, headers=headers, timeout=6) as response:
# Change the encoding format of the corresponding data
response.encoding = 'utf-8'
# encounter IO Request to suspend the current task , etc. IO The code after the operation is completed and executed , When the collaboration hangs , Event loops can perform other tasks .
json_data = await response.json()
if json_data != '':
return save_to_csv(json_data['data'])
except:
return NoneData analysis and storage
JSON There are many fields in the data , You can parse and save the fields you need according to your own situation .
def save_to_csv(json_info):
save_info = {}
save_info[' School id'] = json_info['school_id'] # School id
save_info[' School name '] = json_info['name'] # The name of the school
level = ""
if json_info['f985'] == '1' and json_info['f211'] == '1':
level += "985 211"
elif json_info['f211'] == '1':
level += "211"
else:
level += json_info['level_name']
save_info[' School level '] = level # School level
save_info[' Soft science ranking '] = json_info['rank']['ruanke_rank'] # Soft science ranking
save_info[' Alumni Association ranking '] = json_info['rank']['xyh_rank'] # Alumni Association ranking
save_info[' Ranking of martial arts company '] = json_info['rank']['wsl_rank'] # Ranking of martial arts company
save_info['QS World rankings '] = json_info['rank']['qs_world'] # QS World rankings
save_info['US World rankings '] = json_info['rank']['us_rank'] # US World rankings
save_info[' School type '] = json_info['type_name'] # School type
save_info[' Province '] = json_info['province_name'] # Province
save_info[' City '] = json_info['city_name'] # The city name
save_info[' Location '] = json_info['town_name'] # Location
save_info[' Phone number of Admissions Office '] = json_info['phone'] # Phone number of Admissions Office
save_info[' Official website of Admissions Office '] = json_info['site'] # Official website of Admissions Office
df = pd.DataFrame(save_info, index=[0])
header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)The scheduler
Schedule the entire acquisition program . obtain URL>> Limit the amount of concurrency >> Create task object >> Suspend task
async def main(loop):
# obtain url list
url_list = get_url_list(5000)
# Limit the amount of concurrency
semaphore = asyncio.Semaphore(500)
# Create a task object and add it to the task list
tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
# Pending task list
for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
await tThe above is the main code of the program .
Running effect
Collect here ID stay 5000 University information within , If you want to collect as much information about colleges and universities as possible , It is recommended to run multiple times , Until there is no new data .
First run ( Collect to 2140 That's ok )
Second operation ( Collect to 680 That's ok )
A total of 2820 Row data . Now let's start the visualization part .
Tableau visualization
Compared with other visualization tools or third-party drawing libraries , I prefer Tableau, It's easy to get started , You can take a look at what you want to know Tableau Public, Many works of Daniel have been released here .
https://public.tableau.com/app/discover
Its only drawback is charging , If it's a student , Free to use , otherwise , It is recommended to use the free version first Tableau Public, Wait until you fully understand before you consider paying .
For this visualization , There are four charts in total .
Distribution map of the number of colleges and universities

The top three universities are Jiangsu 、 guangdong 、 Henan ( For reference only )
Ranking of soft science universities TOP10
According to the ranking of Soft Sciences , The national TOP10 The vast majority of colleges and universities are comprehensive , Only the seventh place university of science and technology of China in science and Engineering .
The hierarchical distribution of colleges and universities
From the collected data , 211 The proportion of colleges and universities is about 9.5% ,985 The proportion of colleges and universities is about 3.5%, It's really rare .
Distribution of University types
The school type is mainly science and engineering and comprehensive , The number of the two is basically the same , Are far ahead of other types . The second level of quantity is finance , Teacher education , Medicine .
Composite Kanban
Combine the above worksheets into a Kanban , The process is very simple , Just drag the icon to the specified location .
Add another filtering operation , Click a province in the map to link with other worksheets .
Kanban has been posted to Tableau Public. It can be edited online , Or download the entire visualization work package , Links are as follows :https://public.tableau.com/shared/ZCXWTK6SP?:display_count=n&:origin=viz_share_link
See the attachment for the complete code :
National college information collection source code .
link :https://pan.baidu.com/s/1FCXwAyeeqkoH6M_ITWWAcw
Extraction code :6cbf- END -
contrast Excel The cumulative sales of the series of books reached 15w book , Make it easy for you to master data analysis skills , You can search the title of the book on the whole network to understand :边栏推荐
- For the problem that Folium map cannot be displayed, the temporary solution is as follows
- Analysis of the difference between array and linked list
- 愛可可AI前沿推介(7.2)
- [leetcode] 876 intermediate node of linked list
- (Wanzi essence knowledge summary) basic knowledge of shell script programming
- Experiment collection of University "Fundamentals of circuit analysis". Experiment 4 - Research on linear circuit characteristics
- 【LeetCode】1905-统计子岛屿
- Experiment collection of University "Fundamentals of circuit analysis". Experiment 6 - observation and measurement of typical signals
- Aiko ai Frontier promotion (7.2)
- [leetcode] 977 - carré du tableau ordonné
猜你喜欢

Deux séquences ergodiques connues pour construire des arbres binaires

PostgresSQL 流复制 主备切换 主库无读写宕机场景

动态规划入门一,队列的bfs(70.121.279.200)

Pattern matching extraction of specific subgraphs in graphx graph Computing Practice

可视化技术在 Nebula Graph 中的应用

愛可可AI前沿推介(7.2)
![[network security] network asset collection](/img/3e/6665b5af0dedfcbc7bd548cc486878.png)
[network security] network asset collection

How to import a billion level offline CSV into Nepal graph

使用 percona 工具给 MySQL 表加字段中断后该如何操作

How to use percona tool to add fields to MySQL table after interruption
随机推荐
2279. 装满石头的背包的最大数量
【LeetCode】189-轮转数组
[leetcode] 977 square of ordered array
[2. Basics of Delphi grammar] 3 Object Pascal constants and variables
Locate: cannot execute stat() `/var/lib/mlocate/mlocate Db ': there is no such file or directory
[leetcode] 344 reverse string
(Video + graphic) machine learning introduction series - Chapter 5 machine learning practice
/Bin/ld: cannot find -llz4
Golang MD5 encryption and MD5 salt value encryption
Aike AI frontier promotion (7.2)
解决BASE64Encoder报错的问题
PTA ladder game exercise set l2-001 inter city emergency rescue
Pattern matching extraction of specific subgraphs in graphx graph Computing Practice
【LeetCode】977-有序數組的平方
Application of visualization technology in Nebula graph
Finally, I understand the event loop, synchronous / asynchronous, micro task / macro task, and operation mechanism in JS (with test questions attached)
将点云坐标转换成世界坐标的demo
6090. Minimax games
[development environment] install Visual Studio Ultimate 2013 development environment (download software | install software | run software)
睿智的目标检测23——Pytorch搭建SSD目标检测平台