当前位置:网站首页>Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
2022-07-02 15:49:00 【Junhong's road of data analysis】
I remember when I finished the college entrance examination , When I chose the school, I bought the book of university information ranking in the bookstore , However, the information in the book is long ago , Not much help . Today, let's take you to climb something really useful , National University Information , Covering the vast majority of universities , And make visual kanban . Don't talk much , Get to the point !
Data crawling
Address :https://www.gaokao.cn/school/140
F12
Open developer tools , It's easy to get through the packet capturing tool json
file . We directly request the link , You can get the information of the corresponding University .
In addition, the comparison found that https://static-data.eol.cn/www/2.0/school/140/info.json
, The key parameters 140 For the school ID, but ID Not continuous , therefore , We can only crawl according to the approximate number of schools .
Crawling code
The import module
import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import time
Briefly explain the purpose of the main modules :
aiohttp: Single thread concurrency IO operation . If only used on the client , It doesn't have much power , Just to match
asyncio
To use , becauserequests
Asynchronous is not supported . If you putasyncio
Used on the server side , for example Web The server , because HTTP The connection is IO operation , So you can use Single thread +coroutine
Achieve high concurrency support for multiple users .asyncio: It provides perfect asynchronous IO Support , You can combine multiple processes (
coroutine
) Package into a groupTask
And then execute concurrently .pandas: Convert the crawled data to
DataFrame
type , And generatecsv
file .pathlib: Object oriented programming to represent file system paths .
tqdm: Just use
tqdm(iterable)
Wrap any iteratable object , You can make your loop generate an intelligent progress bar .
Generate URL Sequence
By the appointed URL
Template and max_id
Generate URL
Sequence , I added one here Deduplication , If you have collected university information before , It will be based on the files in the same directory , Eliminate the colleges and universities that have been collected ID, Only collect the information of colleges and universities that have not been obtained .
def get_url_list(max_id):
url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
not_crawled = set(range(max_id))
if Path.exists(Path(current_path, 'college_info.csv')):
df = pd.read_csv(Path(current_path, 'college_info.csv'))
not_crawled -= set(df[' School id'].unique())
return [url%id for id in not_crawled]
collection JSON data
Through synergetic pair URL The sequence sends a request , Be careful to limit the amount of concurrency ,Window:500,Linux:1024
.
async def get_json_data(url, semaphore):
async with semaphore:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
try:
async with session.get(url=url, headers=headers, timeout=6) as response:
# Change the encoding format of the corresponding data
response.encoding = 'utf-8'
# encounter IO Request to suspend the current task , etc. IO The code after the operation is completed and executed , When the collaboration hangs , Event loops can perform other tasks .
json_data = await response.json()
if json_data != '':
return save_to_csv(json_data['data'])
except:
return None
Data analysis and storage
JSON There are many fields in the data , You can parse and save the fields you need according to your own situation .
def save_to_csv(json_info):
save_info = {}
save_info[' School id'] = json_info['school_id'] # School id
save_info[' School name '] = json_info['name'] # The name of the school
level = ""
if json_info['f985'] == '1' and json_info['f211'] == '1':
level += "985 211"
elif json_info['f211'] == '1':
level += "211"
else:
level += json_info['level_name']
save_info[' School level '] = level # School level
save_info[' Soft science ranking '] = json_info['rank']['ruanke_rank'] # Soft science ranking
save_info[' Alumni Association ranking '] = json_info['rank']['xyh_rank'] # Alumni Association ranking
save_info[' Ranking of martial arts company '] = json_info['rank']['wsl_rank'] # Ranking of martial arts company
save_info['QS World rankings '] = json_info['rank']['qs_world'] # QS World rankings
save_info['US World rankings '] = json_info['rank']['us_rank'] # US World rankings
save_info[' School type '] = json_info['type_name'] # School type
save_info[' Province '] = json_info['province_name'] # Province
save_info[' City '] = json_info['city_name'] # The city name
save_info[' Location '] = json_info['town_name'] # Location
save_info[' Phone number of Admissions Office '] = json_info['phone'] # Phone number of Admissions Office
save_info[' Official website of Admissions Office '] = json_info['site'] # Official website of Admissions Office
df = pd.DataFrame(save_info, index=[0])
header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)
The scheduler
Schedule the entire acquisition program . obtain URL>> Limit the amount of concurrency >> Create task object >> Suspend task
async def main(loop):
# obtain url list
url_list = get_url_list(5000)
# Limit the amount of concurrency
semaphore = asyncio.Semaphore(500)
# Create a task object and add it to the task list
tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
# Pending task list
for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
await t
The above is the main code of the program .
Running effect
Collect here ID stay 5000 University information within , If you want to collect as much information about colleges and universities as possible , It is recommended to run multiple times , Until there is no new data .
First run ( Collect to 2140 That's ok ) Second operation ( Collect to 680 That's ok )
A total of 2820 Row data . Now let's start the visualization part .
Tableau visualization
Compared with other visualization tools or third-party drawing libraries , I prefer Tableau, It's easy to get started , You can take a look at what you want to know Tableau Public, Many works of Daniel have been released here .
https://public.tableau.com/app/discover
Its only drawback is charging , If it's a student , Free to use , otherwise , It is recommended to use the free version first Tableau Public, Wait until you fully understand before you consider paying .
For this visualization , There are four charts in total .
Distribution map of the number of colleges and universities

The top three universities are Jiangsu 、 guangdong 、 Henan ( For reference only )
Ranking of soft science universities TOP10
According to the ranking of Soft Sciences , The national TOP10 The vast majority of colleges and universities are comprehensive , Only the seventh place university of science and technology of China in science and Engineering .
The hierarchical distribution of colleges and universities
From the collected data , 211 The proportion of colleges and universities is about 9.5% ,985 The proportion of colleges and universities is about 3.5%, It's really rare .
Distribution of University types
The school type is mainly science and engineering and comprehensive , The number of the two is basically the same , Are far ahead of other types . The second level of quantity is finance , Teacher education , Medicine .
Composite Kanban
Combine the above worksheets into a Kanban , The process is very simple , Just drag the icon to the specified location . Add another filtering operation , Click a province in the map to link with other worksheets .
Kanban has been posted to Tableau Public. It can be edited online , Or download the entire visualization work package , Links are as follows :https://public.tableau.com/shared/ZCXWTK6SP?:display_count=n&:origin=viz_share_link
See the attachment for the complete code :
National college information collection source code .
link :https://pan.baidu.com/s/1FCXwAyeeqkoH6M_ITWWAcw
Extraction code :6cbf
- END -
contrast Excel The cumulative sales of the series of books reached 15w book , Make it easy for you to master data analysis skills , You can search the title of the book on the whole network to understand :
边栏推荐
- SQL FOREIGN KEY
- 解决BASE64Encoder报错的问题
- The sea of stars hidden behind the nebula graph
- /bin/ld: 找不到 -lcrypto
- Strings and arrays
- [leetcode] 189 rotation array
- [2. Basics of Delphi grammar] 3 Object Pascal constants and variables
- College entrance examination admission score line crawler
- 【LeetCode】977-有序数组的平方
- ssh/scp 使不提示 All activities are monitored and reported.
猜你喜欢
动态规划入门一,队列的bfs(70.121.279.200)
全是精华的模电专题复习资料:基本放大电路知识点
Why does the system convert the temp environment variable to a short file name?
[leetcode] 1905 statistics sub Island
Two traversal sequences are known to construct binary trees
树-二叉搜索树
Aiko ai Frontier promotion (7.2)
How to import a billion level offline CSV into Nepal graph
The outline dimension function application of small motherboard
Traversal before, during and after binary tree
随机推荐
《大学“电路分析基础”课程实验合集.实验六》丨典型信号的观察与测量
Moveit obstacle avoidance path planning demo
【LeetCode】695-岛屿的最大面积
【LeetCode】200-岛屿数量
XPT2046 四线电阻式触摸屏
[leetcode] 344 reverse string
[development environment] install the Chinese language pack for the 2013 version of visual studio community (install test agents 2013 | install visual studio 2013 simplified Chinese)
(Wanzi essence knowledge summary) basic knowledge of shell script programming
[network security] network asset collection
/bin/ld: 找不到 -lgssapi_krb5
Postgressql stream replication active / standby switchover primary database no read / write downtime scenario
6095. Strong password checker II
【LeetCode】877-石子游戏
Wechat Alipay account system and payment interface business process
爱可可AI前沿推介(7.2)
6092. Replace elements in the array
《大学“电路分析基础”课程实验合集.实验五》丨线性有源二端网络等效电路的研究
/Bin/ld: cannot find -lxslt
Add an empty column to spark dataframe - add an empty column to spark dataframe
[leetcode] 1140 stone game II