当前位置:网站首页>Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
2022-07-02 15:49:00 【Junhong's road of data analysis】
I remember when I finished the college entrance examination , When I chose the school, I bought the book of university information ranking in the bookstore , However, the information in the book is long ago , Not much help . Today, let's take you to climb something really useful , National University Information , Covering the vast majority of universities , And make visual kanban . Don't talk much , Get to the point !
Data crawling
Address :https://www.gaokao.cn/school/140
F12
Open developer tools , It's easy to get through the packet capturing tool json
file . We directly request the link , You can get the information of the corresponding University .
In addition, the comparison found that https://static-data.eol.cn/www/2.0/school/140/info.json
, The key parameters 140 For the school ID, but ID Not continuous , therefore , We can only crawl according to the approximate number of schools .
Crawling code
The import module
import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import time
Briefly explain the purpose of the main modules :
aiohttp: Single thread concurrency IO operation . If only used on the client , It doesn't have much power , Just to match
asyncio
To use , becauserequests
Asynchronous is not supported . If you putasyncio
Used on the server side , for example Web The server , because HTTP The connection is IO operation , So you can use Single thread +coroutine
Achieve high concurrency support for multiple users .asyncio: It provides perfect asynchronous IO Support , You can combine multiple processes (
coroutine
) Package into a groupTask
And then execute concurrently .pandas: Convert the crawled data to
DataFrame
type , And generatecsv
file .pathlib: Object oriented programming to represent file system paths .
tqdm: Just use
tqdm(iterable)
Wrap any iteratable object , You can make your loop generate an intelligent progress bar .
Generate URL Sequence
By the appointed URL
Template and max_id
Generate URL
Sequence , I added one here Deduplication , If you have collected university information before , It will be based on the files in the same directory , Eliminate the colleges and universities that have been collected ID, Only collect the information of colleges and universities that have not been obtained .
def get_url_list(max_id):
url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
not_crawled = set(range(max_id))
if Path.exists(Path(current_path, 'college_info.csv')):
df = pd.read_csv(Path(current_path, 'college_info.csv'))
not_crawled -= set(df[' School id'].unique())
return [url%id for id in not_crawled]
collection JSON data
Through synergetic pair URL The sequence sends a request , Be careful to limit the amount of concurrency ,Window:500,Linux:1024
.
async def get_json_data(url, semaphore):
async with semaphore:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
}
async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
try:
async with session.get(url=url, headers=headers, timeout=6) as response:
# Change the encoding format of the corresponding data
response.encoding = 'utf-8'
# encounter IO Request to suspend the current task , etc. IO The code after the operation is completed and executed , When the collaboration hangs , Event loops can perform other tasks .
json_data = await response.json()
if json_data != '':
return save_to_csv(json_data['data'])
except:
return None
Data analysis and storage
JSON There are many fields in the data , You can parse and save the fields you need according to your own situation .
def save_to_csv(json_info):
save_info = {}
save_info[' School id'] = json_info['school_id'] # School id
save_info[' School name '] = json_info['name'] # The name of the school
level = ""
if json_info['f985'] == '1' and json_info['f211'] == '1':
level += "985 211"
elif json_info['f211'] == '1':
level += "211"
else:
level += json_info['level_name']
save_info[' School level '] = level # School level
save_info[' Soft science ranking '] = json_info['rank']['ruanke_rank'] # Soft science ranking
save_info[' Alumni Association ranking '] = json_info['rank']['xyh_rank'] # Alumni Association ranking
save_info[' Ranking of martial arts company '] = json_info['rank']['wsl_rank'] # Ranking of martial arts company
save_info['QS World rankings '] = json_info['rank']['qs_world'] # QS World rankings
save_info['US World rankings '] = json_info['rank']['us_rank'] # US World rankings
save_info[' School type '] = json_info['type_name'] # School type
save_info[' Province '] = json_info['province_name'] # Province
save_info[' City '] = json_info['city_name'] # The city name
save_info[' Location '] = json_info['town_name'] # Location
save_info[' Phone number of Admissions Office '] = json_info['phone'] # Phone number of Admissions Office
save_info[' Official website of Admissions Office '] = json_info['site'] # Official website of Admissions Office
df = pd.DataFrame(save_info, index=[0])
header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)
The scheduler
Schedule the entire acquisition program . obtain URL>> Limit the amount of concurrency >> Create task object >> Suspend task
async def main(loop):
# obtain url list
url_list = get_url_list(5000)
# Limit the amount of concurrency
semaphore = asyncio.Semaphore(500)
# Create a task object and add it to the task list
tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
# Pending task list
for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
await t
The above is the main code of the program .
Running effect
Collect here ID stay 5000 University information within , If you want to collect as much information about colleges and universities as possible , It is recommended to run multiple times , Until there is no new data .
First run ( Collect to 2140 That's ok ) Second operation ( Collect to 680 That's ok )
A total of 2820 Row data . Now let's start the visualization part .
Tableau visualization
Compared with other visualization tools or third-party drawing libraries , I prefer Tableau, It's easy to get started , You can take a look at what you want to know Tableau Public, Many works of Daniel have been released here .
https://public.tableau.com/app/discover
Its only drawback is charging , If it's a student , Free to use , otherwise , It is recommended to use the free version first Tableau Public, Wait until you fully understand before you consider paying .
For this visualization , There are four charts in total .
Distribution map of the number of colleges and universities

The top three universities are Jiangsu 、 guangdong 、 Henan ( For reference only )
Ranking of soft science universities TOP10
According to the ranking of Soft Sciences , The national TOP10 The vast majority of colleges and universities are comprehensive , Only the seventh place university of science and technology of China in science and Engineering .
The hierarchical distribution of colleges and universities
From the collected data , 211 The proportion of colleges and universities is about 9.5% ,985 The proportion of colleges and universities is about 3.5%, It's really rare .
Distribution of University types
The school type is mainly science and engineering and comprehensive , The number of the two is basically the same , Are far ahead of other types . The second level of quantity is finance , Teacher education , Medicine .
Composite Kanban
Combine the above worksheets into a Kanban , The process is very simple , Just drag the icon to the specified location . Add another filtering operation , Click a province in the map to link with other worksheets .
Kanban has been posted to Tableau Public. It can be edited online , Or download the entire visualization work package , Links are as follows :https://public.tableau.com/shared/ZCXWTK6SP?:display_count=n&:origin=viz_share_link
See the attachment for the complete code :
National college information collection source code .
link :https://pan.baidu.com/s/1FCXwAyeeqkoH6M_ITWWAcw
Extraction code :6cbf
- END -
contrast Excel The cumulative sales of the series of books reached 15w book , Make it easy for you to master data analysis skills , You can search the title of the book on the whole network to understand :
边栏推荐
- [leetcode] 1254 - count the number of closed Islands
- [leetcode] 977 - carré du tableau ordonné
- [salesforce] how to confirm your salesforce version?
- [development environment] install the Chinese language pack for the 2013 version of visual studio community (install test agents 2013 | install visual studio 2013 simplified Chinese)
- [leetcode] 283 move zero
- Cultural scores of summer college entrance examination
- /bin/ld: 找不到 -lxml2
- Introduction to dynamic planning I, BFS of queue (70.121.279.200)
- Locate: cannot execute stat() `/var/lib/mlocate/mlocate Db ': there is no such file or directory
- SQL FOREIGN KEY
猜你喜欢
智联招聘的基于 Nebula Graph 的推荐实践分享
[leetcode] 417 - Pacific Atlantic current problem
[network security] network asset collection
(Video + graphic) machine learning introduction series - Chapter 5 machine learning practice
Aiko ai Frontier promotion (7.2)
如何實現十億級離線 CSV 導入 Nebula Graph
Application of visualization technology in Nebula graph
How to import a billion level offline CSV into Nepal graph
The sea of stars hidden behind the nebula graph
Postgressql stream replication active / standby switchover primary database no read / write downtime scenario
随机推荐
Deux séquences ergodiques connues pour construire des arbres binaires
Introduction to dynamic planning I, BFS of queue (70.121.279.200)
[leetcode] 283 move zero
lseek 出错
动态规划入门二(5.647.62)
Introduction to Dynamic Planning II (5.647.62)
[leetcode] 977 square of ordered array
Lseek error
[leetcode] 189 rotation array
(Wanzi essence knowledge summary) basic knowledge of shell script programming
使用 percona 工具给 MySQL 表加字段中断后该如何操作
Comment réaliser un graphique Nebula d'importation CSV hors ligne de niveau milliard
Basic knowledge of cryptography
[leetcode] 19 delete the penultimate node of the linked list
Nebula Graph & 数仓血缘关系数据的存储与读写
/Bin/ld: cannot find -lxml2
ssh/scp 使不提示 All activities are monitored and reported.
XPT2046 四线电阻式触摸屏
[network security] network asset collection
6092. Replace elements in the array