当前位置：网站首页>Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!

Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!

2022-07-02 15:49:00 【Junhong's road of data analysis】

I remember when I finished the college entrance examination , When I chose the school, I bought the book of university information ranking in the bookstore , However, the information in the book is long ago , Not much help . Today, let's take you to climb something really useful , National University Information , Covering the vast majority of universities , And make visual kanban . Don't talk much , Get to the point ！

Data crawling

Address ：https://www.gaokao.cn/school/140F12 Open developer tools , It's easy to get through the packet capturing tool json file . We directly request the link , You can get the information of the corresponding University .

In addition, the comparison found that https://static-data.eol.cn/www/2.0/school/140/info.json, The key parameters 140 For the school ID, but ID Not continuous , therefore , We can only crawl according to the approximate number of schools .

Crawling code

The import module

import aiohttp
import asyncio
import pandas as pd
from pathlib import Path
from tqdm import tqdm
import time

Briefly explain the purpose of the main modules ：

aiohttp： Single thread concurrency IO operation . If only used on the client , It doesn't have much power , Just to match asyncio To use , because requests Asynchronous is not supported . If you put asyncio Used on the server side , for example Web The server , because HTTP The connection is IO operation , So you can use Single thread + coroutine Achieve high concurrency support for multiple users .
asyncio： It provides perfect asynchronous IO Support , You can combine multiple processes （coroutine） Package into a group Task And then execute concurrently .
pandas： Convert the crawled data to DataFrame type , And generate csv file .
pathlib： Object oriented programming to represent file system paths .
tqdm： Just use tqdm(iterable) Wrap any iteratable object , You can make your loop generate an intelligent progress bar .

Generate URL Sequence

By the appointed URL Template and max_id Generate URL Sequence , I added one here Deduplication , If you have collected university information before , It will be based on the files in the same directory , Eliminate the colleges and universities that have been collected ID, Only collect the information of colleges and universities that have not been obtained .

def get_url_list(max_id):
    url = 'https://static-data.eol.cn/www/2.0/school/%d/info.json'
    not_crawled = set(range(max_id))
    if Path.exists(Path(current_path, 'college_info.csv')):
        df = pd.read_csv(Path(current_path, 'college_info.csv'))
        not_crawled -= set(df[' School id'].unique())
    return [url%id for id in not_crawled]

collection JSON data

Through synergetic pair URL The sequence sends a request , Be careful to limit the amount of concurrency ,Window：500,Linux：1024.

async def get_json_data(url, semaphore):
    async with semaphore:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
        }
        async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
            try:
                async with session.get(url=url, headers=headers, timeout=6) as response:
                    #  Change the encoding format of the corresponding data 
                    response.encoding = 'utf-8'
                    #  encounter IO Request to suspend the current task , etc. IO The code after the operation is completed and executed , When the collaboration hangs , Event loops can perform other tasks .
                    json_data = await response.json()
                    if json_data != '':
                        return save_to_csv(json_data['data'])
            except:
                return None

Data analysis and storage

JSON There are many fields in the data , You can parse and save the fields you need according to your own situation .

def save_to_csv(json_info):
    save_info = {}
    save_info[' School id'] = json_info['school_id']              #  School id
    save_info[' School name '] = json_info['name']                  #  The name of the school 
    level = ""
    if json_info['f985'] == '1' and json_info['f211'] == '1':
        level += "985 211"
    elif json_info['f211'] == '1':
        level += "211"
    else:
        level += json_info['level_name']
    save_info[' School level '] = level                               #  School level 
    save_info[' Soft science ranking '] = json_info['rank']['ruanke_rank']    #  Soft science ranking 
    save_info[' Alumni Association ranking '] = json_info['rank']['xyh_rank']     #  Alumni Association ranking 
    save_info[' Ranking of martial arts company '] = json_info['rank']['wsl_rank']     #  Ranking of martial arts company 
    save_info['QS World rankings '] = json_info['rank']['qs_world']     # QS World rankings 
    save_info['US World rankings '] = json_info['rank']['us_rank']      # US World rankings 
    save_info[' School type '] = json_info['type_name']              #  School type 
    save_info[' Province '] = json_info['province_name']              #  Province 
    save_info[' City '] = json_info['city_name']                  #  The city name 
    save_info[' Location '] = json_info['town_name']              #  Location 
    save_info[' Phone number of Admissions Office '] = json_info['phone']                #  Phone number of Admissions Office 
    save_info[' Official website of Admissions Office '] = json_info['site']                 #  Official website of Admissions Office 


    df = pd.DataFrame(save_info, index=[0])

    header = False if Path.exists(Path(current_path, 'college_info.csv')) else True
    df.to_csv(Path(current_path, 'college_info.csv'), index=False, mode='a', header=header)

The scheduler

Schedule the entire acquisition program . obtain URL>> Limit the amount of concurrency >> Create task object >> Suspend task

async def main(loop):
    #  obtain url list 
    url_list =  get_url_list(5000)
    #  Limit the amount of concurrency 
    semaphore = asyncio.Semaphore(500)
    #  Create a task object and add it to the task list 
    tasks = [loop.create_task(get_json_data(url, semaphore)) for url in url_list]
    #  Pending task list 
    for t in tqdm(asyncio.as_completed(tasks), total=len(tasks)):
        await t

The above is the main code of the program .

Running effect

Collect here ID stay 5000 University information within , If you want to collect as much information about colleges and universities as possible , It is recommended to run multiple times , Until there is no new data .

First run （ Collect to 2140 That's ok ） Second operation （ Collect to 680 That's ok ） A total of 2820 Row data . Now let's start the visualization part .

Tableau visualization

Compared with other visualization tools or third-party drawing libraries , I prefer Tableau, It's easy to get started , You can take a look at what you want to know Tableau Public, Many works of Daniel have been released here .

https://public.tableau.com/app/discover

Its only drawback is charging , If it's a student , Free to use , otherwise , It is recommended to use the free version first Tableau Public, Wait until you fully understand before you consider paying .

For this visualization , There are four charts in total .

Distribution map of the number of colleges and universities

The top three universities are Jiangsu 、 guangdong 、 Henan （ For reference only ）

Ranking of soft science universities TOP10

According to the ranking of Soft Sciences , The national TOP10 The vast majority of colleges and universities are comprehensive , Only the seventh place university of science and technology of China in science and Engineering .

The hierarchical distribution of colleges and universities

From the collected data , 211 The proportion of colleges and universities is about 9.5% ,985 The proportion of colleges and universities is about 3.5%, It's really rare .

Distribution of University types

The school type is mainly science and engineering and comprehensive , The number of the two is basically the same , Are far ahead of other types . The second level of quantity is finance , Teacher education , Medicine .

Composite Kanban

Combine the above worksheets into a Kanban , The process is very simple , Just drag the icon to the specified location . Add another filtering operation , Click a province in the map to link with other worksheets .

Kanban has been posted to Tableau Public. It can be edited online , Or download the entire visualization work package , Links are as follows ：
https://public.tableau.com/shared/ZCXWTK6SP?:display_count=n&:origin=viz_share_link

See the attachment for the complete code ：

 National college information collection source code .
 link ：https://pan.baidu.com/s/1FCXwAyeeqkoH6M_ITWWAcw
 Extraction code ：6cbf

- END -

 contrast Excel The cumulative sales of the series of books reached 15w book , Make it easy for you to master data analysis skills , You can search the title of the book on the whole network to understand ：

原网站

版权声明
本文为[Junhong's road of data analysis]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202151453398144.html