当前位置：网站首页>Detailed explanation of the ranking of the best universities

Detailed explanation of the ranking of the best universities

2022-07-05 04:49:00 【Prosperity comes to an end and the city is ruined 891】

#2021/10/16 Saturday
# Crawling https://www.shanghairanking.cn/rankings/bcur/202111 The top Chinese Universities on the website 20 Famous university “ ranking ”“ University name ”“ Provinces ”“ Total score ” Four things
# Before crawling, carefully observe the web page source code of the content to be crawled , Include the tag element where the content is located （<tbody><tr><td><div><a>), Sort the crawled content in the same tag
# The website changes every year , The source code of the website will change , In recent years, there are many spaces in the content tags we need to crawl , Attention should be paid to handling

import requests# Role of request , The simple understanding is to request web pages url link , Then climb it
import bs4# In the second method bs4 Tag definition function of element
from bs4 import BeautifulSoup# This BeautifulSoup Library is a function of typesetting and beautifying web pages , To the original web page html Wrap closer to make it look more comfortable

def getHTMLText(url):# Get university rankings from the web : Defined function getHTMLText()
try: # remarks 1
r = requests.get(url,timeout=30)# adopt get Function to obtain url Information
r.raise_for_status()# Used to generate abnormal information
r.encoding = r.apparent_encoding# Modify encoding ,apparent_encoding It's usually utf-8, Avoid garbled code .
return r.text# If successful, the web page information of the link will be returned
except:
return ""# Otherwise, it is abnormal information , Return to empty string

def fillUnivList(ulist, html):# Extract the information needed in the university ranking web page and store it in the appropriate list
soup = BeautifulSoup(html, "html.parser")# adopt BeautifulSoup Function to adjust the page , Make the format more convenient to see , use html The parser
for tr in soup.find('tbody').children:# remarks 2
if isinstance(tr, bs4.element.Tag):# remarks 3（ To filter out bs4 Other information of non label information defined by the Library ）
a = tr('a')# Will all a The tag is saved as a list type
tds = tr('td')# Will all td The tag is saved as a list type
ulist.append([tds[0].text.strip(), a[0].text.strip(), tds[2].text.strip(),tds[4].text.strip()])
#td There is more white space before the content in the label ,strip() Method is used to remove the characters specified at the beginning and end of a string （ The default is space or newline ） Or character sequence

def printUnivList(ulist, num):# Use data structure to display and output results
tplt = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
# use tplt Store output The definition of format ; among ^ Indicates center alignment ,10 According to the said 10 The length of characters is output . The length is not enough to fill in spaces ,{4} Said the use of format Functional
# The fourth variable is filled , That is, fill in the blanks in Chinese .
print(tplt.format(" ranking "," School name "," Provinces "," Total score ",chr(12288)))
#Python Use .format Function to format the output
#chr(12288) Means to fill in blanks according to Chinese habits , To output aligned constraints
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
print("Suc"+str(num))

def main():
uinfo = []# Store University Information
url = "https://www.shanghairanking.cn/rankings/bcur/202111"
html = getHTMLText(url# Get the content of this page
fillUnivList(uinfo,html)# Analyze the content of this web page , Store in uinfo In the list
printUnivList(uinfo,20)# Print the information of the top 20 in the list

main()
# remarks 1：
#try except The execution flow of the statement is as follows ：
# First, execute try Code block in , If an exception occurs during execution , The system will automatically generate an exception type , And submit the exception to Python Interpreter , This process is called catching exceptions .
# When Python When the interpreter receives an exception object , Will look for someone who can handle the exception object except block , If you find the right except block , Then give the exception object to the except Block handling ,
# This process is called exception handling . If Python The interpreter could not find a to handle the exception except block , Then the program is terminated ,Python The interpreter will also exit .

# remarks 2：
# The following functions need to be written by observing the source code of the web page ,( You can use the web page source code page ctrl+f Find the tag ） It can be seen that ： One <tr></tr> It contains all the information of a University , Every
#<td></td> It also includes a ranking of different aspects of universities 、 name 、 Provinces and cities, etc .tr The last attribute of is tbody, adopt tbody The child node of search traverses all tr, stay tr label
# Find td Tag information , And will be the first 1、2、4 Corresponding to tds No 0、1、3 Column information , The first 1 Corresponding a The... In the array 0 The information of the column is stored in ulist in .

# remarks 3：
#isinstance Function USES ,isinstance() Function to determine whether an object is a known type , similar type().isinstance(object, classinfo),object： Instance object .
#classinfo: It can be a direct or indirect class name 、 Basic types or tuples made up of them . Determine whether the instance belongs to which class .
#bs4.element.Tag:bs4 Defined in the library tag type

原网站

版权声明
本文为[Prosperity comes to an end and the city is ruined 891]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202140628462806.html

当前位置：网站首页>Detailed explanation of the ranking of the best universities

Detailed explanation of the ranking of the best universities

边栏推荐

猜你喜欢

随机推荐