当前位置:网站首页>Detailed explanation of the ranking of the best universities
Detailed explanation of the ranking of the best universities
2022-07-05 04:49:00 【Prosperity comes to an end and the city is ruined 891】
#2021/10/16 Saturday
# Crawling https://www.shanghairanking.cn/rankings/bcur/202111 The top Chinese Universities on the website 20 Famous university “ ranking ”“ University name ”“ Provinces ”“ Total score ” Four things
# Before crawling, carefully observe the web page source code of the content to be crawled , Include the tag element where the content is located (<tbody><tr><td><div><a>), Sort the crawled content in the same tag
# The website changes every year , The source code of the website will change , In recent years, there are many spaces in the content tags we need to crawl , Attention should be paid to handling
import requests# Role of request , The simple understanding is to request web pages url link , Then climb it
import bs4# In the second method bs4 Tag definition function of element
from bs4 import BeautifulSoup# This BeautifulSoup Library is a function of typesetting and beautifying web pages , To the original web page html Wrap closer to make it look more comfortable
def getHTMLText(url):# Get university rankings from the web : Defined function getHTMLText()
try: # remarks 1
r = requests.get(url,timeout=30)# adopt get Function to obtain url Information
r.raise_for_status()# Used to generate abnormal information
r.encoding = r.apparent_encoding# Modify encoding ,apparent_encoding It's usually utf-8, Avoid garbled code .
return r.text# If successful, the web page information of the link will be returned
except:
return ""# Otherwise, it is abnormal information , Return to empty string
def fillUnivList(ulist, html):# Extract the information needed in the university ranking web page and store it in the appropriate list
soup = BeautifulSoup(html, "html.parser")# adopt BeautifulSoup Function to adjust the page , Make the format more convenient to see , use html The parser
for tr in soup.find('tbody').children:# remarks 2
if isinstance(tr, bs4.element.Tag):# remarks 3( To filter out bs4 Other information of non label information defined by the Library )
a = tr('a')# Will all a The tag is saved as a list type
tds = tr('td')# Will all td The tag is saved as a list type
ulist.append([tds[0].text.strip(), a[0].text.strip(), tds[2].text.strip(),tds[4].text.strip()])
#td There is more white space before the content in the label ,strip() Method is used to remove the characters specified at the beginning and end of a string ( The default is space or newline ) Or character sequence
def printUnivList(ulist, num):# Use data structure to display and output results
tplt = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
# use tplt Store output The definition of format ; among ^ Indicates center alignment ,10 According to the said 10 The length of characters is output . The length is not enough to fill in spaces ,{4} Said the use of format Functional
# The fourth variable is filled , That is, fill in the blanks in Chinese .
print(tplt.format(" ranking "," School name "," Provinces "," Total score ",chr(12288)))
#Python Use .format Function to format the output
#chr(12288) Means to fill in blanks according to Chinese habits , To output aligned constraints
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
print("Suc"+str(num))
def main():
uinfo = []# Store University Information
url = "https://www.shanghairanking.cn/rankings/bcur/202111"
html = getHTMLText(url# Get the content of this page
fillUnivList(uinfo,html)# Analyze the content of this web page , Store in uinfo In the list
printUnivList(uinfo,20)# Print the information of the top 20 in the list
main()
# remarks 1:
#try except The execution flow of the statement is as follows :
# First, execute try Code block in , If an exception occurs during execution , The system will automatically generate an exception type , And submit the exception to Python Interpreter , This process is called catching exceptions .
# When Python When the interpreter receives an exception object , Will look for someone who can handle the exception object except block , If you find the right except block , Then give the exception object to the except Block handling ,
# This process is called exception handling . If Python The interpreter could not find a to handle the exception except block , Then the program is terminated ,Python The interpreter will also exit .
# remarks 2:
# The following functions need to be written by observing the source code of the web page ,( You can use the web page source code page ctrl+f Find the tag ) It can be seen that : One <tr></tr> It contains all the information of a University , Every
#<td></td> It also includes a ranking of different aspects of universities 、 name 、 Provinces and cities, etc .tr The last attribute of is tbody, adopt tbody The child node of search traverses all tr, stay tr label
# Find td Tag information , And will be the first 1、2、4 Corresponding to tds No 0、1、3 Column information , The first 1 Corresponding a The... In the array 0 The information of the column is stored in ulist in .
# remarks 3:
#isinstance Function USES ,isinstance() Function to determine whether an object is a known type , similar type().isinstance(object, classinfo),object: Instance object .
#classinfo: It can be a direct or indirect class name 、 Basic types or tuples made up of them . Determine whether the instance belongs to which class .
#bs4.element.Tag:bs4 Defined in the library tag type
边栏推荐
- AutoCAD - command repetition, undo and redo
- 程序员应该怎么学数学
- [AI bulletin 20220211] the hard core up owner has built a lidar and detailed AI accelerator
- [Business Research Report] top ten trends of science and technology and it in 2022 - with download link
- windows下Redis-cluster集群搭建
- 2022-2028 global and Chinese video coding and transcoding Market Research Report
- Practice | mobile end practice
- 10 programming habits that web developers should develop
- AutoCAD -- dimension break
- Download the details and sequence of the original data access from the ENA database in EBI
猜你喜欢
10 programming habits that web developers should develop
2022 U.S. college students' mathematical modeling e problem ideas / 2022 U.S. game e problem analysis
Manually implement heap sorting -838 Heap sort
49 pictures and 26 questions explain in detail what is WiFi?
2022-2028 global and Chinese FPGA prototype system Market Research Report
[AI bulletin 20220211] the hard core up owner has built a lidar and detailed AI accelerator
自动语音识别(ASR)研究综述
Qt蓝牙:搜索蓝牙设备的类——QBluetoothDeviceDiscoveryAgent
Autocad-- Real Time zoom
AutoCAD - feature matching
随机推荐
Neural networks and deep learning Chapter 4: feedforward neural networks reading questions
Solutions and answers for the 2021 Shenzhen cup
Function template
2022-2028 global and Chinese FPGA prototype system Market Research Report
[groovy] closure (closure call is associated with call method | call () method is defined in interface | call () method is defined in class | code example)
AutoCAD - Document Management
XSS injection
CSDN body auto generate directory
2022 thinking of mathematical modeling a problem of American college students / analysis of 2022 American competition a problem
质量体系建设之路的分分合合
History of web page requests
Private collection project practice sharing [Yugong series] February 2022 U3D full stack class 006 unity toolbar
Introduce Hamming distance and calculation examples
Wenet: E2E speech recognition tool for industrial implementation
2022 thinking of mathematical modeling D problem of American college students / analysis of 2022 American competition D problem
775 Div.1 B. integral array mathematics
[Business Research Report] Research Report on male consumption trends in other economic times -- with download link
#775 Div.1 B. Integral Array 数学
2021 electrician cup idea + code - photovoltaic building integration plate index development trend analysis and prediction: prediction planning issues
Manually implement heap sorting -838 Heap sort