当前位置:网站首页>Detailed explanation of the ranking of the best universities
Detailed explanation of the ranking of the best universities
2022-07-05 04:49:00 【Prosperity comes to an end and the city is ruined 891】
#2021/10/16 Saturday
# Crawling https://www.shanghairanking.cn/rankings/bcur/202111 The top Chinese Universities on the website 20 Famous university “ ranking ”“ University name ”“ Provinces ”“ Total score ” Four things
# Before crawling, carefully observe the web page source code of the content to be crawled , Include the tag element where the content is located (<tbody><tr><td><div><a>), Sort the crawled content in the same tag
# The website changes every year , The source code of the website will change , In recent years, there are many spaces in the content tags we need to crawl , Attention should be paid to handling
import requests# Role of request , The simple understanding is to request web pages url link , Then climb it
import bs4# In the second method bs4 Tag definition function of element
from bs4 import BeautifulSoup# This BeautifulSoup Library is a function of typesetting and beautifying web pages , To the original web page html Wrap closer to make it look more comfortable
def getHTMLText(url):# Get university rankings from the web : Defined function getHTMLText()
try: # remarks 1
r = requests.get(url,timeout=30)# adopt get Function to obtain url Information
r.raise_for_status()# Used to generate abnormal information
r.encoding = r.apparent_encoding# Modify encoding ,apparent_encoding It's usually utf-8, Avoid garbled code .
return r.text# If successful, the web page information of the link will be returned
except:
return ""# Otherwise, it is abnormal information , Return to empty string
def fillUnivList(ulist, html):# Extract the information needed in the university ranking web page and store it in the appropriate list
soup = BeautifulSoup(html, "html.parser")# adopt BeautifulSoup Function to adjust the page , Make the format more convenient to see , use html The parser
for tr in soup.find('tbody').children:# remarks 2
if isinstance(tr, bs4.element.Tag):# remarks 3( To filter out bs4 Other information of non label information defined by the Library )
a = tr('a')# Will all a The tag is saved as a list type
tds = tr('td')# Will all td The tag is saved as a list type
ulist.append([tds[0].text.strip(), a[0].text.strip(), tds[2].text.strip(),tds[4].text.strip()])
#td There is more white space before the content in the label ,strip() Method is used to remove the characters specified at the beginning and end of a string ( The default is space or newline ) Or character sequence
def printUnivList(ulist, num):# Use data structure to display and output results
tplt = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}"
# use tplt Store output The definition of format ; among ^ Indicates center alignment ,10 According to the said 10 The length of characters is output . The length is not enough to fill in spaces ,{4} Said the use of format Functional
# The fourth variable is filled , That is, fill in the blanks in Chinese .
print(tplt.format(" ranking "," School name "," Provinces "," Total score ",chr(12288)))
#Python Use .format Function to format the output
#chr(12288) Means to fill in blanks according to Chinese habits , To output aligned constraints
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))
print("Suc"+str(num))
def main():
uinfo = []# Store University Information
url = "https://www.shanghairanking.cn/rankings/bcur/202111"
html = getHTMLText(url# Get the content of this page
fillUnivList(uinfo,html)# Analyze the content of this web page , Store in uinfo In the list
printUnivList(uinfo,20)# Print the information of the top 20 in the list
main()
# remarks 1:
#try except The execution flow of the statement is as follows :
# First, execute try Code block in , If an exception occurs during execution , The system will automatically generate an exception type , And submit the exception to Python Interpreter , This process is called catching exceptions .
# When Python When the interpreter receives an exception object , Will look for someone who can handle the exception object except block , If you find the right except block , Then give the exception object to the except Block handling ,
# This process is called exception handling . If Python The interpreter could not find a to handle the exception except block , Then the program is terminated ,Python The interpreter will also exit .
# remarks 2:
# The following functions need to be written by observing the source code of the web page ,( You can use the web page source code page ctrl+f Find the tag ) It can be seen that : One <tr></tr> It contains all the information of a University , Every
#<td></td> It also includes a ranking of different aspects of universities 、 name 、 Provinces and cities, etc .tr The last attribute of is tbody, adopt tbody The child node of search traverses all tr, stay tr label
# Find td Tag information , And will be the first 1、2、4 Corresponding to tds No 0、1、3 Column information , The first 1 Corresponding a The... In the array 0 The information of the column is stored in ulist in .
# remarks 3:
#isinstance Function USES ,isinstance() Function to determine whether an object is a known type , similar type().isinstance(object, classinfo),object: Instance object .
#classinfo: It can be a direct or indirect class name 、 Basic types or tuples made up of them . Determine whether the instance belongs to which class .
#bs4.element.Tag:bs4 Defined in the library tag type
边栏推荐
- jmeter -- 分布式压测
- Aperçu en direct | Services de conteneurs ACK flexible Prediction Best Practices
- Neural networks and deep learning Chapter 6: Circular neural networks reading questions
- Discussion on the dimension of confrontation subspace
- 775 Div.1 B. integral array mathematics
- How should programmers learn mathematics
- 2021 electrician cup (the 12th "China Society of electrical engineering Cup" National Undergraduate electrician mathematical modeling) detailed ideas + codes + references
- Séparation et combinaison de la construction du système qualité
- [groovy] closure (closure as function parameter | code example)
- Invalid bound statement (not found) in idea -- problem solving
猜你喜欢
2022-2028 global and Chinese equipment as a Service Market Research Report
Special information | finance, accounting, audit - 22.1.23
Rip notes [rip message security authentication, increase of rip interface measurement]
PostgreSQL 超越 MySQL,“世界上最好的编程语言”薪水偏低
Download the details and sequence of the original data access from the ENA database in EBI
Hypothesis testing -- learning notes of Chapter 8 of probability theory and mathematical statistics
CUDA Programming atomic operation atomicadd reports error err:msb3721, return code 1
QT Bluetooth: a class for searching Bluetooth devices -- qbluetooth devicediscoveryagent
Solution of circular dependency
xss注入
随机推荐
Neural networks and deep learning Chapter 5: convolutional neural networks reading questions
CSDN正文自动生成目录
SQLServer 存储过程传递数组参数
Debug insights
History of web page requests
AutoCAD -- dimension break
English topic assignment (27)
[groovy] closure (Introduction to closure class closure | closure parametertypes and maximumnumberofparameters member usage)
Séparation et combinaison de la construction du système qualité
54. 螺旋矩阵 & 59. 螺旋矩阵 II ●●
【acwing】836. Merge sets
Raki's notes on reading paper: code and named entity recognition in stackoverflow
xss注入
[groovy] closure (closure parameter binding | curry function | rcurry function | ncurry function | code example)
What are the building energy-saving software
中国溶聚丁苯橡胶(SSBR)行业研究与预测报告(2022版)
PHP reads the INI file and writes the modified content
Difference between singleton and factory pattern
Label exchange experiment
[crampon programming] lintcode decoding Encyclopedia - 1100 strange printer