当前位置:网站首页>使用selenium自动化测试工具爬取高考相关院校专业招生分数线及排名情况
使用selenium自动化测试工具爬取高考相关院校专业招生分数线及排名情况
2022-07-01 03:19:00 【黄钢】
随着高考分数公布,填报大学和专业成了各位家长最重要的事情,这两天有好几位亲戚朋友咨询专业填报的事情,发现了一个网站内容不错,提供了各个学校各个专业的最低分数线和最低录取名次,网站链接在这里,这个就是计算机类专业在浙江招生的情况,专业可以换掉。
这个页面的内容还是很简单的,但是他的分页(不同年份)通过get请求没法体现,应该是用前后端分离的模式开发的,所以通过网页请求来爬虫可能不太容易实现,所以使用了selenium进行自动化提取,并自动化跳转页面。
代码如下:
from selenium import webdriver
import time
import pandas as pd
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(r'C:\Users\HP\Downloads\chromedriver_win32\chromedriver.exe')
#time.sleep(5)
driver.get("https://www.zjut.cc/zhuanye/fsx-0809-33.html")
# time.sleep(15)
# url = driver.find_element_by_xpath("/html/body/div/div/section/main/div/div[4]/div/div[1]/div/div/div[3]/table/tbody/tr[1]")
# url = driver.find_element_by_xpath("/html/body/div/div/section/main/div/div[4]/div/div[1]/div/div/div[3]/table/tbody/tr[1]/td[2]/div")
# scqy = driver.find_element_by_xpath("/html/body/div/div/section/main/div/div[4]/div/div[1]/div/div/div[3]/table/tbody/tr[1]/td[2]/div").text
vehicles = []
res = []
for j in range(4):
schools = []
if j < 2:
for i in range(100):
series = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[1]/div/div[2]/table/tbody/tr[{}]/th".format(1+i)).text
school_name = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[1]/div/div[2]/table/tbody/tr[{}]/td[1]/a".format(1+i)).text
major = driver.find_element_by_xpath('//*[@id="pills-2021"]/div/div[2]/table/tbody/tr[{}]/td[1]/small[2]'.format(1+i)).text
min_score = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[1]/div/div[2]/table/tbody/tr[{}]/td[2]".format(1+i)).text
min_rank = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[1]/div/div[2]/table/tbody/tr[{}]/td[3]".format(1+i)).text
plan = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[1]/div/div[2]/table/tbody/tr[{}]/td[4]".format(1+i)).text
schools.append([series, school_name, major, min_score, min_rank, plan])
else:
for i in range(100):
series = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[3]/div/div[2]/table/tbody/tr[{}]/th".format(1+i)).text
school_name = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[3]/div/div[2]/table/tbody/tr[{}]/td[1]/a".format(1+i)).text
major = driver.find_element_by_xpath('//*[@id="pills-2021"]/div/div[2]/table/tbody/tr[{}]/td[1]/small[2]'.format(1+i)).text
min_score = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[3]/div/div[2]/table/tbody/tr[{}]/td[2]".format(1+i)).text
min_rank = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[3]/div/div[2]/table/tbody/tr[{}]/td[3]".format(1+i)).text
plan = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/div/div[3]/div/div[2]/table/tbody/tr[{}]/td[4]".format(1+i)).text
schools.append([series, school_name, major, min_score, min_rank, plan])
df = pd.DataFrame(schools, columns=['排序', '院校', '专业', '最低分', '最低排名', '计划招录人数'])
df.to_excel("%d.xlsx" % (-j + 2021), index=False)
# res.append(schools)
a = driver.find_element_by_xpath("/html/body/div[3]/div[1]/div/ul/li[{}]/a".format(1+j))
driver.execute_script("arguments[0].click();", a)
time.sleep(3)
可以看出来,绝大多数用的xpath,但也有一些细节需要解释,等空了再来解释。
边栏推荐
- Binary tree god level traversal: Morris traversal
- C语言多线程编程入门学习笔记
- Design of serial port receiving data scheme
- 终极套娃 2.0 | 云原生交付的封装
- 衡量两个向量相似度的方法:余弦相似度、pytorch 求余弦相似度:torch.nn.CosineSimilarity(dim=1, eps=1e-08)
- ES6解构语法详解
- 用小程序的技术优势发展产业互联网
- Cookie&Session
- Detailed explanation of ES6 deconstruction grammar
- idea插件备份表
猜你喜欢

FCN full Convolution Network Understanding and Code Implementation (from pytorch Official Implementation)

4、【WebGIS实战】软件操作篇——数据导入及处理

Ultimate dolls 2.0 | encapsulation of cloud native delivery

复习专栏之---消息队列

Ctfshow blasting WP

数据库中COMMENT关键字的使用

Home online shopping project

线程数据共享和安全 -ThreadLocal

FCN全卷积网络理解及代码实现(来自pytorch官方实现)

Ridge regression and lasso regression
随机推荐
雪崩问题以及sentinel的使用
Edlines: a real time line segment detector with a false detection control
IPv4和IPv6、局域网和广域网、网关、公网IP和私有IP、IP地址、子网掩码、网段、网络号、主机号、网络地址、主机地址以及ip段/数字-如192.168.0.1/24是什么意思?
不用加减乘除实现加法
Ouc2021 autumn - Software Engineering - end of term (recall version)
ECMAScript 6.0
Leetcode 31 next spread, leetcode 64 minimum path sum, leetcode 62 different paths, leetcode 78 subset, leetcode 33 search rotation sort array (modify dichotomy)
10、Scanner. Next() cannot read spaces /indexof -1
Binary tree god level traversal: Morris traversal
Cookie&Session
Data exchange JSON
EDLines: A real-time line segment detector with a false detection control翻译
ECMAScript 6.0
监听器 Listener
Leetcode 1482 guess, how about this question?
RSN:Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs
Golang multi graph generation gif
Go tool cli for command line implementation
数据交换 JSON
md5sum操作