当前位置:网站首页>seleniumwire获取百度指数
seleniumwire获取百度指数
2022-07-28 22:59:00 【小小明-代码实体】
之前我在《如何用Python下载百度指数的数据》分享了如何使用接口获取百度指数,但是今年百度指数已经增加了新的校验方式,例如如下代码:
import requests
import json
from datetime import date, timedelta
headers = {
"Connection": "keep-alive",
"Accept": "application/json, text/plain, */*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://index.baidu.com/v2/main/index.html",
"Accept-Language": "zh-CN,zh;q=0.9",
'Cookie': cookie,
}
words = '[[{"name":"python","wordType":1}],[{"name":"java","wordType":1}]]'
start, end = "2022-06-28", "2022-07-27"
url = f'https://index.baidu.com/api/SearchApi/index?area=0&word={
words}&startDate={
start}&endDate={
end}'
res = requests.get(url, headers=headers)
res.json()
{'status': 10018,
'data': '',
'logid': 2631899650,
'message': '您好,百度指数监测到疑似存在xx访问行为,如您未有类似行为,可能是由于您使用公共网络或访问频次过高,\n 您可以通过邮箱[email protected]联系我们'}
百度指数并未返回数据,而是提示访问异常访问。经简单检查,现在的请求参数header中增加了Cipher-Text
参数,JS逆向大佬可以直接分析js从而正确产生该参数通过校验。
不过今天我将演示一个非常简单实用的获取百度指数的方案,直接使用seleniumwire来获取数据并解密。
关于seleniumwire的介绍,可参考我上一篇文章:《selenium对接代理与seleniumwire访问开发者工具NetWork》
实现自动登录百度指数
由于selenium操作百度指数网页每次都需要登录比较麻烦,我们可以在缓存cookie到本地文件后,每次重启都能自动登录百度。
自动保存cookie代码:
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get("https://index.baidu.com/v2/index.html")
browser.find_element_by_css_selector("span.username-text").click()
print("等待登录...")
while True:
if browser.find_element_by_css_selector("span.username-text").text != "登录":
break
else:
time.sleep(3)
print("已登录,现在为您保存cookie...")
with open('cookie.txt', 'w', encoding='u8') as f:
json.dump(browser.get_cookies(), f)
browser.close()
print("cookie保存完成,游览器已自动退出...")
运行以上代码后,会自动打开登录界面,待人工登录后,会自动保存cookie到本地并关闭游览器。
然后我们以如下方式访问百度指数,即可自动登录:
from seleniumwire import webdriver
browser = webdriver.Chrome()
with open('cookie.txt', 'r', encoding='u8') as f:
cookies = json.load(f)
browser.get('https://index.baidu.com/v2/index.html')
for cookie in cookies:
browser.add_cookie(cookie)
browser.get('https://index.baidu.com/v2/index.html')
参考:《提取谷歌游览器Cookie的五重境界》
搜索并获取数据
使游览器执行搜索特定关键字,例如Python:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(browser, 30)
edit = wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, "#search-input-form > input.search-input")))
print("清空前历史记录数:", len(browser.requests))
del browser.requests # 清空历史数据
edit.send_keys(Keys.CONTROL+'a')
edit.send_keys(Keys.DELETE)
edit.send_keys("Python")
submit = browser.find_element_by_css_selector("span.search-input-cancle")
submit.click()
print("清空后再执行搜索后的历史记录数:", len(browser.requests))
清空前历史记录数: 87
清空后再执行搜索后的历史记录数: 3
执行完搜索操作后,我们就可以从游览器缓存中获取数据了:
import gzip
import zlib
import brotli
import json
def auto_decompress(res):
content_encoding = res.headers["content-encoding"]
if content_encoding == "gzip":
res.body = gzip.decompress(res.body)
elif content_encoding == "deflate":
res.body = zlib.decompress(res.body)
elif content_encoding == "br":
res.body = brotli.decompress(res.body)
def fetch_data(rule, encoding="u8", is_json=True):
result = ""
for request in reversed(browser.requests):
if rule in request.url:
res = request.response
auto_decompress(res)
result = res.body.decode(encoding)
if is_json:
result = json.loads(result)
return result
def decrypt(ptbk, index_data):
n = len(ptbk)//2
a = dict(zip(ptbk[:n], ptbk[n:]))
return "".join([a[s] for s in index_data])
ptbk = fetch_data("Interface/ptbk")['data']
data = fetch_data("api/SearchApi/index")['data']
for userIndexe in data['userIndexes']:
name = userIndexe['word'][0]['name']
index_data = userIndexe['all']['data']
r = decrypt(ptbk, index_data)
print(name, r)
python 21077,21093,21186,19643,14612,13961,21733,21411,21085,21284,18591,13211,12753,27225,20302,19772,20156,17647,12018,11745,19535,19300,20075,20136,18153,12956,12406,17098,16259,18707
对比结果后可以看到,数据获取正确。这样我们就可以通过seleniumwire获取百度指数的数据了,若需要获取指定日期范围或指定省份,只需通过selenium模拟人工执行相应的查询操作,再通过游览器后台缓存获取即可。
多客户端数据的解析可以参考之前《如何用Python下载百度指数的数据》中的代码。
边栏推荐
- Dynamic programming problem (VIII)
- Applet waterfall flow, upload pictures, simple use of maps
- Recursion / backtracking (Part 2)
- Anti shake and throttling
- Nftscan and nftplay have reached strategic cooperation in the field of NFT data
- Andriod6.0 low power mode (turn off WiFi, Bluetooth, GPS, screen brightness, etc.)
- Shell编程规范与变量
- 面试被问到了String相关的几道题,你能答上来吗?
- [small bug diary] Navicat failed to connect to MySQL | MySQL service disappeared | mysqld installation failed (this application cannot run on your computer)
- PTA (daily question) 7-74 yesterday
猜你喜欢
PTA (daily question) 7-75 how many people in a school
How to solve the problem that the Oracle instance cannot be started
PTA (daily question) 7-71 character trapezoid
I don't know how lucky the boy who randomly typed the log is. There must be a lot of overtime!
MySQL事务(transaction) (有这篇就足够了..)
MySQL sub database and sub table and its smooth expansion scheme
Data warehouse construction - DWT floor
110 MySQL interview questions and answers (continuously updated)
Dynamic programming problem (VII)
还在写大量 if 来判断?一个规则执行器干掉项目中所有的 if 判断...
随机推荐
Execute immediate simple sample set (DML)
数仓搭建——DWT层
[ESN] learning echo state network
I don't know how lucky the boy who randomly typed the log is. There must be a lot of overtime!
Common sparse basis and matlab code for compressed sensing
What does the expression > > 0 in JS mean
PTA (daily question) 7-72 calculate the cumulative sum
PTA (daily question) 7-73 turning triangle
@Detailed explanation of the use of transactional annotation
Statistical analysis of time series
Data warehouse construction - DWT floor
Alibaba Code代码索引技术实践:为Code Review提供本地IDE的阅读体验
PTA (daily question) 7-69 narcissus number
[develop low code platform] low code rendering
There is a span tag. If you want to do click events on it, how can you expand the click area
Some operations of Ubuntu remote server configuration database (unable to locate package MySQL server, steps of installing mysql, unable to enter password when logging in MySQL)
Installation and use of pnpm
Flyway's quick start tutorial
execute immediate 简单示例合集(DML)
Outlier detection and open set identification (1)