当前位置:网站首页>Selenium wire obtains Baidu Index
Selenium wire obtains Baidu Index
2022-07-29 00:50:00 【Xiaoming - code entity】
I was in 《 How to use Python Download Baidu Index Data 》 Shared how to use the interface to obtain Baidu Index , But this year Baidu Index has added a new verification method , For example, the following code :
import requests
import json
from datetime import date, timedelta
headers = {
"Connection": "keep-alive",
"Accept": "application/json, text/plain, */*",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://index.baidu.com/v2/main/index.html",
"Accept-Language": "zh-CN,zh;q=0.9",
'Cookie': cookie,
}
words = '[[{"name":"python","wordType":1}],[{"name":"java","wordType":1}]]'
start, end = "2022-06-28", "2022-07-27"
url = f'https://index.baidu.com/api/SearchApi/index?area=0&word={
words}&startDate={
start}&endDate={
end}'
res = requests.get(url, headers=headers)
res.json()
{'status': 10018,
'data': '',
'logid': 2631899650,
'message': ' Hello! , Baidu Index detected suspected xx Access behavior , If you have no similar behavior , It may be because you use the public network or visit too often ,\n You can use email [email protected] Contact us '}
Baidu index did not return data , Instead, prompt access exception access . After simple inspection , Now the request parameters header Added in Cipher-Text Parameters ,JS The reverse boss can directly analyze js So that the parameter can be generated correctly and pass the verification .
But today I will demonstrate a very simple and practical scheme to obtain Baidu Index , Use it directly seleniumwire To get data and decrypt .
About seleniumwire Introduction to , Please refer to my last article :《selenium Connect the agent with seleniumwire Visit developer tools NetWork》
Achieve automatic login Baidu Index
because selenium It's troublesome to log in every time you operate Baidu Index website , We can cache cookie After local file , Every time you restart, you can automatically log in to Baidu .
Save automatically cookie Code :
from selenium import webdriver
import time
browser = webdriver.Chrome()
browser.get("https://index.baidu.com/v2/index.html")
browser.find_element_by_css_selector("span.username-text").click()
print(" Waiting for login ...")
while True:
if browser.find_element_by_css_selector("span.username-text").text != " Sign in ":
break
else:
time.sleep(3)
print(" Logged in , Now save for you cookie...")
with open('cookie.txt', 'w', encoding='u8') as f:
json.dump(browser.get_cookies(), f)
browser.close()
print("cookie Save complete , The browser has automatically exited ...")
After running the above code , It will automatically open the login interface , Wait for manual login , It will save itself cookie Go local and turn off the tour .
Then we visit Baidu Index in the following way , You can log in automatically :
from seleniumwire import webdriver
browser = webdriver.Chrome()
with open('cookie.txt', 'r', encoding='u8') as f:
cookies = json.load(f)
browser.get('https://index.baidu.com/v2/index.html')
for cookie in cookies:
browser.add_cookie(cookie)
browser.get('https://index.baidu.com/v2/index.html')
Reference resources :《 Extract Google viewer Cookie Five realms of 》
Search and get data
Make the browser perform a search for specific keywords , for example Python:
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(browser, 30)
edit = wait.until(EC.presence_of_element_located(
(By.CSS_SELECTOR, "#search-input-form > input.search-input")))
print(" Number of history records before emptying :", len(browser.requests))
del browser.requests # Clear historical data
edit.send_keys(Keys.CONTROL+'a')
edit.send_keys(Keys.DELETE)
edit.send_keys("Python")
submit = browser.find_element_by_css_selector("span.search-input-cancle")
submit.click()
print(" The number of history records after clearing and then executing the search :", len(browser.requests))
Number of history records before emptying : 87
The number of history records after clearing and then executing the search : 3
After performing the search operation , We can get data from the browser cache :
import gzip
import zlib
import brotli
import json
def auto_decompress(res):
content_encoding = res.headers["content-encoding"]
if content_encoding == "gzip":
res.body = gzip.decompress(res.body)
elif content_encoding == "deflate":
res.body = zlib.decompress(res.body)
elif content_encoding == "br":
res.body = brotli.decompress(res.body)
def fetch_data(rule, encoding="u8", is_json=True):
result = ""
for request in reversed(browser.requests):
if rule in request.url:
res = request.response
auto_decompress(res)
result = res.body.decode(encoding)
if is_json:
result = json.loads(result)
return result
def decrypt(ptbk, index_data):
n = len(ptbk)//2
a = dict(zip(ptbk[:n], ptbk[n:]))
return "".join([a[s] for s in index_data])
ptbk = fetch_data("Interface/ptbk")['data']
data = fetch_data("api/SearchApi/index")['data']
for userIndexe in data['userIndexes']:
name = userIndexe['word'][0]['name']
index_data = userIndexe['all']['data']
r = decrypt(ptbk, index_data)
print(name, r)
python 21077,21093,21186,19643,14612,13961,21733,21411,21085,21284,18591,13211,12753,27225,20302,19772,20156,17647,12018,11745,19535,19300,20075,20136,18153,12956,12406,17098,16259,18707
After comparing the results, we can see , Data acquisition is correct . So we can get through seleniumwire Get the data of Baidu Index , If you need to get a specified date range or a specified province , Just pass selenium Simulate manual execution of corresponding query operations , Then get it through the background cache of the browser .
For the analysis of multi client data, please refer to the previous 《 How to use Python Download Baidu Index Data 》 The code in .
边栏推荐
- Breadth first search (BFS) and its matlab code
- How to learn R language
- 【愚公系列】2022年07月 Go教学课程 020-Go容器之数组
- Outlier detection and open set identification (1)
- Relying on cloud business to support revenue growth alone, is Microsoft still overvalued?
- Router view cannot be rendered (a very low-level error)
- [untitled]
- execute immediate 简单示例合集(DML)
- 数仓搭建——DWT层
- mysql时间按小时格式化_mysql时间格式化,按时间段查询的MySQL语句[通俗易懂]
猜你喜欢

selenium对接代理与seleniumwire访问开发者工具NetWork
Depth first search (DFS) and its matlab code

Data warehouse construction - ads floor

DRF - paging, JWT introduction and principle, JWT quick use, JWT source code analysis, JWT custom return format, custom user issued token, custom token authentication class

将Word中的表格以图片形式复制到微信发送

最长上升子序列

Longest ascending subsequence

将行内元素转换为块元素的方法

Still writing a lot of if to judge? A rule executor kills all if judgments in the project

Brief introduction to compressed sensing
随机推荐
Android必备的面试技能(含面试题和学习资料)
Camera Hal OEM module ---- CMR_ preview.c
第二轮1000个Okaleido Tiger,再次登录Binance NFT 1小时售罄
Rk3399 9.0 driver add powser button
17.机器学习系统的设计
MySQL的隔离级别、可能出现的问题(脏读、不可重复读、幻读)及其解决方法
flask结合容联云发送验证码
C语言括号匹配(栈括号匹配c语言)
Matlab02: structured programming and function definition "suggestions collection"
关于ThreadPool的一些注意事项
Introduction of shortest path tree (SPT) and matlab code
Execute immediate simple sample set (DML)
Introduction and solution of common security vulnerabilities in Web System SQL injection
I was asked several questions about string in the interview. Can you answer them?
1331. Array sequence number conversion: simple simulation question
Outlier detection and open set identification (1)
[ESN] learning echo state network
时间序列数据的预处理方法总结
CUDA related
Flyway's quick start tutorial