当前位置：网站首页>Selenium wire obtains Baidu Index

Selenium wire obtains Baidu Index

2022-07-29 00:50:00 【Xiaoming - code entity】

I was in 《 How to use Python Download Baidu Index Data 》 Shared how to use the interface to obtain Baidu Index , But this year Baidu Index has added a new verification method , For example, the following code ：

import requests
import json
from datetime import date, timedelta

headers = {
    
    "Connection": "keep-alive",
    "Accept": "application/json, text/plain, */*",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36",
    "Sec-Fetch-Site": "same-origin",
    "Sec-Fetch-Mode": "cors",
    "Sec-Fetch-Dest": "empty",
    "Referer": "https://index.baidu.com/v2/main/index.html",
    "Accept-Language": "zh-CN,zh;q=0.9",
    'Cookie': cookie,
}

words = '[[{"name":"python","wordType":1}],[{"name":"java","wordType":1}]]'
start, end = "2022-06-28", "2022-07-27"
url = f'https://index.baidu.com/api/SearchApi/index?area=0&word={
      words}&startDate={
      start}&endDate={
      end}'
res = requests.get(url, headers=headers)
res.json()

{'status': 10018,
 'data': '',
 'logid': 2631899650,
 'message': ' Hello! , Baidu Index detected suspected xx Access behavior , If you have no similar behavior , It may be because you use the public network or visit too often ,\n         You can use email [email protected] Contact us '}

Baidu index did not return data , Instead, prompt access exception access . After simple inspection , Now the request parameters header Added in Cipher-Text Parameters ,JS The reverse boss can directly analyze js So that the parameter can be generated correctly and pass the verification .

But today I will demonstrate a very simple and practical scheme to obtain Baidu Index , Use it directly seleniumwire To get data and decrypt .

About seleniumwire Introduction to , Please refer to my last article ：《selenium Connect the agent with seleniumwire Visit developer tools NetWork》

Achieve automatic login Baidu Index

because selenium It's troublesome to log in every time you operate Baidu Index website , We can cache cookie After local file , Every time you restart, you can automatically log in to Baidu .

Save automatically cookie Code ：

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get("https://index.baidu.com/v2/index.html")
browser.find_element_by_css_selector("span.username-text").click()
print(" Waiting for login ...")
while True:
    if browser.find_element_by_css_selector("span.username-text").text != " Sign in ":
        break
    else:
        time.sleep(3)
print(" Logged in , Now save for you cookie...")
with open('cookie.txt', 'w', encoding='u8') as f:
    json.dump(browser.get_cookies(), f)
browser.close()
print("cookie Save complete , The browser has automatically exited ...")

After running the above code , It will automatically open the login interface , Wait for manual login , It will save itself cookie Go local and turn off the tour .

Then we visit Baidu Index in the following way , You can log in automatically ：

from seleniumwire import webdriver


browser = webdriver.Chrome()
with open('cookie.txt', 'r', encoding='u8') as f:
    cookies = json.load(f)
browser.get('https://index.baidu.com/v2/index.html')
for cookie in cookies:
    browser.add_cookie(cookie)
browser.get('https://index.baidu.com/v2/index.html')

Reference resources ：《 Extract Google viewer Cookie Five realms of 》

Search and get data

Make the browser perform a search for specific keywords , for example Python：

from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

wait = WebDriverWait(browser, 30)
edit = wait.until(EC.presence_of_element_located(
    (By.CSS_SELECTOR, "#search-input-form > input.search-input")))
print(" Number of history records before emptying ：", len(browser.requests))
del browser.requests  #  Clear historical data 
edit.send_keys(Keys.CONTROL+'a')
edit.send_keys(Keys.DELETE)
edit.send_keys("Python")
submit = browser.find_element_by_css_selector("span.search-input-cancle")
submit.click()
print(" The number of history records after clearing and then executing the search ：", len(browser.requests))

 Number of history records before emptying ： 87
 The number of history records after clearing and then executing the search ： 3

After performing the search operation , We can get data from the browser cache ：

import gzip
import zlib
import brotli
import json


def auto_decompress(res):
    content_encoding = res.headers["content-encoding"]
    if content_encoding == "gzip":
        res.body = gzip.decompress(res.body)
    elif content_encoding == "deflate":
        res.body = zlib.decompress(res.body)
    elif content_encoding == "br":
        res.body = brotli.decompress(res.body)


def fetch_data(rule, encoding="u8", is_json=True):
    result = ""
    for request in reversed(browser.requests):
        if rule in request.url:
            res = request.response
            auto_decompress(res)
            result = res.body.decode(encoding)
            if is_json:
                result = json.loads(result)
            return result
        
def decrypt(ptbk, index_data):
    n = len(ptbk)//2
    a = dict(zip(ptbk[:n], ptbk[n:]))
    return "".join([a[s] for s in index_data])

ptbk = fetch_data("Interface/ptbk")['data']
data = fetch_data("api/SearchApi/index")['data']

for userIndexe in data['userIndexes']:
    name = userIndexe['word'][0]['name']
    index_data = userIndexe['all']['data']
    r = decrypt(ptbk, index_data)
    print(name, r)

python 21077,21093,21186,19643,14612,13961,21733,21411,21085,21284,18591,13211,12753,27225,20302,19772,20156,17647,12018,11745,19535,19300,20075,20136,18153,12956,12406,17098,16259,18707

After comparing the results, we can see , Data acquisition is correct . So we can get through seleniumwire Get the data of Baidu Index , If you need to get a specified date range or a specified province , Just pass selenium Simulate manual execution of corresponding query operations , Then get it through the background cache of the browser .

For the analysis of multi client data, please refer to the previous 《 How to use Python Download Baidu Index Data 》 The code in .

原网站

版权声明
本文为[Xiaoming - code entity]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207282258395319.html

当前位置：网站首页>Selenium wire obtains Baidu Index

Selenium wire obtains Baidu Index

Achieve automatic login Baidu Index

Search and get data

边栏推荐

猜你喜欢

随机推荐