当前位置:网站首页>Selenium docking agent and selenium wire access developer tool network
Selenium docking agent and selenium wire access developer tool network
2022-07-29 00:50:00 【Xiaoming - code entity】
I was in 《 Use MitmProxy Offline caching 360 Du panoramic web page 》 This article demonstrates how to build python proxy server MitmProxy.
But before, it was purely manual access to Web cache data , If we want to be able to automatically access the web page and connect with the agent to download data , Can pass selenium Control the browser to realize automatic access .
docking selenium Proxy server for , One usage is to use browsermobproxy, It's based on Java Development , Need to be in https://chromedevtools.github.io/devtools-protocol/tot/Network/ Download the corresponding file .
Reference resources :https://blog.csdn.net/u010741112/article/details/118674293
However, after personal research, I found that it is only based on Java Developed proxy server , The processing of coding is not accurate enough, and garbled code often appears , And difficult to restore .
Solve his garbled code problem , You have to add Java The interceptor of the code sets the code , For example, set the text with GBK Encoding and decoding :
proxy = server.create_proxy()
# According to the actual website settings
proxy.response_interceptor(''' if (contents.isText() ) { response.headers().set("Content-Type", "text/json;charset=GBK"); } ''')
As a whole , Personally, it's hard to use . Far less than selenium Direct docking MitmProxy Convenient for agency .
selenium Code example of using proxy server :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument(f'--proxy-server=127.0.0.1:8080')
browser = webdriver.Chrome(options=option)
browser.get('https://www.baidu.com/')
So only we can pass MitmProxy The proxy server gets all the information selenium Control the data accessed by the browser , In this way, we also realize the decoupling between automatic control and data acquisition ,mitmdump The loaded script is specially complex to intercept data and process ,selenium The code is dedicated to automatic control .
Today I want to introduce seleniumwire This library , For the complete usage of this library, please refer to :https://pypi.org/project/selenium-wire/
install :
pip install selenium-wire
Be careful :Linux and Mac The system must be additionally installed openssl Used to decode https The data of .
CentOS:
yum install opensslMac:
brew install openssl
This library can basically use the controlled browser to access the history left , You can download it at will , But only the original byte data is saved by default , If you encounter compressed data, you need to decompress it by yourself .
Now let's demonstrate with the market center data of NetEase Finance , The complete code is as follows :
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
import json
import gzip
import time
import pandas as pd
browser = webdriver.Chrome()
browser.get('http://quotes.money.163.com/old/#query=dy019000&DataType=HS_RANK&sort=PERCENT&order=desc&count=24&page=0')
wait = WebDriverWait(browser, 30)
# Wait for the table component to load
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.ID_table")))
def fetch_data():
for request in reversed(browser.requests):
if not request.url.startswith("http://quotes.money.163.com/hs/service/diyrank.php"):
continue
res = request.response
if not res or res.status_code != 200:
continue
if res.headers["content-encoding"] == "gzip":
# gzip decompression
res.body = gzip.decompress(res.body)
return json.loads(res.body.decode('u8'))
def wait_loading():
" wait for Wait for the disappearance of components "
for _ in range(150):
if not browser.find_elements_by_css_selector("div.loading-cover"):
break
time.sleep(0.2)
data = []
for i in range(100):
wait_loading() # Wait for the load to finish
page = fetch_data() # Get data from historical travel records
if page:
print(i, len(page['list']), page['list'][0]['CODE']) # Test data
data.extend(page['list']) # insert data
a_tag = browser.find_element_by_css_selector(
"div.ID_pages a:nth-last-of-type(1)")
del browser.requests # Empty the browser cache
if a_tag.text != " The next page ":
break
a_tag.click()
df = pd.DataFrame(data)
df

1048 rows × 25 columns
Key code explanation :
if res.headers["content-encoding"] == "gzip":
res.body = gzip.decompress(res.body)
This code implements gzip Compressed network data is automatically decompressed , If there are other types of compression, you need to write targeted code , General decompression code :
import gzip
import zlib
import brotli
content_encoding = res.headers["content-encoding"]
if content_encoding == "gzip":
res.body = gzip.decompress(res.body)
elif content_encoding == "deflate":
res.body = zlib.decompress(res.body)
elif content_encoding == "br":
res.body = brotli.decompress(res.body)
wait_loading Method is used to detect whether the data has been loaded , In principle, every 0.2 Check it in seconds loading Whether the component still exists , If it does not exist, the process of loading data has ended .
browser.requests It caches all the data obtained by the viewer during the access process , It's like the browser developer tool Network,reversed(browser.requests) The purpose of is to view the data upside down , That is, priority should be given to the latest data . In this way , Even if you don't clean up the history cache , It can also obtain data more correctly .
del browser.requests The function of is to clean up the cache , The purpose is to clean up the history cache before the next click access , So that the speed of data acquisition becomes faster .
边栏推荐
- Teach you how to install latex (nanny level tutorial)
- ZABBIX deployment and monitoring
- selenium对接代理与seleniumwire访问开发者工具NetWork
- Alibaba code index technology practice: provide reading experience of local IDE for code review
- Flyway's quick start tutorial
- Error reporting: Rong Lianyun sends SMS verification code message 500
- Error reporting: when the browser clicks the modify add button, there is no response and no error reporting. Solution
- Anomaly detection and unsupervised learning (1)
- 软考 --- 数据库(4)SQL语句
- Common sparse basis and matlab code for compressed sensing
猜你喜欢

【开发教程11】疯壳·开源蓝牙心率防水运动手环-整机功能代码讲解

requestVideoFrameCallback() 简单实例

2022DASCTF7月赋能赛(复现)

从零开始实现lmax-Disruptor队列(六)Disruptor 解决伪共享、消费者优雅停止实现原理解析

Some operations of Ubuntu remote server configuration database (unable to locate package MySQL server, steps of installing mysql, unable to enter password when logging in MySQL)

Error reporting: Rong Lianyun sends SMS verification code message 500
![[development tutorial 11] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - explanation of the function code of the whole machine](/img/a1/9a69e5d123a8a11504da251bd1bcfc.png)
[development tutorial 11] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - explanation of the function code of the whole machine

Statistical analysis of time series

16.偏差、方差、正则化、学习曲线对模型的影响

Xinchi technology released the latest flagship product of G9 series, equipped with six A55 cores with 1.8GHz dominant frequency
随机推荐
MATLAB02:结构化编程和函数定义「建议收藏」
Html+css+php+mysql realize registration + login + change password (with complete code)
Android必备的面试技能(含面试题和学习资料)
Dynamic programming problem (VIII)
[development tutorial 10] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - Bluetooth ble transceiver
JWT token related configuration (global configuration identity authentication rewrites authenticate method)
追踪伦敦银实时行情的方法有哪些?
16.偏差、方差、正则化、学习曲线对模型的影响
【无标题】
PTA (daily question) 7-71 character trapezoid
PTA (one question per day) 7-76 ratio
Teach you how to install latex (nanny level tutorial)
[development tutorial 11] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - explanation of the function code of the whole machine
Surfacecontrol and surfaceflinger communication
Requestvideoframecallback() simple instance
ZABBIX deployment and monitoring
Router view cannot be rendered (a very low-level error)
Talk about seven ways to realize asynchronous programming
Common sparse basis and matlab code for compressed sensing
redis版本怎么查看(查看redis进程)