当前位置:网站首页>Selenium docking agent and selenium wire access developer tool network
Selenium docking agent and selenium wire access developer tool network
2022-07-29 00:50:00 【Xiaoming - code entity】
I was in 《 Use MitmProxy Offline caching 360 Du panoramic web page 》 This article demonstrates how to build python proxy server MitmProxy.
But before, it was purely manual access to Web cache data , If we want to be able to automatically access the web page and connect with the agent to download data , Can pass selenium Control the browser to realize automatic access .
docking selenium Proxy server for , One usage is to use browsermobproxy, It's based on Java Development , Need to be in https://chromedevtools.github.io/devtools-protocol/tot/Network/ Download the corresponding file .
Reference resources :https://blog.csdn.net/u010741112/article/details/118674293
However, after personal research, I found that it is only based on Java Developed proxy server , The processing of coding is not accurate enough, and garbled code often appears , And difficult to restore .
Solve his garbled code problem , You have to add Java The interceptor of the code sets the code , For example, set the text with GBK Encoding and decoding :
proxy = server.create_proxy()
# According to the actual website settings
proxy.response_interceptor(''' if (contents.isText() ) { response.headers().set("Content-Type", "text/json;charset=GBK"); } ''')
As a whole , Personally, it's hard to use . Far less than selenium Direct docking MitmProxy Convenient for agency .
selenium Code example of using proxy server :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument(f'--proxy-server=127.0.0.1:8080')
browser = webdriver.Chrome(options=option)
browser.get('https://www.baidu.com/')
So only we can pass MitmProxy The proxy server gets all the information selenium Control the data accessed by the browser , In this way, we also realize the decoupling between automatic control and data acquisition ,mitmdump The loaded script is specially complex to intercept data and process ,selenium The code is dedicated to automatic control .
Today I want to introduce seleniumwire This library , For the complete usage of this library, please refer to :https://pypi.org/project/selenium-wire/
install :
pip install selenium-wire
Be careful :Linux and Mac The system must be additionally installed openssl Used to decode https The data of .
CentOS:
yum install opensslMac:
brew install openssl
This library can basically use the controlled browser to access the history left , You can download it at will , But only the original byte data is saved by default , If you encounter compressed data, you need to decompress it by yourself .
Now let's demonstrate with the market center data of NetEase Finance , The complete code is as follows :
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from seleniumwire import webdriver
import json
import gzip
import time
import pandas as pd
browser = webdriver.Chrome()
browser.get('http://quotes.money.163.com/old/#query=dy019000&DataType=HS_RANK&sort=PERCENT&order=desc&count=24&page=0')
wait = WebDriverWait(browser, 30)
# Wait for the table component to load
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "table.ID_table")))
def fetch_data():
for request in reversed(browser.requests):
if not request.url.startswith("http://quotes.money.163.com/hs/service/diyrank.php"):
continue
res = request.response
if not res or res.status_code != 200:
continue
if res.headers["content-encoding"] == "gzip":
# gzip decompression
res.body = gzip.decompress(res.body)
return json.loads(res.body.decode('u8'))
def wait_loading():
" wait for Wait for the disappearance of components "
for _ in range(150):
if not browser.find_elements_by_css_selector("div.loading-cover"):
break
time.sleep(0.2)
data = []
for i in range(100):
wait_loading() # Wait for the load to finish
page = fetch_data() # Get data from historical travel records
if page:
print(i, len(page['list']), page['list'][0]['CODE']) # Test data
data.extend(page['list']) # insert data
a_tag = browser.find_element_by_css_selector(
"div.ID_pages a:nth-last-of-type(1)")
del browser.requests # Empty the browser cache
if a_tag.text != " The next page ":
break
a_tag.click()
df = pd.DataFrame(data)
df

1048 rows × 25 columns
Key code explanation :
if res.headers["content-encoding"] == "gzip":
res.body = gzip.decompress(res.body)
This code implements gzip Compressed network data is automatically decompressed , If there are other types of compression, you need to write targeted code , General decompression code :
import gzip
import zlib
import brotli
content_encoding = res.headers["content-encoding"]
if content_encoding == "gzip":
res.body = gzip.decompress(res.body)
elif content_encoding == "deflate":
res.body = zlib.decompress(res.body)
elif content_encoding == "br":
res.body = brotli.decompress(res.body)
wait_loading Method is used to detect whether the data has been loaded , In principle, every 0.2 Check it in seconds loading Whether the component still exists , If it does not exist, the process of loading data has ended .
browser.requests It caches all the data obtained by the viewer during the access process , It's like the browser developer tool Network,reversed(browser.requests) The purpose of is to view the data upside down , That is, priority should be given to the latest data . In this way , Even if you don't clean up the history cache , It can also obtain data more correctly .
del browser.requests The function of is to clean up the cache , The purpose is to clean up the history cache before the next click access , So that the speed of data acquisition becomes faster .
边栏推荐
- Several methods of multi-threaded sequential operation can be asked casually in the interview
- Application and principle of distributed current limiting redistribution rratelimiter
- [basic course of flight control development 8] crazy shell · open source formation uav-i2c (laser ranging)
- PTA (daily question) 7-74 yesterday
- Dynamic programming problem (VIII)
- How to solve the problems of MQ message loss, duplication and backlog?
- SurfaceControl和SurfaceFlinger通信
- Longest ascending subsequence
- PTA (daily question) 7-69 narcissus number
- I don't know how lucky the boy who randomly typed the log is. There must be a lot of overtime!
猜你喜欢

DRF - web development mode, API interface, API interface testing tool, restful specification, serialization and deserialization, DRF installation and use

Upload Excel files with El upload and download the returned files

Tips for API interface optimization

会议OA项目之会议通知&会议反馈&反馈详情功能
![[development tutorial 10] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - Bluetooth ble transceiver](/img/06/5e417bb97e309b6ee27dc693cabb85.png)
[development tutorial 10] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - Bluetooth ble transceiver

从零开始实现lmax-Disruptor队列(六)Disruptor 解决伪共享、消费者优雅停止实现原理解析

将行内元素转换为块元素的方法

华为发布HarmonyOS 3.0,向“万物互联”再迈一步

Data warehouse construction - DWT floor

PTA (one question per day) 7-76 ratio
随机推荐
C语言括号匹配(栈括号匹配c语言)
execute immediate 简单示例合集(DML)
Basic knowledge of PHP language (super detailed)
Talk about seven ways to realize asynchronous programming
Anti shake and throttling
What are the skills of API interface optimization?
Huawei releases harmonyos 3.0, taking another step towards "Internet of all things"
PTA (daily question) 7-72 calculate the cumulative sum
Dynamic programming problem (6)
【开发教程10】疯壳·开源蓝牙心率防水运动手环-蓝牙 BLE 收发
armeabi-v7a架构(sv7a)
Introduction of shortest path tree (SPT) and matlab code
ZABBIX deployment and monitoring
PTA (daily question) 7-74 yesterday
mysql时间按小时格式化_mysql时间格式化,按时间段查询的MySQL语句[通俗易懂]
SQL Server 只有数据库文件,没有日志文件,恢复数据时报1813错误的解决方案
[ESN] learning echo state network
How to solve the problems of MQ message loss, duplication and backlog?
CDN mode uses vant components, and components cannot be called after they are introduced
DRF - web development mode, API interface, API interface testing tool, restful specification, serialization and deserialization, DRF installation and use