当前位置:网站首页>Crawling JS encrypted data of playwright actual combat case
Crawling JS encrypted data of playwright actual combat case
2022-07-29 08:23:00 【Make Jun Huan】
List of articles
Preface
Playwright Is a powerful Python library , Use only one API It can be executed automatically Chromium、Firefox、WebKit And other mainstream browser automation operations , And support headless mode at the same time 、 Head mode operation . Playwright The automation technology provided is green 、 Powerful 、 Reliable and fast , Support Linux、Mac as well as Windows operating system .
One 、Playwright Installation and use of
1. install
- To use Playwright, need Python 3.7 Version and above , Please make sure Python The version of meets the requirements .
install Playwright, The order is as follows :
# install playwright library
pip install playwright
# Install browser driver files ( The installation process is a little slow )
playwright install
2. Recording
- Use Playwright No need to write a line of code , We just need to manually operate the browser , It will record our actions , And then automatically generate code scripts
Enter the following command
# Help order
playwright codegen --help
# Try starting a Firefox browser , Then output the operation result to script.py file
playwright codegen -o script.py -b firefox
Two 、 Case realization
1. Ideas
- adopt playwright Open the browser to get JavaScript Rendered data , Thus bypassing decryption JavaScript Encrypted data , Follow Selenium Same , It's all analog people who open the browser to get data .
2. Import and stock in
The code is as follows ( Example ):
from playwright.sync_api import sync_playwright
from lxml import etree
import pymongo
# pymongo It has its own connection pool and automatic reconnection mechanism , But still need to capture AutoReconnect Exception and reissue the request .
from pymongo.errors import AutoReconnect
from retry import retry
# logging Used to output information
import logging
import time
# Starting time
start = time.time()
# Log output format
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s: %(message)s')
3. Drive the browser to access
- By opening the browser visit https://www.oklink.com/zh-cn/btc/tx-list?limit=100&pageNum=1 Then get the web source code , Then pass the web page source code into the generator , For the next call
The code is as follows ( Example ):
BTC_URL = 'https://www.oklink.com/zh-cn/btc/tx-list?limit=100&pageNum={pageNum}'
def run(playwright):
# Drive browser , And start headless mode
browser = playwright.chromium.launch(headless=True)
# open windows
page = browser.new_page()
for Num in range(1, 3):
# Trigger event
page.on('response', on_response)
# visit URL
page.goto(BTC_URL.format(pageNum=Num))
# Called wait_for_load_state Method waits for a state of the page to complete , Here we introduce state yes networkidle, That is, the network is idle
page.wait_for_load_state('networkidle')
# generator
yield page.content()
# html = page.content()
# Get_the_data(page.content())
# print(html)
browser.close()
4. Triggering event
- on_response Method is used to judge some requests ( /api/explorer/v1/btc/transactionsNoRestrict ) Whether the returned status is 200, If it is 200 , You can get the data returned by the request , That is, the encrypted data
The code is as follows ( Example ):
def on_response(response):
try:
# Filter requests , And judge the state
if '/api/explorer/v1/btc/transactionsNoRestrict' in response.url and response.status == 200:
# return json Format data
logging.info('get invalid status code %s while scraping %s',
response.status, response.url)
# data_set = response.json().get('data').get('hits')
# for item in data_set:
# Transaction_hashing = item.get('hash')
# The_block= item.get('blockHeight')
# print(Transaction_hashing)
# print(The_block)
return response.json()
if '/api/explorer/v1/btc/transactionsNoRestrict' in response.url and response.status != 200:
# If not 200 Print out the response code and link in the log
logging.error('get invalid status code %s while scraping %s',
response.status, response.url)
except Exception as e:
# exc_info Boolean value , If the value of this parameter is True when , The exception information will be added to the log message ; If not, it will None Add to log information .
logging.error('error occurred while scraping %s',
response.url, exc_info=True)
5. Get data
- In front we have passed run() Method to get the source code of the web page , And the data we want is also in the web source code , Just use it xpath Come and get the data
The code is as follows ( Example ):
def Get_the_data(html):
# Format source code
selector = etree.HTML(html)
data_set = selector.xpath(
'//*[@id="root"]/main/div/div[3]/div/div[2]/section/div/div/div/div/table/tbody/tr')[1:]
for data in data_set:
Transaction_hashing = data.xpath('td[1]/div/a/text()')[0]
The_block = data.xpath('td[2]/a/text()')[0]
Trading_hours = data.xpath('td[3]/div/span/text()')[0]
The_input = data.xpath('td[4]/span/text()')[0]
The_output = data.xpath('td[5]/span/text()')[0]
quantity = data.xpath('td[6]/span/span/text()')[0]
premium = data.xpath('td[7]/span/span/text()')[0]
BTC_data = {
' Transaction hash ': Transaction_hashing,
' Block ': The_block,
' Trading hours ': Trading_hours,
' Input ': The_input,
' Output ': The_output,
' Number (BTC)': quantity,
' Service Charge (BTC)': premium
}
yield BTC_data
6. Save data to Mongodb
The code is as follows ( Example ):
@retry(AutoReconnect, tries=4, delay=1)
def save_data(data):
""" Save data to mongodb Use update_one() The first parameter of the method is the query condition , The second parameter is the field to be modified . upsert: It's a special update , If you do not find a document that meets the conditions for updating , A new document will be created based on this condition and the updated document ; If a matching document is found , It will be updated normally ,upsert Very convenient , There is no need to preset the set , The same set of code can be used to create documents and update documents """
# If it exists, it will be updated , If it doesn't exist, create a new one ,
collection.update_one({
# Guarantee data Is the only one.
' Transaction hash ': data.get(' Transaction hash ')
}, {
'$set': data
}, upsert=True)

7. Calling method
The code is as follows ( Example ):
with sync_playwright() as playwright:
for html in run(playwright):
for data in Get_the_data(html):
logging.info('get detail data %s', data)
logging.info('saving data to mongodb')
save_data(data)
logging.info('data saved successfully')
# End time
end = time.time()
print('Cost time: ', end - start)
8. Run code

summary
playwright Compared with the existing automated testing tools, it has many advantages , such as :
- Cross browser , Support Chromium、Firefox、WebKit
- Cross operating system , Support Linux、Mac、Windows
- It can record and generate code , Liberating hands
At present, the disadvantage of mobile terminal is that the ecosystem and documents are not very complete .
边栏推荐
- 【Transformer】ATS: Adaptive Token Sampling For Efficient Vision Transformers
- Deep learning (1): prediction of bank customer loss
- Background management system platform of new energy charging pile
- Simplefoc parameter adjustment 2- speed and position control
- Huawei wireless device configuration uses WDS technology to deploy WLAN services
- DC motor control system based on DAC0832
- SQL 面试碰到的一个问题
- Application scheme of charging pile
- TCP——滑动窗口
- Collation of ml.net related resources
猜你喜欢

DAC0832 waveform generator based on 51 single chip microcomputer

STM32 detection signal frequency

BiSeNet v2

HC-SR04超声波测距模块使用方法和例程(STM32)
![[academic related] why can't many domestic scholars' AI papers be reproduced?](/img/1a/7b162741aa7ef09538355001bf45e7.png)
[academic related] why can't many domestic scholars' AI papers be reproduced?

Lora opens a new era of Internet of things -asr6500s, asr6501/6502, asr6505, asr6601

What is the working principle of the noise sensor?

Arduinoide + stm32link burning debugging

PostgreSQL手动创建HikariDataSource解决报错Cannot commit when autoCommit is enabled

Stm32ff030 replaces domestic MCU dp32g030
随机推荐
Day15: the file contains the vulnerability range manual (self use file include range)
华为无线设备配置利用WDS技术部署WLAN业务
2.4G band wireless transceiver chip si24r1 summary answer
A problem encountered in SQL interview
Reading of false news detection papers (3): semi supervised content-based detection of misinformation via tensor embeddings
Leetcode Hot 100 (brush question 9) (301/45/517/407/offer62/mst08.14/)
What constitutes the smart charging pile system?
BiSeNet v2
Background management system platform of new energy charging pile
深度学习(1):银行客户流失预测
Cluster usage specification
Proteus simulation based on msp430f2491
【OpenCV】-算子(Sobel、Canny、Laplacian)学习
[robomaster] control RM motor from scratch (2) -can communication principle and electric regulation communication protocol
Deep learning (1): prediction of bank customer loss
The computer video pauses and resumes, and the sound suddenly becomes louder
torch.Tensor和torch.tensor的区别
What is Amazon self support number and what should sellers do?
DAC0832 waveform generator based on 51 single chip microcomputer
[beauty of software engineering - column notes] 28 | what is the core competitiveness of software engineers? (next)