当前位置:网站首页>Selenium actual combat case crawling JS encrypted data
Selenium actual combat case crawling JS encrypted data
2022-07-29 08:23:00 【Make Jun Huan】
List of articles
Preface
Selenium Is a Web Tools for application testing .Selenium Test runs directly in browser , It's like a real user is doing it . Supported browsers include IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera,Edge etc. . The main functions of this tool include : Test compatibility with browser —— Test the application to see if it works well on different browsers and operating systems . Test system functions —— Create regression tests to verify software functionality and user requirements . Support automatic recording of actions and automatic generation .Net、Java、Perl Test scripts in different languages .
Tips : The following is the main body of this article , The following cases can be used for reference
One 、Selenium
1. function
Frame bottom use JavaScript Simulate real users to operate the browser . When the test script executes , The browser automatically makes clicks according to the script code , Input , open , Verification and other operations , Just like real users do , Test the application from the end user's perspective .
Make it possible to automate browser compatibility testing , Although there are still subtle differences on different browsers . Easy to use , You can use Java,Python And other languages to write use case scripts .Because the data is JS If you want to get data after encrypting, you need to decrypt , But it's not that easy to decrypt , So , If you use Selenium To drive the browser to load web pages , You can get it directly JavaScript The result of the rendering , Don't worry about what encryption system you're using .
2. install Selenium
- chromedriver Download address :
http://chromedriver.storage.googleapis.com/index.html - Check whether you have Chrome Browser version , Download the same version chromedriver
Check whether you have Chrome Browser version
Download to Chrome The same version of the browser chromedriver
3. decompression chromedriver package , take chromedriver.exe Copied to the python In the installation directory of Python3.8( What version of yourself is there )
4. then chromedriver.exe, Copy it to chrome The location of the browser
Choose Chrome browser , Right click mouse , Then click the location of the open file 
To put chromedriver.exe Copied to the chrome The location of the browser 
5. Configure environment variables : Double click this computer → Double click on the computer → System attribute → system information → Advanced system setup → environment variable → System variables → double-click Path→ edit → newly build , take chrome Copy the location path of the browser , Then don't forget to click OK for all the follow-up 
Two 、 Use steps
1. Import and stock in
- Install correctly Python Of Selenium library .
pip install Selenium
The code is as follows ( Example ):
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
import time
from lxml import etree
from selenium.webdriver.common.keys import Keys
import pymongo
# pymongo It has its own connection pool and automatic reconnection mechanism , But still need to capture AutoReconnect Exception and reissue the request .
from pymongo.errors import AutoReconnect
from retry import retry
# logging Used to output information
import logging
2. Set anti shielding and headless mode
Anti shielding is not added , This can be easily detected , Because in most cases , The basic principle of detection is to detect... In the current browser window window.navigator Whether the object contains webdriver This attribute . Because in normal use of the browser , This property is undefined, In the use of the Selenium,Selenium Will give window.navigator Set up webdriver attribute . Many websites pass JavaScript Determine if the webdriver Attributes exist , Then directly shield .

have access to CDP( namely Chrome Devtools-Protocol,Chrome Development tool agreement ) To solve this problem , Through it, we can execute when each page is just loaded JavaScript Code , Executive CDP The method is called Page.addScriptToEvaluateOnNewDocument, Then pass in the above JavaScript The code can be , In this way, we can load the page before each page webdriver Property is empty . in addition , We can also add several options to hide WebDriver Prompt bar and automatic extension information

The code is as follows ( Example ):
option = ChromeOptions()
# Turn on Headless mode
option.add_argument('--headless')
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('useAutomationExtension', False)
browser = webdriver.Chrome(options=option)
browser.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {
'source': 'Object.defineProperty(navigator, "webdriver", {get: () => undefined})'
})
browser.get('https://www.endata.com.cn/BoxOffice/BO/Year/index.html')
# Explicit waiting 10 second
wait = WebDriverWait(browser, 10)
# stay 10 If found within seconds XPATH Quit until
wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="OptionDate"]')))
time.sleep(2)
3. Get data
- use browser.page_source To output the responded code , to Get_the_data In the method , You can extract data in a basic way
The code is as follows ( Example ):
# Log output format
logging.basicConfig(level=logging.INFO,
format='%(asctime)s - %(levelname)s: %(message)s')
def Get_the_data(html):
# format html Code
selector = etree.HTML(html)
data_set = selector.xpath('//*[@id="TableList"]/table/tbody/tr')
for data in data_set:
movie_name = data.xpath('td[2]/a/p/text()')[0]
movie_type = data.xpath('td[3]/text()')[0]
Total_box_office = data.xpath('td[4]/text()')[0]
Average_ticket_price = data.xpath('td[5]/text()')[0]
sessions = data.xpath('td[6]/text()')[0]
country = data.xpath('td[7]/text()')[0]
Release_date = data.xpath('td[8]/text()')[0]
movie_data = {
' Film name ': movie_name,
' type ': movie_type,
' Total box office ( ten thousand )': Total_box_office,
' Average fare ': Average_ticket_price,
' Average person time ': sessions,
' Countries and regions ': country,
' Release date ': Release_date
}
logging.info('get detail data %s', movie_data)
logging.info('saving data to mongodb')
save_data(movie_data)
logging.info('data saved successfully')
4. Page turning action
- stay Perform_the_action Method to simulate click action First click the down arrow , To press the down arrow on the keyboard , Press enter again

The code is as follows ( Example ):
def Perform_the_action():
for i in range(1, 15):
action = browser.find_element(By.XPATH, '//*[@id="OptionDate"]')
time.sleep(1)
action.click()
# And then use send_keys Method , Reuse Keys Method input the Enter key
time.sleep(1)
# Press the down arrow
action.send_keys(Keys.ARROW_DOWN)
time.sleep(1)
# Press enter
action.send_keys(Keys.ENTER)
time.sleep(2)
# return html Source code
response = browser.page_source
Get_the_data(response)
# print(i)
5. Read in the data
- Save data to Mongodb In the database

The code is as follows ( Example ):
# Appoint mongodb The connection of IP, Library name , aggregate
MONGO_CONNECTION_STRING = 'mongodb://192.168.27.101:27017'
client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['movie_data']
collection = db['movie_data']
@retry(AutoReconnect, tries=4, delay=1)
def save_data(data):
""" Save data to mongodb Use update_one() Method to modify a record in a document . The first parameter of this method is the query condition , The second parameter is the field to be modified . upsert: It's a special update , If you do not find a document that meets the conditions for updating , A new document will be created based on this condition and the updated document ; If a matching document is found , It will be updated normally ,upsert Very convenient , There is no need to preset the set , The same set of code can be used to create documents and update documents """
# If it exists, it will be updated , If it doesn't exist, create a new one ,
collection.update_one({
# Guarantee data Is the only one.
' Film name ': data.get(' Film name ')
}, {
'$set': data
}, upsert=True)
6. Last method call
The code is as follows ( Example ):
if __name__ == '__main__':
# return html Source code
response = browser.page_source
Get_the_data(response)
Perform_the_action()
browser.close()
summary
- This section is about Selenium General usage of , Use Selenium To deal with it JavaScript Rendering pages is no longer difficult
边栏推荐
- 随机抽奖转盘微信小程序项目源码
- Data warehouse layered design and data synchronization,, 220728,,,,
- 数字人民币时代隐私更安全
- Leetcode Hot 100 (brush question 9) (301/45/517/407/offer62/mst08.14/)
- New energy shared charging pile management and operation platform
- Deep learning (2): image and character recognition
- [beauty of software engineering - column notes] 26 | continuous delivery: how to release new versions to the production environment at any time?
- Dp4301-sub-1g highly integrated wireless transceiver chip
- Simple operation of SQL server data table
- Tle5012b+stm32f103c8t6 (bluepill) reading angle data
猜你喜欢

Reading of false news detection papers (3): semi supervised content-based detection of misinformation via tensor embeddings

Unity shader learning (VI) achieving radar scanning effect

Stm32ff030 replaces domestic MCU dp32g030

Implementation of support vector machine with ml11 sklearn

node:文件写入数据(readFile、writeFile),覆盖与增量两种模式

Lora opens a new era of Internet of things -asr6500s, asr6501/6502, asr6505, asr6601

Unity Shader学习(六)实现雷达扫描效果
![[beauty of software engineering - column notes] 27 | what is the core competitiveness of software engineers? (top)](/img/23/288f6c946a44e36ab58eb0555f3650.png)
[beauty of software engineering - column notes] 27 | what is the core competitiveness of software engineers? (top)
![[academic related] why can't many domestic scholars' AI papers be reproduced?](/img/1a/7b162741aa7ef09538355001bf45e7.png)
[academic related] why can't many domestic scholars' AI papers be reproduced?

Day5: PHP simple syntax and usage
随机推荐
TCP - sliding window
BiSeNet v2
Ga-rpn: recommended area network for guiding anchors
集群使用规范
搜索与回溯经典题型(八皇后)
Unity Shader学习(六)实现雷达扫描效果
Reading papers on false news detection (4): a novel self-learning semi supervised deep learning network to detect fake news on
[beauty of software engineering - column notes] "one question and one answer" issue 3 | 18 common software development problem-solving strategies
User identity identification and account system practice
Day15: the file contains the vulnerability range manual (self use file include range)
Component transfer participation lifecycle
Unity shader learning (VI) achieving radar scanning effect
commonjs导入导出与ES6 Modules导入导出简单介绍及使用
MySQL中的时间函数
Collation of ml.net related resources
Privacy is more secure in the era of digital RMB
Four pin OLED display based on stm32
Tb6600+stm32f407 test
Day4: SQL server is easy to use
ML.NET相关资源整理