当前位置:网站首页>How to crawl web pages with playwright?
How to crawl web pages with playwright?
2022-07-29 01:58:00 【Oxylabs Chinese station】

Playwright Web Capture tutorial
In recent years , With the development of Internet industry , The influence of the Internet is gradually rising . This is also due to the improvement of technical level , More and more applications with good user experience have been developed . Besides , From the development of network applications to testing , automation The use in the whole process is becoming more and more popular . Web crawlers are also used more and more widely .
Having efficient tools to test web applications is critical .Playwright Wait for the library to open the web application in the browser and interact with others , For example, click an element 、 Type text , And extracting public data from the network to speed up the whole process .
This tutorial will explain about Playwright Related content of , as well as How to use it for automation and even network crawling .
What is? Playwright?
Playwright It's a Testing and automation framework , It can realize the automatic interaction of web browser . In short , You can write code to open the browser , Use code to realize the function of using all web browsers . Automated scripts can achieve Navigate to URL、 Input text 、 Click the button and extract the text And so on .Playwright The most surprising feature is that it can Process multiple pages at the same time without waiting , Will not be blocked .
Playwright Support for most browsers , for example Google Chrome、Firefox、 Use Chromium Kernel Microsoft Edge And use WebKit Kernel Safari. Cross browser network automation yes Playwright The strengths of , You can effectively execute the same code for all browsers . Besides ,Playwright Support for various programming languages , for example Node.js、Python、Java and .NET. You can write code to open a web site and interact with it in any of these languages .
Playwright The content of the document is very detailed , Wide coverage . it It covers all classes and methods from beginner to advanced .
Support Playwright Agent for
Playwright Support the use of agents . We will take the following Node.js and Python Code snippet of , Teach you step by step how to Chromium Using agents in :
Node.js:
const { chromium } = require('playwright'); "
const browser = await chromium.launch();Python:
from playwright.async_api import async_playwright
import asyncio
with async_playwright() as p:
browser = await p.chromium.launch()The above code only needs a little modification to integrate the agent . In the use of Node.js when , The startup function can accept LauchOptions Optional parameters of type . This LaunchOption Object can send several other parameters , for example ,headless. Another parameter required is proxy. This proxy is another object with these properties :server,username,password etc. . The first step is to create an object that can specify these parameters .
// Node.js
const launchOptions = {
proxy: {
server: 123.123.123.123:80'
},
headless: false
}The second step is to pass this object to the startup function :
const browser = await chromium.launch(launchOptions);Just Python for , The situation is slightly different . No need to create LaunchOptions. contrary , All values can be sent as separate parameters . The following is how to send the proxy dictionary :
# Python
proxy_to_use = {
'server': '123.123.123.123:80'
}
browser = await pw.chromium.launch(proxy=proxy_to_use, headless=False)When deciding which proxy to use to perform the crawl , You'd better use A residential agent , Because they leave no trace , Nor will it trigger any security alerts .Oxylabs Our residential agent is a The coverage area is wide and stable Agent network . You can Oxylabs Residential agents visit specific countries 、 Sites in provinces and even cities . most important of all , You can also Oxylabs Agent and Playwright Easy integration .
01. Use Playwright Perform basic grab
Next we will introduce how to pass Node.js and Python Use Playwright.
If you are using Node.js, You need to create a new project and install Playwright library . You can use these two simple commands to complete :
npm init -y
npm install playwrightThe basic script for opening dynamic pages is as follows :
const playwright = require('playwright');
(async () => {
const browser = await playwright.chromium.launch({
headless: false // Show the browser.
});
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
await page.waitForTimeout(1000); // wait for 1 seconds
await browser.close();
})();Let's take a look at the above code . The first line of code imports Playwright. then , It starts a Chromium example . It allows script Automation Chromium. Please note that , This script will be written in Visual user interface function . Successful delivery headless:false after , Open a new browser page ,page.goto The function will navigate to Books to Scrape This web page . Wait for 1 Seconds to show the page to the end user . Last , Browser closed .
The same code is used Python Writing is also very simple . First , Use pip command install Playwright:
pip install playwrightPlease note that ,Playwright Two ways are supported —— Synchronous and asynchronous . The following example uses asynchronous API:
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.launch(
headless=False # Show the browser
)
page = await browser.new_page()
await page.goto('https://books.toscrape.com/')
# Data Extraction Code Here
await page.wait_for_timeout(1000) # Wait for 1 second
await browser.close()
if __name__ == '__main__':
asyncio.run(main())This code is similar to Node.js Code . The biggest difference is asyncio Library usage . Another difference is that the function name is from camelCase Turn into snake_case.
If you want to create multiple browser environments , Or you want more precise control , You can create an environment object and create multiple pages in that environment . This code will open the page in a new tab :
const context = await browser.newContext();
const page1 = await context.newPage();
const page2 = await context.newPage();If you also want to handle page context in your code . have access to page.context() Function to get the browser page context .
02. Positioning elements
To extract information from an element or click an element , The first step is to locate the element .Playwright Support CSS and XPath Two selectors .
This can be better understood through a practical example . stay Chrome Open in The URL of the page to be crawled , And right-click the first book and select view source code .

You can see that all the books are article Under the element , This element has a class product_prod.
To select all books , You need to be on all article Element sets a loop .article Elements can be used CSS Select with the selector :
.product_podAgain , You can also use XPath Selectors :
//*[@class="product_pod"]To use these selectors , The most commonly used functions are as follows :
●$eval(selector, function)– Choose the first element , Send the element to the function , Returns the result of the function ;
●$$eval(selector, function)– ditto , The difference is that it selects all elements ;
●querySelector(selector)– Returns the first element ;
●querySelectorAll(selector)– Return all elements .
These methods are in CSS and XPath The selector works normally .
03. Grab text
Continue with Books to Scrape Page as an example , After the page loads , You can use Selectors and $$eval function Extract all book containers .
const books = await page.$$eval('.product_pod', all_items
=> {
// run a loop here
})Then you can extract all the elements containing the book data in the loop :
all_items.forEach(book => {
const name = book.querySelector('h3').innerText;
})Last ,innerText Attributes can be used to extract data from each data point . Here are Node.js Complete code in :
const playwright = require('playwright');
(async () => {
const browser = await playwright.chromium.launch();
const page = await browser.newPage();
await page.goto('https://books.toscrape.com/');
const books = await page.$$eval('.product_pod', all_items
=> {
const data = [];
all_items.forEach(book => {
const name = book.querySelector('h3').innerText;
const price = book.querySelector('.price_color').
innerText;
const stock = book.querySelector('.availability').
innerText;
data.push({ name, price, stock});
});
return data;
});
console.log(books);
await browser.close();
})();Python The code in will be a little different .Python There is a function eval_on_selector, and Node.js Of $eval similar , But it is not suitable for this kind of scene . The reason is that the second parameter still needs to be JavaScript. Use... Under certain circumstances JavaScript Probably good , But in this case , use Python Writing the whole code will be more applicable .
Best use query_selector and query_selector_all Return an element and an element list respectively .
from playwright.async_api import async_playwright
import asyncio
async def main():
async with async_playwright() as pw:
browser = await pw.chromium.
page = await browser.new_page()
await page.goto('https://books.toscrape.com')
all_items = await page.query_selector_all('.product_pod')
books = []
for item in all_items:
book = {}
name_el = await item.query_selector('h3')
book['name'] = await name_el.inner_text()
price_el = await item.query_selector('.price_color')
book['price'] = await price_el.inner_text()
stock_el = await item.query_selector('.availability')
book['stock'] = await stock_el.inner_text()
books.append(book)
print(books)
await browser.close()
if __name__ == '__main__':
asyncio.run(main())Last ,Node.js and Python The output of the code is the same .
Playwright VS Puppeteer and Selenium
When grabbing data , Besides using Playwright, You can also use it Selenium and Puppeteer.
about Puppeteer, you The available browsers and programming languages are very limited . At present, the only language that can be used is JavaScript, The only compatible browser is Chromium.
about Selenium, Although the compatibility of browser language is good . however , it Slow and not very friendly to developers .
The other thing to say is ,Playwright Can intercept network requests . Please refer to More details .
Here is a comparison of the three tools :
_ | Playwright | Puppeteer | Selenium |
Speed | fast | fast | slower |
Archiving capabilities | good | good | Ordinary |
Development experience | best | good | Ordinary |
programing language | JavaScript、Python、C# and Java | JavaScript | Java、Python、C#、Ruby、JavaScript and Kotlin |
Supporter | Microsoft | Communities and sponsors | |
Community | Small and active | Big and active | Big and active |
Available browsers | Chromium、Firefox and WebKit | Chromium | Chrome、Firefox、IE、Edge、Opera and Safari etc. |
Conclusion
This paper discusses Playwright As a test tool, capture the function of dynamic sites , It also introduces Node.js and Python Code examples in . because Playwright Of Asynchronous features and cross browser support , It is a popular alternative to other tools .
Playwright You can navigate to URL、 Input text 、 Click the button and extract text and other functions . It can extract dynamically rendered text . These things can also be passed Puppeteer and Selenium Wait for other tools to complete , But if you need to use multiple browsers , Or you need to use JavaScript/Node.js Languages other than , that Playwright It will be a better choice .
If you are interested in other similar topics , Please check out our information about using Selenium Crawl articles or view them on the Internet Puppeteer course . You can also visit at any time Our website Communicate with customer service .
边栏推荐
- Basic label in body
- Planning mathematics final exam simulation II
- What is the ISO assessment? How to do the waiting insurance scheme
- PCL 点云转强度图像
- [golang] use select {}
- [7.21-26] code source - [good sequence] [social circle] [namonamo]
- TDA75610-I2C-模拟功放I2C地址的确定
- [the road of Exile - Chapter 2]
- Make logic an optimization example in sigma DSP - data distributor
- Lua third-party byte stream serialization and deserialization module --lpack
猜你喜欢

Anaconda environment installation problem
![Golang run times undefined error [resolved]](/img/9b/3379aeeff59b47531fe277f7422ce7.png)
Golang run times undefined error [resolved]
![[the road of Exile - Chapter 7]](/img/3c/8b4b7c40367b8b68d0361d9ca4013a.png)
[the road of Exile - Chapter 7]

How to choose professional, safe and high-performance remote control software

Js DOM2 和 DOM3

Minimalist thrift+consumer

560 and K
![[search] - DFS pruning and optimization](/img/d4/7c2fec02f5a6bcfa2d5e204398af01.png)
[search] - DFS pruning and optimization
![[web technology] 1395 esbuild bundler HMR](/img/74/be75c8f745f18b374ed15c8e1b4466.png)
[web technology] 1395 esbuild bundler HMR

【GoLang】同步锁 Mutex
随机推荐
[the road of Exile - Chapter 6]
Reinforcement learning (II): SARS, with code rewriting
Use POI to export excel file, image URL to export file, image and excel file to export compressed package
Use of packet capturing tool Charles
Practical experience of Google cloud spanner
覆盖接入2w+交通监测设备,EMQ为深圳市打造交通全要素数字化新引擎
[7.27] code source - [deletion], [bracket sequence], [number replacement], [game], [painting]
Alphafold revealed the universe of protein structure - from nearly 1million structures to more than 200million structures
Super technology network security risk assessment service, comprehensively understand the security risks faced by the network system
5G 商用第三年:无人驾驶的“上山”与“下海”
Analyze OP based on autoware_ global_ Planner global path planning module re planning
Internship: tool class writing for type judgment
StoneDB 邀请您参与开源社区月会!
【流放之路-第六章】
StoneDB 为何敢称业界唯一开源的 MySQL 原生 HTAP 数据库?
The basic concept of transaction and the implementation principle of MySQL transaction
How to deal with the DDoS attack on the game server and how to defend it?
StoneDB 邀请您参与开源社区月会!
Introduction to Elmo, Bert and GPT
Sigma-DSP-OUTPUT