当前位置：网站首页>How to crawl web pages with playwright?

How to crawl web pages with playwright?

2022-07-29 01:58:00 【Oxylabs Chinese station】

Playwright Web Capture tutorial

In recent years , With the development of Internet industry , The influence of the Internet is gradually rising . This is also due to the improvement of technical level , More and more applications with good user experience have been developed . Besides , From the development of network applications to testing , automation The use in the whole process is becoming more and more popular . Web crawlers are also used more and more widely .

Having efficient tools to test web applications is critical .Playwright Wait for the library to open the web application in the browser and interact with others , For example, click an element 、 Type text , And extracting public data from the network to speed up the whole process .

This tutorial will explain about Playwright Related content of , as well as How to use it for automation and even network crawling .

What is? Playwright？

Playwright It's a Testing and automation framework , It can realize the automatic interaction of web browser . In short , You can write code to open the browser , Use code to realize the function of using all web browsers . Automated scripts can achieve Navigate to URL、 Input text 、 Click the button and extract the text And so on .Playwright The most surprising feature is that it can Process multiple pages at the same time without waiting , Will not be blocked .

Playwright Support for most browsers , for example Google Chrome、Firefox、 Use Chromium Kernel Microsoft Edge And use WebKit Kernel Safari. Cross browser network automation yes Playwright The strengths of , You can effectively execute the same code for all browsers . Besides ,Playwright Support for various programming languages , for example Node.js、Python、Java and .NET. You can write code to open a web site and interact with it in any of these languages .

Playwright The content of the document is very detailed , Wide coverage . it It covers all classes and methods from beginner to advanced .

Support Playwright Agent for

Playwright Support the use of agents . We will take the following Node.js and Python Code snippet of , Teach you step by step how to Chromium Using agents in ：

Node.js：

const { chromium } = require('playwright'); "
const browser = await chromium.launch();

Python：

from playwright.async_api import async_playwright
import asyncio
with async_playwright() as p:
browser = await p.chromium.launch()

The above code only needs a little modification to integrate the agent . In the use of Node.js when , The startup function can accept LauchOptions Optional parameters of type . This LaunchOption Object can send several other parameters , for example ,headless. Another parameter required is proxy. This proxy is another object with these properties ：server,username,password etc. . The first step is to create an object that can specify these parameters .

// Node.js
const launchOptions = {
     proxy: {
     server: 123.123.123.123:80'
     },
     headless: false
}

The second step is to pass this object to the startup function ：

const browser = await chromium.launch(launchOptions);

Just Python for , The situation is slightly different . No need to create LaunchOptions. contrary , All values can be sent as separate parameters . The following is how to send the proxy dictionary ：

# Python
proxy_to_use = {
        'server': '123.123.123.123:80'
}
browser = await pw.chromium.launch(proxy=proxy_to_use, headless=False)

When deciding which proxy to use to perform the crawl , You'd better use A residential agent , Because they leave no trace , Nor will it trigger any security alerts .Oxylabs Our residential agent is a The coverage area is wide and stable Agent network . You can Oxylabs Residential agents visit specific countries 、 Sites in provinces and even cities . most important of all , You can also Oxylabs Agent and Playwright Easy integration .

01. Use Playwright Perform basic grab

Next we will introduce how to pass Node.js and Python Use Playwright.

If you are using Node.js, You need to create a new project and install Playwright library . You can use these two simple commands to complete ：

npm init -y
npm install playwright

The basic script for opening dynamic pages is as follows ：

const playwright = require('playwright');
(async () => {
        const browser = await playwright.chromium.launch({
        headless: false // Show the browser.
});

        const page = await browser.newPage();
        await page.goto('https://books.toscrape.com/');
        await page.waitForTimeout(1000); // wait for 1 seconds
        await browser.close();
})();

Let's take a look at the above code . The first line of code imports Playwright. then , It starts a Chromium example . It allows script Automation Chromium. Please note that , This script will be written in Visual user interface function . Successful delivery headless:false after , Open a new browser page ,page.goto The function will navigate to Books to Scrape This web page . Wait for 1 Seconds to show the page to the end user . Last , Browser closed .

The same code is used Python Writing is also very simple . First , Use pip command install Playwright：

pip install playwright

Please note that ,Playwright Two ways are supported —— Synchronous and asynchronous . The following example uses asynchronous API：

from playwright.async_api import async_playwright
import asyncio

async def main():
    async with async_playwright() as pw:
    browser = await pw.chromium.launch(
        headless=False  # Show the browser
    )
    page = await browser.new_page()
    await page.goto('https://books.toscrape.com/')
    # Data Extraction Code Here
    await page.wait_for_timeout(1000)  # Wait for 1 second
    await browser.close()
if __name__ == '__main__':
    asyncio.run(main())

This code is similar to Node.js Code . The biggest difference is asyncio Library usage . Another difference is that the function name is from camelCase Turn into snake_case.

If you want to create multiple browser environments , Or you want more precise control , You can create an environment object and create multiple pages in that environment . This code will open the page in a new tab ：

const context = await browser.newContext();
const page1 = await context.newPage();
const page2 = await context.newPage();

If you also want to handle page context in your code . have access to page.context() Function to get the browser page context .

02. Positioning elements

To extract information from an element or click an element , The first step is to locate the element .Playwright Support CSS and XPath Two selectors .

This can be better understood through a practical example . stay Chrome Open in The URL of the page to be crawled , And right-click the first book and select view source code .

You can see that all the books are article Under the element , This element has a class product_prod.

To select all books , You need to be on all article Element sets a loop .article Elements can be used CSS Select with the selector ：

.product_pod

Again , You can also use XPath Selectors ：

//*[@class="product_pod"]

To use these selectors , The most commonly used functions are as follows ：

●$eval(selector, function)– Choose the first element , Send the element to the function , Returns the result of the function ;

●$$eval(selector, function)– ditto , The difference is that it selects all elements ;

●querySelector(selector)– Returns the first element ;

●querySelectorAll(selector)– Return all elements .

These methods are in CSS and XPath The selector works normally .

03. Grab text

Continue with Books to Scrape Page as an example , After the page loads , You can use Selectors and $$eval function Extract all book containers .

const books = await page.$$eval('.product_pod', all_items
 => {
// run a loop here
})

Then you can extract all the elements containing the book data in the loop ：

all_items.forEach(book => {
        const name = book.querySelector('h3').innerText;
})

Last ,innerText Attributes can be used to extract data from each data point . Here are Node.js Complete code in ：

const playwright = require('playwright');
(async () => {
       const browser = await playwright.chromium.launch();
       const page = await browser.newPage();
       await page.goto('https://books.toscrape.com/');
       const books = await page.$$eval('.product_pod', all_items 
=> {
     const data = [];
     all_items.forEach(book => {
             const name = book.querySelector('h3').innerText;
             const price = book.querySelector('.price_color').
innerText;
              const stock = book.querySelector('.availability').
innerText;
              data.push({ name, price, stock});
     });
     return data;
     });
     console.log(books);
     await browser.close();
})();

Python The code in will be a little different .Python There is a function eval_on_selector, and Node.js Of $eval similar , But it is not suitable for this kind of scene . The reason is that the second parameter still needs to be JavaScript. Use... Under certain circumstances JavaScript Probably good , But in this case , use Python Writing the whole code will be more applicable .

Best use query_selector and query_selector_all Return an element and an element list respectively .

from playwright.async_api import async_playwright
import asyncio

async def main():
      async with async_playwright() as pw:
      browser = await pw.chromium.
      page = await browser.new_page()
      await page.goto('https://books.toscrape.com')

      all_items = await page.query_selector_all('.product_pod')
      books = []
      for item in all_items:
         book = {}
         name_el = await item.query_selector('h3')
         book['name'] = await name_el.inner_text()
         price_el = await item.query_selector('.price_color')
         book['price'] = await price_el.inner_text()
         stock_el = await item.query_selector('.availability')
         book['stock'] = await stock_el.inner_text()
         books.append(book)
      print(books)
      await browser.close()

if __name__ == '__main__':
      asyncio.run(main())

Last ,Node.js and Python The output of the code is the same .

Playwright VS Puppeteer and Selenium

When grabbing data , Besides using Playwright, You can also use it Selenium and Puppeteer.

about Puppeteer, you The available browsers and programming languages are very limited . At present, the only language that can be used is JavaScript, The only compatible browser is Chromium.

about Selenium, Although the compatibility of browser language is good . however , it Slow and not very friendly to developers .

The other thing to say is ,Playwright Can intercept network requests . Please refer to More details .

Here is a comparison of the three tools ：

_	Playwright	Puppeteer	Selenium
Speed	fast	fast	slower
Archiving capabilities	good	good	Ordinary
Development experience	best	good	Ordinary
programing language	JavaScript、Python、C# and Java	JavaScript	Java、Python、C#、Ruby、JavaScript and Kotlin
Supporter	Microsoft	Google	Communities and sponsors
Community	Small and active	Big and active	Big and active
Available browsers	Chromium、Firefox and WebKit	Chromium	Chrome、Firefox、IE、Edge、Opera and Safari etc.

Conclusion

This paper discusses Playwright As a test tool, capture the function of dynamic sites , It also introduces Node.js and Python Code examples in . because Playwright Of Asynchronous features and cross browser support , It is a popular alternative to other tools .

Playwright You can navigate to URL、 Input text 、 Click the button and extract text and other functions . It can extract dynamically rendered text . These things can also be passed Puppeteer and Selenium Wait for other tools to complete , But if you need to use multiple browsers , Or you need to use JavaScript/Node.js Languages other than , that Playwright It will be a better choice .

If you are interested in other similar topics , Please check out our information about using Selenium Crawl articles or view them on the Internet Puppeteer course . You can also visit at any time Our website Communicate with customer service .

原网站

版权声明
本文为[Oxylabs Chinese station]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130554309647.html