当前位置:网站首页>Node crawler puppeter usage
Node crawler puppeter usage
2022-06-12 11:40:00 【Jioho_】
List of articles
node Reptiles puppeteer Use
puppeteer It's based on Chromium Of node The crawler frame . The powerful thing is that it has all the functions of the browser , And through nodejs You can control . Perfectly achieve the crawler effect we want ( Complete code attached ~)
install puppeteer It will also be downloaded synchronously Chromium . If the network is not good, use it directly cnpm Download it .
Of course, the official also has another bag , The original words are as follows :
Since version 1.7.0 we publish the puppeteer-core package, a version of Puppeteer that doesn't download any browser by default.
It probably means to use puppeteer-core You can use puppeteer Core functions , Instead of installing a full browser . I haven't used it this time , Next time I use it, I'll elaborate ~
Record an actual battle demo
One node Command line gadgets , The principle is from https://ping.chinaz.com/ The website obtains the corresponding URL IP Information and request duration , Then from the local Ping Corresponding IP, Find the best node of the web address
Running effect :
PS: Why not use this website directly to check , I have to write it myself ? Because the website found out IP The display is OK , But local ping Not necessarily , You can't try one by one , So this tool was born ~
As for why puppeteer ? Because this website doesn't provide api, So only from the requested interface , Get the information we need
Development first look at the documentation
What if I can't open it for the first time , Just put 185.199.110.133 raw.githubusercontent.com Add to host Documents to , Because the document is from github Loaded . And put netword Panel Disable caching Get rid of .

Launch the browser
Introduction to the official website demo
const puppeteer = require('puppeteer')
;(async () => {
// Create a browser object
const browser = await puppeteer.launch()
// Open a new page
const page = await browser.newPage()
// Setting the page's URL
await page.goto('https://www.google.com')
// other actions...
// Finally, close the browser ( If you don't close ,node The program won't end )
await browser.close()
})()
puppeteer.launch
First look at the startup document ~ launch file
About launch I only used one parameter , headless whether No display Browser interface
The initial debugging mode , I still chose to open the browser interface
const browser = await puppeteer.launch({
headless: false // whether Do not display the browser interface
})
page.goto
await page.goto('https://ping.chinaz.com/www.baidu.com', {
timeout: 0, // No limit on timeout
waitUntil: 'networkidle0'
})
page It's the current tab Tab .page.goto file
After opening the interface ,https://ping.chinaz.com/ If there is a website behind , Will automatically start requesting , So what I have to do is , When they complete their request , I just started getting content
goto There are several options , One is how much to wait s after , The page is not loaded , Even if the GG 了 . Unless set -1 I'll keep waiting
The second one is waitUntil I use it networkidle0. Namely 500 There are no requests in milliseconds , Will continue to carry on

Get the elements on the page
We usually use it on the console $ Operator , The query .puppeteer It also provides a similar function , And it's more finely divided .
You can see that there is page.$ and page.$$ quite a lot api There will be $ and $$ difference . They all mean the same thing , use $ Only the first matching element will be queried , While using $$ Class api, Will help you find out the corresponding elements , Become an array element
page.$$(selector) and page.$$eval(selector, pageFunction[, ...args]) It's very similar , Are all query elements
however page.$$ Is to query all matching nodes , Then provide you with the corresponding method , For example, you can find those nodes and execute page.$eval
page.$$eval Is the corresponding node closest to our query , Then the following callback function can operate node node
therefore , I used it page.$$eval All data in the query table , Operate each item in the callback function innerText.
let list = await page.$$eval('#speedlist .listw', options =>
options.map(option => {
let [city, ip, ipaddress, responsetime, ttl] = option.innerText.split(/[\n]/g)
return {
city, ip, ipaddress, responsetime, ttl }
})
)
The rest is business logic
Realization ping Method
ping The implementation of the method is particularly simple , Is to introduce npm - ping
ping.sys.probe(IP Address , Callback function )
So the following encapsulates based on this asynchronous method , Pass in IP, Calculate the specific of the machine ping Time for . And use Promise Encapsulate the , It's convenient for us to use async and await
const pingIp = ip =>
new Promise((resolve, reject) => {
let startTime = new Date().getTime()
ping.sys.probe(ip, function(isAlive) {
if (isAlive) {
resolve({
ip: ip,
time: new Date().getTime() - startTime
})
} else {
reject({
ip: ip,
time: -1
})
}
})
})
await Add a beautiful thing to a contrasting beautiful thing
If our Promise The return is reject . Usually we can only use try/catch To deal with . It's not very elegant , You can refer to previous articles [http://jioho.gitee.io/blog/JavaScript/%E5%A6%82%E4%BD%95%E4%BC%98%E9%9B%85%E5%A4%84%E7%90%86async%E6%8A%9B%E5%87%BA%E7%9A%84%E9%94%99%E8%AF%AF.html)
const awaitWrap = promise => {
return promise.then(data => [data, null]).catch(err => [null, err])
}
In the project loading effect
Because the whole process is relatively long , It takes a lot of waiting time , So there is a good loading Tips are very important . So using npm - ora
Add... Everywhere loading And the corresponding prompt
Complete project code
Dependencies :
| npm Package name | effect |
|---|---|
| ora | Amicable loading effect |
| puppeteer | The crawler frame |
| ping | node Conduct ping Methods |
index.js
const ora = require('ora')
let inputUrl = process.argv[2]
if (!inputUrl) {
ora().warn(' Please enter the link to be judged ')
process.exit()
}
const puppeteer = require('puppeteer')
const {
awaitWrap, pingIp, moveHttp } = require('./utils')
let url = moveHttp(inputUrl)
const BaiseUrl = 'https://ping.chinaz.com/'
;(async () => {
const init = ora(' Initialize the browser environment ').start()
const browser = await puppeteer.launch({
headless: true // Whether the browser interface is not displayed
})
init.succeed(' Initialization complete ')
const loading = ora(` Parsing ${
url}`).start()
const page = await browser.newPage() // Create a new page
await page.goto(BaiseUrl + url, {
timeout: 0, // No limit on timeout
waitUntil: 'networkidle0'
})
loading.stop()
let list = await page.$$eval('#speedlist .listw', options =>
options.map(option => {
let [city, ip, ipaddress, responsetime, ttl] = option.innerText.split(/[\n]/g)
return {
city, ip, ipaddress, responsetime, ttl }
})
)
if (list.length == 0) {
ora().fail(' Please enter the correct web address or IP')
process.exit()
}
ora().succeed(' obtain IP Address complete , Try to connect IP Address ')
let ipObj = {
}
let success = []
let failList = []
let fast = Infinity
let fastIp = ''
for (let i = 0; i < list.length; i++) {
let item = list[i]
let time = parseInt(item.responsetime)
if (!isNaN(time) && !ipObj[item.ip]) {
const tryIp = ora(` Try ${
item.ip}`).start()
let [res, error] = await awaitWrap(pingIp(item.ip))
if (!error) {
success.push(res.ip)
if (res.time < fast) {
fast = res.time
fastIp = res.ip
}
tryIp.succeed(`${
res.ip} Successful connection , Time consuming :${
res.time}ms`)
} else {
failList.push(error.ip)
tryIp.fail(`${
error.ip} The connection fails `)
}
ipObj[item.ip] = time
}
}
if (success.length > 0) {
ora().succeed(` The request is successful :${
JSON.stringify(success)}`)
}
if (failList.length > 0) {
ora().fail(` request was aborted :${
JSON.stringify(failList)}`)
}
if (fastIp) {
ora().info(` Recommended nodes : ${
fastIp}, Time : ${
fast}ms`)
ora().info(`host To configure : ${
fastIp} ${
url}`)
}
browser.close() // Close the browser
})()
utils/index.js
var ping = require('ping')
module.exports = {
awaitWrap: promise => {
return promise.then(data => [data, null]).catch(err => [null, err])
},
pingIp: ip =>
new Promise((resolve, reject) => {
let startTime = new Date().getTime()
ping.sys.probe(ip, function(isAlive) {
if (isAlive) {
resolve({
ip: ip,
time: new Date().getTime() - startTime
})
} else {
reject({
ip: ip,
time: -1
})
}
})
}),
moveHttp: val => {
val = val.replace(/http(s)?:\/\//i, '')
var temp = val.split('/')
if (temp.length <= 2) {
if (val[val.length - 1] == '/') {
val = val.substring(0, val.length - 1)
}
}
return val
}
}
summary
The whole project comes down , The most important thing is to open the browser , And learn to look at the events in the document , What is shown here is just the waiting of the browser's network request , There are many other ways
- For example, a callback waiting for a node to appear
- Intercept request
- Wait for a callback after a request ends
- Screen capture
- Website performance analysis
- etc. …
in general puppeteer The emergence of let node There are more possibilities . hope node Ecology can be better and better ~
Of course , debugging node Program , Especially with so many nodes and prototype chain methods , It is obviously not enough to debug with the terminal , So I want to better debug your node Script , May have a look debugging node Program tool comparison
Finally, attach the request document URL Of IP situation (getIp It's my folder ,index.js That's the code above index.js Code. ~)


End ~
边栏推荐
猜你喜欢
随机推荐
邻居子系统之邻居项状态更新
postman传入list
Windows10 install mysql-8.0.28-winx64
6.6 separate convolution
Heavyweight proxy cache server squid
Shardingjdbc-5.1.0 monthly horizontal table splitting + read-write separation, automatic table creation and node table refresh
Clickhouse column basic data type description
2022-06-11: note that in this document, graph is not the meaning of adjacency matrix, but a bipartite graph. In the adjacency matrix with length N, there are n points, matrix[i][j]
Problems in cross validation code of 10% discount
Spark常用封装类
Unit test case framework --unittest
Differences among various cross compiling tools of arm
M-arch (fanwai 10) gd32l233 evaluation -spi drive DS1302
Naming specification / annotation specification / logical specification
arm交叉编译链下载地址
Lambda and filter, index of list and numpy array, as well as various distance metrics, concatenated array and distinction between axis=0 and axis=1
Postman incoming list
Relation entre les classes et à l'intérieur des classes de classification vidéo - - Régularisation
Blue Bridge Cup 2015 CA provincial competition (filling the pit)
Socket implements TCP communication flow









