当前位置:网站首页>How to use crawlers to capture bullet screen and comment data of station B?
How to use crawlers to capture bullet screen and comment data of station B?
2022-06-25 03:54:00 【Blockchain research】
Tool preparation :
Cloud gathering reptile :http://www.yuncaix.com/
The goal is :
For example, we capture the data of this video :
https://www.bilibili.com/video/BV1q741167KA?from=search
Mainly capture comments 、 title 、 Barrage this information , Pictured :

Data modeling :
The preview of the entire data structure is as follows :

There are two key fields , They are barrage and comment , We set the data type to data grouping , For example, comments , Used to store multiple comments

Grasping process
Design flow chart :
What we are grabbing is only a detail page , So just pull the component of a detail page , Pictured :

Capture video information
Video information is stored in the web source code , We just need to find a way to extract data from it . But this paragraph is a paragraph js Code defined json. How do we extract it ?

We have to pull one 【 The data processing 】 Components , The data processing , seeing the name of a thing one thinks of its function , Processing data .

The following rules :
window.__INITIAL_STATE__=[data];(function()We use strings , Intercepted the middle section json. The source code can be analyzed , This requires a little eyesight .
Then we click debug , Get a paragraph json data :

Let's put this data into json See what is in the parsing tool . website https://json.cn/

From the picture , We saw this paragraph json What's in it .
Including video data ,up Master's message , Including the number of fans 、 Play data or something , It's all inside .
Let's pull another one 【 Data extractor 】, Extract these data , Pictured :

For example, get the title field :

His json The path is a
videoData.titleThe following rules :
choice json The way

test result :

The title of the video is thus obtained , What other fans are there , The number of bullets , All in this way . This process is too much , No more description .
Let's move on to the second stage , Grab comments
Grab comments
First we need to analyze , Where are the comments .

First , Let's pull the page down , It will be loaded into the comment data . stay CHrome Analysis in the browser . You can type reply Keyword filtering .( I found this after looking for it for a long time , It took a while )
Get this js Request address :https://api.bilibili.com/x/v2/reply?callback=jQuery172007409334557635194_1587209947092&jsonp=jsonp&pn=1&type=1&oid=85246284&sort=2&_=1587209958023
We have direct access to , Will go wrong :

What do I do ?
Then I found out , hold callback=jQuery172007409334557635194_1587209947092& Just remove this parameter .
The date should also be removed .
So the address is :https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn=1&type=1&oid=85246284&sort=2 see , The data came out ....

You ask me why , I don't know either . Sometimes intuition is such a thing , It's amazing . Interested students can analyze the reasons .
Let's format the data :
Comments in data.replies Inside

Now we're thinking about how to catch these comments .
First , We need to construct this address , Let the crawler crawl the comments according to this address , How to construct ?
Let's go back and analyze the structure of the comment address :
https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn=1&type=1&oid=85246284&sort=2&_=1587209958023
Parameters pn=1, This is the page number .
oid=85246284 , This oid Where can I get it ?
We put 85246284 Search in the source code , It is found that this parameter exists in many places , Call him aid Well .

We just need to get this aid, You can construct the comment address .
We need a 【 Temporary data 】 Components , Temporarily store the variables inside , You can use it at will .
Add a new one aid Variable .

Now the flow chart is as follows :

test result :

Get this aid after , How to construct an address ?
Let's introduce another 【 The data processing 】 Components , Further process data into addresses .

source , We choose “ nothing “, Rule acquisition method , Choose a template language .
https://api.bilibili.com/x/v2/reply?jsonp=jsonp&pn=1&type=1&oid={
{
{tempVar.aid}}}&sort=2What we defined above aid Variable , Use tempVar.aid Can be embedded inside .
The test results are as follows :

Got the address , The next step ?
We jump to a new comment page .
At this time, you need to introduce a jump component , Here's the picture :

Get the rules , We choose 【 Original value 】, The so-called original value , seeing the name of a thing one thinks of its function , Is to use the data of upstream components , No changes .
Jump to the comments page , Pull in a 【 Details page 】 Components

Now? , We need to extract the desired data from this page .
Put the comment address in the debugging , Pictured :

The comment data is multiple , How do we extract ?
We need to introduce a 【 Field loop area 】 Components .
Here, the bound fields are grouped , Remember to select the comment field .

Data acquisition is directly filled in data.replies

The test results are as follows :

Maybe this is not intuitive enough , Click on the bottom 【trace】, You can see more details :

After getting this cyclic data ? You can use 【 Data extractor 】 Directly extract the data .

fill :
content.message
Why fill in this ?
Let's go back to the above analysis json In the data structure

test result :

ok.
Grab barrage data
According to the above analysis method , Find the bullet curtain ajax Request address :
https://api.bilibili.com/x/v1/dm/list.so?oid=145727684
The barrage data is all in it .

How to get ....
In the same way . Length is too long , No more introduction .
The whole flow chart is finally like this :
Including setting page turning and so on .

This is the preview of the captured data :

View comment data separately :

Easily export data :


Final , We don't have to write any code , It doesn't need much advanced technology , The whole crawling process is easily completed , If you master this tool , It's easy to create a complex web crawler application in ten minutes .
Interested users can use this product to capture other data , Including Weibo 、 Zhihu and so on .
Cloud gathering reptile :http://www.yuncaix.com/
边栏推荐
- 一文搞懂php中的(DI)依赖注入
- Preparedstatement principle of preventing SQL injection
- 亚马逊在中国的另一面
- Teach you how to install win11 system in winpe
- PHP uses getid3 to obtain the duration of MP3, MP4, WAV and other media files
- 【Rust投稿】从零实现消息中间件(6)-CLIENT
- What is an SSL certificate and what are the benefits of having an SSL certificate?
- 马斯克被诉传销索赔2580亿美元,台积电公布2nm制程,中科院发现月壤中含有羟基形式的水,今日更多大新闻在此...
- ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
- 程序员真人秀又来了!呼兰当主持挑灯狂补知识,SSS大佬本科竟是药学,清华朱军张敏等加入导师团...
猜你喜欢

Mstp+vrrp+ospf implements a three-tier architecture

How to raise key issues in the big talk club?

Do you really need automated testing?

Lao Ye's blessing

Xidian AI ranked higher than Qingbei in terms of AI majors, and Nantah ranked the first in China in 2022 in terms of soft science majors

可能是拿反了的原因

Collaboration + Security + storage, cloud box helps Shenzhen edetai restructure its data center

Why can banana be a random number generator? Because it is the "king of radiation" in the fruit industry

扎克伯格最新VR原型机来了,要让人混淆虚拟与现实的那种

9 necessary soft skills for program ape career development
随机推荐
A new generation of cascadable Ethernet Remote i/o data acquisition module
Disassembly of Weima prospectus: the electric competition has ended and the intelligent qualifying has just begun
孙武玩《魔兽》?有图有真相
Background page production 01 production of IVX low code sign in system
騰訊開源項目「應龍」成Apache頂級項目:前身長期服務微信支付,能hold住百萬億級數據流處理...
China's SkyEye found suspicious signals of extraterrestrial civilization. Musk said that the Starship began its orbital test flight in July. Netinfo office: app should not force users to agree to proc
The era of copilot free is over! The official version is 67 yuan / month, and the student party and the defenders of popular open source projects can prostitute for nothing
程序猿职业发展9项必备软技能
IE寿终正寝,网友们搞起了真·墓碑……
马斯克被诉传销索赔2580亿美元,台积电公布2nm制程,中科院发现月壤中含有羟基形式的水,今日更多大新闻在此...
谷歌创始人布林二婚破裂:被曝1月已提出与华裔妻子离婚,目前身家6314亿美元...
ICML 2022 | ByteDance AI Lab proposes a multimodal model: x-vlm, learning multi granularity alignment of vision and language
Configuration source code
墨天轮访谈 | IvorySQL王志斌—IvorySQL,一个基于PostgreSQL的兼容Oracle的开源数据库
Redis related-02
MySQL modifies and deletes tables in batches according to the table prefix
Two common OEE monitoring methods for equipment utilization
Install ffmpeg in LNMP environment and use it in yii2
Google founder brin's second marriage broke up: it was revealed that he had filed for divorce from his Chinese wife in January, and his current fortune is $631.4 billion
完美洗牌问题