当前位置:网站首页>Data Lake: flume, a massive log collection engine
Data Lake: flume, a massive log collection engine
2022-07-28 03:00:00 【YoungerChina】
Special topic : Data Lake series articles
1. summary
Flume Yes, a distributed 、 High availability 、 Highly reliable massive log collection 、 Converged and transported systems , Support for customizing various data senders in the logging system , To collect data , At the same time, it provides the ability to simply process the data and write it to various data receivers .
Flume The design principle of is Data flow based , It can efficiently collect massive log data from different data sources 、 polymerization 、 Move , Finally, it is stored in a centralized data storage system .Flume It can push in near real time , And it can meet the situation that the amount of data is continuous and of great magnitude . For example, it can collect social networking site logs , And collect these huge amounts of log data from the website server , Store in HDFS or HBase Distributed database .
Flume Official website :http://flume.apache.org/
Flume Official documents :http://flume.apache.org/FlumeUserGuide.html
2. Basic framework

First, one will be deployed on each data source flume agent , This agent It is used to take data .
This agent from 3 Components :source,channel,sink. And in the flume in , The basic unit of data transmission is event.
(1)source
Used to collect data from data sources , And transmit the data on channel in .source Support multiple data source collection methods . For example, the monitoring port collects data , Collect from file , Collect from the directory , from http In service collection, etc .
(2)channel
be located source and sink Between , It is a temporary storage area of data . In general , from source The rate of data outflow and sink The rate of outgoing data will vary . So you need a space to temporarily store those that cannot be transferred to sink Data for processing . therefore channel Similar to a buffer , A line .
(3)sink
from channel get data , And write the data to the target source . The target source supports multiple , Like local files 、hdfs、kafka、 next flume agent Of source And so on .
(4)event
The transmission unit ,flume The basic unit of transmission , Include headers and body Two parts ,header You can add some header information ,body Data .
3. Flume characteristic
1) reliability
When a node fails , Logs can be delivered to other nodes without loss .Flume There are three levels of Reliability Assurance , The order from strong to weak is :
(1)end-to-end( Receive the data agent First of all, will event Write to disk , When the data transfer is successful , And then delete ; If the data delivery fails , You can resend it );
(2)Store on failure( This is also scribe Strategies adopted , When the data receiver crash when , Write the data locally , After waiting for recovery , Continue to send );
(3)Best effort( After the data is sent to the receiver , There is no confirmation ).
2) Extensibility
Flume It adopts three-tier architecture , Respectively agent,collector and storage, Each layer can horizontally expand all agent and collector from master Unified management , This makes the system easy to monitor and maintain , And master More than one is allowed ( Use ZooKeeper Manage and load balance ), This avoids a single point of failure .
3) manageability
(1) all agent and colletor from master Unified management , This makes the system easy to maintain .
(2) many master situation ,Flume utilize ZooKeeper and gossip, Ensure the consistency of dynamic configuration data .
(3) Users can go to master Check the execution of each data source or data flow on , And it can configure and load data sources dynamically .
(4)Flume Provides web and shell script command Two forms of data flow management .
4) Functional scalability
(1) Users can add their own agent,collector perhaps storage.
(2) Besides ,Flume It comes with a lot of components , Includes a variety of agent(file, syslog etc. ),collector and storage(file,HDFS etc. ).
5) The document is rich , The community is active
Flume yes Apache The next top project , Has become a Hadoop Standard configuration of ecosystem , Its documentation is relatively rich , The community is more active , It's convenient for us to study .
4. Other questions
Flume Will the collected data be lost ?
according to Flume Architecture principle of ,Flume It's impossible to lose data , It has a perfect internal transaction mechanism ,Source To Channel It's transactional , Channel To Sink It's transactional , Therefore, there will be no data loss in these two links , The only possible loss of data is Channel use memoryChannel, agent Data loss due to downtime , perhaps Channel The storage is full , Lead to Source No more writing , Data not written is lost .Flume No loss of data , But it may cause data duplication , For example, the data has been successfully generated by Sink issue , But no response was received , Sink The data will be sent again , This may cause data duplication .
5. Reference material
[01]https://blog.csdn.net/weixin_41605937/article/details/106812923
[02]https://blog.51cto.com/kinglab/2447898
————————————————
边栏推荐
- D multi production single consumption
- What "posture" does JD cloud have to promote industrial digitalization to climb to a "new level"?
- [signal processing] weak signal detection in communication system based on the characteristics of high-order statistics with matlab code
- 修改MySQL密码的四种方法(适合初学者)
- JS event object offsetx/y clientx y pagex y
- Why is there no unified quotation for third-party testing fees of software products?
- Opengauss Developer Day 2022 sincerely invites you to visit the "database kernel SQL Engine sub forum" of Yunhe enmo
- MySQL is shown in the figure. The existing tables a and B need to be associated with a and B tables through projectcode to find idcardnum with different addresses.
- Pycharm 快速给整页全部相同名称修改的快捷键
- windbg
猜你喜欢

"29 years old, general function test, how do I get five offers in a week?"

How to simply realize the function of menu dragging and sorting

Flutter神操作学习之(满级攻略)

Chapter III queue

智能工业设计软件公司天洑C轮数亿元融资

Center Based 3D object detection and tracking (centerpoint) paper notes

数据中台夯实数据基础

【图像隐藏】基于DCT、DWT、LHA、LSB的数字图像信息隐藏系统含各类攻击和性能参数附matlab代码

Email security report in the second quarter: email attacks have soared fourfold, and well-known brands have been used to gain trust

CNN training cycle reconstruction - hyperparametric test | pytorch series (XXVIII)
随机推荐
How to simply realize the function of menu dragging and sorting
【自我成长网站收集】
Is it safe to buy funds on Alipay? I want to make a fixed investment in the fund
Deep residual learning for image recognition shallow reading and Implementation
Canvas from getting started to persuading friends to give up (graphic version)
写英文IEEE论文的技巧
D multi production single consumption
Data center construction (III): introduction to data center architecture
基于FPGA的64位8级流水线加法器
Some shortest path problems solved by hierarchical graph
牛客-TOP101-BM340
Flutter God operation learning (full level introduction)
新基建助力智能化道路交通领域的转型发展
@Valid的作用(级联校验)以及常用约束注解的解释说明
trivy【1】工具扫描运用
[wechat applet development (V)] the interface is intelligently configured according to the official version of the experience version of the development version
【信号去噪】基于卡尔曼滤波实现信号去噪附matlab代码
【英雄哥七月集训】第 27天:图
一次跨域问题的记录
初识C语言 -- 结构体,分支和循环语句