当前位置:网站首页>Data Lake: flume, a massive log collection engine
Data Lake: flume, a massive log collection engine
2022-07-28 03:00:00 【YoungerChina】
Special topic : Data Lake series articles
1. summary
Flume Yes, a distributed 、 High availability 、 Highly reliable massive log collection 、 Converged and transported systems , Support for customizing various data senders in the logging system , To collect data , At the same time, it provides the ability to simply process the data and write it to various data receivers .
Flume The design principle of is Data flow based , It can efficiently collect massive log data from different data sources 、 polymerization 、 Move , Finally, it is stored in a centralized data storage system .Flume It can push in near real time , And it can meet the situation that the amount of data is continuous and of great magnitude . For example, it can collect social networking site logs , And collect these huge amounts of log data from the website server , Store in HDFS or HBase Distributed database .
Flume Official website :http://flume.apache.org/
Flume Official documents :http://flume.apache.org/FlumeUserGuide.html
2. Basic framework

First, one will be deployed on each data source flume agent , This agent It is used to take data .
This agent from 3 Components :source,channel,sink. And in the flume in , The basic unit of data transmission is event.
(1)source
Used to collect data from data sources , And transmit the data on channel in .source Support multiple data source collection methods . For example, the monitoring port collects data , Collect from file , Collect from the directory , from http In service collection, etc .
(2)channel
be located source and sink Between , It is a temporary storage area of data . In general , from source The rate of data outflow and sink The rate of outgoing data will vary . So you need a space to temporarily store those that cannot be transferred to sink Data for processing . therefore channel Similar to a buffer , A line .
(3)sink
from channel get data , And write the data to the target source . The target source supports multiple , Like local files 、hdfs、kafka、 next flume agent Of source And so on .
(4)event
The transmission unit ,flume The basic unit of transmission , Include headers and body Two parts ,header You can add some header information ,body Data .
3. Flume characteristic
1) reliability
When a node fails , Logs can be delivered to other nodes without loss .Flume There are three levels of Reliability Assurance , The order from strong to weak is :
(1)end-to-end( Receive the data agent First of all, will event Write to disk , When the data transfer is successful , And then delete ; If the data delivery fails , You can resend it );
(2)Store on failure( This is also scribe Strategies adopted , When the data receiver crash when , Write the data locally , After waiting for recovery , Continue to send );
(3)Best effort( After the data is sent to the receiver , There is no confirmation ).
2) Extensibility
Flume It adopts three-tier architecture , Respectively agent,collector and storage, Each layer can horizontally expand all agent and collector from master Unified management , This makes the system easy to monitor and maintain , And master More than one is allowed ( Use ZooKeeper Manage and load balance ), This avoids a single point of failure .
3) manageability
(1) all agent and colletor from master Unified management , This makes the system easy to maintain .
(2) many master situation ,Flume utilize ZooKeeper and gossip, Ensure the consistency of dynamic configuration data .
(3) Users can go to master Check the execution of each data source or data flow on , And it can configure and load data sources dynamically .
(4)Flume Provides web and shell script command Two forms of data flow management .
4) Functional scalability
(1) Users can add their own agent,collector perhaps storage.
(2) Besides ,Flume It comes with a lot of components , Includes a variety of agent(file, syslog etc. ),collector and storage(file,HDFS etc. ).
5) The document is rich , The community is active
Flume yes Apache The next top project , Has become a Hadoop Standard configuration of ecosystem , Its documentation is relatively rich , The community is more active , It's convenient for us to study .
4. Other questions
Flume Will the collected data be lost ?
according to Flume Architecture principle of ,Flume It's impossible to lose data , It has a perfect internal transaction mechanism ,Source To Channel It's transactional , Channel To Sink It's transactional , Therefore, there will be no data loss in these two links , The only possible loss of data is Channel use memoryChannel, agent Data loss due to downtime , perhaps Channel The storage is full , Lead to Source No more writing , Data not written is lost .Flume No loss of data , But it may cause data duplication , For example, the data has been successfully generated by Sink issue , But no response was received , Sink The data will be sent again , This may cause data duplication .
5. Reference material
[01]https://blog.csdn.net/weixin_41605937/article/details/106812923
[02]https://blog.51cto.com/kinglab/2447898
————————————————
边栏推荐
- New infrastructure helps the transformation and development of intelligent road transportation
- Redis AOF log persistence
- Pytest the best testing framework
- CNN循环训练的解释 | PyTorch系列(二十二)
- P6118 [joi 2019 final] solution to the problem of Zhenzhou City
- 初识C语言 -- 结构体,分支和循环语句
- pytest最好的测试框架
- 数据湖:数据库数据迁移工具Sqoop
- NPDP考生!7月31号考试要求在这里看!
- MySQL索引学习
猜你喜欢
[email protected]注解使用"/>[email protected]注解使用

修改MySQL密码的四种方法(适合初学者)

【ELM分类】基于核极限学习机和极限学习机实现UCI数据集分类附matlab代码

【信号去噪】基于卡尔曼滤波实现信号去噪附matlab代码

Center-based 3D Object Detection and Tracking(基于中心的3D目标检测和跟踪 / CenterPoint)论文笔记

Day 8 of DL

基于FPGA的64位8级流水线加法器

Notes for the fourth time of first knowing C language

Data center construction (III): introduction to data center architecture

New infrastructure helps the transformation and development of intelligent road transportation
随机推荐
Newline required at end of file but not found.
Superparameter adjustment and experiment - training depth neural network | pytorch series (26)
修改MySQL密码的四种方法(适合初学者)
【信号处理】基于高阶统计量特征的通信系统中微弱信号检测附matlab代码
一次跨域问题的记录
MySQL is shown in the figure. The existing tables a and B need to be associated with a and B tables through projectcode to find idcardnum with different addresses.
ORACLE BASICFILE LOB字段空间回收SHRINK SPACE的疑惑
[image defogging] image defogging based on dark channel and non-mean filtering with matlab code
A brief analysis of the differences between functional testing and non functional testing, recommended by Shanghai haokoubei software testing company
Collision and rebound of objects in unity (learning)
PS simple to use
CSDN Top1 "how does a Virgo procedural ape" become a blogger with millions of fans through writing?
Arm32 for remote debugging
[software testing] - unittest framework for automated testing
[image hiding] digital image information hiding system based on DCT, DWT, LHA, LSB, including various attacks and performance parameters, with matlab code
别人发你的jar包你如何使用(如何使用别人发您的jar包)
windbg
Center Based 3D object detection and tracking (centerpoint) paper notes
Oracle basicfile lob field space recycling shrink space doubts
Notes for the fourth time of first knowing C language