flume架构与相关原理总结

可能说起日志采集，首先想起的便是flume。先来看看flume官网是怎么说的：http://flume.apache.org/releases/content/1.7.0/FlumeUserGuide.html#overview

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a (JVM) process that hosts the components through which events flow from an external source to the next destination (hop).
A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run asynchronously with the events staged in the channel.

1、flume基本架构
引用一下架构图

flume架构与相关原理总结
从上图可以看到主要有Agent，其包含了source、channel、sink。
flume数据是借助于agent进程以事件（event）为单元进行传送的，在一个agent进程实例里包括了以下三大组件。
source：
对接数据上游，负责接收各种类型的日志数据，如avro、exec、spooling directory、jms等，并将其写入到channel
sink：
对接数据下游，负责从channel拉取数据发送到其他业务组件，如avro、hdfs、logger、file等
channel：
为了调和source和sink数据处理效率不一致问题，引入了channel数据缓冲区，主要类型有memory channel、file channel 、kafka channel等。
Event：flume数据传送的基本数据单元，同时也是事务的基本单位。

2、关于flume事务
flume要尽可能的保证数据的安全性，其在source推送数据到channel以及sink从channel拉取数据时都是以事务方式进行的。因为在agent内的两次数据传递间都会涉及到数据的传送、从数据上游删除数据的问题；就比如sink从channel拉取数据并提交到数据下游之后需要从channel中删除已获取到的批次数据，其中跨越了多个原子事件，故而需要以事务的方式将这些原子事件进一步绑定在一起，以便在其中某个环节出错时进行回滚防止数据丢失。所以在选用file channel时一般来说是不会丢失数据的。

3、flume常用的集中数据拓部结构
1、Setting multi-agent flow（多agent串联）
flume架构与相关原理总结

2、Consolidation（聚合）
flume架构与相关原理总结

3、Multiplexing the flow（多路复用，flume复制和多路复用简单示例（监控日志文件对接hdfs、kafka、本地文件系统））

flume架构与相关原理总结

4、load balancing(负载均衡)
flume架构与相关原理总结

4、关于channel selector的问题：（If the type is not specified, then defaults to “replicating”.）
flume常用的有两种channel 选择器：
replicating（default）：将event发往所有channel
multiplexing：将event按用户配置发往对应的channel

5、flume监控工具：Ganglia

6、flume调优相关
1、Capacity 参数决定Channel可容纳最大的event条数。
2、transactionCapacity 参数决定每次Source往channel里面写的最大event条数和每次Sink从channel里面读的最大event条数。
transactionCapacity需要大于Source和Sink的batchSize参数。
3、关于channel类型选择：
若是对接数据非常重要不容有失选择file channel、若追求速度而数据并不是很重要则使用memory channel，在对接kafka时使用kafka channel可以大大提高吞吐，因为kafka channel本身即是生产者又是消费者，在配置层面上也可以看出其是没有sink组件的。
4、在一定范围内根据实际情况增加source和sink的数量可以提高flume吞吐量。
5、可以更具需要自定义source、channel、sink。

详细可见这位大佬博客：Flume整体架构总结