Cloud Insight Kafka 监控,默认监控以下性能指标:
想要可视化 Kafka 的性能,往往需要自建运维系统:利用 Zabbix 等开源工具搭建运维监控平台。这往往意味着大量的工作,以及繁琐的调试过程。
而报警、指标的运算、不同主机间数据的聚合,以及自定义指标的可视化,都需要对接新的开源工具。从而,更多的时间成本和人力成本,会投入进来。
Cloud Insight 探针的安装只需一条指令,且提供 Puppet 对探针进行批量处理。监控 Kafka 也只需开启 Docker 配置文件。过程十分简单。
而且,Cloud Insight 数据的自动抓取和上传,以及丰富的可视化效果,再加上多渠道的报警。让您免于自建运维监控系统的困扰。
Cloud Insight 数据管理功能,能够针对集群中,不同主机的 Kafka 性能指标,进行聚合、过滤、分组。
通过简单的指标查询,能够快速了解分属于不同功能模块、地域、网段的 Kafka 的性能的最大值、平均值、最小值。让运维工作更简单、更敏捷。
OneAPM Cloud Insight Agent 通过 JMX 获取 Kafka 中的性能指标。
由于每个实体最多可以监控 350 个性能指标,所以您需要按照下方的配置方法,修改配置文件来确定自己需要哪些指标。
编辑配置文件 conf.d/kafka.yaml,使 Cloud Insight Agent 可以与 Kafka 通信。
WARNING
This sample works only for Kafka >= 0.8.2.
instances:
- host: localhost
port: 9999
name: jmx_instance
user: username
password: password
#java_bin_path: /path/to/java #Optional, should be set if the agent cannot find your java executable
#trust_store_path: /path/to/trustStore.jks # Optional, should be set if ssl is enabled
#trust_store_password: password
init_config:
is_jmx: true
# Metrics collected by this check. You should not have to modify this.
conf:
#
# Aggregate cluster stats
#
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=BrokerTopicMetrics,
name=AllTopicsBytesOutPerSec'
attribute:
MeanRate:
metric_type: counter
alias: kafka.net.bytes_out
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=BrokerTopicMetrics,
name=AllTopicsBytesInPerSec'
attribute:
MeanRate:
metric_type: counter
alias: kafka.net.bytes_in
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=BrokerTopicMetrics,
name=AllTopicsMessagesInPerSec'
attribute:
MeanRate:
metric_type: gauge
alias: kafka.messages_in
#
# Request timings
#
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=BrokerTopicMetrics,
name=AllTopicsFailedFetchRequestsPerSec'
attribute:
MeanRate:
metric_type: gauge
alias: kafka.request.fetch.failed
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=BrokerTopicMetrics,
name=AllTopicsFailedProduceRequestsPerSec'
attribute:
MeanRate:
metric_type: gauge
alias: kafka.request.produce.failed
- include:
domain: 'kafka.network'
bean: 'kafka.network:type=RequestMetrics,name=Produce-TotalTimeMs'
attribute:
Mean:
metric_type: counter
alias: kafka.request.produce.time.avg
99thPercentile:
metric_type: counter
alias: kafka.request.produce.time.99percentile
- include:
domain: 'kafka.network'
bean: 'kafka.network:type=RequestMetrics,name=Fetch-TotalTimeMs'
attribute:
Mean:
metric_type: counter
alias: kafka.request.fetch.time.avg
99thPercentile:
metric_type: counter
alias: kafka.request.fetch.time.99percentile
- include:
domain: 'kafka.network'
bean: 'kafka.network:type=RequestMetrics,name=UpdateMetadata-TotalTimeMs'
attribute:
Mean:
metric_type: counter
alias: kafka.request.update_metadata.time.avg
99thPercentile:
metric_type: counter
alias: kafka.request.update_metadata.time.99percentile
- include:
domain: 'kafka.network'
bean: 'kafka.network:type=RequestMetrics,name=Metadata-TotalTimeMs'
attribute:
Mean:
metric_type: counter
alias: kafka.request.metadata.time.avg
99thPercentile:
metric_type: counter
alias: kafka.request.metadata.time.99percentile
- include:
domain: 'kafka.network'
bean: 'kafka.network:type=RequestMetrics,name=Offsets-TotalTimeMs'
attribute:
Mean:
metric_type: counter
alias: kafka.request.offsets.time.avg
99thPercentile:
metric_type: counter
alias: kafka.request.offsets.time.99percentile
#
# Replication stats
#
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=ReplicaManager,name=ISRShrinksPerSec'
attribute:
MeanRate:
metric_type: counter
alias: kafka.replication.isr_shrinks
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=ReplicaManager,name=ISRExpandsPerSec'
attribute:
MeanRate:
metric_type: counter
alias: kafka.replication.isr_expands
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=ControllerStats,
name=LeaderElectionRateAndTimeMs'
attribute:
MeanRate:
metric_type: counter
alias: kafka.replication.leader_elections
- include:
domain: 'kafka.server'
bean: 'kafka.server:type=ControllerStats,
name=UncleanLeaderElectionsPerSec'
attribute:
MeanRate:
metric_type: counter
alias: kafka.replication.unclean_leader_elections
#
# Log flush stats
#
- include:
domain: 'kafka.log'
bean: 'kafka.log:type=LogFlushStats,name=LogFlushRateAndTimeMs'
attribute:
MeanRate:
metric_type: counter
alias: kafka.log.flush_rate
编辑 Consumer 配置文件 conf.d/kafka_consumer.yaml。
init_config:
instances:
- kafka_connect_str: localhost:19092
zk_connect_str: localhost:2181
zk_prefix: /0.8
consumer_groups:
my_consumer:
my_topic: [0, 1, 4, 12]
重启 OneAPM Cloud Insight Agent,使配置生效。