当前位置：网站首页>Elastic stack Best Practices Series: a filebeat memory leak analysis and tuning

Elastic stack Best Practices Series: a filebeat memory leak analysis and tuning

2022-06-21 15:36:00 【Three ignition cycles】

ELK From release 5.0 Then I joined beats After the kit , It's called elastic stack 了 .beats It's a lightweight set of software , It provides us with a simple , Quick way to collect in real time 、 More data to support our analysis . But because of beats All need to be installed in ELK Outside the cluster , On host , Its impact on the performance of the host often becomes the key to consider whether it can be used , Rather than what it provides . Because the stable operation of the business is the core KPI, Other data generated due to operation and maintenance is always a lower priority . There may be many aspects that affect the performance of the host , such as CPU Occupancy rate , Network throughput occupancy , disk IO, Memory, etc. , Here we discuss the problem of memory leakage in detail

filebeat yes beats One of the core components of the suite （ Another core is metricbeat）, It is generally used to generate the collected files （ It's mainly the journal ） The program is installed in one place , According to the official advice , yes filebeat It is not recommended to collect NFS（ Network shared disk ） Of the data on . When filebeat After running , Must be right cpu, Memory , The network and other resources are consumed , When this consumption can be limited to an acceptable range , I don't think anyone will restrict your use in the production environment filebeat. But if something unexpected happens , For example, it takes up a lot of memory , Then the operation and maintenance team must give priority to the resources to ensure the core business , hold filebeat The process was killed . It's a pity , Memory leaks , from filebeat The birth of has not been completely solved until now , No matter what version （ Abreast of the times 6.5 The version has not been observed yet ） In different scenarios and configurations , There is a problem of too much memory . ad locum , I mainly describe what I met in filebeat 6.0 Problems encountered on .

Problem scenarios and configurations

We use a unified simple configuration to monitor many hosts , It is this simple, undifferentiated configuration , It creates problems . It's not right , It's not right , It's not right ！！！ The reasonable way is to analyze specific problems , Customized configurations are made for different scenarios .

multiline, Multiline configuration , When the log file does not conform to the specification , Lots of matches pattern When , Memory leaks can occur
max_procs, Limit filebeat The number of processes , It's actually the number of cores , It is recommended to manually set to 1

filebeat.prospectors:
- type: log
  enabled: true
  paths:
    - /qhapp/*/*.log
  tail_files: true
  multiline.pattern: '^[[:space:]]+|^Caused by:|^.+Exception:|^\d+\serror'
  multiline.negate: false
  multiline.match: after
  fields:
    app_id: bi_lass
    service: "{{ hostvars[inventory_hostname]['service'] }}"
    ip_address: "{{ hostvars[inventory_hostname]['ansible_host'] }}"
    topic: qh_app_raw_log

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

setup.template.settings:
  index.number_of_shards: 3
  #index.codec: best_compression
  #_source.enabled: false
output.kafka:
  enabled: true
  hosts: [{{kafka_url}}]

  topic: '%{[fields][topic]}'

max_procs: 1

Be careful , In the above configuration , Just right cpu The number of cores is limited , There are no special restrictions on memory usage . At the configuration level , influence filebeat There are two main indicators of memory usage ：

queue.mem.events The size of the message queue , The default value is 4096, This parameter is in 6.0 The previous version was spool-size, From the command line , Configure at startup
max_message_bytes The size of a single message , The default value is 10M

filebeat The largest possible memory footprint is max_message_bytes * queue.mem.events = 40G, Considering this queue It's for storage encode Data from the past ,raw Data is also stored , therefore , Without limiting memory , The maximum memory usage can reach more than 80G.

therefore , The suggestion is to simultaneously filebeat Of CPU And memory .

below , Let's see. , Using the above configuration, under what conditions will memory leaks be observed

Too many monitoring files

For files that generate large amounts of content in real time , Like logs , The common practice is to change the log file rotate, Depending on the strategy , Every once in a while or after reaching a fixed size , Log rotate.

such , A large number of log files may be generated in the file directory .

If we use wildcards , To monitor the directory , be filebeat Will start a lot of harvester Instance to collect files . however , please remember , I'm not saying that this is bound to cause memory leaks , Just a memory leak was observed here , This is not to say that this is the cause of the memory leak .

Insert picture description here

When filebeat After a few months of operation , Occupied more than 10 individual G Of memory

Insert picture description here

Very often rotate journal

Another possibility is ,filebeat Only one file is configured for monitoring , such as test2.log, But because of test2.log Ceaseless rotate Issue new files , Although no wildcard is used to collect all the files in this directory , But because linux System is used inode number To uniquely identify the document ,rotate The new document has not changed its inode number, therefore , On the time filebeat It also enables monitoring of multiple files at the same time .

Insert picture description here

because multiline This leads to excessive memory consumption

multiline.pattern: '^[[:space:]]+|^Caused by:|^.+Exception:|^\d+\serror, For example, this configuration , Beginning with a space or tab line Is an addition to the previous line , It needs to be a multiline mode , Store in the same event among . When the file you are monitoring happens to have a space on each line of the file , Match multiple rows incorrectly , cause filebeat After the analysis , Single event The number of lines reached thousands , The size has reached 10M, And regular expressions are used in this process , Every one of them event The processing of will consume a lot of memory . Because most of the filebeat output It needs to be answered ,buffer these event Will inevitably consume a lot of memory .

Simulation scenario

I won't go into that , Let's make it simple python Code for ：

[loggers]
keys=root

[handlers]
keys=NormalHandler

[formatters]
keys=formatter

[logger_root]
level=DEBUG
handlers=NormalHandler

[handler_NormalHandler]
class=logging.handlers.TimedRotatingFileHandler
formatter=formatter
args=('./test2.log', 'S', 10, 200)

[formatter_formatter]
format=%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s

above , every other 10 second （'S', 'M' = minute ,'D'= God ）rotate A file , All in all rotate 200 File .

then , Just look for a log , Keep playing , Here are 330 strip / second

import logging
from logging.config import fileConfig
import os
import time
CURRENT_FOLDER = os.path.dirname(os.path.realpath(__file__))

fileConfig(CURRENT_FOLDER + '/logging.ini')
logger = logging.getLogger()

while True:
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task !")
    logger.debug("DEBUG 2018-11-26 09:31:35 com.sunyard.insurance.server.GetImage 43 -  Resource request :date=20181126&file_name=/imagedata/imv2/pool1/images/GXTB/2017/11/14/57/06b6bcdd31763b70b20f56c689e51f5e_1/06b6bcdd31763b70b20f56c689e51f5e_2.syd&file_encrypt=0&token=HUtGGG20GH4GAqq209R9tc9UGtAURR0b DEBUG 2018-11-26 09:31:40 com.sunyard.insurance.scheduler.job.DbEroorHandleJob 26 - [ Database operation exception handling JOB] Handling exception files , This machine does not run , Exit task [email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#[email protected]#!#@[email protected]##########################################################################################################################################################")
    time.sleep(0.03)

How to observe filebeat Of memory

stay 6.5 Before the release , We can't get through xpack Of monitoring Function to observe beats The performance of the suite . therefore , What is discussed here is no monitoring when , How do we detect filebeat Performance of . Of course , The simple way is through top,ps Wait for the command of the operating system to view , But these are real-time , Unable to make trend observation , And they are all process level , Can't see filebeat The internal situation is really . therefore , Here's how to pass filebeat 's logs and pprof This tool is used to observe memory usage

adopt filebeat Log

filebeat Document interpretation

Actually filebeat Log , Many parameters have been included for real-time observation filebeat Resource usage of , Here are filebeat A log fragment of （ The log fragment here is 6.0 Version of ,6.5 After the version , The entire log format has changed , from kv The format becomes json Object format ,xpack You can directly log filebeat Of monitoring）：

2018-11-02T17:40:01+08:00 INFO Non-zero metrics in the last 30s: beat.memstats.gc_next=623475680 beat.memstats.memory_alloc=391032232 beat.memstats.memory_total=155885103371024 filebeat.events.active=-402 filebeat.events.added=13279 filebeat.events.done=13681 filebeat.harvester.closed=1 filebeat.harvester.open_files=7 filebeat.harvester.running=7 filebeat.harvester.started=2 libbeat.config.module.running=0 libbeat.output.events.acked=13677 libbeat.output.events.batches=28 libbeat.output.events.total=13677 libbeat.outputs.kafka.bytes_read=12112 libbeat.outputs.kafka.bytes_write=1043381 libbeat.pipeline.clients=1 libbeat.pipeline.events.active=0 libbeat.pipeline.events.filtered=4 libbeat.pipeline.events.published=13275 libbeat.pipeline.events.total=13279 libbeat.pipeline.queue.acked=13677 registrar.states.cleanup=1 registrar.states.current=8 registrar.states.update=13681 registrar.writes=28

The parameters are mainly divided into three parts ：

beat.*, contain memstats.gc_next,memstats.memory_alloc,memstats.memory_total, This is all beat Indicators for all components , yes filebeat inherited , Mainly memory related , We pay special attention to memstats.memory_alloc,alloc The more , The more memory it takes
filebeat.*, This part is filebeat Unique indicators , adopt event Relevant indicators , We know huff and puff , adopt harvester, We know how many files are being monitored , Not consumed event The more they pile up ,havester The more you create , The more memory is consumed
libbeat.*, It's also beats Common metrics for components , contain outputs and pipeline Etc . It's important here When outputs When there is a jam , It will directly affect queue Inside event Consumption , Cause memory accumulation
registrar,filebeat Put the status of the monitoring file in registry In the document , When there are many monitoring files , such as 10 m , And there is no reasonable setting close_inactive Parameters , This file can achieve 100M, After loading memory , Take up memory directly

filebeat Log parsing

Of course , We can't read this log directly , Since we use ELK, It must be ELK Read it . the reason being that kv Format , Very convenient , use logstash Of kv plugin：

filter {
  kv {}
}

kv Can't specify properties Of type, therefore , We need to specify the template of the index slightly ：

PUT _template/template_1
{
  "index_patterns": ["filebeat*"],
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "doc": {
      "_source": {
        "enabled": false
      },
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "path_match":   "*beat.*",
            "path_unmatch": "*.*name",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

The template above , take kv It was resolved that properties all mapping To long type , But pay attention to "path_match": "*beat.*" unable match To registrar Indicators of , Readers can write a more perfect mapping.

such , We can go through kibana Visualization components , Clearly see the process of memory leakage

Insert picture description here

And the use of resources ：

Insert picture description here

After visualizing the information , We can clearly see that , Memory mutations and ERR It happens at the same time

Insert picture description here

That is, the following error：

2018-11-27T09:05:44+08:00 ERR Harvester could not be started on new file: /qhapp/logs/bd-etl/logs/test2.log, Err: Error setting up harvester: Harvester setup failed. Unexpected file opening error: file info is not identical with opened file. Aborting harvesting and retrying file later again

It can lead to filebeat Suddenly requested additional memory . Please check issue

adopt pprof

as everyone knows ,filebeat Yes, it is go The realization of language , and go The basic library of the language itself contains pprof This extremely powerful performance analysis tool , This tool is just for debug Of , In normal mode ,filebeat It will not start the selection process , And I'm sorry , There is no mention in the official documents that we can use pprof To observe filebeat. We can then pass 6.3 A memory leak fixed on issue, To learn how to use pprof Analyze

start-up pprof monitoring

First , Need make filebeat Run at startup pprof, The specific method is to add parameters during startup -httpprof localhost:6060, namely /usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -path.home /usr/share/filebeat -path.config /etc/filebeat -path.data /var/lib/filebeat -path.logs /var/log/filebeat -httpprof localhost:6060. There is only binding localhost, Unable to access remotely , If you want remote access , You should use 0.0.0.0.

At this time , You can go through curl http://localhost:6060/debug/pprof/heap > profile.txt Wait for the order , obtain filebeat Real time stack information .

Remote connection

Of course , You can also install on your local computer go, And then through go tool Remote connection pprof.

Because we need to study the memory problem , So the following connection accesses /heap Subpath

go tool pprof http://10.60.x.x:6060/debug/pprof/heap

top command

After the connection , You can go through top command , View the instances that consume the most memory ：

33159.58kB of 33159.58kB total (  100%)
Dropped 308 nodes (cum <= 165.80kB)
Showing top 10 nodes out of 51 (cum >= 512.04kB)
      flat  flat%   sum%        cum   cum%
19975.92kB 60.24% 60.24% 19975.92kB 60.24%  runtime.malg
 7680.66kB 23.16% 83.40%  7680.66kB 23.16%  github.com/elastic/beats/filebeat/channel.SubOutlet
 2048.19kB  6.18% 89.58%  2048.19kB  6.18%  github.com/elastic/beats/filebeat/prospector/log.NewHarvester
 1357.91kB  4.10% 93.68%  1357.91kB  4.10%  runtime.allgadd
 1024.08kB  3.09% 96.76%  1024.08kB  3.09%  runtime.acquireSudog
  544.67kB  1.64% 98.41%   544.67kB  1.64%  github.com/elastic/beats/libbeat/publisher/queue/memqueue.NewBroker
  528.17kB  1.59%   100%   528.17kB  1.59%  regexp.(*bitState).reset
         0     0%   100%   528.17kB  1.59%  github.com/elastic/beats/filebeat/beater.(*Filebeat).Run
         0     0%   100%   512.04kB  1.54%  github.com/elastic/beats/filebeat/channel.CloseOnSignal.func1
         0     0%   100%   512.04kB  1.54%  github.com/elastic/beats/filebeat/channel.SubOutlet.func1

View the stack call graph

Input web command , Will produce stack call relationships svg chart , In this svg In the figure , You can combine top Commands , stay top in , We already know github.com/elastic/beats/filebeat/channel.SubOutlet It takes up a lot of memory , In the picture , It shows the call relation stack , You can see how this class is instantiated , And in the whole heap , How memory is distributed . The most intuitive is , The larger the rectangular area of the instance , Represents the more memory used .：

Insert picture description here

View source code

adopt list command , You can quickly view the problem source code of an instance , Like before top10 In command , We've seen github.com/elastic/beats/filebeat/channel.SubOutlet Instances of this class take up a lot of memory , We can go through list Do further analysis , See which statement in this class begins to occupy memory ：

(pprof) list SubOutlet
Total: 32.38MB
ROUTINE ======================== github.com/elastic/beats/filebeat/channel.SubOutlet in /home/jeremy/src/go/src/github.com/elastic/beats/filebeat/channel/util.go
    7.50MB     7.50MB (flat, cum) 23.16% of Total
         .          .     15:// SubOutlet create a sub-outlet, which can be closed individually, without closing the
         .          .     16:// underlying outlet.
         .          .     17:func SubOutlet(out Outleter) Outleter {
         .          .     18:	s := &subOutlet{
         .          .     19:		isOpen: atomic.MakeBool(true),
       1MB        1MB     20:		done:   make(chan struct{}),
       2MB        2MB     21:		ch:     make(chan *util.Data),
    4.50MB     4.50MB     22:		res:    make(chan bool, 1),
         .          .     23:	}
         .          .     24:
         .          .     25:	go func() {
         .          .     26:		for event := range s.ch {
         .          .     27:			s.res <- out.OnEvent(event)

How to tune

In fact, the process of tuning is the process of adjusting parameters , I said before , Memory related parameters , max_message_bytes,queue.mem.events,queue.mem.flush.min_events, And the formula of the memory occupied by the queue :max_message_bytes * queue.mem.events

output.kafka:
  enabled: true
#  max_message_bytes: 1000000
  hosts: ["10.60.x.x:9092"]
  topic: '%{[fields][topic]}'
max_procs: 1 
#queue.mem.events: 256
#queue.mem.flush.min_events: 128

But in fact , In different circumstances , Different reasons may cause filebeat Too much memory used , here , You need to carefully identify your context ：

Is it because of wildcard characters , Cause a huge number of files to be monitored at the same time , In this case, you should avoid using wildcards to monitor useless files .
Whether the single line content of the file is huge , Determine whether the contents of the document need to be modified , Or filter it
Whether there are too many matches multiline Of pattern, And multi line event Whether the volume of a single strip is too large . At this time , It needs to be closed temporarily multiline, Modify the contents of the file or multiline Of pattern.
whether output Frequent blockage ,event queue It is always cached event. At this time, check whether your network environment or message queue middleware is normal

原网站

版权声明
本文为[Three ignition cycles]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202221140566085.html