monitoring | un-mundane idiosyncrasies

We use Munin primarily to gather historical data. The data in turn is used for capacity planning, (e.g. server upgrades). The graphs are a good tool also to determine unusual server behavior (eg. spikes in memory, cpu usage, etc. ). We use it also as indicators or pointers to what caused a server crash.

Since we consolidated our servers and migrated it to virtualized ones, our Munin server was also affected. When we virtualized our Munin server, the first few days was a disaster. It simply can’t handle the load because the disk I/O required is too great!

To determine what part we can tweak to improve performance, it’s important to take a look how Munin generates those lovely graphs first. The Munin server process has four steps:

munin-update -> updates the RRD files, if you have a lot of nodes, the disk I/O will be hammered!
munin-limits
munin-graph -> generates graphs out of the RDD files, multiple CPU cores is a must!
munin-html

We only need to tweak steps #1 and #3 to increase its performance. But before I go with the details, here’s the specs of our Munin server:

OS: CentOS 6.2 x86_64
CPU: 4 cores
RAM: 3.5GB
HDD: 10GB
Munin: version 1.4.6

Note: Add the EPEL repository to install Munin 1.4.6 using yum.

Yup. I need that much RAM to address #1. Since it’s way cheaper to buy more memory than buying an SSD or an array of 10k/15k RPM drives, I used tmpfs to solve the disk I/O problem. This will make all RRD updates done in memory. This is not a new idea, this approach has been used for years already.

I added these lines in /etc/fstab:

# tmpfs for munin files /var/lib/munin /var/lib/munin tmpfs size=1280M,nr_inodes=1m,mode=775,uid=munin,gid=munin,noatime 0 0 /var/www/munin /var/www/munin tmpfs size=768M,nr_inodes=1m,mode=775,uid=munin,gid=munin,noatime 0 0

And this is how it looks like in production once mounted and in use:

[root@munin ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/vda1 9.6G 6.3G 3.0G 69% / tmpfs 1.8G 0 1.8G 0% /dev/shm /var/lib/munin 1.3G 937M 344M 74% /var/lib/munin /var/www/munin 768M 510M 259M 67% /var/www/munin

Since all RRD files are now stored in RAM, these files will simply disappear to oblivion if the server was rebooted for any reason. To compensate, I added these maintenance scripts in root’s cron:

[root@munin ~]# crontab -l # create RRD files backup */15 * * * * mkdir -p $HOME/munin-files/munin-lib/ && rsync --archive /var/lib/munin/* $HOME/munin-files/munin-lib/ > /dev/null 2>&1

# restore RRD files at reboot @reboot mkdir -p /var/www/munin/ /var/lib/munin/ && chown -R munin.munin /var/www/munin/ /var/lib/munin/ && cp -a -r $HOME/munin-files/munin-lib/* /var/lib/munin/

# cleanup: remove inactive rrd and png files @daily find /var/lib/munin/ -type f -mtime +7 -name '*.rrd' | xargs rm -f @daily find $HOME/munin-files/munin-lib/ -type f -mtime +7 -name '*.rrd' | xargs rm -f @daily find /var/www/munin/ -type f -mtime +7 -name '*.png' | xargs rm -f

What it does are:

creates a backup of the RRD files every 15 minutes
restores the RRD files from #1 in case the server was rebooted/crashed
deletes inactive RRD and PNG (graphs) files to reduce tmpfs usage

As of date, our Munin server is currently monitoring 131 servers which equates to 18,000+ RRD files, and disk I/O is not an issue during munin-update, thanks to tmpfs.

[root@munin ~]# pcregrep '^\s*\[' /etc/munin/munin.conf | wc -l 131 [root@munin ~]# find /var/lib/munin/ -type f -name '*.rrd' | wc -l 18635

This is the typical cpu usage of our munin server for a day, iowait is neglible.

As for #3, the munin-graph step, this simply requires pure brute CPU computation power, multiple cores and some configuration tweaks. As reflected in the CPU graph above, I allotted 4 cores for our Munin server and about 75% of that is constantly in use. The KVM hypervisor of our Munin server has a Xeon E5504, not really the best there is but it gets the job done.

Since I allotted 4 cores for the Munin server VM, I set max_graph_jobs to 4:

[root@munin ~]# grep max_graph_jobs /etc/munin/munin.conf # max_graph_jobs. max_graph_jobs 4

Note: munin-graph was one process only in older versions of Munin. I recommend you use the 1.4.6 version.

Test your configurations, see how it behaves. You have to calibrate this value depending on what your CPU is and how many core it has (e.g if you have a Xeon X56xx, 4 cores may be an overkill).

This graph contains enough information to check what steps of the munin server you need to tweak…

As reflected in the graph above the munin-graph took about 200 secs maximum to finish. If this value goes beyond 300 (Munin’s master process runs every 5 minutes) , I may have to add a core and change max_graph_jobs to 5, or move the VM to a better hypervisor, else the graphs will be 5+ mins late or filled with gaps.

That’s it. This is how I managed our Munin server to monitor 100+ servers. Of course this only applies to Munin 1.4.x, I read that Munin 2.0 will be a lot different. Hopefully, Munin 2.0 can support hundreds of nodes out of the box, no tweaking needed… I guess we’ll see… 🙂

un-mundane idiosyncrasies

stuffs… work… life… parenthood… follower of Christ

Tag Archives: monitoring

How to configure a virtualized Munin server to monitor 100+ servers in CentOS/RHEL