We use Munin primarily to gather historical data. The data in turn is used for capacity planning, (e.g. server upgrades). The graphs are a good tool also to determine unusual server behavior (eg. spikes in memory, cpu usage, etc. ). We use it also as indicators or pointers to what caused a server crash.
Since we consolidated our servers and migrated it to virtualized ones, our Munin server was also affected. When we virtualized our Munin server, the first few days was a disaster. It simply can’t handle the load because the disk I/O required is too great!
To determine what part we can tweak to improve performance, it’s important to take a look how Munin generates those lovely graphs first. The Munin server process has four steps:
- munin-update -> updates the RRD files, if you have a lot of nodes, the disk I/O will be hammered!
- munin-graph -> generates graphs out of the RDD files, multiple CPU cores is a must!
- OS: CentOS 6.2 x86_64
- CPU: 4 cores
- RAM: 3.5GB
- HDD: 10GB
- Munin: version 1.4.6
Note: Add the EPEL repository to install Munin 1.4.6 using yum.
Yup. I need that much RAM to address #1. Since it’s way cheaper to buy more memory than buying an SSD or an array of 10k/15k RPM drives, I used tmpfs to solve the disk I/O problem. This will make all RRD updates done in memory. This is not a new idea, this approach has been used for years already.
# tmpfs for munin files
/var/lib/munin /var/lib/munin tmpfs size=1280M,nr_inodes=1m,mode=775,uid=munin,gid=munin,noatime 0 0
/var/www/munin /var/www/munin tmpfs size=768M,nr_inodes=1m,mode=775,uid=munin,gid=munin,noatime 0 0
[root@munin ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/vda1 9.6G 6.3G 3.0G 69% /
tmpfs 1.8G 0 1.8G 0% /dev/shm
/var/lib/munin 1.3G 937M 344M 74% /var/lib/munin
/var/www/munin 768M 510M 259M 67% /var/www/munin
Since all RRD files are now stored in RAM, these files will simply disappear to oblivion if the server was rebooted for any reason. To compensate, I added these maintenance scripts in root’s cron:
[root@munin ~]# crontab -l
# create RRD files backup
*/15 * * * * mkdir -p $HOME/munin-files/munin-lib/ && rsync --archive /var/lib/munin/* $HOME/munin-files/munin-lib/ > /dev/null 2>&1
# restore RRD files at reboot
@reboot mkdir -p /var/www/munin/ /var/lib/munin/ && chown -R munin.munin /var/www/munin/ /var/lib/munin/ && cp -a -r $HOME/munin-files/munin-lib/* /var/lib/munin/
# cleanup: remove inactive rrd and png files
@daily find /var/lib/munin/ -type f -mtime +7 -name '*.rrd' | xargs rm -f
@daily find $HOME/munin-files/munin-lib/ -type f -mtime +7 -name '*.rrd' | xargs rm -f
@daily find /var/www/munin/ -type f -mtime +7 -name '*.png' | xargs rm -f
What it does are:
- creates a backup of the RRD files every 15 minutes
- restores the RRD files from #1 in case the server was rebooted/crashed
- deletes inactive RRD and PNG (graphs) files to reduce tmpfs usage
[root@munin ~]# pcregrep '^\s*\[' /etc/munin/munin.conf | wc -l
[root@munin ~]# find /var/lib/munin/ -type f -name '*.rrd' | wc -l
This is the typical cpu usage of our munin server for a day, iowait is neglible.
As for #3, the munin-graph step, this simply requires pure brute CPU computation power, multiple cores and some configuration tweaks. As reflected in the CPU graph above, I allotted 4 cores for our Munin server and about 75% of that is constantly in use. The KVM hypervisor of our Munin server has a Xeon E5504, not really the best there is but it gets the job done.
Since I allotted 4 cores for the Munin server VM, I set max_graph_jobs to 4:
[root@munin ~]# grep max_graph_jobs /etc/munin/munin.conf
Note: munin-graph was one process only in older versions of Munin. I recommend you use the 1.4.6 version.
Test your configurations, see how it behaves. You have to calibrate this value depending on what your CPU is and how many core it has (e.g if you have a Xeon X56xx, 4 cores may be an overkill).
This graph contains enough information to check what steps of the munin server you need to tweak…
As reflected in the graph above the munin-graph took about 200 secs maximum to finish. If this value goes beyond 300 (Munin’s master process runs every 5 minutes) , I may have to add a core and change max_graph_jobs to 5, or move the VM to a better hypervisor, else the graphs will be 5+ mins late or filled with gaps.
That’s it. This is how I managed our Munin server to monitor 100+ servers. Of course this only applies to Munin 1.4.x, I read that Munin 2.0 will be a lot different. Hopefully, Munin 2.0 can support hundreds of nodes out of the box, no tweaking needed… I guess we’ll see… 🙂