With Nagios you can monitor almost everything and philosophy is simple.
Nagios uses plug-ins, say Perl/shell script and check its returning value and according to that determines host/service state. So Nagios doesn't know and it's not interested to know what plug-in is monitoring.
Here is the plug-in that monitors an ambient temperature around machine. The plug-in supports next servers: Sun Enterprise T5240 and SunFire X4200/X4500
Basically, the script uses tool 'ipmitool' and connect to ILOM of supported systems. In my case, ILOM interface has name hostname.alom or hostname-alom, so script is also checking this. Another thing, the file .passwd.alom contains ILOM's password.
#!/usr/bin/sh #set -x # Nagios plugin : determine ambient temperature around a server # by zdudic # -- supported systems # Sun Enterprise T5240 and SunFire X4200/X4500 # Nagios plugin return values STATE_OK=0 STATE_WARNING=1 STATE_CRITICAL=2 STATE_UNKNOWN=3 STATE_DEPENDENT=4 # variables WARNTEMP=$2 CRITTEMP=$3 ILOMUSER=admin PASSWDFILE=/opt/csw/libexec/nagios-plugins/ipmitool/.passwd.alom # Function : error and exit 1 err() { echo "\n ERROR: $* \n" exit 1 } # check if arguments are provided (hostname, warning, critical temperature) if [ $# != 3 ] then echo ; echo "USAGE : `basename $0` hostname warn_tmp(C) crit_tmp(C)" ; echo exit 2 fi # check if critical temp is higher than warning if [ $2 -ge $3 ] then echo NOTE : Critical temperature must be higher than warning temperature. exit 3 fi # Function: end script with output, with performance data for NagiosGraph endscript () { echo "${RESULT} | PerfData=${TEMP};${WARNTEMP};${CRITTEMP}" exit ${EXIT_STATUS} } # find if ilom name has -alom or .alom (hostname-alom or hostname.alom) ILOMNAME=`host $1.alom > /dev/null` if [ $? -eq 0 ] then ILOMNAME=$1.alom else ILOMNAME=$1-alom fi PNAME=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Product Name" \ | nawk -F":" '{print $2}' | nawk '{print $1}'` \ || err "Cannot find what system type is $1" case ${PNAME} in T5240) TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \ | grep T_AMB \ | awk -F"|" '{print $5}' | awk '{print $1}'` # if [ ${TEMP} -le ${WARNTEMP} ] then RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK" EXIT_STATUS=${STATE_OK} elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ] then RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING" EXIT_STATUS=${STATE_WARNING} else RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL" EXIT_STATUS=${STATE_CRITICAL} fi # ;; ILOM) # can be X4500 or X4200 BOARD=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Board Product" \ | nawk -F"ASSY,SERV PROCESSOR," '{print $2}' | nawk '{print $1}'` \ || err "Cannot find whar Board Product is." if [ ${BOARD} = "G1/2" ] then # X4200 TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \ | grep fp.t_amb \ | nawk -F"|" '{print $5}' | nawk '{print $1}'` elif [ ${BOARD} = "X4500" ] then # X4500 TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \ | grep dbp.t_amb \ | nawk -F"|" '{print $5}' | nawk '{print $1}'` fi # -- if [ ${TEMP} -le ${WARNTEMP} ] then RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK" EXIT_STATUS=${STATE_OK} elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ] then RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING" EXIT_STATUS=${STATE_WARNING} else RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL" EXIT_STATUS=${STATE_CRITICAL} fi ;; esac # provide output and nagios return value endscript |
This executable shell script is located in the directory /opt/csw/libexec/nagios-plugins on machine to be monitored.
This article is not about NRPE, but I have to write this:
And now Nagios knows the state or recourse, like OK or Critical. And Nagios doesn't care what resource is.
Saying this, you need this line in your nrpe.cfg (configuration file for cswnrpe service) file on machine that is monitored.
# plugin for ambient temperature command[check_amb_temp]=/opt/csw/libexec/nagios-plugins/ipmitool/amb_temp.sh $ARG1$ |
Your Nagios machine needs defined service, something like:
define servicegroup{ servicegroup_name amb_temp_mvo alias MVO Ambient Temperature } define service{ use gen-service ; Name of service template to use host_name srv-1,srv-2 servicegroups amb_temp_mvo service_description MVO Ambient Temperature # The "$HOSTNAME$ X Y" is 1 argument for command, but actually simulates 3 of them check_command check-nrpe!check_amb_temp!"$HOSTNAME$ 25 27" -t 60 } |
There are many solutions for graphical presentation of Nagio data, one of them is Nagios Grapher from Netways. I am not writing how to setup this, but here is, in short, how to configure a graph for this plugin.
See the script's funcion that gives results back to Nagios, it also provides performance data. This is what Nagiosgrapher needs.
After installing nagiosgrapher, check the directory ngraph.d Say that I monitor ambient temperature of 2 servers in Mountain View (MVO) server room. The nagiosgrapher configuration file is:
#NagiosGrapherTemplate for check_amb_temp # ---------- Help ------------------------------------ # service_name = # regular expresion used to identify service # # graph_perf_regex = # regular expresion used to find searched value in performance data # must be in round brackets () # # graph_value = variable name in rrd database, no empty space # # graph_units = units on Y axis, X axis is time # # graph_legend = it contains key for variable, shows under graph # # page = optional # # rrd_plottype = LINE1 is simple line, AREA is filled out surface # # ----------------------------------------------- # Amb Temp in MVO define ngraph{ service_name MVO Ambient Temperature graph_perf_regex PerfData=([0-9]*) graph_value amb_temp graph_units C graph_legend MVO Ambient Temperature graph_upper_limit 30 graph_lower_limit 15 rrd_plottype LINE2 rrd_color FF9900 # orange } # AVERAGE of ambient temperature define ngraph{ service_name MVO Ambient Temperature type VDEF graph_value vdef_amb_temp_average graph_legend Amb temp Average graph_calc amb_temp,AVERAGE rrd_plottype LINE1 rrd_color 0000ff hide no } define ngraph{ service_name MVO Ambient Temperature # HRULE draws horizontal line type HRULE hrule_value 25 rrd_color FF0000:Warning level # red } define ngraph{ service_name MVO Ambient Temperature type HRULE hrule_value 27 rrd_color 000000:Critical level # black } |
Here is the weekly graph. Beside this, you'll also have current graph, daily, monthly and yearly
There is also multigraph if you want to compare service of more systems. For example, I compare ambient temperature of 6 systems.
# NOTE : it is nmgraph, not ngraph # ------------------------------- define nmgraph{ host_name Multigraph service_name .* DCO.* Ambient Temperature # RegEX hosts [a-zA-Z]+ # RegEX services .* DCO.* Ambient Temperature # This matches 'graph_value' from the ngraph definition graph_values amb_temp # line or stack or area graph_type LINE2 colors f0e68c,fff000,cd5c5c,ffa500,ff0000,ff1493 } |
And the graph is: