Nagios plug-in : Ambient Temperature

Nagios plug-in (+ Graph): Ambient Temperature

With Nagios you can monitor almost everything and philosophy is simple.

Nagios uses plug-ins, say Perl/shell script and check its returning value and according to that determines host/service state. So Nagios doesn't know and it's not interested to know what plug-in is monitoring.

Here is the plug-in that monitors an ambient temperature around machine. The plug-in supports next servers: Sun Enterprise T5240 and SunFire X4200/X4500

Basically, the script uses tool 'ipmitool' and connect to ILOM of supported systems. In my case, ILOM interface has name hostname.alom or hostname-alom, so script is also checking this. Another thing, the file .passwd.alom contains ILOM's password.

#!/usr/bin/sh
#set -x

# Nagios plugin : determine ambient temperature around a server
# by zdudic
# -- supported systems
# Sun Enterprise T5240 and SunFire X4200/X4500

# Nagios plugin return values
STATE_OK=0
STATE_WARNING=1
STATE_CRITICAL=2
STATE_UNKNOWN=3
STATE_DEPENDENT=4

# variables
WARNTEMP=$2
CRITTEMP=$3
ILOMUSER=admin
PASSWDFILE=/opt/csw/libexec/nagios-plugins/ipmitool/.passwd.alom

# Function : error and exit 1
err() {
        echo "\n ERROR: $* \n"
        exit 1
}

# check if arguments are provided (hostname, warning, critical temperature)
if [ $# != 3 ]
then
        echo ; echo "USAGE : `basename $0` hostname warn_tmp(C) crit_tmp(C)" ; echo
        exit 2
fi

# check if critical temp is higher than warning
if [ $2 -ge $3 ]
then
        echo NOTE : Critical temperature must be higher than warning temperature.
        exit 3
fi


# Function: end script with output, with performance data for NagiosGraph
endscript () {
        echo "${RESULT} | PerfData=${TEMP};${WARNTEMP};${CRITTEMP}"
        exit ${EXIT_STATUS}
}

# find if ilom name has -alom or .alom (hostname-alom or hostname.alom)
ILOMNAME=`host $1.alom > /dev/null`
if [ $? -eq 0 ]
then
        ILOMNAME=$1.alom
else
        ILOMNAME=$1-alom
fi

PNAME=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Product Name" \
        | nawk -F":" '{print $2}' | nawk '{print $1}'` \
        || err "Cannot find what system type is $1"

case ${PNAME} in
T5240)
        TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
        | grep T_AMB \
        | awk -F"|" '{print $5}' | awk '{print $1}'`
        #
        if [ ${TEMP} -le ${WARNTEMP} ]
        then
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK"
                EXIT_STATUS=${STATE_OK}
        elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ]
        then
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING"
                EXIT_STATUS=${STATE_WARNING}
        else
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL"
                EXIT_STATUS=${STATE_CRITICAL}
        fi
        #
        ;;

ILOM)
        # can be X4500 or X4200

        BOARD=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} fru | head | grep "Board Product" \
        | nawk -F"ASSY,SERV PROCESSOR," '{print $2}' | nawk '{print $1}'` \
        || err "Cannot find whar Board Product is."

        if [ ${BOARD} = "G1/2" ]
        then
                #  X4200
                TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
                | grep fp.t_amb \
                | nawk -F"|" '{print $5}' | nawk '{print $1}'`

        elif [ ${BOARD} = "X4500" ]
        then
                # X4500
                TEMP=`ipmitool -H ${ILOMNAME} -U ${ILOMUSER} -f ${PASSWDFILE} sdr type temperature \
                | grep dbp.t_amb \
                | nawk -F"|" '{print $5}' | nawk '{print $1}'`
        fi

        # --

        if [ ${TEMP} -le ${WARNTEMP} ]
        then
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : OK"
                EXIT_STATUS=${STATE_OK}
        elif [ ${TEMP} -gt ${WARNTEMP} ] && [ ${TEMP} -le ${CRITTEMP} ]
        then
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : WARNING"
                EXIT_STATUS=${STATE_WARNING}
        else
                RESULT="Host: $1 : Ambient Temp(C): ${TEMP} : CRITICAL"
                EXIT_STATUS=${STATE_CRITICAL}
        fi

        ;;
esac

# provide output and nagios return value
endscript

This executable shell script is located in the directory /opt/csw/libexec/nagios-plugins on machine to be monitored.

This article is not about NRPE, but I have to write this:

Machine that runs Nagios, also runs check_nrpe plugin.
On Solaris, the NRPE service is running on remote machine (the one from Blastwave.org is cswnrpe)
The NRPE service (on remote machine) runs specific plug-in in order to determine status of local resource, like CPU load or Solaris Fault Manager messages (script in this article).
Then NRPE service sends plug-in result back to check_nrpe (from step 1)

And now Nagios knows the state or recourse, like OK or Critical. And Nagios doesn't care what resource is.

Saying this, you need this line in your nrpe.cfg (configuration file for cswnrpe service) file on machine that is monitored.

# plugin for ambient temperature
command[check_amb_temp]=/opt/csw/libexec/nagios-plugins/ipmitool/amb_temp.sh $ARG1$

Your Nagios machine needs defined service, something like:

define servicegroup{
        servicegroup_name       amb_temp_mvo
        alias                   MVO Ambient Temperature
        }

define service{
        use                             gen-service         ; Name of service template to use
        host_name                       srv-1,srv-2
        servicegroups                   amb_temp_mvo
        service_description             MVO Ambient Temperature
        # The "$HOSTNAME$ X Y" is 1 argument for command, but actually simulates 3 of them
        check_command                   check-nrpe!check_amb_temp!"$HOSTNAME$ 25 27" -t 60
        }

NETWAYS Nagios Grapher

There are many solutions for graphical presentation of Nagio data, one of them is Nagios Grapher from Netways. I am not writing how to setup this, but here is, in short, how to configure a graph for this plugin.

See the script's funcion that gives results back to Nagios, it also provides performance data. This is what Nagiosgrapher needs.

After installing nagiosgrapher, check the directory ngraph.d Say that I monitor ambient temperature of 2 servers in Mountain View (MVO) server room. The nagiosgrapher configuration file is:

#NagiosGrapherTemplate for check_amb_temp

# ---------- Help ------------------------------------
# service_name =
#       regular expresion used to identify service
#
# graph_perf_regex =
#       regular expresion used to find searched value in performance data
#       must be in round brackets ()
#
# graph_value = variable name in rrd database, no empty space
#
# graph_units = units on Y axis, X axis is time
#
# graph_legend = it contains key for variable, shows under graph
#
# page = optional
#
# rrd_plottype = LINE1 is simple line, AREA is filled out surface
#
# -----------------------------------------------

# Amb Temp in MVO
define ngraph{
        service_name            MVO Ambient Temperature
        graph_perf_regex        PerfData=([0-9]*)
        graph_value             amb_temp
        graph_units             C
        graph_legend            MVO Ambient Temperature
        graph_upper_limit       30
        graph_lower_limit       15
        rrd_plottype            LINE2
        rrd_color               FF9900 # orange
        }

# AVERAGE of ambient temperature
define ngraph{
        service_name            MVO Ambient Temperature
        type                    VDEF
        graph_value             vdef_amb_temp_average
        graph_legend            Amb temp Average
        graph_calc              amb_temp,AVERAGE
        rrd_plottype            LINE1
        rrd_color               0000ff
        hide                    no
}

define ngraph{
        service_name            MVO Ambient Temperature
	# HRULE draws horizontal line 
        type                    HRULE
        hrule_value             25
        rrd_color               FF0000:Warning level  # red
        }

define ngraph{
        service_name            MVO Ambient Temperature
        type                    HRULE
        hrule_value             27
        rrd_color               000000:Critical level  # black
        }

Here is the weekly graph. Beside this, you'll also have current graph, daily, monthly and yearly

There is also multigraph if you want to compare service of more systems. For example, I compare ambient temperature of 6 systems.

# NOTE : it is nmgraph, not ngraph
# -------------------------------
define nmgraph{
       host_name                Multigraph
       service_name             .* DCO.* Ambient Temperature
       # RegEX
       hosts                   [a-zA-Z]+
       # RegEX
       services                  .* DCO.* Ambient Temperature
       # This matches 'graph_value' from the ngraph definition
       graph_values             amb_temp
       # line or stack or area
       graph_type               LINE2
       colors                   f0e68c,fff000,cd5c5c,ffa500,ff0000,ff1493

}

And the graph is:

Back to the main page