Solaris Fault Manager

Fault Manager is part of self-healing functionality that provides fault isolation and component restart, in this case hardware component
(SMF will take care of software components).

Make sure that you run the service and have required packages.
# pkginfo |grep fmd
system SUNWfmd Fault Management Daemon and Utilities
system SUNWfmdr Fault Management Daemon and Utilities (Root)

# svcs fmd
STATE STIME FMRI
online Jun_29 svc:/system/fmd:default

Display Fault Manager Configuration:
# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-diagnosis 1.6 active CPU/Memory Diagnosis
cpumem-retire 1.1 active CPU/Memory Retire Agent
eft 1.16 active eft diagnosis engine
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent

For example, kernel sends error to FMD and FMD forwards error to module. There are two types of module:

1. Diagnosis engine : provides diagnosis based on symptoms
2. Agents : respond to given diagnosis and takes action, say offline faulty CPU.

The fault manager maintains two log files:

1. error log - list of errors sent to the fault manager daemon
2. fault log - list of diagnosed and repaired problems

See fault log with:

# fmdump

See error log with:

# fmdump -e

Tips:
-u - limits the output to a specific UUID
-T - displays events that occurred BEFORE specific time yyyy-mm-dd
-t - displays events that occurred AFTER specific time yyyy-mm-dd
-V - verbose output

Run command below to see if Faulty Manager shows some failed resources.

In this example we see that memory module DIMM 3 failed.
# fmadm faulty
--------------- ------------------------------------ -------------- ---------
TIME EVENT-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Jun 23 02:30:30 2578e639-38cd-4cd8-9c16-87e96116f41e AMD-8000-2F Major

Fault class : fault.memory.dimm_sb
Affects : mem:///motherboard=0/chip=1/memory-controller=0/dimm=3/rank=0
degraded but still in service
FRU : "CPU 1 DIMM 3" (hc://:product-id=Sun-Fire-X4200-Server:chassis-id=0000000000:server-id=oryx/motherboard=0
/chip=1/memory-controller=0/dimm=3)

Description : The number of errors associated with this memory module has
exceeded acceptable levels. Refer to
http://sun.com/msg/AMD-8000-2F for more information.

Response : Pages of memory associated with this memory module are being
removed from service as errors are reported.

Impact : Total system memory capacity will be reduced as pages are
retired.

Action : Schedule a repair procedure to replace the affected memory
module. Use fmdump -v -u <EVENT_ID> to identify the module.

Note that there is the link with more info (like knowledge base), go there and it tells you about resolution.

Okay, so say you are replacing DIMM now.

Once DIMM is replaced, you need to update resource cache to indicate there is no issue any more.
# fmadm repair 2578e639-38cd-4cd8-9c16-87e96116f41e
fmadm: recorded repair to 2578e639-38cd-4cd8-9c16-87e96116f41e

Reset the Fault Manager module.

Don't know which one, previously mentioned web link will tell you.
# fmadm reset eft
fmadm: eft module has been reset

Verify that there is no more faulty resources.
# fmadm faulty

No output, super! Means there is no h/w issue!

Back to the main page