OCI compute monitoring (metrics and alarm)

Oracle cloud agent

Requirements
Compute - Instance - Instance details - Metrics
Compute - Instance - Instance details - OS Management

Overview
Monitored resources
Top processes

Custom compute metrics

Disk usage
Service state

Metric query
Metric alarm

Oracle cloud agent

Oracle Cloud Agent is a process (or service) that manages plugins running on compute. Plugins collect performance metrics, install OS updates, and perform other instance management tasks.

Requirements

The Oracle Cloud Agent software (rpm: oracle-cloud-agent, repo: ol#_oci_included) must be installed on the instance. Oracle Cloud Agent is the service (service name oracle-cloud-agent) that manages plugins running on compute.

Compute - Instance - Instance details - Metrics

These Ansible tasks list service metrics for oci_computeagent namespace.

---
- name: Collect metrics
  oracle.oci.oci_monitoring_metric_actions:
    compartment_id: "your-compartment-ocid"
    action: list
    namespace: oci_computeagent
    dimension_filters: { "resourceID": "your-compute-instance-ocid"}
  register: result

- name: Set list of indexes (number of metrics)
  ansible.builtin.set_fact:
    l_index: "{{ range(result.metric | length) }}"

- name: Show metrics
  ansible.builtin.debug:
    msg:
      - "{{ (result | dict2items)[1][\"value\"][index][\"name\"] }}"
  loop: "{{ l_index }}"  # loop through list indexes
  loop_control:
    index_var: index
...

# Role tests main.yml is like:
---
- name: Test role
  connection: local
  hosts: localhost
  roles:
    - role: ../../oci-metrics
...

# Run role: ansible-playbook -i inventory test.yml

# Expect ten metrics in result:

ok: [localhost] => (item=0) => "CpuUtilization"
ok: [localhost] => (item=1) => "DiskBytesRead"
ok: [localhost] => (item=2) => "DiskBytesWritten"
ok: [localhost] => (item=3) => "DiskIopsRead"
ok: [localhost] => (item=4) => "DiskIopsWritten"
ok: [localhost] => (item=5) => "LoadAverage"
ok: [localhost] => (item=6) => "MemoryAllocationStalls"
ok: [localhost] => (item=7) => "MemoryUtilization"
ok: [localhost] => (item=8) => "NetworksBytesIn"
ok: [localhost] => (item=9) =>  "NetworksBytesOut"

Compute - Instance - Instance details - OS Management

Overview

It manages updates and patches on the compute. Show available updates: security, bugs, enhancement, others.

Monitored resources

It's auto discovery of predefined applications to monitor (monitor CPU and memory usage of certified applications like Apache v2.4, MySQL v.5.7+)
OSMS config folder: /etc/oracle-cloud-agent/plugins/osms (present files: config.yml systemid up2date)

Top processes

List top processes, based on CPU and memory utilization. Similar to Linux top command.

Custom compute metrics

Custom metrics can be sent to OCI Monitoring service (Observability & Management - Monitoring - Metrics Explorer).
One of ideas can be migration of your monitoring plugins (like Nagios, nrpe) to OCI metric.

Disk usage

The python script to ingest disk usage metric to oci metric service.

#!/bin/python3

import socket
import argparse
import oci
import psutil
from datetime import datetime

# arguments are location (on-prem or OCI region) and partition (ex, /, /boot)
parser = argparse.ArgumentParser(description=f"OCI metric disk_usage for {socket.gethostname()}")
parser.add_argument("--location", help="Location", required=True)
parser.add_argument("--partition", help="Partition", required=True)
args = parser.parse_args()
location = args.location
partition = args.partition

comp_ocid = "...your-compartment-ocid..."
hostname = socket.gethostname()
metric_name = "disk_usage"
metric_namespace = "your-team"  # ex. custom metrics for your team

# Use default config file ~/.oci/config
config = oci.config.from_file()

# create monitoring service client
monitoring_client = \
    oci.monitoring.MonitoringClient(config, service_endpoint="https://telemetry-ingestion..oraclecloud.com")


def get_disk_usage():
    """ On-prem host root disk usage """
    # global disk_usage_percent
    df_cmd = psutil.disk_usage(f"{partition}")
    disk_usage_percent = df_cmd.percent
    return disk_usage_percent

# Metric value (disk usage) for a specific timestamp
datapoints_info = \
    [oci.monitoring.models.Datapoint(timestamp=f"{datetime.utcnow().isoformat('T')}Z", value=get_disk_usage())]

# A metric object and its details
metric_data_details = oci.monitoring.models.MetricDataDetails(compartment_id=comp_ocid, datapoints=datapoints_info,
                                                              dimensions={"location": f"{location}",
                                                                          "hostname": f"{hostname}",
                                                                          "partition": f"{partition}"},
                                                              name=f"{metric_name}",
                                                              namespace=f"{metric_namespace}")

# Metric object to be posted to Monitoring service
post_metric_data_details = oci.monitoring.models.PostMetricDataDetails(metric_data=[metric_data_details])

# Publishes metric data to Monitoring service.
post_metric_data_response = monitoring_client.post_metric_data(post_metric_data_details)

Example of hourly cronjobs for / and /boot to ingest data to metric service, location is IAD region.

15 * * * * disk-usage.py --location IAD --partition /
15 * * * * disk-usage.py --location IAD --partition /boot

Service state

For the service state, script is similar, the function to return value is:

#
# the only supported return value for oci metrics is number
# workaround for plugins that return string
#
def check_service_state(service_name):
    try:
        subprocess.check_output(['systemctl', 'is-active', service_name])
        return 0
    except subprocess.CalledProcessError:
        return 1

Example of hourly cronjobs for mysqld service, location is IAD region.

15 * * * * service-state.py --location IAD --service mysqld

Metric query

The script example to qury disk usage of a partition. Arguments are partition, start and end for metrics. Hard coded are OCI compartment and namespace.

#!/bin/python3
# Returns aggregated data from query.
from datetime import datetime
import oci
import argparse

parser = argparse.ArgumentParser(description=f"Disk usage metrics")
parser.add_argument("--partition", help="Partition", required=True)
parser.add_argument("--start", help="Start time for metric yyyy-mm-dd", required=True)
parser.add_argument("--end", help="End time for metric yyyy-mm-dd", required=True)
args = parser.parse_args()
partition = args.partition
start = args.start
end = args.end

# Use default config file ~/.oci/config
config = oci.config.from_file()

# Initialize service client with default config file
monitoring_client = oci.monitoring.MonitoringClient(config)

comp_id = "ocid1.compartment.oc1..your-id"
namespace = "your-namespace"
query = f"disk_usage[1h]{{partition = \"{partition}\"}}.mean()"

summarize_metrics_data_response = monitoring_client.summarize_metrics_data(
    compartment_id=f"{comp_id}",
    summarize_metrics_data_details=oci.monitoring.models.SummarizeMetricsDataDetails(
        namespace=f"{namespace}",
        query=f"{query}",
        start_time=f"{start}T00:00:00+00",
        end_time=f"{end}T00:00:00+00"))

# Get the data from response
print(summarize_metrics_data_response.data)

The result example is:

$ python3 summarize-disk-usage.py --partition / --start 2023-06-19 --end 2023-06-21

[{
  "aggregated_datapoints": [
    {
      "timestamp": "2023-06-19T00:00:00+00:00",
      "value": 17.4
    },
    
    -- shortened --
    {
      "timestamp": "2023-06-21T00:00:00+00:00",
      "value": 17.0
    }
  ],
  "compartment_id": "ocid1.compartment.oc1...a",
  "dimensions": {
    "hostname": "myhostname.domain.com",
    "partition": "/"
  },
  "metadata": {},
  "name": "disk_usage",
  "namespace": "your-namespace",
  "resolution": null,
  "resource_group": null
}]

Metric alarm

Metric Explorer - Query - Create alarm - Name, Severity (crit/warn), Body, Trigger rule, Destination (Notification service for sending messages, Topic - communication channel to send message to subscription), Repeat. Example or Ansible task to get alarm info.

---
- name: Get alarm mysqld-down
  oracle.oci.oci_monitoring_alarm_facts:
    alarm_id: "your-alarm-ocid"
...

# expected result is like:

    "alarms": [
        {
            "body": "Mysqld service is not online. ",
            "display_name": "mysql-down",
            "is_enabled": true,
            "lifecycle_state": "ACTIVE",
            "namespace": "your-custom-team-namespace",
            "query": "service_state[1h]{service = \"mysqld\"}.mean() not in (0, 0)",
            "severity": "CRITICAL",
            "suppression": null,
        }

Back to the main page