Important Concepts

This pages lists important concepts and idea you should understand to make full use of Opsview.

Hosts and Services Copied

A service is something that is important to you, that you want to know the status of. Services can be “active” (checked on a regular basis) or “passive” (waits to be given data). This document will focus only on “Active” Service Checks.

All Services (also called a Check or a Service Check) have a status, one line of output and (optionally) some performance data.

Note

Hosts without any services will not be shown in the monitoring status pages.

Please see more details about Hosts, Host Groups and Host Check Commands.

States Copied

Services have one of four possible states:

The last 3 states are collectively called Problem States.

Hosts have one of three possible states:

If a Host is DOWN, then the Services on the Host will be marked as CRITICAL with the summary text of “Dependency failure: Host X is DOWN”. Service will no longer be executed until the Host has returned to an UP state.

If a Host is UNREACHABLE, then it will be marked with the summary text of “Dependency failure: Host X is DOWN”. The Host will not be checked again until at least one of its parents is UP. All Services on the Host will not be checked until the Host returns to an UP state.

Plugins Copied

All active checks use a plugin. This plugin will have the actual logic to know how to check something to determine its Status. For example, a plugin will know how to communicate with a DNS server, or how to interrogate for free filesystem space, or how to get a web page.

Opsview supports Nagios compatible plugins. For more information, see Active Checks.

Check Intervals and State Types Copied

When the active check for a service runs, it is executed on a set frequency (by default 5 minutes). This is called the check interval.

Usually, services are in an OK state, showing that service is stable. However, if a problem occurs and the service changes to a different state, we need to have confidence that this is the correct state. We use state types to highlight this confidence factor.

Services can have one of two state types:

There are two important parameters to determine the soft and hard state types:

The check attempts will be displayed as 3/5, which means the third check with a maximum of fix before it becomes hard.

When a service has gone into a hard state type, then the check attempts will revert to 1.

Note

If a service changes from one problem state to another, the check attempts are reset.

This same logic also applies for an OK state.

The main reason for state types is that notifications are sent on hard states only. This avoids sending notifications for temporary problems.

Notifications Copied

Notifications are sent on hard state changes only. This means notifications will be sent for hosts or services only when they have been in a particular state for a “check attempts * retry interval” amount of time.

Notifications are also sent when a host/service returns to an OK hard state. This is called a hard recovery notification.

Notifications are executed in parallel.

Notifications are suppressed if:

For more information, see Notifications.

Event Handlers Copied

Event handlers are an external script that is executed when a result is returned. There are three possible options:

Event handlers are executed in parallel.

For more information, see Event Handlers.

Lifecycle of a Service Copied

This shows the lifecycle of a service, which transitions from a WARNING to a CRITICAL back to OK state. This assumes the service is run every 31s, with a retry interval of 20s. Max check attempts is 3:

Time State Check Attempt
14:08:08 OK 1/3
14:08:39 WARNING 1/3
14:08:59 WARNING 2/3
14:09:19 WARNING 3/3
14:09:50 WARNING 1/3
14:10:21 CRITICAL 1/3
14:10:41 CRITICAL 2/3
14:11:01 CRITICAL 3/3
14:11:32 CRITICAL 1/3
14:12:03 CRITICAL 1/3
14:12:34 OK 1/3
14:12:54 OK 2/3
14:13:14 OK 3/3
14:13:45 OK 1/3

Dependency Failure Copied

When setting up Hosts and Host Services in Opsview Monitor, it is possible to set up dependency relationships such that the state of one object can affect the state of the objects below it in a dependency tree. To learn more about setting up object dependencies, see Active Checks > Details Tab: Advanced.

Parent/Child relationship example Copied

One example of where a Parent and Child relationship would be useful to set up would be a relationship between a Virtual Machine (VM) management server host and the VMs running under that host. In this case the VM management server host would be set as the parent host, and the VMs running on it would be the children. If the parent management host goes DOWN, the children VMs would be set to UNREACHABLE with a message on the investigate window indicating the dependency failure and the parent host responsible. Likewise, the service checks of both parent and child hosts will go into dependency failure and will be set to CRITICAL.

An example of a useful Parent and Child relationship applied to service checks would be one between a parent SNMP agent check and child SNMP interface checks. The parent service checks that the snmp agent is up, while the children monitor the snmp interfaces. Should the snmp-agent fail, there is no point in monitoring the interfaces as these won’t be up. Thus by having the SNMP agent be the parent, we ensure that when it goes CRITICAL the monitoring of the children services is halted and they go to UNKNOWN state.

What triggers a dependency failure? Copied

A dependency failure is triggered on a monitoring object when its direct parent or a parent object higher up the dependency tree changes to a failure state. For a parent Host, this means going into a DOWN or UNREACHABLE state and for a parent Service, this means going into a CRITICAL state.

Dependency failure behavior Copied

Summary Messages and State Change Copied

A monitoring object in dependency failure will have its status information updated displaying the reason for the object to enter into this state. When the parent host goes DOWN, the child hosts go into UNREACHABLE state and the status information on the Investigate Window will show “Dependency failure: {Parent Host Name} is DOWN”.

VM instance showing state change

When the parent host goes DOWN or UNREACHABLE, the service checks go to CRITICAL state and the status information will say “Dependency failure: {Parent Host Name} is DOWN”. Opsview tracks the dependency failure to the highest parent host that caused it, so even service checks belonging to the child hosts will reference that parent as the cause of the dependency failure.

Checker status critical

When a parent service goes CRITICAL, the child services will go to UNKNOWN state and the status information will say “Dependency failure: {Parent Service Name} is CRITICAL”.

Checker status unknown

No active checks Copied

An object under dependency failure will not have any active checks run. This means it will stay under dependency failure until its parent(s) exit their failure states.

Setting off a manual recheck on the object will run the check a single time. When the check is due to run again, the object re-enters into the dependency failure state.

Freshness Checking & Stale Service Checks Copied

When enabled, freshness checking will ensure that results have been received recently for service checks. If a result has not recently been received (see Passive Checks > Details Tab > Advanced), then the service check will be marked as Stale.

For active service checks and SNMP checks, freshness checking will only occur in the time period configured for the service check.

For passive checks, freshness checking will only occur in the time period configured for the host that the service check belongs to.

Notifications and Event Handlers Copied

Because a Host or Service under Dependency Failure is disabled, no new checks are scheduled for them. This means that no notifications or event handlers will run for these objects.

Recovery from dependency failure Copied

When a host or service recovers (returns to a hard UP/OK state) its child services return to active checking, losing the “Dependency failure:” summary message. When this happens to a service, the state that it was in prior to going into dependency failure is restored as per the following diagram:

Diagram about dependency failure

The restored state will contain the added prefix “Restored status: “, with the last check time matching the time when the service recovered from dependency failure:

Checker status OK

The restored state will not include any performance data from the prior state:

Checker status information

However, if a passive check result comes in while the service is in dependency failure, then that state will overwrite the dependency failure and saved state, since the result is newer than both of these.

Note

If a host moves to another cluster while it has service checks in dependency failure, they will lose any stored state information and will be unable to restore in this manner until an active check next occurs, or a passive result is received.
["Opsview On-premises"] ["User Guide"]

Was this topic helpful?