Business Service Monitoring
Business Service Monitoring, or BSM, allows you a much-enhanced view into your IT infrastructure – as opposed to looking at your infrastructure on a Host by Host basis. Your monitoring solution will understand resiliency, service / operational availability (SLA/OLA), and more.
This is the advanced monitoring of Hosts and Services in Opsview Monitor — allowing you to group similar Hosts into Components, set resiliency on these Components, and add multiple Components together in order to form a top-level business service.
You can also undertake actions against every layer of the business service, from the user interface. such as setting downtime, acknowledging problems, adding comments against business service’s, etc.
The purpose of BSM is to give real-world views where Hosts are grouped together into Components (i.e. h-scaled clusters), and Components are then grouped together to form the overall Business Service unit (i.e. all host and service that need to run so that, for example, my website is operational).
Using BSM, you can have multiple Hosts of the same type grouped together in Components, and if you have configured the resiliency level correctly, you can allow one Host to fail but the Component to still be operational.
A Business Service can then be comprised of multiple Components, each with their own resiliency levels ’ giving a true end-to-end view of the Business Service, as shown below.
You can then undertake operations at a Component and a BSM level, such as:
- Who/which team is responsible for this?.
- Add notes to the BSM/Component.
- Schedule downtime against entire Component.
- Acknowledge all issues in a Component.
- View events related to this BSM/Component.
- Send alerts at a Component/BSM level, i.e. “I only want to know when a Component is critical, not when a Host in it has failed” as we have resiliency in place.
- Be alerted that a BSM’s availability has gone below a certain level, i.e. it is below your agreed SLA/OLA.
- Run historical reports against BSM’s and Components, automated or manual, and emailed in your companies brand to yourself/customer at a pre-determined time/date.
In an Opsview Monitor system, you can have a website called ‘Website.com’ with six Components including an ‘Apache Servers’ cluster, ‘Linux cluster’, etc.
With BSM, you can now monitor and display your entire stack in a single view, so you can see ‘one Host has failed in the Linux cluster; it hasn’t affected my website yet but I will need to fix that soon.
Business Service monitoring is a terrific tool that will take existing Hosts, Services and Host templates and allow the creation of a hierarchy of Components and Business Services showing the relationship between Hosts and the Business Services they support, availability (SLA/OLA) at each layer, reporting, notifications, access control and more.
Note that for the purposes of BSM, the Host consists of the Service Checks related to the Host template used by the Component — the Host state is not taken into account.
The Host can be one of three calculated states:
- OPERATIONAL: if no Service Checks are in a CRITICAL state.
- FAILED: at least one Service Check is in a CRITICAL state. Service Checks in downtime are ignored.
- DOWNTIME: all CRITICAL Service Checks are in a DOWNTIME state.
Additionally, there is one calculated flag:
- ACKNOWLEDGED: if all the CRITICAL service checks are acknowledged, then the Component Host is acknowledged.
The soft or hard state of the Service Check is not considered — the latest state is always used.
If you set DOWNTIME and there are no failed services, then an operational state is used. This is to cover scenarios where downtime of two hours is recorded, but only 15 minutes is used. This allows the Host to be marked as DOWNTIME only during the time there were actual failures.
It is possible that for a Host, the Service Checks are UNKNOWN yet the Host is DOWN. From a BSM perspective, the Host is considered to be OPERATIONAL because there are no CRITICAL Service Checks. This would be an error in the configuration as the Service Check should be CRITICAL to show a severe error.
This is calculated from the Component Host states and can be one of three calculated states:
OPERATIONAL: if no Hosts are failed or there are enough Hosts to satisfy the operational level, then the Component is operational
DOWNTIME: means all Hosts are in a DOWNTIME state
FAILED: otherwise Component is failed
Additionally, there are two calculated flags:
ACKNOWLEDGED: if all the failed Hosts are acknowledged, then the Component is acknowledged
IMPACTED: if the state is operational and there is at least one host failed, then the Component is impacted
The operational zone percent is calculated as (hosts_required_online) / (hosts_total) x 100. If there are not enough operational Hosts, then the Component is failed. Hosts in DOWNTIME are not counted, but have the effect of making failed Hosts more important.
NoteDue to the operational zone percentage, it is possible that a Component is in an operational state with failed Hosts. If those failed Hosts are acknowledged, then the Component will also be acknowledged, so you could have an acknowledged icon on an operational Component.
This can be one of three calculated states:
OFFLINE: means at least ONE Component has failed
DOWNTIME: means at least ONE Component is in a DOWNTIME state
OPERATIONAL: otherwise, it means everything is working fine
Additionally, there are two calculated flags:
ACKNOWLEDGED: if all the impacted and failed Components are acknowledged, then the Business Service is acknowledged
IMPACTED: the Business Service is impacted if any Component is impacted. This means it is possible to have a Business Service in an OFFLINE state and be impacted.
The below diagram provides with details of when a user will or will not receive a notification based on status changes: