Automation in IT monitoring

Overview Copied

Warning

This documentation refers to tools and platforms, such as Puppet, Chef, and NRPE, that may have changed or become outdated in the latest Opsview release. For the latest information and guidance, refer to the Concept of IT Monitoring or contact ITRS Support.

Automation refers to the use of technology, such as machines and control systems, to make processes more efficient and productive.

In a monitoring context using Opsview, automation allows the system to automatically react based on specific conditions or a series of criteria being met. For example, when something goes wrong, Opsview has the power to automatically fix it (proactive monitoring) and alert users via notifications; it can even automatically create a ticket in a service desk and assign it to a queue (service desk connectors).

These scenarios are examples of problem automation, which are event-driven items that are engaged when something happens in the system. For example, when the DHCP service on WINSRV003 has stopped, this will result in an email notification or push notification via Opsview Mobile. It can also execute an event handler that will restart the service, if configured to do so.

Conversely, automation also occurs without any problems or errors occurring, which is referred to as operational automation. This involves items such as deployment tool integration. For example, when a new host is deployed via puppet, it can be configured to be automatically added to your monitoring tool.

Additionally, reports can be scheduled in Opsview to be automatically sent at certain times and stopped on certain dates. Monitoring solutions within the platform can also be configured to automatically back up network device configurations, such as Cisco routers. This way, if a device is stolen or damaged, a fresh copy of the configuration can be retrieved from the monitoring system, added to new hardware, and the site will be back up, which minimizes downtime.

Problem automation Copied

Alerting Copied

When a problem occurs on your monitored equipment, you will want to be notified just in case you are not watching the display when the event happens, or it is 2 AM and you are the stand-by engineer.

This is where notifications play a significant role. Most modern APM or IT monitoring systems allow users to setup alerts, so that when a problem occurs, they get an email alerting them of an issue occurring. More advanced solutions allow users to choose from multiple methods, in addition to specifics on which hosts to be alerted for, or which services on those hosts.

In Opsview, notification profiles allow users to specifically choose the alerts they receive, which includes the schedule, methods, and objects that are within the user’s permissions.

New Notification Profiles window

In the preceding example, the notifications are only configured to network devices that are down or unreachable and service checks on those hosts that are critical with alerts via email during work hours.

This automation allows administrators to be informed immediately about the problems they want to be aware of, which prevents notification spam for every state changes.

Service desks Copied

Some higher-end APM and IT monitoring solutions can integrate with service desks using API to automatically create tickets when they occur.

This removes the need for an administrator to view the problem and raise a ticket. With service desk automation, Opsview will automatically create the ticket in the service desk group or desired project depending on CRM or service desk, and add the ticket number as a comment next to the original error within the system. This allows subsequent administrators to log in and view the details of the ticket to avoid duplication.

This also allows administrators to keep an eye on the ticket queue specific to the monitoring system, which is the Opsview Monitor queue, and work on raised tickets based on the criteria during setup.

Note

It is not recommended to raise tickets for common or small errors such as high Pagefile use and swap.

For more information, see Service Desk Connector Module.

Proactive monitoring Copied

Proactive monitoring is one of the key benefits of automation as it involves monitoring systems for potential problems, which reduces the need to spend time assessing the cause of the problems.

By using event handlers, you can specify in Opsview what a service must automatically do when a service check turns critical. With problem states, it can be configured to send notifications and restart the service.

Restart apache service configuration

In the preceding example, it is specified that when the service goes critical, the event handler restart_service –s apache2 must be run. This will restart the apache service.

These scripts are executed by Nagios Remote Plugin Executor (NRPE) or its Windows version, NS Client++. This allows a vast range of possibilities, such as monitoring the temperature of a server, and when that service check goes critical, a script can be called to automatically ramp up the fans for 20 minutes.

Operational automation Copied

Deployment Copied

Nearly all of the high-end APM and IT monitoring solutions have deployment assistance. The more cloud-focused tools in the marketplace have cloud agents or an agent-register technology, which is configured with the details of the monitoring server. It also automatically registers itself when it goes up online and unregisters itself when it goes offline. This can be useful for cloud and dynamic environments, such as those for developers and QA teams.

Additionally, there is deployment tool integration with tools like Red Hat Satellite, OpenStack, Puppet, or Chef to reduce operational overheads, which concerns the time taken to register the newly deployed server or virtual machine into your monitoring solution.

Opsview has an integration with Puppet and Chef, so that when an administrator deploys a new server or VM, it automatically registers itself into the Opsview system.

Reporting Copied

Opsview has the capability to automate the creation and sending of reports. For example, your Daily SLA Report can be set to be delivered in your inbox at a certain time, such as 8:30 AM from Monday to Friday, without user involvement.

This functionality can automate any report with the GUI using the pre-defined report formats from Daily SLA Report to Yearly cost of downtime. The GUI is highly configurable as you can specify parameters, such as when the report will be sent, what it will be generated against, to whom it will be sent to, in which format, and at what time, date, and timezone.

Business Service SLA Report

For more information, see Scheduling a Report.

Network backup and auditing Copied

Advanced monitoring tools provides network administrators with the option to not only monitor a network device, but take a copy of its configuration and alert on changes. This allows efficient monitoring of device performance and a view into the actual configuration.

For example, Cisco 6509 is running one supervisor with 1000 lines of IOS configuration on it. A fire breaks out and destroys the chassis. Getting replacement hardware onsite can be done in 4 hours on the right Cisco SLA or contract; however, if you do not have a backup of that configuration, it is going to take a very long time to recreate.

With Opsview, this risk can be removed using a module called NetAudit, which is enabled on network devices to monitor its configuration and view it from a central console.

This allows you to view any changes in the configuration and get alerted if anything changes due to malicious attacks.

["Opsview On-Premises"] ["User Guide", "Quickstart Guide"]

Was this topic helpful?