The approach provides the ability to carefully define how Dotcom-Monitor interprets responses as either “Up” or “Down” responses. This is accomplished using filters.
Filtering defines the Up/Down states using the following adjustable criteria:
- Error is reported for a specified number of minutes
- Error is confirmed by a specified number of agents
- Error is detected in a specified number of tasks.
All filters and their settings are available at Configure > Filters. After a filter is applied to a device all of the device’s notifications are based on the filter’s criteria. “Default Filter” is assigned to all new devices. The default filter has a balanced configuration and is suitable for most monitoring devices.
Uptime/Downtime Calculations
The formula for the Downtime calculation is as follows:
1. Downtime Duration is tied directly to the configurations within the filter:
- The Downtime period starts when a filter’s conditions are met. For example, when the number of agents which report a failure equals the number of agents specified in the filter, and as also specified the conditions are met for the number of minutes and tasks, then a downtime alert is sent.
- The Uptime period starts when the filter’s conditions are no longer met. Specifically, Uptime starts when the number of agents, minutes, or tasks, which have reported “up” success, no longer meet the conditions needed for the filtered “down” conditions. For example, an “up” state is indicated when the number of error (“down”) responses received by agents becomes less than the number of error (“down”) responses that agents need, as set in the filter, in order to indicate a “down” condition.
2. Duration of “Undefined” state. An Undefined state can be set when the status of each agent involved in monitoring becomes Undefined. An agent status is considered as Undefined if the agent does NOT provide any response (error or success) in a certain amount of time:
Response Wait Time Duration = (the overall agents number+1) × monitoring frequency + 15 min
For example, we use three monitoring agents and a monitoring frequency of 5 minutes. Each agent will wait for a response for Response Wait Time Duration = (3+1)×5+15 min = 35 min. Once the time expired and no response received, an agent reports Undefined.
3. Duration of “Postponed” state. Postponing a device at any moment will stop any monitoring activity until it is re-enabled.
4. Duration Excluded by Schedule. Another entity that can significantly affect Uptime/Downtime calculations is Schedules. This is an option for managing your monitoring during routine maintenance. Monitoring can be postponed for specific days of the week as well as specific hours and minutes during a day. To set up a schedule, follow the instruction.
EXAMPLE:
Let’s say we have device monitored from 7 locations and filter set that 3 locations must report an error for Downtime condition. First, monitoring node (agent 1) detects an error while the rest are still reporting success, then the second (agent 2) and at last third one (agent 4) detects an error at T4 which triggers filter to set Downtime beginning right from this moment. The Down state will remain until you set hypothetical Postpone at T5 because of the number of agent reporting errors higher than adjusted 3 throughout all this time. The time gap between T6 and T7 is an illustration of the fact we get the first response with a delay (monitoring session processing time includes network transfer delays and the execution itself), so “Postponed” time is being calculated as ∆ (T7–T5) (Postponed 2nd). Again, we fall into Downtime only on 3rd error from Agent 3 and get in the Up state only on the T9 response, when the number of failing agents becomes less than adjusted in the filter. Here comes the final downtime % calculation formula for this case: