Geneos ["Geneos"]

Gateway Hub

Overview

You can monitor Gateway Hub via Gateway by using the metrics collected from the internal Netprobes running on each Gateway Hub node.

Configuring Gateway Hub self monitoring allows you to:

  • View Gateway Hub metrics in your Active Console or Web Console.

  • Investigate historical Gateway Hub metrics in your Active Console or Web Console

  • Set Gateway alerting rules on Gateway Hub metrics.

You can also use the Gateway Hub data Gateway plugin to monitor the status of the connection between Gateway and Gateway Hub. For more information see Gateway Hub data in Gateway Plug-Ins.

Prerequisites

The following requirements must be met prior to the installation and setup of this integration:

  • Gateway version 5.5.x or newer.

  • Gateway Hub version 2.3.0 or newer.

You must have a valid Gateway license to use this feature outside of demo mode. Any licensing errors are reported by the Health Plugin.

Gateway Hub self monitoring

Caution: Gateway Hub self monitoring relies on new simplified mappings for dynamic entities pilot feature. This feature is subject to change.

Each Gateway Hub node runs an internal Netprobe and Collection Agent to collect metrics on its own performance. You can connect to the internal Netprobe to monitor Gateway Hub from your Active Console. You must connect to each node individually.

Gateway Hub datapoints must be processed by configuring Dynamic Entities, to simplify this process you can download an include file that provides this configuration. The Gateway Hub self monitoring includes file uses the new simplified Dynamic Entities configuration options, this is a pilot feature and is subject to change.

Note: In Gateway Hub version 2.4.0, the Collection Agent plugin used to collect self-monitoring metrics was updated from linux-infra to system. This requires an updated include file with the correct plugin name and other changes.Ensure you have the correct include file for the version of Gateway Hub you are using.

Configure default self monitoring

To enable Gateway Hub self monitoring in Active Console:

  1. Download the Gateway Hub Integration from ITRS Downloads . This should contain the geneos-integration-gateway-hub-<version>.xml include file and you should save this file to a location accessible to your Gateway.

  2. Open the Gateway Setup Editor.

  3. Right-click the Includes top level section, then select New Include.

  4. Set the following options:

    • Priority — Any value above 1.

    • Location — Specify the path to the location of the geneos-integration-gateway-hub-<version>.xml include file.

  5. Right-click the include file in the State Tree and select Expand all.
  6. Select Click to load. The new includes file will load.
  7. Right-click the Probes top-level section, then select New Probe.
  8. Set the following options in the Basic tab:
    • Name — Specify the name that will appear in Active Console, for example Gateway_Hub_Self_Monitoring.
    • Hostname — Specify the hostname of your Gateway Hub node.
    • Port7036.
    • Secure Enabled.
  9. Set the following options in the Dynamic Entities tab:
    • Mapping typeHub.
  10. Repeat steps 7 to 9 for each Gateway Hub node you wish to monitor.
  11. Click Save current document.

In your Active Console, Dynamic Entities are created for metrics from Gateway Hub self monitoring.

Dataviews are automatically populated from available metrics, in general an entity will be created for each Gateway Hub component. Components that use Java will also include metrics on JVM performance.

Depending on how you have configured your Active Console, you may see repeated entities in the State Tree. This is because there are multiple Gateway Hubnodes each running the same components. To organise your State Tree by node, perform the following steps:

  1. Open your Active Console.

  2. Navigate to Tools > Settings > General.

  3. Set the Viewpath option to hostname.

  4. Click Apply.

Configure log file monitoring

You can additionally monitor Gateway Hub log files using the FKM plugin. To do this you must configure a new Managed Entity using the hub-logs sampler provided as part of the geneos-integration-gateway-hub-<version>.xml include file.

To enable Gateway Hub log monitoring:

  1. Open the Gateway Setup Editor.

  2. Right-click the Managed Entities top level section, then select New Managed Entity.

  3. Set the following options:

    • Name — Specify the name that will appear in Active Console, for example Hub_Log_Monitoring.

    • Options > Probe — Specify the name of the internal Gateway Hub probe.

    • Samplerhub-logs.

  4. Repeat steps 2 to 3 for each Gateway Hub node you wish to monitor.

  5. Click Save current document.

In your Active Console an additional Managed Entity, with the name you specified, will show the status of Gateway Hub's log files on that node.

If you have configured Gateway Hub to store log files in a directory other than the default, you must direct the sampler to your logs directory. To do this, specify the hub-logs-dir variable from the advanced tab of your Managed Entity. For more information about setting variables, see managedEntities > managedEntity > var in Managed Entities and Managed Entity Groups.

Important metrics

If you have configured your Gateway to receive Gateway Hub self monitoring metrics, you may want to set up alerts for changes in the most important metrics. This section outlines the key metrics for the major Gateway Hub components and provides advice to help create meaningful alerts.

JVM memory

You can observe JVM memory with the following metrics:

Metric Source Description
jvm_memory_heap_committed StatsD

Amount of memory (in bytes) that is guaranteed to be available for use by the JVM.

The amount of committed memory may change over time (increase or decrease). The value of jvm_memory_heap_committed may be less than jvm_memory_heap_max but will always be greater than jvm_memory_pool_heap_used.

 

jvm_memory_heap_max StatsD

Maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined.

The maximum amount of memory may change over time if defined. The value of jvm_memory_pool_heap_used and jvm_memory_heap_committed will always be less than or equal to jvm_memory_heap_max if defined.

Memory allocation may fail if it attempts to increase the used memory such that the used memory is greater than the committed memory, even if the used memory would still be less than or equal to the maximum memory (for example, when the system is low on virtual memory).

jvm_memory_pool_heap_used StatsD Amount of memory currently used (in bytes) by the JVM.
jvm_memory_gc_collection_count StatsD Number of garbage collections that have occurred in the JVM life cycle.
jvm_memory_gc_collection_time StatsD Time spent in garbage collection.
     

Observing high heap usage, where jvm_memory_pool_heap_used is nearing jvm_memory_heap_max, can be an indicator that the heap memory allocation may not be sufficient.

Note: While there might be a temptation to use increase the memory allocation, the Gateway Hub installer calculates the ideal Gateway Hub memory settings based on the size of the machine being used. It is important not to over-allocate memory to any Gateway Hub component, as it may result in an over-commitment that can produce unexpected behaviours including failures or swapping.

JVM garbage collection

You can observe JVM garbage collection with the following metrics:

Metric Source Description
jvm_memory_gc_collection_count StatsD Total number of collections that have occurred
jvm_memory_gc_collection_time StatsD Approximate accumulated collection elapsed time in milliseconds.
     

When creating alerts, note the following:

  • Long pauses due to garbage will negatively impact any JVM based process, particurly if it is latency sensitive.

Kafka consumer metrics

Several Gateway Hub daemons contain Kafka consumers that consume messages from Kafka topic partition and process them.

Key consumer metrics

You can observe Kafka consumers with the following metrics:

Metric Source Dimensions Description
kafka_consumer_bytes_consumed_rate StatsD app, client-id, hostname, topic

Average bytes consumed per topic, per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_rate StatsD app, client-id, hostname

Average bytes consumed per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total StatsD app, client-id, hostname

Total bytes consumed.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total StatsD app, client-id, hostname, topic

Total bytes consumed by topic.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate StatsD app, client-id, hostname, topic

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate StatsD app, client-id, hostname

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_total StatsD app, client-id, hostname, topic

Total records consumed.

Example: kafka_consumer_records_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

       

When creating alerts, note the following:

  • The kafka_consumer_records_consumed_rate metric is a measure of network bandwidth. This should stay largely constant. If it does not, then this may indicate network problems.

  • The kafka_consumer_records_consumed_total metric is a measure of actual records consumed. This may fluctuate depending on the message size. It may not correlate with bytes consumed. In a healthy application, you would expect this metric to stay fairly constant. If this measure drops to zero it may indicate a consumer failure.

Consumer lag

If a Kafka topic fills faster than the topic is consumed then we get what is known as "lag". High lag means that your system is not keeping up with messages. Near zero lag means that it is keeping up. High lag, in operational terms, means that there may be significant latency between a message being ingested into Gateway Hub, and when that message is reflected to a user (via a query or some other means). You should try to ensure that lag is close to zero, increasing lag is a problem.

You can observe lag with the following metrics:

Metric Source Dimensions Description
kafka_consumer_records_lag StatsD app, client-id, hostname, partition, topic

Number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_avg StatsD app, client-id, hostname, partition, topic

Average number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_max StatsD app, client-id, hostname, partition, topic

Max number of messages the consumer is behind the producer on this partition

Example: kafka_consumer_records_lag_max Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

       

These Kafka consumer metrics work at the consumer level and not at the consumer group level. To get a complete picture you should watch the equivalent metrics across all nodes in the cluster.

When creating alerts, note the following:

  • The kafka_consumer_records_lag metric is the actual lag between the specific consumer in the daemon and the producer for the specified topic/partition. You should monitor this metric closely as it is a key indicator that the system may not be processing records quickly enough.

  • If lag for the same topic across all nodes is roughly 0 there is no problem.

  • If lag is significantly higher for the same topics on different nodes, then a problem is likely present on specific nodes.

  • If lag is high across all nodes, then it may be an indicator that the Gateway Hub is overloaded across all nodes, possibly because the load present is higher than the node hardware is rated for.

Fetch rate

You can observe fetch rates with the following metrics:

Metric   Dimensions Description
kafka_consumer_fetch_rate StatsD app, client-id, hostname

Number of fetch requests per second.

Example: kafka_consumer_fetch_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg StatsD app, client-id, hostname

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg StatsD app, client-id, hostname, topic

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_max StatsD app, client-id, hostname, topic

Max number of bytes fetched per request.

Example: kafka_consumer_fetch_size_max Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

       

When creating alerts, note the following:

  • The kafka_consumer_fetch_rate metric is an indicator that the consumer is performing fetches and this should be fairly constant for a healthy consumer. If it drops, then this could mean there is a problem in the consumer.

Kafka producer metrics

Kafka producers publish records to Kafka topics. When producers publish messages in a reliable system, they must be sure that messages have been received (unless explicitly configured not to care). To do this publishers receive acknowledgements from brokers. In some configurations, a producer does not require acknowledgements from all other brokers, it merely needs to receive a minimum number (to achieve a quorum). In other configurations, it may need acknowledgements from all brokers. In either case, the act of receiving acknowledgements is somewhat latency-sensitive and will impact how fast a producer can push messages.

When configuring Kafka producers, consider the following:

  • Producers can send messages in batches, this will generally be more efficient than sending individual messages as it means the conversation with the broker is less extensive, and fewer acknowledgements are required.

  • Producers can compress messages, compressing messages will make them smaller which requires less network bandwidth. However, compression means more CPU power is needed.

You can observe Kafka producer behaviour with the following metrics:

Metric Source Dimensions Description
kafka_producer_batch_size_avg StatsD app, client-id, hostname

Average number of bytes sent per partition, per request.

Example: kafka_producer_batch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_compression_rate_avg StatsD app, client-id, hostname

Average compression rate of record batches.

Example: kafka_producer_compression_rate_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_response_rate StatsD app, client-id, hostname, node-id

Average number of responses received per second from the broker.

Example: kafka_producer_node_response_rate Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_request_rate StatsD app, client-id, hostname

 

Example: kafka_producer_request_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_request_latency_avg StatsD app, client-id, hostname, node-id

Average request latency in milliseconds for a node.

Example: kafka_producer_node_request_latency_avg Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_io_wait_time_ns_avg StatsD app, client-id, hostname

Average length of time the I/O thread spends waiting for a socket ready for reads or writes in nanoseconds.

Example: kafka_producer_io_wait_time_ns_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_outgoing_byte_rate StatsD app, client-id, hostname

Average number of bytes sent per second to the broker.

Example: kafka_producer_outgoing_byte_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

       

When creating alerts, note the following:

  • The kafka_producer_batch_size_avg metric indicates the average size of batches sent to the broker. Large batches are preferred, since small batches do not compress well, and need to be sent more often, thus require more network traffic. This value should not vary greatly under a reasonably constant load.

  • If the kafka_producer_node_response_rate metric is low this may indicate that the producer is falling behind and that data can't be consumed at an ideal rate. This value should not vary greatly under a reasonably constant load.

  • If the kafka_producer_request_rate is low under high load this could indicate an issue with the producer. An extremely high rate could also indicate a problem as it may mean the consumer may struggle to keep up ( which could require throttling).

  • Generally batches should be large. However, large batches may also increase the kafka_producer_node_request_latency_avg metric. This is because a producer may wait until it builds up a big enough batch before it initiates a send operation (this behaviour is controlled by the linger.ms Kafka setting). You should prefer throughput over latency, however, this is a trade-off and too much latency can also be problematic. Large batches are most likely to cause a problem in high load scenarios.

  • If the kafka_producer_io_wait_time_ns_avg metric is high this means that the producer is spending a lot of time waiting on network resources while CPU is essentially idle. This may point to resource saturation, a slow network, or similar problems.

Kafka broker metrics

All Kafka messages pass through a broker, so if the broker is encountering problems this can have a wider impact on performance and reliability.

Note that Kafka broker metrics are collected via the Kafka-plugin which uses JMX. The consumer and producer metrics listed above are gathered by the StatsD plugin for each specific process.

You can observe Kafka broker behaviour with the following metrics:

Metric Source Dimensions Description
server_replica_under_replicated_partitions Kafka app, broker_id, cluster_id, hostname

Number of under-replicated partitions.

Each Kafka partition may be replicated in order to provide reliability guarantees. In the set of replicas for a given partition, one is chosen as the leader. The leader is always considered in sync. The remaining replicas will also be considered in sync, providing they are not too far behind the leader. Synchronised replicas form the ISR (In Sync Replica) pool. If a partition lags too far behind it is removed from the ISR pool. Producers may require a minimum number of ISRs in order to operate reliably. When the ISR pool shrinks you will surely see an increase in under-replicated partitions.

Example: server_replica_under_replicated_partitions Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_expands_per_sec_<attribute> Kafka app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) expands.

Example: server_replica_isr_expands_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_shrinks_per_sec_<attribute> Kafka app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) shrinks.

Example: server_replica_isr_shrinks_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_active_controller_count Kafka app, broker_id, cluster_id, hostname

Number of active controllers in the cluster.

Example: controller_active_controller_<attribute> Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_offline_partitions_count Kafka app, broker_id, cluster_id, hostname

Number of partitions that do not have an active leader and are hence not writable or readable.

Example: controller_offline_partitions_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_leader_election_rate_and_time_ms Kafka app, broker_id, cluster_id, hostname Rate of leader elections per second and the overall duration the cluster went without a leader.
controller_unclean_leader_elections_per_sec_count Kafka app, broker_id, cluster_id, hostname

Unclean leader election rate.

If a broker goes offline, some partitions will be leaderless and Kafka will elect a new leader from the ISR pool. Gateway Hub does not allow unclean elections, hence the new leader must come from the ISR pool.

Example: controller_unclean_leader_elections_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

       

When creating alerts, note the following:

  • The server_replica_under_replicated_partitions metric should never be greater than zero.

  • The server_replica_isr_expands_per_sec_<attribute> metric should not vary significantly.

  • The controller_active_controller_count metric must be equal to one. There should be exactly one controller per cluster.

  • The controller_active_controller_count metric should never be greater than zero.

  • A high rate of leader elections, as indicated by the controller_leader_election_rate_and_time_ms metric, suggests brokers are fluctuating between offline and online statuses. Additionally, taking too long to elect a leader will result in partitions being inaccessible for long periods.

  • The controller_unclean_leader_elections_per_sec_count metric should never be greater than zero.

Zookeeper metrics

Kafka uses Zookeeper to store metadata about topics and brokers. It plays a critical role in ensuring Kafka's performance and stability. If Zookeeper is not available Kafka cannot function.

Zookeeper is very sensitive to IO latency. In particular, disk latency can have severe impacts on Zookeeper, this is because quorum operations must be completed quickly.

You can observe Zookeeper with the following metrics:

Metric Source Dimensions Description
zookeeper_outstanding_requests Zookeeper app, hostname, port

Number of requests from followers that have yet to be acknowledged.

Example: zookeeper_outstanding_requests Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_avg_request_latency Zookeeper app, hostname, port

Average time to respond to a client request.

Example: zookeeper_avg_request_latency Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_max_client_cnxns_per_host Zookeeper app, hostname, port

Example: zookeeper_max_client_cnxns_per_host Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_num_alive_connections Zookeeper app, hostname, port

Number of connections currently open. Should be well under the configured maximum connections for safety.

Example: zookeeper_num_alive_connections Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_fsync_threshold_exceed_count Zookeeper app, hostname, port

Count of instances f-sync time has exceeded the warning threshold.

Example: zookeeper_fsync_threshold_exceed_count Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

       

When creating alerts, note the following:

  • The zookeeper_outstanding_requests metric should be low. Confluent suggests that this value should be below 10.

  • The zookeeper_avg_request_latency metric should be low as possible (typical values should be less than 10ms), and ideally fairly constant. You should investigate if this number spikes or has variability.

  • If the zookeeper_fsync_threshold_exceed_count metric increases steadily there may be a problem with disk latency. Ideally, this metric should be zero or non moving (in the case of recovery after a latency problem).

etcd metrics

etcd is used by Gateway Hub as a key-value store. Disk latency is the most important etcd metric, but CPU starvation can also cause problems.

Many etcd metrics are provided as histograms composed of several gauges. Etcd histogram buckets are cumulative. See below for an example:

disk_wal_fsync_duration_seconds_bucket_0.001 = 2325
disk_wal_fsync_duration_seconds_bucket_0.002 = 4642
disk_wal_fsync_duration_seconds_bucket_0.004 = 5097
disk_wal_fsync_duration_seconds_bucket_0.008 = 5187
disk_wal_fsync_duration_seconds_bucket_0.016 = 5248
disk_wal_fsync_duration_seconds_bucket_0.032 = 5253
disk_wal_fsync_duration_seconds_bucket_0.064 = 5254
disk_wal_fsync_duration_seconds_bucket_0.128 = 5254
disk_wal_fsync_duration_seconds_bucket_0.256 = 5254
disk_wal_fsync_duration_seconds_bucket_0.512 = 5254
disk_wal_fsync_duration_seconds_bucket_1.024 = 5254
disk_wal_fsync_duration_seconds_bucket_2.048 = 5254
disk_wal_fsync_duration_seconds_bucket_4.096 = 5254
disk_wal_fsync_duration_seconds_bucket_8.192 = 5254
disk_wal_fsync_duration_seconds_sum = 7.362459756999995
disk_wal_fsync_duration_seconds_count = 5254

The value of the disk_wal_fsync_duration_seconds_bucket_<x.y> metric indicates the cumulative total over the whole duration ending with the bucket specified by the <x.y> postfix. In this example, the disk_wal_fsync_duration_seconds_<x.y> value increases with each time step for 0.064 seconds then remains static.

Key metrics

You can observe etcd with the following metrics:

Metric Source Description
server_health_failures prometheus-target Total number of failed health checks.
server_heartbeat_send_failures_total prometheus-target Total number of leader heartbeat send failures (likely overloaded from slow disk). If non zero and increasing, this could be due to a slow disk, and can be a prelude to a cluster failover.
     

When creating alerts, note the following:

  • The rate of server health failures should be low.

  • If the server_heartbeat_send_failures_total metric is increasing, this may indicate a slow disk.

Etcd latency metrics

Most issues with etcd relate to slow disks. High disk latency can cause long pauses that will lead to missed heartbeats, and potentially fail overs in the etcd cluster. Disk latency will also contribute to high request latency.

You can observe etcd latency with the following metrics:

Metric Source Description
disk_backend_commit_duration_seconds_bucket_<bucket> prometheus-target Presented as a histogram. The latency distributions of commit called by backend.
disk_wal_fsync_duration_seconds_bucket_<bucket> prometheus-target Presented as a histogram. The latency distributions of fsync called by the Write Ahead Log (WAL).
     

When creating alerts, note the following:

  • The 0.99th percentile of the disk_backend_commit_duration_seconds_bucket_<bucket> metric should be less than 25ms.

  • The disk_wal_fsync_duration_seconds_bucket_<bucket> metric should be fairly constant and ideally, as low as possible.

PostgreSQL metrics

Gateway Hub stores metrics in a PostgreSQL database.

You can observe PostgreSQL performance with the following metrics:

Metric Source Dimensions Description
processes_cpu_load System comm, hostname, pid CPU usage for a specific process identified by the pid dimension.
memory_used_swap_percent System hostname Swap memory used for a specific process.
processes_virtual_size System hostname, process_id, process_name Sum of all mapped memory used by a specific process, including swap.
processes_resident_set_size System hostname, process_id, process_name Sum of all physical memory used by a specific process, excluding swap. Known as the resident set.
disk_free System hostname, volume Total free space on the specified volume.
disk_total System hostname, volume Total space on the specified volume.
active_connections PostgreSQL app, hostname Number of connections currently active on the server.
app_connections PostgreSQL app, hostname, username Number of connections currently active on the server, grouped by application.
max_connections PostgreSQL app, hostname Maximum number of connections for the PostgreSQL server.
       

When creating alerts, note the following:

  • Ensure that excess CPU usage by PostgreSQL is monitored. For example, a runaway query could consume excessive CPU which can affect the whole Gateway Hub node. Other causes of excessive CPU usage include: expensive back ground workers, expensive queries, high ingestion rates, misconfiguration of background worker counts, and more.

  • Do not manually configure PostgreSQL after installation.

  • At no time should PostgreSQL or other process show swappiness, the memory budget should ensure that overall memory allocation does not exceed the total memory.

  • It is possible that PostgreSQL may run at or near to its upper memory allocation. Storage systems will buffer pages into memory for caching and other efficiency reasons, so it is normal to see high but fairly constant memory usage.

  • The disk_free metric should not exceed 90 percent of disk_total for any volume.

  • PostgreSQL sets an upper limit on concurrent connections. If this limit is exceeded, database clients (including application code), may be blocked while waiting to be allocated a connection. A blocked client can result in a timeout which can significantly impact performance or throughput. During Gateway Hub installation PostgreSQL is configured with a maximum connection limit based on the available hardware.

  • The likely causes of a connection limit problem are connections leaks in applications code or manually accessing the PostgreSQL database. Both should be avoided.

  • The active_connections metric should not exceed 80 percent of max_connections.