Geneos ["Geneos"]

Gateway Hub

Overview

You can monitor Gateway Hub via Gateway by using the metrics collected from the internal Netprobes running on each Gateway Hub node.

Configuring Gateway Hub self monitoring allows you to:

  • View Gateway Hub metrics in your Active Console or Web Console.

  • Investigate historical Gateway Hub metrics in your Active Console or Web Console

  • Set Gateway alerting rules on Gateway Hub metrics.

You can also use the Gateway Hub data Gateway plugin to monitor the status of the connection between Gateway and Gateway Hub. For more information see Gateway Hub data in Gateway Plug-Ins.

Prerequisites

The following requirements must be met prior to the installation and setup of this integration:

  • Gateway version 5.5.x or newer.

  • Gateway Hub version 2.3.0 or newer.

You must have a valid Gateway license to use this feature outside of demo mode. Any licensing errors are reported by the Health Plugin.

Gateway Hub self monitoring

Caution: Gateway Hub self monitoring relies on new simplified mappings for dynamic entities pilot feature. This feature is subject to change.

Each Gateway Hub node runs an internal Netprobe and Collection Agent to collect metrics on its own performance. You can connect to the internal Netprobe to monitor Gateway Hub from your Active Console. You must connect to each node individually.

Gateway Hub datapoints must be processed by configuring Dynamic Entities, to simplify this process you can download an include file that provides this configuration. The Gateway Hub self monitoring includes file uses the new simplified Dynamic Entities configuration options, this is a pilot feature and is subject to change.

Note: In Gateway Hub version 2.4.0, the Collection Agent plugin used to collect self-monitoring metrics was updated from linux-infra to system. This requires an updated include file with the correct plugin name and other changes.Ensure you have the correct include file for the version of Gateway Hub you are using.

Configure default self monitoring

To enable Gateway Hub self monitoring in Active Console:

  1. Download the geneos-integration-gateway-hub-<version>.xml include file from (missing or bad snippet). You should save this file to a location accessible to your Gateway.

  2. Open the Gateway Setup Editor.

  3. Right-click the Includes top level section, then select New Include.

  4. Set the following options:

    • Priority — Any value above 1.

    • Location — Specify the path to the location of the geneos-integration-gateway-hub-<version>.xml include file.

  5. Right-click the include file in the State Tree and select Expand all.
  6. Select Click to load. The new includes file will load.
  7. Right-click the Probes top-level section, then select New Probe.
  8. Set the following options in the Basic tab:
    • Name — Specify the name that will appear in Active Console, for example Gateway_Hub_Self_Monitoring.
    • Hostname — Specify the hostname of your Gateway Hub node.
    • Port7036.
    • Secure Enabled.
  9. Set the following options in the Dynamic Entities tab:
    • Mapping typeHub.
  10. Repeat steps 7 to 9 for each Gateway Hub node you wish to monitor.
  11. Click (missing or bad snippet).

In your Active Console, Dynamic Entities are created for metrics from Gateway Hub self monitoring.

Dataviews are automatically populated from available metrics, in general an entity will be created for each Gateway Hub component. Components that use Java will also include metrics on JVM performance.

Depending on how you have configured your Active Console, you may see repeated entities in the State Tree. This is because there are multiple Gateway Hubnodes each running the same components. To organise your State Tree by node, perform the following steps:

  1. Open your Active Console.

  2. Navigate to Tools > Settings > General.

  3. Set the Viewpath option to hostname.

  4. Click Apply.

Configure log file monitoring

You can additionally monitor Gateway Hub log files using the FKM plugin. To do this you must configure a new Managed Entity using the hub-logs sampler provided as part of the geneos-integration-gateway-hub-<version>.xml include file.

To enable Gateway Hub log monitoring:

  1. Open the Gateway Setup Editor.

  2. Right-click the Managed Entities top level section, then select New Managed Entity.

  3. Set the following options:

    • Name — Specify the name that will appear in Active Console, for example Hub_Log_Monitoring.

    • Options > Probe — Specify the name of the internal Gateway Hub probe.

    • Samplerhub-logs.

  4. Repeat steps 2 to 3 for each Gateway Hub node you wish to monitor.

  5. Click Save current document.

In your Active Console an additional Managed Entity, with the name you specified, will show the status of Gateway Hub's log files on that node.

If you have configured Gateway Hub to store log files in a directory other than the default, you must direct the sampler to your logs directory. To do this, specify the hub-logs-dir variable from the advanced tab of your Managed Entity. For more information about setting variables, see managedEntities > managedEntity > var in Managed Entities and Managed Entity Groups.

Important metrics

If you have configured your Gateway to receive Gateway Hub self monitoring metrics, you may want to set up alerts for changes in the most important metrics. This section outlines the key metrics for the major Gateway Hub components and provides advice to help create meaningful alerts.

JVM memory

You can observe JVM memory with the following metrics:

Metric Description
jvm_memory_heap_memory_usage_Init

Initial amount of memory (in bytes) that the JVM requests from the operating system for memory management during startup.

The JVM may request additional memory from the operating system and may also release memory to the system over time.

The value of jvm_memory_heap_memory_usage_Init may be undefined.

jvm_memory_heap_memory_usage_Committed

Amount of memory (in bytes) that is guaranteed to be available for use by the JVM.

The amount of committed memory may change over time (increase or decrease). The value of jvm_memory_heap_memory_usage_Committed may be less than jvm_memory_heap_memory_usage_Init but will always be greater than jvm_memory_heap_memory_usage_Used.

 

jvm_memory_heap_memory_usage_Max

Maximum amount of memory (in bytes) that can be used for memory management. Its value may be undefined.

The maximum amount of memory may change over time if defined. The value of jvm_memory_heap_memory_usage_Used and jvm_memory_heap_memory_usage_Committed will always be less than or equal to jvm_memory_heap_memory_usage_Max if defined.

Memory allocation may fail if it attempts to increase the used memory such that the used memory is greater than the committed memory, even if the used memory would still be less than or equal to the maximum memory (for example, when the system is low on virtual memory).

jvm_memory_heap_memory_usage_Used The amount of memory currently used (in bytes) by the JVM.
   

Observing high heap usage, where jvm_memory_heap_memory_usage_Used is nearing jvm_memory_heap_memory_usage_Max, can be an indicator that the heap memory allocation may not be sufficient.

Note: While there might be a temptation to use increase the memory allocation, the Gateway Hub installer calculates the ideal Gateway Hub memory settings based on the size of the machine being used. It is important not to over-allocate memory to any Gateway Hub component, as it may result in an over-commitment that can produce unexpected behaviours including failures or swapping.

JVM garbage collection

You can observe JVM garbage collection with the following metrics:

Metric Description
jvm_gc_G1_[Young | Old]_Generation_collection_count Total number of collections that have occurred
jvm_gc_G1_[Young | Old]_Generation_collection_time Approximate accumulated collection elapsed time in milliseconds.
   

When creating alerts, note the following:

  • The jvm_gc_G1_Young_Generation_collection_time metric should be as small as possible. This is because JVM will completely halt the execution of a program when running a young generation collection cycle, this period should be as small as possible to avoid performance impacts on latency-sensitive components such as Kafka consumers or Zookeeper.

Kafka consumer metrics

Several Gateway Hub daemons contain Kafka consumers that consume messages from Kafka topic partition and process them.

Key consumer metrics

You can observe Kafka consumers with the following metrics:

Metric Dimensions Description
kafka_consumer_bytes_consumed_rate app, client-id, hostname, topic

Average bytes consumed per topic, per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_rate app, client-id, hostname

Average bytes consumed per second.

Example: kafka_consumer_bytes_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total app, client-id, hostname

Total bytes consumed.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_bytes_consumed_total app, client-id, hostname, topic

Total bytes consumed by topic.

Example: kafka_consumer_bytes_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate app, client-id, hostname, topic

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_records_consumed_rate app, client-id, hostname

The average number of records consumed per second.

Example: kafka_consumer_records_consumed_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

     
kafka_consumer_records_consumed_total app, client-id, hostname, topic

Total records consumed.

Example: kafka_consumer_records_consumed_total Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

     

When creating alerts, note the following:

  • The kafka_consumer_records_consumed_rate metric is a measure of network bandwidth. This should stay largely constant. If it does not, then this may indicate network problems.

  • The kafka_consumer_records_consumed_total metric is a measure of actual records consumed. This may fluctuate depending on the message size. It may not correlate with bytes consumed. In a healthy application, you would expect this metric to stay fairly constant. If this measure drops to zero it may indicate a consumer failure.

Consumer lag

If a Kafka topic fills faster than the topic is consumed then we get what is known as "lag". High lag means that your system is not keeping up with messages. Near zero lag means that it is keeping up. High lag, in operational terms, means that there may be significant latency between a message being ingested into Gateway Hub, and when that message is reflected to a user (via a query or some other means). You should try to ensure that lag is close to zero, increasing lag is a problem.

You can observe lag with the following metrics:

Metric Dimensions Description
kafka_consumer_records_lag app, client-id, hostname, partition, topic

Number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_avg app, client-id, hostname, partition, topic

Average number of messages the consumer is behind the producer on this partition.

Example: kafka_consumer_records_lag_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

kafka_consumer_records_lag_max app, client-id, hostname, partition, topic

Max number of messages the consumer is behind the producer on this partition

Example: kafka_consumer_records_lag_max Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, partition=1, client-id=consumer-1

     

These Kafka consumer metrics work at the consumer level and not at the consumer group level. To get a complete picture you should watch the equivalent metrics across all nodes in the cluster.

When creating alerts, note the following:

  • The kafka_consumer_records_lag metric is the actual lag between the specific consumer in the daemon and the producer for the specified topic/partition. You should monitor this metric closely as it is a key indicator that the system may not be processing records quickly enough.

  • If lag for the same topic across all nodes is roughly 0 there is no problem.

  • If lag is significantly higher for the same topics on different nodes, then a problem is likely present on specific nodes.

  • If lag is high across all nodes, then it may be an indicator that the Gateway Hub is overloaded across all nodes, possibly because the load present is higher than the node hardware is rated for.

Fetch rate

You can observe fetch rates with the following metrics:

Metric Dimensions Description
kafka_consumer_fetch_rate app, client-id, hostname

Number of fetch requests per second.

Example: kafka_consumer_fetch_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg app, client-id, hostname

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_avg app, client-id, hostname, topic

Average number of bytes fetched per request.

Example: kafka_consumer_fetch_size_avg Dimensions = app=snapshotd, topic=hub-metrics-normalised, hostname=hub-vm.internal, client-id=consumer-1

kafka_consumer_fetch_size_max app, client-id, hostname, topic

Max number of bytes fetched per request.

Example: kafka_consumer_fetch_size_max Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=consumer-1

     

When creating alerts, note the following:

  • The kafka_consumer_fetch_rate metric is an indicator that the consumer is performing fetches and this should be fairly constant for a healthy consumer. If it drops, then this could mean there is a problem in the consumer.

Kafka producer metrics

Kafka producers publish records to Kafka topics. When producers publish messages in a reliable system, they must be sure that messages have been received (unless explicitly configured not to care). To do this publishers receive acknowledgements from brokers. In some configurations, a producer does not require acknowledgements from all other brokers, it merely needs to receive a minimum number (to achieve a quorum). In other configurations, it may need acknowledgements from all brokers. In either case, the act of receiving acknowledgements is somewhat latency-sensitive and will impact how fast a producer can push messages.

When configuring Kafka producers, consider the following:

  • Producers can send messages in batches, this will generally be more efficient than sending individual messages as it means the conversation with the broker is less extensive, and fewer acknowledgements are required.

  • Producers can compress messages, compressing messages will make them smaller which requires less network bandwidth. However, compression means more CPU power is needed.

You can observe Kafka producer behaviour with the following metrics:

Metric Dimensions Description
kafka_producer_batch_size_avg app, client-id, hostname

Average number of bytes sent per partition, per request.

Example: kafka_producer_batch_size_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_compression_rate_avg app, client-id, hostname

Average compression rate of record batches.

Example: kafka_producer_compression_rate_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_response_rate app, client-id, hostname, node-id

Average number of responses received per second from the broker.

Example: kafka_producer_node_response_rate Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_request_rate app, client-id, hostname

 

Example: kafka_producer_request_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_node_request_latency_avg app, client-id, hostname, node-id

Average request latency in milliseconds for a node.

Example: kafka_producer_node_request_latency_avg Dimensions = app=snapshotd, node-id=node--1, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_io_wait_time_ns_avg app, client-id, hostname

Average length of time the I/O thread spends waiting for a socket ready for reads or writes in nanoseconds.

Example: kafka_producer_io_wait_time_ns_avg Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

kafka_producer_outgoing_byte_rate app, client-id, hostname

Average number of bytes sent per second to the broker.

Example: kafka_producer_outgoing_byte_rate Dimensions = app=snapshotd, hostname=hub-vm.internal, client-id=producer-1

     

When creating alerts, note the following:

  • The kafka_producer_batch_size_avg metric indicates the average size of batches sent to the broker. Large batches are preferred, since small batches do not compress well, and need to be sent more often, thus require more network traffic. This value should not vary greatly under a reasonably constant load.

  • If the kafka_producer_node_response_rate metric is low this may indicate that the producer is falling behind and that data can't be consumed at an ideal rate. This value should not vary greatly under a reasonably constant load.

  • If the kafka_producer_request_rate is low under high load this could indicate an issue with the producer. An extremely high rate could also indicate a problem as it may mean the consumer may struggle to keep up ( which could require throttling).

  • Generally batches should be large. However, large batches may also increase the kafka_producer_node_request_latency_avg metric. This is because a producer may wait until it builds up a big enough batch before it initiates a send operation (this behaviour is controlled by the linger.ms Kafka setting). You should prefer throughput over latency, however, this is a trade-off and too much latency can also be problematic. Large batches are most likely to cause a problem in high load scenarios.

  • If the kafka_producer_io_wait_time_ns_avg metric is high this means that the producer is spending a lot of time waiting on network resources while CPU is essentially idle. This may point to resource saturation, a slow network, or similar problems.

Kafka broker metrics

All Kafka messages pass through a broker, so if the broker is encountering problems this can have a wider impact on performance and reliability.

You can observe Kafka broker behaviour with the following metrics:

Metric Dimensions Description
server_replica_under_replicated_partitions app, broker_id, cluster_id, hostname

Number of under-replicated partitions.

Each Kafka partition may be replicated in order to provide reliability guarantees. In the set of replicas for a given partition, one is chosen as the leader. The leader is always considered in sync. The remaining replicas will also be considered in sync, providing they are not too far behind the leader. Synchronised replicas form the ISR (In Sync Replica) pool. If a partition lags too far behind it is removed from the ISR pool. Producers may require a minimum number of ISRs in order to operate reliably. When the ISR pool shrinks you will surely see an increase in under-replicated partitions.

Example: server_replica_under_replicated_partitions Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_expands_per_sec_<attribute> app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) expands.

Example: server_replica_isr_expands_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

server_replica_isr_shrinks_per_sec_<attribute> app, broker_id, cluster_id, hostname

Rate at which the pool of in-sync replicas (ISRs) shrinks.

Example: server_replica_isr_shrinks_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_active_controller_count app, broker_id, cluster_id, hostname

Number of active controllers in the cluster.

Example: controller_active_controller_<attribute> Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_offline_partitions_count app, broker_id, cluster_id, hostname

Number of partitions that do not have an active leader and are hence not writable or readable.

Example: controller_offline_partitions_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

controller_leader_election_rate_and_time_ms app, broker_id, cluster_id, hostname Rate of leader elections per second and the overall duration the cluster went without a leader.
controller_unclean_leader_elections_per_sec_count app, broker_id, cluster_id, hostname

Unclean leader election rate.

If a broker goes offline, some partitions will be leaderless and Kafka will elect a new leader from the ISR pool. Gateway Hub does not allow unclean elections, hence the new leader must come from the ISR pool.

Example: controller_unclean_leader_elections_per_sec_count Dimensions = broker_id=0, hostname=hub-vm.internal, cluster_id=a1ZajBN8QiS60Hs_AZte3Q, app=kafka

     

When creating alerts, note the following:

  • The server_replica_under_replicated_partitions metric should never be greater than zero.

  • The server_replica_isr_expands_per_sec_<attribute> metric should not vary significantly.

  • The controller_active_controller_count metric must be equal to one. There should be exactly one controller per cluster.

  • The controller_active_controller_count metric should never be greater than zero.

  • A high rate of leader elections, as indicated by the controller_leader_election_rate_and_time_ms metric, suggests brokers are fluctuating between offline and online statuses. Additionally, taking too long to elect a leader will result in partitions being inaccessible for long periods.

  • The controller_unclean_leader_elections_per_sec_count metric should never be greater than zero.

Zookeeper metrics

Kafka uses Zookeeper to store metadata about topics and brokers. It plays a critical role in ensuring Kafka's performance and stability. If Zookeeper is not available Kafka cannot function.

Zookeeper is very sensitive to IO latency. In particular, disk latency can have severe impacts on Zookeeper, this is because quorum operations must be completed quickly.

You can observe Zookeeper with the following metrics:

Metric Dimensions Description
zookeeper_outstanding_requests app, hostname, port

Number of requests from followers that have yet to be acknowledged.

Example: zookeeper_outstanding_requests Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_avg_request_latency app, hostname, port

Average time to respond to a client request.

Example: zookeeper_avg_request_latency Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_max_client_cnxns_per_host app, hostname, port

Example: zookeeper_max_client_cnxns_per_host Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_num_alive_connections app, hostname, port

Number of connections currently open. Should be well under the configured maximum connections for safety.

Example: zookeeper_num_alive_connections Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

zookeeper_fsync_threshold_exceed_count app, hostname, port

Count of instances f-sync time has exceeded the warning threshold.

Example: zookeeper_fsync_threshold_exceed_count Dimensions = hostname=hub-vm.internal, port=5181, app=zookeeper

     

When creating alerts, note the following:

  • The zookeeper_outstanding_requests metric should be low. Confluent suggests that this value should be below 10.

  • The zookeeper_avg_request_latency metric should be low as possible (typical values should be less than 10ms), and ideally fairly constant. You should investigate if this number spikes or has variability.

  • If the zookeeper_fsync_threshold_exceed_count metric increases steadily there may be a problem with disk latency. Ideally, this metric should be zero or non moving (in the case of recovery after a latency problem).

etcd metrics

etcd is used by Gateway Hub as a key-value store. Disk latency is the most important etcd metric, but CPU starvation can also cause problems.

Many etcd metrics are provided as histograms composed of several gauges. Etcd histogram buckets are cumulative. See below for an example:

Copy
disk_wal_fsync_duration_seconds_bucket_0.001 = 2325
disk_wal_fsync_duration_seconds_bucket_0.002 = 4642
disk_wal_fsync_duration_seconds_bucket_0.004 = 5097
disk_wal_fsync_duration_seconds_bucket_0.008 = 5187
disk_wal_fsync_duration_seconds_bucket_0.016 = 5248
disk_wal_fsync_duration_seconds_bucket_0.032 = 5253
disk_wal_fsync_duration_seconds_bucket_0.064 = 5254
disk_wal_fsync_duration_seconds_bucket_0.128 = 5254
disk_wal_fsync_duration_seconds_bucket_0.256 = 5254
disk_wal_fsync_duration_seconds_bucket_0.512 = 5254
disk_wal_fsync_duration_seconds_bucket_1.024 = 5254
disk_wal_fsync_duration_seconds_bucket_2.048 = 5254
disk_wal_fsync_duration_seconds_bucket_4.096 = 5254
disk_wal_fsync_duration_seconds_bucket_8.192 = 5254
disk_wal_fsync_duration_seconds_sum = 7.362459756999995
disk_wal_fsync_duration_seconds_count = 5254

The value of the disk_wal_fsync_duration_seconds_bucket_<x.y> metric indicates the cumulative total over the whole duration ending with the bucket specified by the <x.y> postfix. In this example, the disk_wal_fsync_duration_seconds_<x.y> value increases with each time step for 0.064 seconds then remains static.

Key metrics

You can observe etcd with the following metrics:

Metric Description
server_health_failures Total number of failed health checks.
server_heartbeat_send_failures_total Total number of leader heartbeat send failures (likely overloaded from slow disk). If non zero and increasing, this could be due to a slow disk, and can be a prelude to a cluster failover.
   

When creating alerts, note the following:

  • The rate of server health failures should be low.

  • If the server_heartbeat_send_failures_total metric is increasing, this may indicate a slow disk.

Etcd latency metrics

Most issues with etcd relate to slow disks. High disk latency can cause long pauses that will lead to missed heartbeats, and potentially fail overs in the etcd cluster. Disk latency will also contribute to high request latency.

You can observe etcd latency with the following metrics:

Metric Description
disk_backend_commit_duration_seconds_bucket_<bucket> Presented as a histogram. The latency distributions of commit called by backend.
disk_wal_fsync_duration_seconds_bucket_<bucket> Presented as a histogram. The latency distributions of fsync called by the Write Ahead Log (WAL).
   

When creating alerts, note the following:

  • The 0.99th percentile of the disk_backend_commit_duration_seconds_bucket_<bucket> metric should be less than 25ms.

  • The disk_wal_fsync_duration_seconds_bucket_<bucket> metric should be fairly constant and ideally, as low as possible.

PostgreSQL metrics

Gateway Hub stores metrics in a PostgreSQL database.

You can observe PostgreSQL performance with the following metrics:

Metric Dimensions Description
processes_cpu_usage comm, hostname, pid CPU usage for a specific process identified by the pid dimension.
processes_vmswap comm, hostname, pid Swap memory used for a specific process.
processes_vmsize comm, hostname, pid Sum of all mapped memory used by a specific process, including swap.
processes_vmrss comm, hostname, pid Sum of all physical memory used by a specific process, excluding swap. Known as the resident set.
fs_free device, hostname, volume Total free space on the specified volume.
fs_used device, hostname, volume Total used space on the specified volume.
fs_total device, hostname, volume Total space on the specified volume.
active_connections app, hostname Number of connections currently active on the server.
app_connections app, hostname, username Number of connections currently active on the server, grouped by application.
max_connections app, hostname Maximum number of connections for the PostgreSQL server.
     

When creating alerts, note the following:

  • Ensure that excess CPU usage by PostgreSQL is monitored. For example, a runaway query could consume excessive CPU which can affect the whole Gateway Hub node. Other causes of excessive CPU usage include: expensive back ground workers, expensive queries, high ingestion rates, misconfiguration of background worker counts, and more.

  • Do not manually configure PostgreSQL after installation.

  • At no time should PostgreSQL or other process show swappiness, the memory budget should ensure that overall memory allocation does not exceed the total memory.

  • It is possible that PostgreSQL may run at or near to its upper memory allocation. Storage systems will buffer pages into memory for caching and other efficiency reasons, so it is normal to see high but fairly constant memory usage.

  • The fs_used metric should not exceed 90 percent of fs_total for any volume.

  • PostgreSQL sets an upper limit on concurrent connections. If this limit is exceeded, database clients (including application code), may be blocked while waiting to be allocated a connection. A blocked client can result in a timeout which can significantly impact performance or throughput. During Gateway Hub installation PostgreSQL is configured with a maximum connection limit based on the available hardware.

  • The likely causes of a connection limit problem are connections leaks in applications code or manually accessing the PostgreSQL database. Both should be avoided.

  • The active_connections metric should not exceed 80 percent of max_connections.