Hardware requirements

Overview

A Gateway Hub installation consists of a number of individual servers called nodes. A set of nodes collectively form a cluster and all nodes in a cluster should be co-located.

You will also need to consider the number of clusters that are required and where these should be physically located. For a global organisation, you should deploy one cluster in each geographical region (for example, Americas, EMEA, and APAC) to reduce network latency.

For more information about deploying multiple clusters, see Cluster sizing tool.

Cluster size and consensus

Gateway Hub requires a quorum of available nodes to achieve distributed consensus. If an insufficient number of nodes are available, then distributed consensus cannot be achieved and Gateway Hub will not operate correctly.

For a cluster with n nodes quorum is given by (n/2)+1, and as a result an odd-sized cluster tolerates the same number of failures as the sequential even-sized cluster but with fewer nodes. Fault tolerance scales with cluster size as follows:

Node count	Nodes required for consensus	Node failures tolerated
1	1	0
2	2	0
3	2	1
4	3	1
5	3	2
6	4	2

Note: The recommended number of nodes in production deployments is three, however, for testing or proof of concept deployments a single node can be used.

Hardware guidelines

Example scenarios

The specific requirements of your deployment will depend on the expected workload.

The following table shows hardware specifications for a range of indicative production scenarios using the default 3 days of Kafka storage and 90 days of metric storage.

Note: The test environments for all scenarios were created without the use of virtual machines. This is because virtual machines, by definition, execute on shared hardware and as a result it is difficult to make assumptions about CPU usage and disk contention. Running Gateway Hub on virtual machines may have different requirements to those listed here. Additionally, virtual machine environments may not be appropriate for extremely large estates.

Scenario	CPU cores	Memory (GiB)	Storage per node (GiB)	Disk IO (megabytes/sec)	Network IO (megabytes/sec)	Number of nodes	Disk configuration	Equivalent AWS instance	Equivalent AWS EBS volume	Test environment
10 probes with 1 user	8	16	24	102	0.02	1	Single disk	c5a.2xlarge	standard	AWS
100 probes with 3 users	16	16	186	107	0.05	1	Single disk	c5a.2xlarge	standard	AWS
1000 Probes with 5 users	16	16	1854	203	0.28	1	Dedicated disks	c5a.2xlarge	standard	AWS
3000 Probes with 50 users	32	128	3972	305	1.46	3	Dedicated disks	m5a.8xlarge	gp2	Bare metal
6000 Probes with 50 users	32	128	7842	312	2.32	3	Dedicated disks	m5a.8xlarge	gp2	Bare metal

Disks

Disk configuration can have a large impact on the overall performance and stability of Gateway Hub.

Depending on the requirements of your deployment, we recommend one of the following strategies:

Single disk ⁠— A single disk for the whole installation. Use in low-stress scenarios.
Dedicated Disks ⁠— Dedicated disks for etcd, Kafka, and PostgreSQL to maintain disk IO performance in high-stress scenarios.

Dedicated disks

In high-stress situations some Gateway Hub components, such as etcd and Zookeeper, benefit from independent disks.

Gateway Hub uses etcd as a distributed key-value store and Zookeeper to maintain Kafka state across the whole cluster. These components rely on the fsync system call to ensure that data is safely written to disk as part of reaching distributed consensus.

A failure to perform an fsync operation in less than one second typically causes an etcd distributed consensus failure followed by a failover. This can result in the Gateway Hub being unresponsive for a noticeable amount of time. Similarly, Zookeeper is also sensitive to disk latency and Zookeeper failure will affect Kafka ingestion. As a result, it is important to ensure that sync operations occur over a short time and that this time period remains predictable. The etcd documentation recommends that the 99th percentile of the fsync duration should be less than 10 ms for storage to be considered fast enough.

Performing an fsync operation flushes all pending writes to the relevant disk. In any shared environment this may include writes from other applications and in a virtualised environment this may also include writes from other guests sharing the same physical disk. As a result, the duration of the fsync operation can be increased significantly over the time required to flush the etcd or Zookeeper data.

To achieve predictable performance, you should ensure that etcd and Zookeeper are isolated from disk writes performed by other applications, including each other where possible.

When configuring dedicated disks consider the following options:

Using a dedicated disk for each of etcd and Zookeeper. These disks do not need to exceed 10 GB in size.
Using SSDs instead of rotational disks.
Using a storage solution that supports write-back caching. This option should be used with care, see below.
Dedicated disks for Gateway Hub's disk intensive utilities, such as Kafka and PostgreSQL.

Write-back caching

When using write-back caching, performing an fsync operation no longer ensures that data has been durably written to disk, instead it ensures that data has been safely received into a memory buffer. As a result, for production systems it is vital that write-back caching be paired with battery backup or non-volatile memory to ensure that data is not lost or corrupted if a power failure occurs. For more information, see Reliability and the Write-Ahead Log in PostgreSQL documentation.

Cluster sizing tool

When deploying Gateway Hub you must determine the number of clusters that are required and where these should be physically located.

This is done based on the work that Gateway Hub is expected to perform in each region and the specifications of the nodes. The hubsize tool provides an estimate of the hardware requirements for a specified workload. Different regions may have different workloads and therefore require different cluster sizes.

To determine whether an individual server is capable of contributing to the cluster, check Hardware requirements and Software requirements. One of the most common reasons for installation failure is the unsuitability of a node in the cluster.

You can use the hubsize tool to estimate the hardware requirements of your installation environment.

Prerequisites

The hubsize tool can be downloaded from ITRS Downloads, and has the following dependencies:

Requirement	Versions Supported
Python	3.6 or newer
PyYAML	5.3.1 or newer

You can install all dependencies using the included requirements.txt file:

pip3 install --user --requirement requirements.txt

Define the requirements

The requirements of a Gateway Hub installation vary between use cases. The hubsize tool provides an estimate of the requirements of your installation based on the following criteria:

Parameter	Description
`configuration > hardware > clusterSize`	Number of nodes in the cluster.
`configuration > geneos > estimatedNrGeneosMessagesPerSec`	Estimated rate of Geneos metric messages. Alternatively, specify the expected number of Netprobes.
`configuration > geneos > estimatedNrGeneosEventMessagesPerSec`	Estimated rate of Geneos event messages. Alternatively, specify the expected number of Netprobes.
`configuration > geneos > numberOfGeneosProbes`	Number of connected Netprobes. Specifying message estimates provides more accurate results.
`configuration > hub > metrics > retentionPeriodDays`	Duration, in days, for which historical event and metric data is stored.
`configuration > hub > storageBuffer`	Specify a buffer factor that `hubsize` will use to account for growth when performing calculations.
`configuration > hub > kafka > retentionPeriodHours`	Duration, in hours, for which historical Kafka data is stored. You can alternatively specify the retention period in minutes using the `retentionPeriodMinutes` parameter or days using the `retentionPeriodDays` parameter.
`configuration > hub > kafka > replicationFactor`	Number of Kafka replicas.

You must specify these parameters in a definition.yml file that the hubsize tool will read. A human-readable example file is included in the /hubsize/definition.yml directory.

Run the hubsize tool

To estimate the cluster requirements run:

./hubsize definition.yml

The hubsize tool has the following command line options:

Option	Description
`--help`	Print a help text to standard out.
`--version`	Print the tool version to standard out.
`--out`	Specify a file to write the estimated hardware requirements to.
`--metrics`	Write the performance metrics used to a file.