A installation consists of a number of individual servers called nodes. A set of nodes collectively form a cluster and all nodes in a cluster should be co-located.
You will also need to consider the number of clusters that are required and where these should be physically located. For a global organisation, you should deploy one cluster in each geographical region (for example, Americas, EMEA, and APAC) to reduce network latency.
For more information about deploying multiple clusters, see Cluster sizing tool.
Gateway Hub requires a quorum of available nodes to achieve distributed consensus. If an insufficient number of nodes are available, then distributed consensus cannot be achieved and Gateway Hub will not operate correctly.
For a cluster with
n nodes quorum is given by
(n/2)+1, and as a result an odd-sized cluster tolerates the same number of failures as the sequential even-sized cluster but with fewer nodes. Fault tolerance scales with cluster size as follows:
|Node count||Nodes required for consensus||Node failures tolerated|
NoteThe recommended number of nodes in production deployments is three, however, for testing or proof of concept deployments a single node can be used.
The specific requirements of your deployment will depend on the expected workload.
The following table shows hardware specifications for a range of indicative production scenarios using the default 3 days of Kafka storage and 90 days of metric storage.
NoteThe test environments for all scenarios were created without the use of virtual machines. This is because virtual machines, by definition, execute on shared hardware and as a result it is difficult to make assumptions about CPU usage and disk contention. Running Gateway Hub on virtual machines may have different requirements to those listed here. Additionally, virtual machine environments may not be appropriate for extremely large estates.
|Scenario||CPU cores||Memory (GiB)||Storage per node (GiB)||Disk IO (megabytes/sec)||Network IO (megabytes/sec)||Number of nodes||Disk configuration||Equivalent AWS instance||Equivalent AWS EBS volume||Test environment|
|10 probes with 1 user||8||16||24||102||0.02||1||Single disk||c5a.2xlarge||standard||AWS|
|100 probes with 3 users||16||16||186||107||0.05||1||Single disk||c5a.2xlarge||standard||AWS|
|1000 Probes with 5 users||16||16||1854||203||0.28||1||Dedicated disks||c5a.2xlarge||standard||AWS|
|3000 Probes with 50 users||32||128||3972||305||1.46||3||Dedicated disks||m5a.8xlarge||gp2||Bare metal|
|6000 Probes with 50 users||32||128||7842||312||2.32||3||Dedicated disks||m5a.8xlarge||gp2||Bare metal|
Disk configuration can have a large impact on the overall performance and stability of Gateway Hub.
Depending on the requirements of your deployment, we recommend one of the following strategies:
- Single disk — A single disk for the whole installation. Use in low-stress scenarios.
- Dedicated Disks — Dedicated disks for etcd, Kafka, and PostgreSQL to maintain disk IO performance in high-stress scenarios.
In high-stress situations some Gateway Hub components, such as etcd and Zookeeper, benefit from independent disks.
Gateway Hub uses etcd as a distributed key-value store and Zookeeper to maintain Kafka state across the whole cluster. These components rely on the fsync system call to ensure that data is safely written to disk as part of reaching distributed consensus.
A failure to perform an fsync operation in less than one second typically causes an etcd distributed consensus failure followed by a failover. This can result in the Gateway Hub being unresponsive for a noticeable amount of time. Similarly, Zookeeper is also sensitive to disk latency and Zookeeper failure will affect Kafka ingestion. As a result, it is important to ensure that sync operations occur over a short time and that this time period remains predictable. The etcd documentation recommends that the 99th percentile of the fsync duration should be less than
10 ms for storage to be considered fast enough.
Performing an fsync operation flushes all pending writes to the relevant disk. In any shared environment this may include writes from other applications and in a virtualised environment this may also include writes from other guests sharing the same physical disk. As a result, the duration of the fsync operation can be increased significantly over the time required to flush the etcd or Zookeeper data.
To achieve predictable performance, you should ensure that etcd and Zookeeper are isolated from disk writes performed by other applications, including each other where possible.
When configuring dedicated disks consider the following options:
- Using a dedicated disk for each of etcd and Zookeeper. These disks do not need to exceed 10 GB in size.
- Using SSDs instead of rotational disks.
- Using a storage solution that supports write-back caching. This option should be used with care, see below.
- Dedicated disks for Gateway Hub’s disk intensive utilities, such as Kafka and PostgreSQL.
When using write-back caching, performing an fsync operation no longer ensures that data has been durably written to disk, instead it ensures that data has been safely received into a memory buffer. As a result, for production systems it is vital that write-back caching be paired with battery backup or non-volatile memory to ensure that data is not lost or corrupted if a power failure occurs. For more information, see Reliability and the Write-Ahead Log in PostgreSQL documentation.
When deploying Gateway Hub you must determine the number of clusters that are required and where these should be physically located.
This is done based on the work that Gateway Hub is expected to perform in each region and the specifications of the nodes. The
hubsize tool provides an estimate of the hardware requirements for a specified workload. Different regions may have different workloads and therefore require different cluster sizes.
To determine whether an individual server is capable of contributing to the cluster, check Hardware Requirements and Software Requirements. One of the most common reasons for installation failure is the unsuitability of a node in the cluster.
You can use the
hubsize tool to estimate the hardware requirements of your installation environment.
hubsize tool can be downloaded from ITRS Downloads, and has the following dependencies:
|Python||3.6 or newer|
|PyYAML||5.3.1 or newer|
You can install all dependencies using the included
pip3 install --user --requirement requirements.txt
The requirements of a Gateway Hub installation vary between use cases. The
hubsize tool provides an estimate of the requirements of your installation based on the following criteria:
You must specify these parameters in a
definition.yml file that the
hubsize tool will read. A human-readable example file is included in the
||Number of nodes in the cluster.|
||Estimated rate of Geneos metric messages. Alternatively, specify the expected number of Netprobes.|
||Estimated rate of Geneos event messages. Alternatively, specify the expected number of Netprobes.|
||Number of connected Netprobes.
||Duration, in days, for which historical event and metric data is stored.|
||Specify a buffer factor that
||Duration, in hours, for which historical Kafka data is stored. You can alternatively specify the retention period in minutes using the
||Number of Kafka replicas.|
To estimate the cluster requirements run:
hubsize tool has the following command line options:
||Print a help text to standard out.|
||Print the tool version to standard out.|
||Specify a file to write the estimated hardware requirements to.|
||Write the performance metrics used to a file.|