×
Sample configuration for AWS EC2 with NGINX Ingress controller (small)
Download this sample configuration provided by ITRS.
# Example ITRS Analytics configuration for AWS EC2 handling ~250,000 entities, ~1,000,000 time series,
# ~10,000 datapoints/sec (tested with approximately 8100 metrics/sec, 2000 logs/sec, 50 signals/sec, 50 audit events/sec);
# and ~5000 span/sec (pre-sampling).
# Actual ingestion composition may vary by deployment.
#
# NODE REQUIREMENTS:
# - Total capacity needed: ~44 cores / ~112 GiB requests (~105 cores / ~156 GiB limits)
# - These totals include optional Linkerd sidecar resources
# - Minimum per node: 16 cores / 32 GiB
# - Example: (5) c5.4xlarge (16 cores / 32 GiB) or equivalent
#
# HA CONFIGURATION NOTE:
# This configuration provides seamless HA for service layer workloads (2 replicas minimum for stateless services).
#
# ClickHouse workloads (chmetrics, chplatform, chlogs, chtraces) each run 2 replicas in an active-active
# configuration — both replicas serve reads and writes simultaneously. There is no primary/standby distinction.
# Data is replicated asynchronously between replicas via ReplicatedMergeTree using ClickHouse Keeper (chkeeper).
#
# With 2 replicas, losing one replica has no immediate impact — reads and writes continue on the surviving
# replica without failover. The failed replica re-syncs automatically when it restarts (~2-5 minutes depending
# on data volume accumulated during the outage).
#
# ClickHouse Keeper (chkeeper) runs as a 3-node consensus cluster. Losing 1 of 3 keepers maintains quorum
# and has no impact on reads or writes. Losing 2 of 3 keepers breaks quorum, blocking replication and
# distributed DDL but NOT local reads or writes on individual ClickHouse nodes.
#
# There is no benefit to adding a 3rd ClickHouse replica purely for HA —
# 2 replicas already provide full read/write availability during single-node failure.
#
# DISK REQUIREMENTS:
# Estimated disk requirements based on default retention and the ingestion rate above
# (actual size will vary depending on the shape of the data being ingested).
# - Kafka broker: 100 GiB for each replica (x3)
# - Kafka controller: 10 GiB for each replica (x3)
# - Postgres: 3 GiB for each replica (x2)
# - ClickHouse Keeper: 2 GiB for each replica (x3)
# - ClickHouse Platform: 50 GiB for each replica (x2)
# - ClickHouse Metrics: 400 GiB for each replica (x2)
# - ClickHouse Logs: 200 GiB for each replica (x2)
# - ClickHouse Traces: 50 GiB for each replica (x2)
# - etcd: 16 GiB for each replica (x3)
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput - you can create
# this class or change the config to use a class of your own, but it should be similar in performance.
#
# This configuration is based upon a certain number of IAX entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (IAX entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 10,000 metrics/sec = (250,000 IAX entities * 1 metrics/entity) / 25 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
# abnormally long spans
#
# UPGRADE NOTE: Timescale and Loki are no longer required for fresh installs (v2.18+).
# If upgrading from a pre-2.18 deployment, these workloads must remain and require additional resources:
# - Timescale: additional resource ~4 cores / ~16 GiB (requests = limits)
# - Loki: additional resource requests ~500m / ~1 GiB, with limits of ~1 core / ~8 GiB
# Additional disk requirements (sizes will vary based on existing deployment):
# - Timescale:
# - 4 x timeseries data disks for each replica (x2)
# - 1 x data disk for each replica (x2)
# - 1 x WAL disk for each replica (x2)
# - Loki: 1 x data disk
#
# If upgrading from a pre-2.18 deployment, uncomment the timescale and loki section at the bottom of this file
# and include additional resources and disks listed under "UPGRADE NOTE" above.
#
defaultStorageClass: "gp3"
# Enforce even replica distribution of critical workloads across availability zones in multi-zone clusters.
enforceZoneSpread: true
apps:
externalHostname: "iax.mydomain.internal"
ingress:
className: "nginx"
ingestion:
externalHostname: "iax-ingestion.mydomain.internal"
replicas: 2
ingress:
className: "nginx"
annotations:
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
nginx.ingress.kubernetes.io/use-regex: "true"
usePathRegex: true
producerProperties:
buffer.memory: 67108864
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "1Gi"
cpu: "750m"
traces:
jvmOpts: "-XX:MaxDirectMemorySize=120M -XX:MaxRAMPercentage=75"
producerProperties:
buffer.memory: 67108864
resources:
requests:
memory: "1500Mi"
cpu: "1"
limits:
memory: "2500Mi"
cpu: "2"
iam:
keycloak:
replicas: 2
ingress:
className: "nginx"
iamd:
replicas: 2
kafka:
replicas: 3
diskSize: "100Gi"
resources:
requests:
memory: "3Gi"
cpu: "1"
limits:
memory: "3Gi"
cpu: "2"
controller:
replicas: 3
sinkd:
metrics:
jvmOpts: "-XX:MaxDirectMemorySize=200M"
replicas: 2
consumerProperties:
fetch.max.bytes: 20971520
fetch.max.wait.ms: 250
fetch.min.bytes: 5242880
max.partition.fetch.bytes: 5242880
max.poll.records: 100000
resources:
limits:
memory: "1200Mi"
entities:
resources:
limits:
memory: "1200Mi"
logs:
resources:
requests:
memory: "512Mi"
signals:
consumerProperties:
max.partition.fetch.bytes: 1048576
traces:
consumerProperties:
max.poll.records: 20000
resources:
requests:
memory: "756Mi"
cpu: "100m"
limits:
memory: "1200Mi"
cpu: "1"
platformd:
replicas: 2
resources:
requests:
memory: "1536Mi"
cpu: "1"
limits:
memory: "2Gi"
cpu: "2250m"
dpd:
replicas: 2
jvmOpts: "-XX:MaxRAMPercentage=70"
secondLevelEntityCacheHeapPercent: 10
hazelcast:
jetIdleCooperativeMinMicroSeconds: 1000
jetIdleCooperativeMaxMicroSeconds: 10000
jetIdleNonCooperativeMinMicroSeconds: 1000
jetIdleNonCooperativeMaxMicroSeconds: 10000
consumerProperties:
fetch.min.bytes: 524288
metricsMultiplexer:
maxFilterResultCacheSize: 500000
maxConcurrentOps: 100
resources:
requests:
memory: "4Gi"
cpu: "1"
limits:
memory: "5Gi"
cpu: "2"
entityStream:
intermediate:
consumerProperties:
max.partition.fetch.bytes: 1048576
producerProperties:
buffer.memory: 67108864
storedEntitiesCacheSize: 1000
final:
consumerProperties:
max.partition.fetch.bytes: 1048576
producerProperties:
buffer.memory: 67108864
resources:
requests:
memory: "1350Mi"
cpu: "300m"
limits:
memory: "2Gi"
cpu: "3"
signalsStream:
consumerProperties:
max.partition.fetch.bytes: 1048576
resources:
requests:
memory: "768Mi"
cpu: "150m"
limits:
memory: "1536Mi"
cpu: "1200m"
etcd:
replicas: 3
diskSize: "16Gi"
kvStore:
replicas: 2
licenced:
replicas: 2
clickhouse:
traces:
replicas: 2
diskSize: "50Gi"
resources:
limits:
cpu: "2"
memory: "10Gi"
requests:
cpu: "2"
memory: "10Gi"
metrics:
replicas: 2
diskSize: "400Gi"
resources:
limits:
cpu: "3"
memory: "8Gi"
requests:
cpu: "3"
memory: "8Gi"
platform:
replicas: 2
diskSize: "50Gi"
resources:
limits:
cpu: "3"
memory: "10Gi"
requests:
cpu: "3"
memory: "10Gi"
logs:
replicas: 2
diskSize: "200Gi"
resources:
limits:
cpu: "4"
memory: "8Gi"
requests:
cpu: "4"
memory: "8Gi"
keeper:
replicas: 3
postgres:
clusterSize: 2
statusMetricsStream:
resources:
limits:
memory: "1280Mi"
requests:
memory: "768Mi"
#
# The Timescale configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below have been reduced since Timescale only serves reads during migration.
#
#timescale:
# sharedBuffersPercentage: 40
# clusterSize: 2
# dataDiskSize: "50Gi"
# timeseriesDiskCount: 4
# timeseriesDiskSize: "512Gi"
# walDiskSize: "50Gi"
# resources:
# requests:
# memory: "8Gi"
# cpu: "2"
# limits:
# memory: "8Gi"
# cpu: "2"
#
# The Loki configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below may be increased depending on the volume of the logs.
# Loki memory limits depend on chunk data volume. A reasonable guideline is 2-4x the chunk data size,
# with a minimum of 4 GiB. The 8 GiB limit below assumes chunk data of ~2-4 GiB.
# Adjust if your deployment has significantly larger log volume.
#
#loki:
# diskSize: "30Gi"
# retentionTime: "168h"
# resources:
# limits:
# cpu: "1"
# memory: "8Gi"
# requests:
# cpu: "500m"
# memory: "1Gi"
["ITRS Analytics"]
["User Guide", "Technical Reference"]