Sample configuration for AWS EC2 with NGINX Ingress controller (small)

Download this sample configuration provided by ITRS.
# Example ITRS Analytics configuration for AWS EC2 handling ~250,000 entities, ~1,000,000 time series,
# ~10,000 datapoints/sec (tested with approximately 8100 metrics/sec, 2000 logs/sec, 50 signals/sec, 50 audit events/sec);
# and ~5000 span/sec (pre-sampling).
# Actual ingestion composition may vary by deployment.
#
# NODE REQUIREMENTS:
# - Total capacity needed: ~44 cores / ~112 GiB requests (~105 cores / ~156 GiB limits)
# - These totals include optional Linkerd sidecar resources
# - Minimum per node: 16 cores / 32 GiB
# - Example: (5) c5.4xlarge (16 cores / 32 GiB) or equivalent
#
# HA CONFIGURATION NOTE:
# This configuration provides seamless HA for service layer workloads (2 replicas minimum for stateless services).
#
# ClickHouse workloads (chmetrics, chplatform, chlogs, chtraces) each run 2 replicas in an active-active
# configuration — both replicas serve reads and writes simultaneously. There is no primary/standby distinction.
# Data is replicated asynchronously between replicas via ReplicatedMergeTree using ClickHouse Keeper (chkeeper).
#
# With 2 replicas, losing one replica has no immediate impact — reads and writes continue on the surviving
# replica without failover. The failed replica re-syncs automatically when it restarts (~2-5 minutes depending
# on data volume accumulated during the outage).
#
# ClickHouse Keeper (chkeeper) runs as a 3-node consensus cluster. Losing 1 of 3 keepers maintains quorum
# and has no impact on reads or writes. Losing 2 of 3 keepers breaks quorum, blocking replication and
# distributed DDL but NOT local reads or writes on individual ClickHouse nodes.
#
# There is no benefit to adding a 3rd ClickHouse replica purely for HA —
# 2 replicas already provide full read/write availability during single-node failure.
#
# DISK REQUIREMENTS:
# Estimated disk requirements based on default retention and the ingestion rate above
# (actual size will vary depending on the shape of the data being ingested).
# - Kafka broker: 100 GiB for each replica (x3)
# - Kafka controller: 10 GiB for each replica (x3)
# - Postgres: 3 GiB for each replica (x2)
# - ClickHouse Keeper: 2 GiB for each replica (x3)
# - ClickHouse Platform: 50 GiB for each replica (x2)
# - ClickHouse Metrics: 400 GiB for each replica (x2)
# - ClickHouse Logs: 200 GiB for each replica (x2)
# - ClickHouse Traces: 50 GiB for each replica (x2)
# - etcd: 16 GiB for each replica (x3)
#
# The configuration references a default storage class named `gp3` which uses EBS gp3 volumes. This storage class should
# be configured with the default minimum gp3 settings of 3000 IOPS and 125 MiB/s throughput - you can create
# this class or change the config to use a class of your own, but it should be similar in performance.
#
# This configuration is based upon a certain number of IAX entities, average metrics per entity, and
# average metrics collection interval. The following function can be used to figure out what type of load to expect:
#
# metrics/sec = (IAX entities * metrics/entity) / average metrics collection interval
#
# In this example configuration, we have the following:
#
# 10,000 metrics/sec = (250,000 IAX entities * 1 metrics/entity) / 25 seconds average metrics collection interval
#
# NOTE: Ingestion, storage, and retrieval of OpenTelemetry spans is a beta feature.
#
# Additionally, the configuration is based upon a certain number of OpenTelemetry spans per second that are sampled
# based upon the following rules:
# - Error traces are always sampled
# - Target sampling probability per endpoint (corresponds to the name of the root span) is 0.01
# - Target sampling rate / second / endpoint (corresponds to the name of the root span) is 0.5
# - Root span duration outlier quantile is 0.95. The durations of all root spans are tracked and used to make guesses about
#   abnormally long spans
#
# UPGRADE NOTE: Timescale and Loki are no longer required for fresh installs (v2.18+).
# If upgrading from a pre-2.18 deployment, these workloads must remain and require additional resources:
# - Timescale: additional resource ~4 cores / ~16 GiB (requests = limits)
# - Loki: additional resource requests ~500m / ~1 GiB, with limits of ~1 core / ~8 GiB
# Additional disk requirements (sizes will vary based on existing deployment):
# - Timescale:
#   - 4 x timeseries data disks for each replica (x2)
#   - 1 x data disk for each replica (x2)
#   - 1 x WAL disk for each replica (x2)
# - Loki: 1 x data disk
#
# If upgrading from a pre-2.18 deployment, uncomment the timescale and loki section at the bottom of this file
# and include additional resources and disks listed under "UPGRADE NOTE" above.
#
defaultStorageClass: "gp3"
# Enforce even replica distribution of critical workloads across availability zones in multi-zone clusters.
enforceZoneSpread: true
apps:
  externalHostname: "iax.mydomain.internal"
  ingress:
    className: "nginx"
ingestion:
  externalHostname: "iax-ingestion.mydomain.internal"
  replicas: 2
  ingress:
    className: "nginx"
    annotations:
      nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
      nginx.ingress.kubernetes.io/use-regex: "true"
    usePathRegex: true
  producerProperties:
    buffer.memory: 67108864
  resources:
    requests:
      memory: "512Mi"
      cpu: "500m"
    limits:
      memory: "1Gi"
      cpu: "750m"
  traces:
    jvmOpts: "-XX:MaxDirectMemorySize=120M -XX:MaxRAMPercentage=75"
    producerProperties:
      buffer.memory: 67108864
    resources:
      requests:
        memory: "1500Mi"
        cpu: "1"
      limits:
        memory: "2500Mi"
        cpu: "2"
iam:
  keycloak:
    replicas: 2
  ingress:
    className: "nginx"
iamd:
  replicas: 2
kafka:
  replicas: 3
  diskSize: "100Gi"
  resources:
    requests:
      memory: "3Gi"
      cpu: "1"
    limits:
      memory: "3Gi"
      cpu: "2"
  controller:
    replicas: 3
sinkd:
  metrics:
    jvmOpts: "-XX:MaxDirectMemorySize=200M"
    replicas: 2
    consumerProperties:
      fetch.max.bytes: 20971520
      fetch.max.wait.ms: 250
      fetch.min.bytes: 5242880
      max.partition.fetch.bytes: 5242880
      max.poll.records: 100000
    resources:
      limits:
        memory: "1200Mi"
  entities:
    resources:
      limits:
        memory: "1200Mi"
  logs:
    resources:
      requests:
        memory: "512Mi"
  signals:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
  traces:
    consumerProperties:
      max.poll.records: 20000
    resources:
      requests:
        memory: "756Mi"
        cpu: "100m"
      limits:
        memory: "1200Mi"
        cpu: "1"
platformd:
  replicas: 2
  resources:
    requests:
      memory: "1536Mi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu: "2250m"
dpd:
  replicas: 2
  jvmOpts: "-XX:MaxRAMPercentage=70"
  secondLevelEntityCacheHeapPercent: 10
  hazelcast:
    jetIdleCooperativeMinMicroSeconds: 1000
    jetIdleCooperativeMaxMicroSeconds: 10000
    jetIdleNonCooperativeMinMicroSeconds: 1000
    jetIdleNonCooperativeMaxMicroSeconds: 10000
  consumerProperties:
    fetch.min.bytes: 524288
  metricsMultiplexer:
    maxFilterResultCacheSize: 500000
    maxConcurrentOps: 100
  resources:
    requests:
      memory: "4Gi"
      cpu: "1"
    limits:
      memory: "5Gi"
      cpu: "2"
entityStream:
  intermediate:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    storedEntitiesCacheSize: 1000
  final:
    consumerProperties:
      max.partition.fetch.bytes: 1048576
    producerProperties:
      buffer.memory: 67108864
    resources:
      requests:
        memory: "1350Mi"
        cpu: "300m"
      limits:
        memory: "2Gi"
        cpu: "3"
signalsStream:
  consumerProperties:
    max.partition.fetch.bytes: 1048576
  resources:
    requests:
      memory: "768Mi"
      cpu: "150m"
    limits:
      memory: "1536Mi"
      cpu: "1200m"
etcd:
  replicas: 3
  diskSize: "16Gi"
kvStore:
  replicas: 2
licenced:
  replicas: 2
clickhouse:
  traces:
    replicas: 2
    diskSize: "50Gi"
    resources:
      limits:
        cpu: "2"
        memory: "10Gi"
      requests:
        cpu: "2"
        memory: "10Gi"
  metrics:
    replicas: 2
    diskSize: "400Gi"
    resources:
      limits:
        cpu: "3"
        memory: "8Gi"
      requests:
        cpu: "3"
        memory: "8Gi"
  platform:
    replicas: 2
    diskSize: "50Gi"
    resources:
      limits:
        cpu: "3"
        memory: "10Gi"
      requests:
        cpu: "3"
        memory: "10Gi"
  logs:
    replicas: 2
    diskSize: "200Gi"
    resources:
      limits:
        cpu: "4"
        memory: "8Gi"
      requests:
        cpu: "4"
        memory: "8Gi"
  keeper:
    replicas: 3
postgres:
  clusterSize: 2
statusMetricsStream:
  resources:
    limits:
      memory: "1280Mi"
    requests:
      memory: "768Mi"
#
# The Timescale configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below have been reduced since Timescale only serves reads during migration.
#
#timescale:
#  sharedBuffersPercentage: 40
#  clusterSize: 2
#  dataDiskSize: "50Gi"
#  timeseriesDiskCount: 4
#  timeseriesDiskSize: "512Gi"
#  walDiskSize: "50Gi"
#  resources:
#    requests:
#      memory: "8Gi"
#      cpu: "2"
#    limits:
#      memory: "8Gi"
#      cpu: "2"
#
# The Loki configs need to be enabled if upgrading from a pre-2.18 deployment.
# The following is an example config. The actual configs should match the existing deployment,
# except that the resource values below may be increased depending on the volume of the logs.
# Loki memory limits depend on chunk data volume. A reasonable guideline is 2-4x the chunk data size,
# with a minimum of 4 GiB. The 8 GiB limit below assumes chunk data of ~2-4 GiB.
# Adjust if your deployment has significantly larger log volume.
#
#loki:
#  diskSize: "30Gi"
#  retentionTime: "168h"
#  resources:
#    limits:
#      cpu: "1"
#      memory: "8Gi"
#    requests:
#      cpu: "500m"
#      memory: "1Gi"
Previous article Next article
Sample configuration for AWS EC2 with NGINX Ingress controller (small)

Was this topic helpful?

Your thoughts...

How can we improve this topic?

Your thoughts...

Thank you for your feedback!