Entity monitoring and eviction

Entity monitoring and eviction are important considerations for the Obcerv platform. This is because large numbers of entities can lead to excessive CPU and memory usage, which may cause instability or unresponsiveness of the UI.

Entities are at the core of the Obcerv platform. Many features rely on being able to access entities, get notified of entity changes, or iterate over all entities in a timely manner. This requires an amount of CPU and memory that is proportional to the number of entities in the system. As such, the number of entities is an important parameter to keep in mind when sizing an Obcerv cluster.

Monitor the number of entities in the system

Once the system is in steady state (past its initialisation phase and after it is connected to various data sources), entities should theoretically be long lived and the number of entities in the system should be stable. You can verify this by either:

Total entity count

*** Entity sizing statistics ***

┌──────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────┐
│      Total entity count      │                                     102,922                                      │
├──────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────┤
│      Total metric count      │                                     627,626                                      │
├──────────────────────────────┼───────┬──────────┬───────────┬─────────┬─────────┬─────────┬─────────┬───────────┤
│                              │  Min  │   Max    │  Average  │   50%   │   75%   │   90%   │   95%   │    99%    │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│  Serialized entity size (B)  │  273  │  72,082  │   1,235   │  1,004  │  1,007  │  2,229  │  2,657  │   3,443   │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│        Number of dimensions  │   1   │    8     │     6     │    6    │    6    │    6    │    6    │     6     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│          Dimension key size  │   2   │    25    │     7     │    7    │    8    │   13    │   13    │    13     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│        Dimension value size  │   1   │   167    │    18     │   17    │   21    │   37    │   39    │    51     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│        Number of attributes  │   4   │    39    │     9     │    6    │    6    │   24    │   24    │    35     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│           Number of metrics  │   0   │   648    │     6     │    6    │    6    │    6    │   12    │    16     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│    Number of status metrics  │   0   │    23    │     4     │    5    │    5    │    5    │    5    │    11     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│              Number of logs  │   0   │    1     │     0     │    0    │    0    │    0    │    0    │     1     │
├──────────────────────────────┼───────┼──────────┼───────────┼─────────┼─────────┼─────────┼─────────┼───────────┤
│          Number of children  │   0   │  75,601  │     1     │    0    │    0    │    0    │    0    │     4     │
└──────────────────────────────┴───────┴──────────┴───────────┴─────────┴─────────┴─────────┴─────────┴───────────┘

Entities with the most children (top 20)
	- (75,601 children) {dataview=TCP Links, managedEntity=colprdgengw2 Hardware, probe=colprdgengw2-7047, sampler=TCP Links, type=Hardware Components}
	- (2,000 children) {dataview=text_table, managedEntity=ME_Kenneth, probe=Kenneth Probe, sampler=SQL Toolkit}
	- (1,668 children) {dataview=event_table, managedEntity=MarkME, probe=MarkProbe, sampler=SQL Sampler}
	- (489 children) {dataview=InvalidMetrics, managedEntity=health-and-fkm, probe=aws-probe, sampler=dme-health}
	- (344 children) {dataview=Entities, managedEntity=Demo Gateway Info, probe=Demo Gateway Info, sampler=Entities}
	- (332 children) {namespace=nightly, node=nightly, pod=kafka-0}
	- (302 children) {dataview=Entities, managedEntity=health-and-fkm, probe=aws-probe, sampler=dme-health}
	- (296 children) {dataview=Services, managedEntity=Windows-Flare-1, probe=MLG-FLARE1, sampler=Services, type=Windows}
	- (291 children) {dataview=Services, managedEntity=Windows10_donet, probe=MLGPC008, sampler=Services, type=Windows}
	- (284 children) {dataview=Network, managedEntity=CENTOS7-MLG7, probe=CENTOS7-MLG7, sampler=Network, type=ALL}
	- (281 children) {dataview=Network, managedEntity=CENTOS7-MLG8, probe=CENTOS7-MLG8, sampler=Network, type=ALL}
	- (273 children) {dataview=Services, managedEntity=Windows-Flare-2, probe=MLG-FLARE2, sampler=Services, type=Windows}
	- (272 children) {dataview=Services, managedEntity=Windows-Flare-3, probe=MLG-FLARE3, sampler=Services, type=Windows}
	- (179 children) {dataview=Network, managedEntity=CENTOS7-MLG5, probe=CENTOS7-MLG5, sampler=Network, type=ALL}
	- (157 children) {namespace=nightly, node=nightly, pod=platformd-7755d64df5-7fglf}
	- (157 children) {dataview=Network, managedEntity=CENTOS7-MLG3, probe=CENTOS7-MLG3, sampler=Network, type=ALL}
	- (141 children) {dataview=Network, managedEntity=hub-build-3, probe=hub-build-3, sampler=Network, type=ALL}
	- (117 children) {namespace=nightly, node=nightly, pod=platformd-78bc784b88-rd5nl}
	- (101 children) {namespace=nightly, node=nightly, pod=platformd-78bc784b88-lh22c}
	- (85 children) {namespace=nightly, node=nightly, pod=platformd-7755d64df5-5hv87}

Entity eviction

Even for a correctly sized system in steady-state, the total number of entities is likely to grow over time. Among other reasons, it may happen due to:

Over a long period of time, even modest growth in the number of entities can lead to excessive resource comsumption. With entity eviction, entities from the system that have not been updated for a long period of time can be automatically purged. By default, for an entity to get evicted, it must not have received any update (metric, log, event, etc.) for the past 60 days. Note that this does not take the entity’s creation date into consideration but only its last update timestamp.

Once an entity is evicted from the system, its metrics are no longer available. Should an entity get evicted and, some time later, be re-created (maybe because a monitored system was down for an extended period of time and came back up), then that entity and its metrics will be available again, but metrics collected for that entity prior to its eviction will remain unavailable.

Ephemeral entities

The default eviction scheme – which is 60 days for entities that have not been updated – won’t be sufficient to cope with ephemeral entities. These are typically produced by Geneos when certain plugins are used, for example, TCP Links, TOP, Trapmon, X-traffic, etc. Such plugins produce highly volatile dataview rows (in the case of the TOP plugin, it will create one row for each monitored process). Since dataview rows get published by Geneos as Obcerv entities, this can result quite rapidly in hundreds of thousands of entities in Obcerv.

In this scenario, the first remedy is to stop publishing dataviews with volatile rows. To delete the entities published by such dataviews, these need to be marked as ephemeral through the Classification page in the Overview app.

Looking at the sample logs above, you’ll see that a large number of entities (about 75% of the 100,000 entities in the system) are coming from the TCP Links plugin. Once Geneos is configured to stop publishing data from that plugin, you need to create a new classification rule that will decorate all entities coming from the TCP Links dataview with a new attribute called ephemeral. To learn how, see Create a new classification policy.

Ephemeral classification policy

When such a classification is in place, entities produced by the TCP Links dataview will be considered ephemeral and will get evicted after 2 hours of inactivity. If such entities are no longer being published by Geneos, they are effectively inactive. This will result in the deletion of the 75,601 TCP Links entities within 2 hours of stopping the publishing.

Review eviction logs

Entities getting evicted are logged by platformd:

...
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-kh2bg, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-ff2sv, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-glmg6, kind=Pod, node=docker-desktop, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-r2ls7, kind=Pod, node=docker-desktop, container=entity-containment-rules, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-j5jl4, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-z6jt6, kind=Pod, namespace=itrs}
2022-07-21 13:00:09.658 [grpc-server-7162-0] INFO  com.itrsgroup.platform.service.entity.InternalEntityManagementGrpcService(Component-3) - Evicting entity {pod=entity-containment-rules--1-fss7p, kind=Pod, namespace=itrs}
...

Configure eviction rules

You can configure eviction rules using the Obcerv operator. The default configuration of an Obcerv instance is as follows:

# Entity evictions cron job
entityEvictions:
  image: "obcerv/entity-evictions"

  resources:
    requests:
      memory: "32Mi"
      cpu: "10m"
    limits:
      memory: "64Mi"
      cpu: "50m"

  cron: "0 * * * *"

  # Eviction rules.  There must be at least one rule defined.  Rules have an optional attribute
  # definition and a required expiration, expressed as an ISO-8601 duration. An attribute is used to
  # restrict a rule to entities where the attribute is present.  The default rules apply an expiration
  # of 2 hours to entities having an "ephemeral" attribute defined, and an expiration of 60 days to
  # all other entities.
  evictionRules:
  - attribute:
      namespace: "itrsgroup.com/classification"
      name: "ephemeral"
    expiration: "PT2H"
  - expiration: "P60D"

The default configuration applies an expiration of 2 hours to entities marked as ephemeral and 60 days for all other entities.

["Obcerv"] ["User Guide"]

Was this topic helpful?