Geneos ["Geneos"]

Data Schema User Guide

Overview

Data schemas give Gateway Hub information about the type of data being published from Gateway.

Each dataview that is published to Gateway Hub requires a data schema definition.

The data schema for a dataview specifies:

  • The name, data type, and any units of measure of each headline.
  • The name, data type, and any units of measure of each column.
  • If the dataview requires pivoting.

Schemas are defined in the Publishing tab of the sampler in the Gateway Setup Editor (GSE).

The Gateway comes packaged with some schemas. See Sampler schema types.

Metrics collected using the StatsD client Java library provide their own schemas automatically, so long as a valid Dynamic Entities mapping is defined.

When a data schema does not exist, you can create user defined schema and add these to the sampler. See Create a data schema.

Caution: Data schemas describe dataviews sent from Gateway to Gateway Hub. This should not be confused with the configuration schema that describes the correct XML formatting of Gateway setup files.

Built-in schemas

The Gateway is packaged with built-in schemas for plug-ins with dataviews containing a set of known column names and data types.

Built-in schemas specify pivoting for dataviews containing rows that have only one value column and the data type of this column varies between the rows. For example, see the Hardware. This allows Gateway Hub to treat the rows of these dataviews as columns.

If a built-in schema exists for a sampler, This <sampler name> sampler has predefined schema(s) is displayed in the Publishing tab of a sampler in the GSE. For a list of samplers with built-in schemas, see Sampler schema types.

The GSE also has a command that can be used to view the schemas currently defined for dataviews on a sampler.

If you make any changes to the dataviews in these samplers, for example by adding headlines and columns using the Compute Engine, you must add these additions to the existing schema.

User-defined schemas

Some plug-ins do not come with any data schema definitions. These are plug-ins where the columns are data types are unknown. You must define the data schema for:

  • All toolkit-like plug-ins. Examples of these include the SQL-TOOLKIT, JMX, and TOOLKIT plug-ins.
  • Any pre-existing dataviews with any user-defined headlines or columns.

Consider the following examples where a data schema definition is required for pre-existing dataviews:

  • Additional columns added to an FKM sampler (see Change FKM dataview columns). If the columns are not defined in the schema, an error is generated in Gateway Hub .
  • Additional columns or headlines added to a CPU sampler via the Compute Engine (see Adding to existing dataviews). If the columns or headlines are not defined in the schema, an error is generated in the GSE.

For how to create a schema, see Create a data schema.

Pivoted dataviews

If you add rows to pre-existing pivoted dataviews, you must define them as if they were additional columns. You do not specify the pivot option if there is a built-in schema for the sampler. If you do, and it conflicts with the built-in schema, it is discarded in favour of the user-defined schema.

View existing schemas

To view existing schemas currently defined for dataviews on a sampler, open Active Console and select Show Current Schema.

In the GSE, the Show Current Schema command is available in the Publishing tab of the sampler.

When the command is run, a window opens showing one table per dataview, describing the schema defined on the sampler. There is a table for every dataview that has a defined schema, irrespective of whether the dataview is in use. Only dataviews that have a schema definition are shown.

Each table combines information from the built-in schema shipped with the Gateway and any additional information added in the GSE.

A description of the columns shown in each table is below:

Column Name Description
Component Describes if this component is a headline or a column.
Name Name of the headline or column.
Type Data type of the headline or column. The data types are: boolean ; date ; dateTime ; float32 ; float64 ; int32 ; int64 ; string .
Units Unit of measure assigned to this headline or column (if present). For a list of units of measure, seeData schema parameters in Data Schema User Guide.
Source

Origin of the headline or column and the schema. Cells in this column display one of the following:

  • Base — The built-in schema shipped with the Gateway.
  • Overridden — A built-in schema exists for this headline or column, but has been superseded by the user-defined schema in the Publishing tab of the sampler.
  • Enriched by Compute Engine — Headline or column added to the dataview via the Compute Engine. Schema has been defined in the Publishing tab of the sampler.
  • Defined by User in plugin configuration — Headline or column added to the dataview via a method other than the Compute Engine e.g. a Toolkit plug-in. Schema has been defined in the Publishing tab of the sampler.
   

Create a data schema

When connected to a running Gateway you should use the Propose Schema command to create data schemas.

When creating and editing XML configuration files without being connected to a Gateway, you should create schemas manually.

The Propose Schema command

Before using the Propose Schema command, please note the following:

  • The command attempts to deduce the data types of columns and headlines from each dataview. The deduced data types may be incorrect.
  • You must always supply any units of measure post-generation.
  • Any types or units of measure already defined by you in the Publishing tab of a sampler take precedence over those inferred from dataviews by the command.
  • Tables with only two columns are pivoted in the generated XML schema. A comment is included in the XML highlighting this.

You can run the Propose Schema command from the Active Console or the GSE, it will act differently depending on which component you are using.

When using the Gateway Setup Editor:

  • The schema is automatically applied to the sampler.
  • The command runs against all the samplers with using the same plugin on the Gateway. This process can take some time if you have many samplers using the same plugin.
  • The command times out after 30 seconds.
  • You can cancel waiting for the command result, but this does not stop the command running on the Gateway.

When using the Active Console:

  • When used in the Active Console, the schema appears as XML in a new window. You must then copy and paste this XML to your sampler.
    • You can also use this to create schema definitions as static variables.

Examples

The following are examples of the output of Propose Schema:

  • If you have added two columns to the CPU sampler, this command generates only the schema definition for those additional columns.
  • A toolkit-like sampler does not have a built-in schema, and therefore the command generates a complete schema definition for the sampler.

Propose a schema in the GSE

When using the Gateway Setup Editor running the Propose Schema command will query the Gateway and all its running instances of the sampler in order to build the schema.

Caution: Proposing a schema is computationally intensive and may impact the performance of a Gateway.

To generate a schema definition using the Propose Schema command in the GSE, follow these steps:

  1. Select the desired sampler in the GSE.
  2. Navigate to the Publishing tab of the sampler.
  3. Select Propose Schema.

    Note: If user-defined schema is already present for the sampler, a dialog opens asking you to if you wish to overwrite the schema information..

The generated schema is directly added to the Publishing section. Any existing schema definitions are overwritten.

After using the command, perform the following:

  1. Select Schema > Dataviews > Data.
  2. Review the schema for errors because the data types and pivoting inferred by the Propose Schema command may be incorrect.

    Note: If the value for Pivot is incorrect you must create the schema manually.

  3. Add any units of measure to the headlines and/or columns.

For more information about types, units of measure, and pivoting, see Gateway Hub Configuration.

Propose a schema in Active Console

In Active Console you can use the Propose Schema command to build schema for specific dataviews. This will not effect any connected Gateways.

To generate a schema definition using the Propose Schema command in Active Console, follow these steps:

  1. Make sure Gateway Hub is enabled in the GSE.
  2. Right-click a sampler or dataview.
  3. Navigate to Sampler Schema.
  4. Select Propose Schema. The generated XML schema definition for the sampler appears in a new window.
  5. Right-click the window with the generated XML in the Active Console and select Copy All.
  6. Navigate to the GSE, right-click the correct sampler and select Paste Schema.
  7. Warning: No checks are performed when using Paste Schema on a sampler. Any existing schema is overwritten, and any copied schema can be pasted on to any sampler.

After using Paste Schema, perform the following:

  1. Navigate to the Publishing tab of the sampler to view the schema definition.
  2. Select Schema > Dataviews > Data.
  3. Review the schema for errors because the data types inferred by the Propose Schema command may be incorrect.
  4. Add any units of measure to the headlines and/or columns.
  5. (Optional) Specify pivoting.

For more information about types, units of measure, and pivoting, see Gateway Hub Configuration.

Note: Paste Schema is only available when valid XML has been copied to the clipboard.

How to use Paste Schema to create static variables

The generated XML output of the Propose Schema command in the Active Console can also be used to create sampler schemas as static variables. Schemas saved as static variables can be selected in the Publishing tab of a sampler.

To use the output of the Propose Schema command to produce schemas as static variables, follow these steps:

  1. Right-click on the window with the generated XML in the Active Console and select Copy All.
  2. Navigate to the GSE.
  3. Right-click Static variables > Sampler-schemas and select Paste Schema.
  4. A static variable containing a schema definition for each dataview in the copied sampler is created. The name used for the static variable is the name of the dataview.

After using Paste Schema, perform the following for each static variable:

  1. Review the schema for errors because the data types inferred by the Propose Schema command may be incorrect.
  2. Add any units of measure to the headlines and/or columns.
  3. (Optional) Specify pivoting.

For more information about types, units of measure, and pivoting, see Gateway Hub Configuration.

Note: Paste Schema is only available when valid XML has been copied to the clipboard.

How to define a schema manually

To define a schema manually, perform the following:

  1. Open your Gateway Setup Editor.
  2. Navigate to the Publishing tab of the sampler you want to make a schema for.
  3. In Schema > Dataviews, click Add new.
    You must provide an entry for each dataview in the sampler that requires a schema definition.
  4. In the Dataview field, enter the name of the dataview.
  5. In the Schema field, choose data.
  6. Click Data.
  7. Add as many new Headlines and Columns entries as you require.
  8. In the Name field, add the name of the headline or column in the dataview.
  9. Under options, choose the type of data represented by the headline or column.
    If you choose Int32, Int64, Float32 or Float64, select the appropriate Unit of measure.
  10. (Optional) If the dataview is pivoted, tick the box by Pivot.
  11. Close the tab.
  12. Click Validate current document to review any errors.
  13. Click Save current document.

For more information about types, units of measure, and pivoting, see Gateway Hub Configuration.

Schema inference

Beginning with version 2.4.x Gateway Hub can use the schema inference feature to infer data schemas for data published by a Gateway when a built-in or user defined data schema does not exist. This is useful in cases where you have a large number of toolkits, and creating user defined schemas for each may take a long time.

Inference produces only a best guess of the appropriate data schema and the ultimate quality of an inferred data schema is dependent on the quality and consistency of the published data.

Note the following limitations:

  • Inference only recognises dates in the yyyy-mm-dd format.

  • Inference does not capture units, and interprets values with a unit as a raw number. For example, if a cell has a value of 987 MB, then schema inference interprets this as 987.

  • Headline data schema are inferred separately from table data schema.

In most cases, it is highly recommenced that the user provide their own data schemas since this is the best way to ensure data schema accuracy and prevent loss of data due to misconfiguration.

Inference modes

You can configure schema inference to run in one of three modes: Naive, Basic, or Smart. By default, schema inference will run in Smart mode and this is the recommenced setting.

Caution: In all cases the data schema inferred is only as good as the data that has been observed. If the data structure changes after inference, then the user must update the data schema manually or risk dropping further datapoints.

Mode Description Advantages Disadvantages
Naive Naive inference uses a single datapoint to infer a very basic data schema where all fields are type string.
  • Simplicity means that the generated data schema ensures that data is always ingested providing that the structure of the data does not change after inference.

  • Using one datapoint ensures that you do not lose data during inference.

  • Some metric functionality may be unavailable, since all fields are typed as string.

  • If the data structure changes after inference then ingestion is impacted.

Basic Basic inference uses a single datapoint to infer a more detailed data schema than Naive inference. Where field data is parsed as numeric it are assigned the type float64 making it accessible to all metric query functionality. Non numeric fields are assigned the type string.
  • Improved numerical data handing compared to Naive inference.

  • Using one datapoint ensures that you do not loose data during inference.

  • Increased likelihood of errors resulting from using a single datapoint. Especially if new fields are added after inference.

Smart

Smart inference uses a multi-datapoint inference model. This is the default and recommend setting.

You must configure the minimum number of datapoints to use over a defined inference period. Once the inference period is over (measured using sample time not clock time), if the inference engine has at least the minimum number of datapoints, it will perform a smart evaluation of property types. All supported types will be inferred and any numerics will always be float64.

When setting the minimum number of datapoints you should consider that you lose the datapoints used for inference. Additionally, any datapoints currently in use for inference are lost if the normaliser is shutdown. This restarts the inference period and requires that datapoints are collected again. The higher the number of datapoints used for inference the higher the accuracy, but this also increases the amount of datapoints that can be lost.

Variations in inference duration are small, as no inference will be performed until the full period is complete.

  • Significantly improved inference compared to Naive and Basic methods. Includes increased user control.

  • Can handle new fields added during the inference period. Where any field cannot be inferred confidently the engine will revert to a Naive inference and assign the string type.

  • Additional configuration required.

  • Datapoints used to infer the schema are lost.

       

Consider as an example, a sequence of four datapoints received over the inference time period such as 10 minutes. Each inference mode will treat the same data differently.

Datapoint Naive Basic Smart (using 3 samples for inference)
123.45 days (string) 123.45 days (string) 123.45 (float64) Used as training data, not stored.
250.56 days (string) 250.56 days (string) 250.56 (float64) Used as training data, not stored.
unavailable (string) unavailable (string) ingestion error Used as training data, not stored.
4.36 days (string) 4.36 days (string) 4.36 (float64) 4.36 days (string)

In cases with missing data, this can change the inferences made.

Datapoint Naive Basic Smart (using 3 samples for inference)
no data omitted omitted Used as training data, not stored.
no data omitted omitted Used as training data, not stored.
321.54 days (string) ingestion error ingestion error Used as training data, not stored.
4.36 days (string) ingestion error ingestion error 4.36 (float64)

Gateway Hub configuration

You can configure data schema inference in Gateway Hub during installation or using hubctl with your installation descriptor.

For the most up to date information about configuration options, see Install Gateway Hub and hubctl tool.

The following configuration options are available:

Option Description
hub_normaliserd_inference_enabled Enable or disable data schema inference. Choose from true or false.
hub_normaliserd_inference_mode Set the inference mode. Choose from Naive, Basic or Smart.
hub_normaliserd_inference_smart_min_samples

Minimum number of samples required before Smart inference can be used. This setting only applies if hub_normaliserd_inference_mode is set to Smart.

Smart inference occurs after a duration set by inferenceWaitDurationSeconds. If at that time Gateway Hub has received at least minSamplesForInference samples, then Smart inference is performed. Otherwise, Naive inference is used.

hub_normaliserd_inference_smart_duration_seconds Duration in seconds to wait before performing Smart inference. This setting only applies if hub_normaliserd_inference_mode is set to Smart.
hub_normaliserd_inference_smart_threshold

Percentage of samples received inside the inference duration that must be of a specific type, for a field to be matched to that type.

For example, if Gateway Hub has received 10 samples by the end of the inference duration, and threshold 0.5, then 5 samples must be type numeric and the remainder null (or effectively null) in order for the associated field to be considered type numeric.

   

Gateway configuration

Gateway version 5.7.x or later is required in order to publish dataviews without a schema. If dataviews without a schema are published to a Gateway Hub version that does not include schema inference, then a large number of ingestion errors will be reported.

Gateway will try to publish with a data schema if possible and will not publish data if it has a data schema with errors.

As a result, the following scenarios are possible:

  • If a dataview has a data schema and Gateway detects no errors, then it will be published with the provided data schema.

  • If the publish setting of a dataview is set to false, then it will not be published.

  • If a dataview has an empty data schema or a data schema that contains errors, then it will not be published.

  • If a dataview has no data schema, then the dataview will be published without a data schema.

Data schema parameters

Units of measure used in schemas

Name Symbol
percent %
seconds s
milliseconds ms
microseconds μs
nanoseconds ns
days d
per second s-1
megahertz MHz
bytes B
kibibytes KiB
mebibytes MiB
gibibytes GiB
bytes per second B/s
megabits Mbit
megabits per second Mbit/s

Sampler schema types

Below is a list of samplers and if they have an entirely built-in schema, a partially built-in schema, or are entirely user-defined.

Plugin Type Comments
Gateway-breachPredictor Built-in  
Gateway-clientConnectionData Built-in  
Gateway-databaseLogging Built-in  
Gateway-exportedData Built-in  
Gateway-gatewayHubData Built-in  
Gateway-gatewayLoad Built-in  
Gateway-importedData Built-in  
Gateway-includesData Built-in  
Gateway-licenceUsage Built-in  
Gateway-severityCount Built-in  
Gateway-includesData Built-in  
Gateway-licenceUsage Built-in  
Gateway-managedEntitiesData Partial  
Gateway-probeData Built-in  
Gateway-scheduledCommandData Built-in  
Gateway-scheduledCommandsHistoryData Built-in  
Gateway-severityCount Built-in  
Gateway-severityData Built-in  
Gateway-snoozeData Built-in  
Gateway-sql User-defined  
Gateway-userAssignmentData Built-in  
api User-defined  
api-streams Built-in  
bloomberg-bpipe Built-in  
citrix-apps Built-in  
citrix-processes Built-in  
citrix-sessions Built-in  
citrix-summary Built-in  
clearvision-status Built-in  
combo User-defined  
component-versions Built-in  
control-m Built-in  
cpu Built-in  
desktop-pc-monitoring Built-in  
deviceio Built-in  
disk Built-in  
e4jms-bridges Built-in  
e4jms-connections Built-in  
e4jms-durables Built-in  
e4jms-non-durables Built-in  
e4jms-queues Built-in  
e4jms-routes Built-in  
e4jms-server Built-in  
e4jms-topics Built-in  
e4jms-usersummary Built-in  
euem Built-in  
extractor User-defined  
fidessa Built-in  
fidessa-dq User-defined  
fix Built-in  
fix-analyser2 Partial

Admin data view schema provided, user must define schema for all other dataviews.

fkm Partial  
flm Partial

User must define schema for additional data displayed based on configuration .

ftm Built-in  
gl-greffon Built-in  
gl-lostorders User-defined  
gl-orderbook User-defined  
gl-permissions Built-in  
gl-router Built-in  
gl-slc Partial

User must define schema for additional data displayed based on configuration or SLC log file.

gl-slc-relay Built-in  
gl-sle Built-in  
gl-sle-tcp Built-in  
hardware Built-in  
ibmi-job Built-in  
ibmi-message Built-in  
ibmi-pool Built-in  
ibmi-queue Built-in  
ibmi-subsystem Built-in  
ibmi-system Built-in  
informix Built-in  
ipc Built-in  
ix-ma User-defined  
jmx-server User-defined  
jmx-threadinfo Built-in  
market-data-monitor User-defined  
message-tracker Built-in  
mibmon User-defined  
miss-x Built-in  
mq-channel Built-in  
mq-qinfo Built-in  
mq-queue Built-in  
net-ping Built-in  
network Built-in  
nyxt-papastats Built-in  
oracle Built-in  
orc Built-in  
pats-status Built-in  
pats-trading-breaches Built-in  
pats-users Built-in  
perfmon User-defined  
processes Built-in  
rest-extractor User-defined  
rmc-interface User-defined  
sets-slc Built-in  
sql-toolkit User-defined  
stateTracker User-defined

User must define schema for user defined custom column names.

su Built-in  
sybase Built-in  
sybase-server Built-in  
tcp-links Built-in  
tib-rv Built-in  
tib-rvpublisher Built-in  
tib-rvstream Built-in  
toolkit User-defined  
top Built-in  
trading-technologies Built-in  
trapmon Partial

User must define schema for user-defined columns in custom view.

unix-users Built-in  
veritas-cluster-server Built-in  
web-mon Built-in  
win-cluster Built-in  
win-services Built-in  
winapps Built-in  
wmi User-defined  
wts-licenses Built-in  
wts-processes Built-in  
wts-sessions Built-in  
wts-summary Built-in  
x-broadcast Built-in  
x-mcast Built-in  
x-multicast Built-in  
x-ping Built-in  
x-route Built-in  
x-services Built-in  
x-top Built-in  
x-traffic Built-in