The Message Tracker plug-in allows users to measure the latency of message propagation (e.g. orders and acknowledgements) through their systems and to monitor messages that fail to arrive at specified destinations. The monitored data can be viewed in real time or it can be stored to a database for offline report generation.
In order to monitor latency and missed messages, points on the message path at which to monitor the messages must first be defined. These points are referred to as Checkpoints. Each checkpoint is monitored by a Message Tracker plug-in. Multiple checkpoints can be monitored by the same Message Tracker plug-in.
A Message Tracker plug-in is composed of two parts:
- Plug-In - this is the generic part of the product that deals with abstract messages. It routes information to other Message Tracker plug-in, and stores data in the database.
- Adapter - this decodes and parses the messages into a normalised form for processing by the plug-in.
Adapters can be internal or external. Internal Adapters are packaged as part of the Netprobe executable and do not run as a separate process. The external Adapters are independent of the Netprobe and run as a separate process.
There is an API for third parties to write their own external adapters. In addition ITRS provides a custom adapter development service.
The plug-in accepts data from a set of Adapters, which it then uses to perform real time analysis of the messages and/or log the message data to a database.
If the plug-in is configured to perform real time analysis it measures the time taken for each message to traverse the link from a previous checkpoint. In order to perform real time monitoring, the plug-in is configured with a part of the message path: it knows which Checkpoints are immediately before the checkpoint that it is monitoring; these are referred to as Parent Checkpoints. It also knows which checkpoints are immediately after the checkpoint that it is monitoring; these are referred to as Child Checkpoint.
Once a message has been detected by the plug-in and one of its Parent Checkpoint, the time taken for the message to pass between these two Checkpoints can be calculated and the latency of the Link is displayed in the Latency View.
Messages that do not get delivered to children can also be recorded. Any message that has been seen at a Parent Checkpoint and at the Current Checkpoint, but is not seen at a Child Checkpoint is considered a lost message. The user can specify how long the system should wait before it considers a message lost. The plug-in displays the ID and description of the message so that users are notified and can carry out remedial action.
The plug-in can also record all messages seen at a checkpoint to a database for later processing.
The Adapter passively listens to the message flow through a defined checkpoint and extracts the Message Identifiers, Message Attributes and the timestamp at which the message was observed. It then passes this information to the plug-in for real-time analysis and database storage.
The plug-in is supplied with a set of internal adapters to read data from the client machines. The adapters perform the following operations on the data;
In order to extract data from a file or data stream, the source of that data needs to be supplied along with the format of the data. The complexity of each of the format types differs considerably. FIX requires a few simple settings, while Regex requires a regular expression to be defined for each piece of data that the user wishes to extract from the data source. A manual is available to describe the setup and scope of each internal adapter. These manuals concentrate on the data extraction, as this is what differentiates one adapter from another.
Not all messages that are visible at a checkpoint are actually messages that should be monitored. Filtering allows undesired messages that to be ignored. The user can specify a single Message Regex that is applied to the message before it is processed in any way. If the messages do not match the provided expression they are then ignored. The user can also specify any number of regular expressions which are applied to tagged data extracted from the message. Again if the tag does not match the provided regular expression the message will be ignored.
The adapters allow inverted regular expressions which match when the non-inverted regular expression would not match and vice versa. This applies to the message and not a tag. So if the tag filter ITRS applied to tag 11 was inverted then messages with tag 11 that did not include ITRS would match the inverted filter and so would messages that lacked tag 11 entirely. (i.e. Messages with tag 11 = ITRS would be ignored).
Once the data has been extracted and filtered, it is presented to adapter's normalisation system as a set of named tags. These names may be defined by the data extractor, as in the case with the FIX adapter where the tag names are the fix tag numbers, or they may be defined by the user, as in case with the REGEX adapter where the user defines a tag name for each regular expression to apply to the message.
The data normalisation allows the user to take one or more tags and map these into IDs and Attributes. For example when monitoring FIX messages from multiple clients the system may need to monitor more than just the ClientOrderID (Tag 11). They may need to use a Message Identifier that is composed of the ClientID (Tag 109) and the ClientOrderID. The data normalisation is used to set up such mappings.
Both Message IDs and Attributes are set up this way. A Named ID must be unique across all the messages sent. It does not have to be present on the message at each checkpoint, but it cannot be present on two different messages. For example using the following information;
Seen at Checkpoint 1: Message (ClientID=1111, InternalID=734234) Message (ClientID=1112, InternalID=838432) Seen at Checkpoint 2: Message (ClientID=1111, InternalID=98343) Message (ClientID=1113, InternalID=1112) Message (ClientID=9283, InternalID=838432)
The ClientID on the 1st message at Checkpoint 1 matches the ClientID on the 1st message at Checkpoint 2, so these are both part of the same message. The InternalID on the 2nd message at Checkpoint 1 matches the InternalID on the 3rd message at Checkpoint 2, so these are part of the same message. While the ClientID on the 2nd message at Checkpoint 1 matches the InternalID on the 2nd message at Checkpoint 2, these are not part of the same message because Message Identifiers are only compared if they have the same name.
Attributes have no unique requirements and can be used to record various message data such as "Buy Price" and "Client Name".
Once the message has been normalised and filtered, it is then passed to the plug-in for database insertion and real time tracking. The Real Time Tracking does not pass all the information about the message around. The user can select an attribute (or ID) to use as a description for the message (used when displaying lost messages) and a category (used when displaying latency measurements).
The user can provide an external adapter that communicates directly to the Real Time Tracking. The adapter uses an XMLRPC API that is described in the MESSAGE TRACKER PLUGIN API. It is the Adapter's job to perform the data extraction, the data normalisation and the message filtering. Because the adapter is communicating directly to the Real Time Checkpoint Tracking, only 2 message attributes can be supplied (Description and Category). It is not currently possible to use an external adapter to submit data to the database via the Message Tracker plug-in.
In this example, orders arrive from clients to the FIX engine. The FIX engine sends the orders to the Order Management System, which in turn sends the orders to the appropriate exchange. The confirmations that are received form the exchange are sent back to the clients.
Four checkpoints are defined in this example:
- Checkpoint1 is where an order arrives into the bank system from the client via FIX.
- Checkpoint2 is where the order leaves the bank system and is delivered to the exchange.
- Checkpoint3 is where an order confirmation is returned to the bank system from the exchange.
- Checkpoint4 is where the order confirmation leaves to the bank system and returns to the client via FIX.
Figure 1 Message Flow
There are two approaches in monitor the latency in this example:
- Link Latency Monitoring
- Turnaround Latency Monitoring
The difference is in the way the monitoring path is configured. Message Tracker supports both approaches.
Turnaround Latency Monitoring will allow monitoring of the time taken between Checkpoint1 and Checkpoint 4, and between Checkpoint2 and Checkpoint 3. This provides roundtrip latency monitoring of the orders at FIX Server and the Order Management System.
This monitoring approach does not require the clocks on all the machines to be in sync, because the latency is measured between checkpoints that are on the same physical box.
Figure 3 Setup with unsynchronised Clocks
Example Setup for Turnaround Latency Monitoring
Below is an example setup for real time tracking of messages using link latency monitoring. To keep the setup simple, the setup has been kept data agnostic and the setup for filters, source and tag mapping have been skipped.
message-tracker adapter name: FIX-OUT checkpoint: CHECKPOINT1 source: … filters: … format type: … tag mapping: … real time tracking: record: … adapter name: FIX-RETURN checkpoint: CHECKPOINT4 source: … filters: … format type: … tag mapping: … real time tracking: record: … rtt name: CHECKPOINT1 child name: CHECKPOINT4 rtt name: CHECKPOINT4 parent name: CHECKPOINT1
message-tracker adapter name: OMS-OUT checkpoint: CHECKPOINT2 source: … filters: … format type: … tag mapping: … real time tracking: record: … adapter name: OMS-RETURN checkpoint: CHECKPOINT3 source: … filters: … format type: … tag mapping: … real time tracking: record: … rtt name: CHECKPOINT2 child name: CHECKPOINT3 rtt name: CHECKPOINT3 parent name: CHECKPOINT2
This plug-in creates five views, examples of which are shown below.
This view is used to show the latency of all messages arriving at monitored checkpoints from a configured parent checkpoint.
Note: Unsolicited messages are ignored unless the monitored checkpoint has no parents.
The view has one row per connection pair. Sub-rows are created using the category specified in the message. If the category is not specified in the plug-in for that checkpoint but it has been at a previous checkpoint, the category used in the previous checkpoint will be used. If no category is defined at any checkpoint then no sub rows are created.
Figure 4 FIX Latency View
Figure 5 OMS Latency View
Latency Table Legend
|link||Message Set identifier composed of the start and end checkpoints in the link together with an optional category|
|parentCheckpoint||Checkpoint name of 1st checkpoint in the link|
|childCheckpoint||Checkpoint name of last (2nd) checkpoint in the link.|
|category||Value of the Category Attribute in the message|
|linkStatus||Status of the monitoring link between the two checkpoints.|
|maxLatency||Maximum latency of the messages monitored over a user defined period (See average > interval). This will be blank if no messages have been seen.|
|averageLatency||Average latency of the messages monitored over a user defined period (See average > interval). This will be blank if no messages have been seen.|
|stdDevLatency||The standard deviation of the messages monitored over a user defined period. (See average > interval). This will be blank if no messages have been seen.|
|numMessages||Number of messages seen over a user defined period (See average > interval).|
|lastMessageTime||Time the last message was detected.|
|totalMessages||Total number of messages seen today.|
The lost, unacknowledged and slow message views are used to indicate messages which may need further investigation. These views show the following conditions:
- Lost - messages which have failed to propagate correctly through the system.
- Unacknowledged - messages older than a threshold still awaiting acknowledgement.
- Slow - acknowledged messages with a latency above the configured threshold.
The lost message view is always enabled and is controlled by the lost message timeout. Unacknowledged and slow message views are enabled by configuring the unacknowledged timeout and slow message timeouts in the Real Time Tracking section.
All views have a configurable maximum number of rows (see the reportingListSize setting). Messages in these views can be controlled by using the context-menu plug-in commands.
Figure 6 Lost Messages View
The number of all lost messages seen.
The count may be greater than the number of messages displayed, if the display is limited by the reporting list size.
The count can also be altered as a result of executing plug-in commands. Lost Messages view only.
|cyclicMessages||The number of cyclic messages seen (a message is cyclic if seen multiple times at the same checkpoint). Lost Messages view only.|
|unacknowledgedMessages||The number of unacknowledged messages seen. Unacknowledged Messages view only.|
|maxShownUnacknowledgedPerQueue||The configured maximum number of unacknowledged messages to display per checkpoint queue. See the maxShownUnacknowledgedPerQueue setting. Unacknowledged Messages view only.|
|slowMessages||The number of slow messages seen. Slow Messages view only.|
|lastCheckpointVisited||Name of the Last Checkpoint to see the message.|
|seenAt||List of the Checkpoints the message successfully passed through.|
|expectedAt||List of child Checkpoint names. The message was expected to be seen by at least one of these checkpoints.|
|problemAt||Checkpoint name reporting a slow message. This is the name of parent checkpoint, which received a slow response from a child checkpoint. Slow Messages view only.|
|description||Description of the message as provided by the source adapter.|
|lastSeenTime||Time the failed message was seen at this Checkpoint.|
|Latency||The message latency - Slow Messages view only.|
The view has one row per adapter. This is used to check the adapter is running and to display the number of messages that each adapter has detected. This can be turned off using the hideAdminViews setting so long as there is at least one real time check point configured. If the plug-in is not supporting any real time checkpoints, then this view will always be present as it will be the only view published.
Figure 7 Admin Adapter View
Links Headline Legend
|databaseQueueSize||The number of messages waiting to be submitted to the database.|
Links Table Legend
|source||Name of the data source file name or stream description.|
|checkpoint||Name of the checkpoint that the adapter is reading messages for|
|messagesLastSample||Number of messages seen by the adapter in the current day|
|messagesPerDay||Number of messages seen by the adapter in the current day|
|realTimeTracking||ON if the adapter is submitting the messages to a real time checkpoint for real time monitoring|
|DBlogging||ON if the adapter is placing the message on the database queue for submission to the database|
The view has one row per Real Time Checkpoint the plug-in is supporting. This is used to check the checkpoints can communicate with one another. The view can be turned off using the hideAdminViews setting.
Figure 8 Admin Checkpoint View
Links Table Legend
|totalParents||Number of parents (active and inactive)|
|activeParents||Number of parent checkpoints with which the name real time checkpoint can communicate with|
|inactiveParents||Number of parent checkpoints with which the name real time checkpoint can not communicate with|
|totalChildren||Number of children (active and inactive)|
|activeChildren||Number of child checkpoints with which the name real time checkpoint can communicate with|
|inactiveChildren||Number of child checkpoints with which the name real time checkpoint can not communicate with|
To visualise the latency of the overall system, the latency of multiple checkpoints from multiple samplers can be combined into a single view. This uses the Active Console Custom Metrics View. In Active Console, it is possible to specify a set of paths to different data views for a single Metric View.
An example of a Custom Metrics View is shown below, where the latency between Checkpoints is shown in a single view.
Figure 10 Custom Metrics in Active Console
Each messages view (for Lost Messages, Slow Messages and Unacknowledged Messages) provides three plug-in commands to clear the messages. These are presented in a context menu available when right-clicking on a message in the view.
Figure 11 Lost Messages menu showing 2 of the 3 Probe commands
These commands are described in more detail below.
Note: For unacknowledged message views, clearing a message entry will only prevent that message from being displayed. If the message is detected as either lost or slow, it will subsequently be displayed in one of these views even if it was previously cleared from the Unacknowledged Messages view.
This command will remove all messages from the view. It is available on the dataview and every cell and headline inside it. Clearing the view will also reset the count headline for the view to 0.
This command will remove all messages that were seen at a specific checkpoint. It is only available by right-clicking on the cells in the lastCheckpointVisited column for Lost and Unacknowledged messages, or the problemAt column for Slow messages.
All messages that share the same checkpoint with the cell the command was triggered from will be removed from the dataview. This includes the message that the command was triggered on.