Monitor a node's health and performance

Watching log files for errors

Please also refer to:

Log structure

By default (depending on log configuration), all log messages with a severity of WARN, ERROR or FATAL are logged in a specific  var/log/node-error.log  file.

This file is configured to rotate automatically when it grows over 10 MB (see description of log files).

The default format for these logs is org.apache.log4j.TTCCLayout (time, thread, category and nested diagnostic context information):

time [thread] logLevel category nestedDiagnosticContext - message
  • The first field is the number of milliseconds elapsed since the start of the program.
  • The second field is the thread outputting the log statement.
  • The third field is the level (WARN, ERROR, FATAL).
  • The fourth field is the category to which the statement belongs.
  • The fifth field (just before the '-') is the nested diagnostic context. Note the nested diagnostic context may be empty.
  • The text after the '-' is the message of the statement.

If an error was the result of an unchecked exception, the full stack trace of the said exception is available in the log in order to provide diagnostic information.

Log content

The error log can contain 3 type of errors:

  • They can result from an application-level error from the Data Integration: incorrect data (missing mandatory field, corrupted input data), an incorrect mapping (use case not handled), incorrect logic in a route, ...
    In that case, the integration specialist needs to enhance the application to handle these cases: ignore invalid data if they have to, handle missing use cases, ...
  • They can result from a configuration issue or technical failure of the environment (hardware, network, ...): unable to connect to a data source, unable to connect to the LDAP server, out of disk space, insufficient memory for the application, ...
    In that case, the system administrator needs to take corrective action to fix the issue.
  • They can result from a defect of the Decision Insight node or deployment itself.
    In that case, the administrator needs to contact the Support.

Well known errors are:

  • FATAL errors correspond to an invalid state of the node and generate a crash. A crash also generates a heap dump in the var/log directory: hs_err_pidXXXX.log, where XXXX is the PID of the process that crashed.
  • java.lang.OutOfMemory: the node has run out of memory. Check the memory configuration or investigate for analysis that generate abnormal memory consumption.
  • com.systar.calcium.AbsorptionException: an error occurred while injecting data into the node. This is usually an application error.
  • java.io.FileNotFoundException: a file was not found. This can be either a file required by Data Integration or a node configuration file. Investigate why the corresponding file is missing.

Please note that the exhaustive list of possible errors codes is virtually unlimited, and is very dependent on the deployment environment (JVM version, node(s), embedded libraries, security configuration, etc.).

Monitoring system through JMX

The node being a standard Java application, it can be monitored through JMX.

By default, the Java Virtual Machine exposes a lot of information about its process and internal metrics into its Platform MXBeans (CPU, Memory, Threads, Classloading, etc.).

Specific activity metrics about Decision Insight nodes are also available. Beyond the default metrics described hereafter, most metrics exposed through JMX are currently reserved for in-depth investigation and troubleshooting by Support and as such are undocumented.

Accessing JMX

The JMX interface can be accessed using any standard JMX monitoring tool, as for example:

  • Using a graphical console as jconsole (which is embedded by default in Oracle Java Development Kit) or VisualVM.
  • Using jmxtrans to dump and keep a history of metrics into a file, as documented here.
  • Using your standard and favorite monitoring tool that supports JMX.

To connect to the JMX interface follow the steps described in Java Management Extensions (JMX).

Essential metrics to monitor

These metrics are the ones included in the default jmxtrans configuration file.

Category 'java.lang'

This category corresponds to the generic metrics from the Java Virtual machine, including metrics from the Operating System.

Memory 

java.lang:type=Memory,*

Attribute Description  Health indication
HeapMemoryUsage

Amount of RAM used by the JVM Heap.

This is a composite metric, with the following values:

  • committed: amount of RAM currently reserved / allocated at the operating system level.
  • init: amount of ram allocated to VM init (configured).
  • max: amount of ram that can be allocated at max by the VM (configured).
  • used: current amount of RAM used for the heap (within the committed amount).
At risk when the 'used' value is > 90% of the 'max' value for more than 2 minutes.
NonHeapMemoryUsage

Amount of RAM used by the JVM in addition to the Heap Size.

This is a composite metric, with the following values:

  • committed: amount of RAM currently reserved / allocated at the operating system level.
  • init: amount of ram allocated to VM init (configured).
  • max: amount of ram that can be allocated at max by the VM (configured).
  • used: current amount of RAM used for the heap (within the committed amount).
 At risk when the 'used' value is > 90% of the 'max' value for more than 2 minutes.


Operating System 

java.lang:type=OperatingSystem

Attribute Description  Health indication
FreePhysicalMemorySize

Free physical memory available at the Operating system level.

Not determined; Information used for diagnostic & troubleshooting.
SystemLoadAverage

Average global CPU usage at the Operating system level over the last minute.

The value is between 0.0 and X.0 on a system with X CPUs.

Not determined; Information used for diagnostic & troubleshooting.

Category 'com.systar'

This category corresponds to metrics specific to Decision Insight.

Absorption Manager

com.systar:type=com.systar.calcium.impl.absorption.AbsorptionManagerMXBean,name=calcium.absorptionManager

Attribute Description  Health indication
AbsorbedInstanceOperationCount

Number of 'Instance Operation' processed by the absorption engine.

These operations can either be generated by Data Integration or by computing results.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.
AbsorbedRequestCount

Number of 'Requests' processed by the absorption engine. A single request can contain multiple instance operations.

These requests can either be generated by Data Integration or by computing results.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Absorption Queue

com.systar:type=com.systar.calcium.impl.absorption.queue.AbsorptionQueueMXBean,name=calcium.queue

Attribute Description  Health indication
Size

Size of the processing queue of the absorption engine.

At risk when =Limit for more than X minutes.

X must be adjusted based on the application behavior and data sources.

X=1 for application with a continuous flow of incoming events.

X can be greater for applications that consume batches, in order to not alert during the processing of those batches.

Default value of Limit=16 by default. Current configuration can be determined using the listLimits operation on bean:
com.systar:type=com.systar.calcium.impl.absorption.AbsorptionManagerMXBean,name=calcium.absorptionManager


Workflow Engine

com.systar:type=com.systar.calcium.impl.absorption.workflow.WorkflowEngineMXBean,name=calcium.workflowEngine

Attribute Description  Health indication
ProcessedEventCount

Number of internal events processed by the workflow engine, in charge of interception & routing of those events.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


MetaModel Repository

com.systar:type=com.systar.calcium.impl.metamodelreader.MetamodelRepositoryMXBean,name=calcium.metamodelRepository

Attribute Description  Health indication
CacheInvalidationCount

Number of time the meta model cache has been invalidated.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Query Execution

com.systar:type=com.systar.nickel.impl.jmx.QueryExecutionMXBean,name=nickel.queryEngine

Attribute Description  Health indication
QueriesCount

Number of queries executed by the query engine.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Computing Executor

com.systar:type=com.systar.krypton.scheduler.impl.executor.ComputingExecutorMXBean,name=krypton-scheduler.computingExecutor

Attribute Description  Health indication
ActiveTasksSize
Number of computing tasks currently active, and waiting for execution.

If > 0 over a long period of time, this is an indicator that the system is active performing computation that might be impacting the availability of up-to-date information in dashboards.

The thresholds need to be adjusted based on the application data sources and configuration.

  • If the value is very high, there is probably is a high number of recomputing in progress (late data, recently added indicators) that might require investigation.
  • If the value is decreasing slowly, there is probably very costly indicators being computed, which usage & configuration might require investigation.
PendingSchedulingTasksSize
Number of computing tasks currently pending for being scheduled. Not determined; Information used for diagnostic & troubleshooting.
PendingTasksSize
Number of computing tasks currently pending for execution. Not determined; Information used for diagnostic & troubleshooting.


Late Data Handler

com.systar:type=com.systar.krypton.scheduler.impl.latedata.LateDataHandlerMXBean,name=krypton-scheduler.lateDataHandler

Attribute Description  Health indication
NumberOfEventsFlushed

Number of events indicating late data that have been flushed.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.
NumberOfFlushes

Number of time late data events have been flushed.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.
LongestFlushedDuration

Longest time interval over which late data have been flushed.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Dashboards Manager

com.systar:type=com.systar.nitrogen.dashboards.impl.jmx.DashboardsManagerMXBean,name=dashboards.dashboardManager

Attribute Description  Health indication
DashboardsDisplayedCount

Number of dashboards visited / displayed.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Flush Task

com.systar:type=com.systar.titanium.temporal.impl.table.monitoring.FlushTaskMXBean,*

Attribute Description  Health indication
FlushCount

Number of time the memory tables have been flushed on disk.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Memory Limiter

com.systar:type=com.systar.titanium.temporal.impl.table.monitoring.MemtableMemoryLimiterMXBean,name=titanium-temporal.memoryLimiter

Attribute Description  Health indication
CurrentOccupiedSize

Memory size currently used by the active memory tables.

Not determined; Information used for diagnostic & troubleshooting.


Merge Task

com.systar:type=com.systar.titanium.temporal.impl.table.monitoring.SSTableMergeTaskMXBean,*

Attribute Description  Health indication
MergeCountLevel0

Number of data files merge at level 0.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.
MergeCountLevel1

Number of data files merge at level 0.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.
MergeCountLevel2

Number of data files merge at level 0.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Temporal Store

com.systar:type=com.systar.titanium.temporal.impl.table.monitoring.TemporalStoreMXBean,*

Attribute Description  Health indication
ScanFromStartCount

Number of data scans involving scanning data over the full history.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.


Plan Executor

com.systar:type=com.systar.tungsten.impl.plans.physical.PlanExecutorMXBean,name=tungsten.planExecutor

Attribute Description  Health indication
ExecutedQueriesCount

Number of queries executed by the node.

This value is cumulative since system start.

Not determined; Information used for diagnostic & troubleshooting.

Monitoring at the operating system level

You can also gather statistics about the performance of the process corresponding to the node.

Please refer to corresponding JMX monitoring description for details on metrics to observe and corresponding health status indication.

Monitoring attribute computing performance

You can also create an attribute computing heartbeat and monitor this heartbeat to detect a potential lag in computing. 

If the heartbeat occurrences derive (e.g. gets later and later compared to every minute), it means that the node is overloaded and is late with its analysis.

Related Links