Cluster behavior

Whenever a node is added or removed from the cluster, the cluster changes. In the event of a cluster change a “cluster membership change” event message is dispatched to the nodes. This is information that is generated and sent from the Axway Clustering Framework.

In B2Bi, the trading engine has a central role in the cluster because it receives the notifications of these events. When changes occur in the cluster, the trading engine communicates this information to the integration engine. The trading engine and the integration engine are closely linked in a 1:1 relationship. Whenever you start or stop a trading engine, this results in the start / stop of the tasks within the integration engine.

This topic describes the behavior and recovery mechanisms of B2Bi in the event of a cluster membership change.

Definitions

Configuration Adapter Node

The Configuration Adapter Node provides services to the integration engines which enable them to search and query for Agreements and Metadata Profiles to assist in the runtime message processing. It also provides a number of other B2Bi-related search services for looking up and validating Messaging ID existence, Detector functionality and other information. This node is also responsible for publishing the current Application state to the integration engines on startup.

Control Node

The Control Node manages the start and stop of the integration engine server, and runs the user interface. It logs these activities in <host_name>_cn.log. Each host in a cluster has a single Control Node.

Trading Engine Node

Instance of the trading engine on a single cluster node. Each trading engine is paired with an instance of the integration engine on a host.

Integration Engine Node

Instance of the integration engine on a single cluster node. Each integration engine is paired with an instance of the trading engine on a host.

System Node

The System Node is the integration engine instance that runs the cluster singleton tasks. Singleton tasks are always running and can only run on one integration engine instance in the cluster.

Singleton

Code that supports a single instance of a service in the cluster. Singletons in B2Bi clusters are classified as:

  • Host singletons – There is a single instance of a host singleton for a specific host machine. If the trading engine is stopped on a host, then the host singleton service is stopped on that node, and an attempt is made to restart it on another trading engine in the cluster.
  • Cluster singletons – Unique instances of service controllers in the entire B2Bi cluster. A cluster singleton can only run on a specific node type. There are cluster singletons that can start on each type of node. When the node that hosts a cluster singleton is stopped (due to a system failure), the oldest node (of the appropriate type) that can host the singleton takes over its work.

Adding nodes

When you add a node to the cluster, it immediately starts to receive and process workload. The way the B2Bi engines manage the change is different for the trading engine and the integration engine.

Trading engine

When a trading engine is stopped, a cluster membership change event message is received by each cluster node.

Trading engine services are managed and distributed through cluster singletons. There can only be one instance of any singleton-controlled service in the entire B2Bi cluster. Each cluster singleton can only run on a specific node type. There are cluster singletons that can start on each type of node. Some start only on Control Nodes, and some only on Trading Engine Nodes.

In case of an abrupt trading engine stop, B2Bi does not wait for the in-process messages to finish. Processing work is immediately distributed to the remaining running trading engine nodes. If there was a cluster singleton running on the trading engine that stopped, it is restarted the oldest node that supports that specific type of singleton.

In addition to cluster singletons, B2Bi has host singletons. There is only one instance of a host singleton for a specific host machine (however, note that you can have multiple nodes on each host: Control Node, Profile Node, Trading Engine Node). If the trading engine is stopped on a host, then the host singleton service is stopped on that node.

Integration engine

An integration engine node can run in system mode and/or in primary or secondary mode. A single integration engine can act in both system and in primary roles. When there is more than one active integration engine in the cluster, B2Bi distributes the roles as described in the following paragraphs:

System role

Only one integration engine node can have the system role assigned to it. The integration engine node that is assigned the system role runs the cluster singleton tasks. Singleton tasks always run, and can only run, on one integration engine node in the cluster. The system role is assigned to the first node that is started in supervised mode.

When the integration engine node that is assigned the system role fails, the remaining integration engine nodes try to take over the system role. Only one node can succeed, and become the new node running in the system role. Once the new system role assignment has been established, all user tasks on the remaining nodes are restarted. Until user tasks are restarted, there may be a temporary interruption of service until tasks are restarted.

Primary/secondary role

There is only one primary node, which is typically the oldest (first started) node in the cluster. The primary node runs certain singletons that do not run on the secondary nodes, and also manages the “pull” protocols on the application side. When the node acting with the primary role fails, another node must run these particular singletons. When a node is added to a running cluster, the new node is initially assigned a secondary node. The new node receives workload assignments from the primary node.

Removing nodes

The removal of a node can occur in two general ways:

  • graceful / controlled
  • unexpected / uncontrolled

Controlled / graceful removal

In this case the engines on the cluster node are intentionally stopped through the B2Bi user interface (after which the services can be stopped, and then the machine can be powered down if necessary).

This type of stop is initiated when the user stops the trading engine.The integration engine owns a monitoring connection that detects the loss of service when the trading engine is stopped.

Trading engine:

Whenever a trading engine is stopped, a cluster membership change event message is received by each node. If the stopped trading engine is paired with an integration system node, and if there are remaining nodes in the cluster, another node becomes the system node.

Integration engine:

When you intentionally stop a trading engine, a cluster membership change event message is received by each node. Each node verifies if its host name is still among the valid cluster members. If not, the user tasks of that node are stopped. This stopping is in a graceful way, which means that all active sessions within the integration engine are completed. If it is the primary node, the singletons are stopped as well. If there are remaining nodes in the cluster, the next oldest member becomes the primary node.

In the case where you are intentionally stopping the supporting machine, you must stop the integration engines before powering down.

Unexpected / uncontrolled removal

Unexpected / uncontrolled node remove occurs when:

  • The node was stopped by the user (not following the procedure as described in the graceful method, but by killing the service or another vital main process).
  • The server on which the node is running is powered down.
  • Hardware failure occurs ( local drive / CPU / ...)
  • The node loses the connection to the network.
  • The node cannot connect to the database or to the shared file system.

Of this list, the first two can be avoided with correct procedures and access rights The third (hardware failure) is less common.

The following paragraphs describe the remaining two items.

Network failure between B2Bi and applications/partners

B2Bi supports automatic retry mechanisms to deal with temporary unavailable partners or applications. When the maximum number of retries is reached, an error is displayed in the Message Tracker for partners, or in Message log for applications.

Network failure between B2Bi and its database or file system (or the database / file system are not available)

Failure cause Recovery routine
Data base not available for trading engine (Interchange)
  1. Node shuts itself down.
  2. Node becomes an unavailable node.
  3. Node tries to re-connect to data base, repeating attempts until successful.
  4. If database becomes available again, the node rejoins the cluster.
File system not available for trading engine (Interchange)
  1. Node invokes the system throttle.
  2. Node turns off all the transport exchanges (no more messages accepted).
  3. Node tries to finish the processing of currently handled messages. Messages potentially rejected.
File system or data base not available for the integration engine (Integrator)
  1. Node shuts itself down.
  2. Node becomes an unavailable node.
  3. Node attempts to restart, repeats attempts until successful.
  4. If file system becomes available again, the node rejoins the cluster.

Note: B2Bi also supports retry mechanisms between engines:

  • Trading engine retry send to the integration engine
  • Integration engine retry send to the trading engine

Cluster membership change on other B2Bi nodes

In addition to the trading engine (Interchange) and the integration engine (Integrator), each B2Bi node has:

  • A control node that manages the B2Bi user interface. External load balancer provides scalability on the B2Bi user interface and on the integration engine client user interface.
  • If a node fails, the load balancer detects the failure, and redirects the user to a new available node. In this case, the user must reconnect to the user interface. The user may lose any configuration data entered in his browser that was not saved by the node that failed.
  • A single ISDN router node.
  • This node can only be active-passive.

Impact of cluster change on runtime data

Both controlled and uncontrolled node removals have a direct impact on the runtime environment. There are differences between a controlled and a uncontrolled removal.

New messages

For messages “pushed” to B2Bi (through protocols like HTTP and other listening protocols), the external load-balancer detects that the B2Bi node (trading engine or integration engine) is unavailable and redirects all new incoming messages to another available B2Bi node.

For messages “pulled” by B2Bi (through the FTP client), the B2Bi cluster detects that the B2Bi node (trading engine or integration engine) is unavailable and assigns the pulling tasks to another available node.

Messages being processed - behavior on cluster membership change

For messages not completely received when node failed, that is, the last byte has not been received, or a expected synchronous acknowledgement has not been generated for a partner:

  • Messages pushed to B2Bi – The partner or application must resend the message.
  • Messages pulled by B2Bi: B2Bi (trading engine or integration engine) automatically restart the pulling tasks to try to get the message again from the partner or the application from a different node.

Trading engine behavior on cluster membership change

This behavior applies to messages fully received and in progress in the trading engine when the particular node fails. This includes synchronous acknowledgements if they are expected by a partner.

  • All messages are processed.
  • There is a limited risk of message duplicates, compensated by the trading engine, depending on the format being used.
    • For the protocols ebMS/EDIINT/RosettaNet, the trading engine checks received message IDs for duplicates.
    • If duplicates represent a risk, the administrator must check the Message Tracker to see if messages where processed twice, or alternatively, a specific code must be implemented to check for duplicates.
  • Processing automatically continues (restart) on a different node:
    • inbound message: Start from beginning (unpackaging…)
    • outbound message:
      • If packaging completed: send to partner
      • If package not completed: create package

Integration engine behavior on cluster membership change

In all almost all failover situations, there is a short interruption of services. This is necessary to establish a new system node. The only exception is when the non-system node is shut down in a controlled way. If the non-system node is interrupted, the singleton tasks on the system node must be restarted to properly close all open socket connections.

The following behavior is applied to messages fully received and in progress in the integration engine when a particular node fails. This includes synchronous acknowledgements if they are expected by a partner.

  • All messages are processed.
  • No risk of message duplication.
  • The integration flow automatically restarts from the last processing step that is part of the execution chain (registered in the service) when the node failed on an alternative node in the cluster.
  • Note: For each B2Bi processing step, the integration engine reads input queues, executes the activity, stores the output in internal queues, and then commits the message read. Each B2Bi step not fully completed is automatically restarted from scratch, polling the un-committed message again from this internal queue.

Trading engine throttling on integration engine failure

On any given node in a cluster, when the integration engine stops, the trading engine directs traffic to another integration engine.

If another integration engine is running on another cluster node and is already consuming messages from the trading engine on that node, that integration engine begins consuming messages from both trading engines, until the other integration engine returns to service.

If no other integration engine is ready to consume messages, the trading engine waits for one of these conditions:

  • 60 second timeout (default setting) for detection of an active integration engine
  • Queue limit of 2000 messages (default setting) to send to an integration engine

When either of these conditions are met, the trading engine throttles message delivery, holding messages in the queue until an integration engine becomes active.

You can view the status of all trading engine and integration nodes on a cluster from the B2Bi user interface, on the System Management page.

From the System Management menu, click System Management to view the page. The following image illustrates a cluster node in which the integration engine is stopped, and in which the trading engine found no other integration engines to direct messages to. As a result, the trading engine has gone into "throttled" status.

Image of the System Management page which shows a single integration engine node that has the status "not all started", and a single trading engine node that has the status "throttled".

Acknowledgements (outbound)

Documents which did not successfully complete the step that detects, splits, maps the message and generates the acknowledgement, are re-executed completely on another node and generate the acknowledgment.

Enveloping

Messages waiting to be enveloped (by the integration engine) on a node that is failing before the enveloping is executed, are enveloped on another available node.

Acknowledgements (inbound)

When you have a pending acknowledgement to be received on a node which fails before the actual acknowledgement has been received, the incoming acknowledgement is handled and correlated to the original entry by one of the alternative nodes.

Message duplication

When nodes are removed from the cluster, there is a risk in the integration engine that messages may be duplicated and processed twice. The system administrator should check the Message Log and Message Tracker in the B2Bi user interface to look for duplicates and take action.

Node rejoining clusters

Typically a node automatically rejoins the cluster as soon as it becomes available again.

  • The B2Bi node that failed, automatically recovers under certain conditions: The trading engine and integration engine automatically restart and process new messages and documents.
  • For messages “pushed” to B2Bi (HTTP, …), the external load-balancer detects that the B2Bi node (trading engine or integration engine) is available and pushes new incoming messages to this node.
  • For messages “pulled” by B2Bi (FTP, …), the B2Bi cluster detects that an additional B2Bi node is available and load balances pulling tasks to this node again (this only applies to the trading engine protocols, the integration protocols continue to run on the primary node).

Monitor erroneous messages

The following B2Bi monitoring tools display a consolidated view of all messages and documents processed by all the nodes of the cluster:

  • Message Tracker (trading engine)
  • Message log (integration engine)
  • Document Tracker (integration engine)
  • Sentinel

Resubmit messages

You can resubmit messages from Message Tracker even if the message fails on a node that is no longer available.

Note   The use of the Message Tracker resend and reprocess options is dependent on whether B2Bi is configured to back up message payloads.

You can resubmit documents from Message Log even if the documents have failed on a node that is no longer available.

Related topics

Related Links