Failure Detection and Membership Views

Failure Detection and Membership Views

GemFire uses failure detection to remove unresponsive members from membership views.

Failure Detection

Network partitioning has a failure detection protocol that is not subject to hanging when NICs or machines fail. Failure detection works by detecting missing datagram heartbeats from the peer to the left in the membership view (see "Membership Views" below for the view layout), followed by attempting to form a TCP/IP connection, and then sending a VERIFY_SUSPECT datagram message to all other processes. Those processes all quickly send several ARE_YOU_DEAD datagram messages to the suspect process. If the process does not answer one of these messages with an I_AM_NOT_DEAD response, the process is kicked out of membership. It is sent a message to disconnect the distributed system and close the cache.

Failure detection processing is also initiated on a member if the ack-wait-threshold elapses before receiving a response to a message, if a TCP/IP connection cannot be made to the member for peer-to-peer (P2P) messaging, and if no other traffic is detected from the member. For this kind of failure detection, the operator must also have set the ack-severe-alert-threshold in

Note: The TCP connection ping is not used for connection keep alive purposes; it is only used to detect failed members. See TCP/IP KeepAlive Configuration for TCP keep alive configuration.
If a new membership view is sent out that includes one or more failed processes, the locator will log new quorum weight calculations. At any point, if quorum loss is detected due to unresponsive processes, the locator will also log a severe level message to identify the failed processes:
Possible loss of quorum detected due to loss of {0} cache processes: {1}
where {0} is the number of processes that failed and {1} lists the processes.

Membership Views

The following is a sample membership view:
[info 2012/01/06 11:44:08.164 PST bridgegemfire1 <UDP Incoming Message Handler> tid=0x1f] 
Membership: received new view  [ent(5767)<v0>:8700|16] [ent(5767)<v0>:8700/44876, 
ent(5829)<v1>:48034/55334, ent(5875)<v2>:4738/54595, ent(5822)<v5>:49380/39564, 
The components of the membership view are as follows:
  • The first part of the view ([ent(5767)<v0>:8700|16] in the example above) corresponds to the view ID. It identifies:
    • the address and processId of the membership coordinator-- ent(5767) in example above.
    • the view-number (<vXX>) of the membership view that the member first appeared in-- <v0> in example above.
    • membership-port of the membership coordinator-- 8700 in the example above.
    • view-number-- 16 in the example above
  • The second part of the view lists all of the member processes in the current view. [ent(5767)<v0>:8700/44876, ent(5829)<v1>:48034/55334, ent(5875)<v2>:4738/54595, ent(5822)<v5>:49380/39564, ent(8788)<v7>:24136/53525] in the example above.
  • The overall format of each listed member is:Address(processId)<vXX>:membership-port/distribution port. The membership coordinator is almost always the first member in the view and the rest are ordered by age.
  • The membership-port is the JGroups TCP UDP port that it uses to send datagrams. The distribution-port is the TCP/IP port that is used for cache messaging.
  • Each member watches the member to its left for failure detection purposes.