LATEST VERSION: 9.2.0 - CHANGELOG
Pivotal GemFire® v9.2

Troubleshooting and System Recovery

This section provides strategies for handling common errors and failure situations.

  • Producing Artifacts for Troubleshooting

    There are several types of files that are critical for troubleshooting.

  • Diagnosing System Problems

    This section provides possible causes and suggested responses for system problems.

  • System Failure and Recovery

    This section describes alerts for and appropriate responses to various kinds of system failures. It also helps you plan a strategy for data recovery.

  • Handling Forced Cache Disconnection Using Autoreconnect

    A GemFire member may be forcibly disconnected from a GemFire distributed system if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the distributed system.

  • Recovering from Application and Cache Server Crashes

    When the application or cache server crashes, its local cache is lost, and any resources it owned (for example, distributed locks) are released. The member must recreate its local cache upon recovery.

  • Recovering from Machine Crashes

    When a machine crashes because of a shutdown, power loss, hardware failure, or operating system failure, all of its applications and cache servers and their local caches are lost.

  • Recovering from ConflictingPersistentDataExceptions

    A ConflictingPersistentDataException while starting up persistent members indicates that you have multiple copies of some persistent data, and GemFire cannot determine which copy to use.

  • Preventing and Recovering from Disk Full Errors

    It is important to monitor the disk usage of GemFire members. If a member lacks sufficient disk space for a disk store, the member attempts to shut down the disk store and its associated cache, and logs an error message. A shutdown due to a member running out of disk space can cause loss of data, data file corruption, log file corruption and other error conditions that can negatively impact your applications.

  • Understanding and Recovering from Network Outages

    The safest response to a network outage is to restart all the processes and bring up a fresh data set.