Troubleshooting and System Recovery
This section provides strategies for handling common errors and failure situations.
There are several types of files that are critical for troubleshooting.
This section provides possible causes and suggested responses for system problems.
This section describes alerts for and appropriate responses to various kinds of system failures. It also helps you plan a strategy for data recovery.
A GemFire member may be forcibly disconnected from a GemFire distributed system if the member is unresponsive for a period of time, or if a network partition separates one or more members into a group that is too small to act as the distributed system.
When the application or cache server crashes, its local cache is lost, and any resources it owned (for example, distributed locks) are released. The member must recreate its local cache upon recovery.
When a machine crashes because of a shutdown, power loss, hardware failure, or operating system failure, all of its applications and cache servers and their local caches are lost.
ConflictingPersistentDataExceptionwhile starting up persistent members indicates that you have multiple copies of some persistent data, and GemFire cannot determine which copy to use.
It is important to monitor the disk usage of GemFire members. If a member lacks sufficient disk space for a disk store, the member attempts to shut down the disk store and its associated cache, and logs an error message. A shutdown due to a member running out of disk space can cause loss of data, data file corruption, log file corruption and other error conditions that can negatively impact your applications.
The safest response to a network outage is to restart all the processes and bring up a fresh data set.