Cluster Lessons
Last night there was a large power outtage in our colocation facility. All the Mosuki machines lost power. All the machines came back up, but the cluster had some problems:
- ntp-servers failed on the directors and app machines
- app machines failed to mount the network fs
- app machines failed to make a connection to the DB server
failed ntp-servers
This seemed unrelated to the power outtage. The ntp-servers were bailing because the drift was too large. It appears they did not deal well with the time change on april 3rd (spring forward). Why they had a problem was not clear. The DB servers adjusted fine.
failed network fs mount
This may be due to nfs loading up before DRBD, which shouldn't happen. See below for explanation.
failed db connections
This was due to postgres being loaded up before DRBD was available. This should not have happened. The reason this occured is because a new version of postgres was installed recently and the symlinks for /etc/rc2.d were replaced. They should not exist as heartbeat controls when this server is started/stopped. A solution to prevent the symlinks from being changed by new package installations needs to be found. This may also explain the problems had with mounting NFS (see above).]