The intention of VDisk copies is to protect servers against hardware failures. For this reason, it is important that the mirrors of the VDisk copies are stored on different storage systems. Here are some additional challenges in system administration and system monitoring.
This document describes how to ensure that all volumes are stored on different sites.
An wrong setup in this area can lead to an only 50% chance of surviving a fail over. If there is a mirrored pair on Site 1 and by an evil coincidence an another on Site 2 you have the 100% certainty that the fail over will not work as expected. So in worst cases nothing will work.
At the level of the server, the problem cannot be recognized at all and it is difficult to detect it using the SVC GUI.
In order to detect wrong configurations like this quickly I have developed a method in BVQ and stored it as a favorite.
- The classification of the hardware (nodes, hosts, storage) in locations such as named sites and rooms in sites.
- A site rule that defines that the mirrors of the volumes must be stored in special groups on storage systems located in different sites.
- An alert that strikes when a volume violates the site rule
- An analysis method stored as a favorite, which gives a good overview of the situation and helps the administration to assess and solve the problem.
This method is somewhat more complex, but also general, and can be reused in any installation
- I took it one step further and composed applications and application groups from the volume groups.
With this I can determine which application will be affected during a failover
Start of all things should be the alert (* 1), which indicates that there are problems in this area
The analytical method now provides us with easy-to-understand data
The overall screen shows all results on one spotClick here to expand...
Click picture to enlarge
- (1) the alerts
- (2) the list of volumes that which are not compliant to the rules
- (3) The list of applications that are affected when the volumes can not fail over properly
- (4) The list of hosts that use at least one of these volumes, and can be affected of this in any way
We could extend this list with many volumes like capacity or performance
- (5) The list of rules in my installation as an overview
- (6) A Treemap, which shows in which applications these same hosts are used.
These applications are also on risk.
The red color of the application "Production" shows that this application owns a volume which fails
The red hosts in the other application just show that this application uses a host on risk.
- (7) This treemap represents the affected volumes graphically and helps in the orientation in the system.
These little red sports show that both volumes are located in two mirrored MDisk groups