Watch a video of GMCV Monitoring with BVQ Dasboards


Download as pdf

BVQ gives you the insight to run a smooth Global Mirror with Change Volumes (GMCV) implementation that contributes to have a successful Disaster Recovery plan.

This whitepaper will discuss the challenges administrators face when implementing GMCV and will show you how BVQ is the solution you need to help you monitor and fine tune the implementation to reach a recovery point that you can guarantee on a consistent basis meet your SLAs.

The Storage Administrator's Challenges

Administrators face these specific challenges when trying to monitor GMCV:

1. Setting the appropriate cycle period parameter that fits the characteristic of the workloads - How do you know which cycle period to set the replication?  The recovery point that can be tuned is called the cycle period and is the main setting the administrator needs to configure. The time the cycle period takes to complete determines when there is good data to recover from. Not knowing the recovery point that can be met on a consistent basis can jeopardize your disaster recovery plan.  These cycle periods are often chosen too small, which means that many copy processes cannot be completed in the scheduled time; therefore, the entire consistency groups will not finish the remote copy in the scheduled time window.

2, Monitoring hundreds of consistency groups and determining quickly for each, when there is a good copy to recover from.  Manually checking to see which consistency group missed the recovery point and why can be cumbersome and take a huge amount of your time.

3. Figuring out what caused the missed Recovery Point, how often and why. You can spend a large amount of time trying to figure out what is preventing the replication from completing successfully.

BVQ is the solution that helps simplify monitoring of GMCV environments with ready-to-go and easy to understand Dashboards


For a customer project, we have developed special dashboards that display all the information an administrator needs to monitor the cycle periods of Global Mirror with Change Volume (GMCV).                         The easy-to-understand GMCV dashboards can be customized to display exactly what the administrator requires. Access them from anywhere and can be accessed by anyone on the team.

 

Imagine that in a Service Level Agreement, a customer is guaranteed that the data can be restored to a maximum of 10 minutes of data loss (RPO 10 minutes). This goal can only be achieved if the cycle period in the consistency groups is set to 5 minutes.

Now,when it comes to recovery, you find that the last recovery point was a consistency group that did not take 10 minutes, but 20 minutes or even longer because the system was technically unable to maintain the set cycle period.


Three Dashboard that help successfully monitor your GMCV implementation:

To help you monitor GMCV and configure it to meet your DR plan, BVQ offers three specialized dashboards:

  • Dashboard 1 GMCV Overview: Identify immediately which consistency group is in trouble, which consistency group did not meet the RPO 100% of the time during a monitored period.

  • Dashboard 2 GMCV Week: Shows how the copy operation is behaving for each consistency group over a period of seven days, indicates when the RPO was not met, how often and how long it was exceeded by.

  • Dashboard 3: GMCV Analysis: Helps answer the questions, why and what is preventing the replication from completing successfully and what should be done to fix it.

Dashboard GMCV Overview

Find out  quickly which consistency group did not meet the RPO during a selected period.  Displays the results of many consistency groups on a single web page.  For each consistency group, you can see the status of the quality of the RPO in a traffic light format as a % RPO met. The indicator RPO % met, is the percentage of time indicating how often the cycle was met successfully.


 

Figure 1. Displays the status of each consistency group in a heatmap view. Status is shown as a percentage of RPO met.


The above heatmap overview, is informing that 94.4% of the time over the selected all expected recovery points have been formed for consistency group 7k05/CG01.  The acceptable percentage level ratings are adjusted based on the customer's requirements. The current setting rates between 98% and 100% as status good (green), from 70% to 98% as warning (orange) and below 70% as error (red).  Therefore, in this case it's showing the status as warning level because only 94.4% of the time the set cycle period was met successfully.

Dashboard GMCV Week

This dashboard helps you quickly answer these two questions, how often and when the RPO is missed, how long was the cycle period exceeded?

It's important to see what is going on with the replication to be able to spot variations to learn at which periods during the day or which periods during the week, the RPO was missed.  This dashboard shows the status of copy quality over a period of seven consecutive days for each consistency group.  It displays a chronological progression of the day; from which one can recognize the individual problems over the day.




Figure 2. View of all consistency groups, each row displays a seven day with status as indicators of when and by how long the cycle time as been exceeded


The area below each heatmap provides a  detailed view of how often the RPO was not met and when.  The  red line spikes indicate precisely The yellow lines tells by how long the cycle time has been exceeded.

From each of the views, you can call up a for each consistency group as shown in Figure 3.


Figure 3. Detailed view of a consistency group, which could not keep its cycles. Helps you recognize the problem periods and by hovering over yellow line see by how long the cycle time has been exceeded (24 mins and 2 mins)


Dashboard GMCV Analysis

This Analysis Dashboard can be called from Dashboard GMCV Week. It looks at three important areas of the replication process to help you find out what is preventing the RPO from being met.

First area is the replication process: it's important to have insight if the replication is creating a high load due to the characteristic of the workload, behavior of the copy process and the type of data access such as a serial or random type. You can see this process in the three top views in this dashboard.  It shows the process from the primary or master volumes all the way through the auxiliary or secondary volumes. As seen below in the Auxiliary side view, this workload has impacted the replication load; it has created a very high load.  Therefore, the high and low loads the volumes produce, are influenced by the flashcopy operations (example shown figure 4). Now you have insight of why these high loads occur and BVQ helps you better understand what to expect during the copy operation process.

Characteristic of workload impacts the copy operation process

The characteristic of certain workloads and the type of data access (serial or random) of the workload together have

a huge impact on the load generated during copy operations (flashcopy).

Figure 4. Example of a workload that can create a high load during the copy operation to the Auxiliary site (third row)


Second area: Using this next view, you can see if there’s a sporadic spike of high load coming from a volume or set of volumes that could be interfering with the copy operation process and this volume can be identified from here, for example the name of this volume is arc0A that is creating a high load

Figure 5. Identified Volume arc0_A is creating a high load at 16:00 hours


The third area is the Cluster to cluster connectivity: From this view, you can check for bandwidth problems.  This allows you to determine if the size of the link might not be sufficient to handle the load. You can see if the line is at the limit of the available bandwidth; it indicates the maximum load that’s going through and latency peaks. You can see if the problems of writing are due to a saturation of a path between clusters.  You can see the data volumes' size (MB/s), indicated by blue line and latency spikes by the red line.

The cluster to cluster connectivity view below shows there's several latency peaks. The set threshold level for latency peaks has been set to 5ms.  In situations where latency peaks occur often over a period of time during critical business hours, then the bandwidth will need to be re-sized.


Figure 5. Cluster to cluster connectivity.  Data rate and latency on the line


Next steps that should be considered:

This insight provided by BVQ empowers you with the knowledge that can be shared with the disaster recovery team to plan out a successful disaster recovery plan that meets the company's SLA.  Now you can confidently implement the following changes:

  1. Use BVQ to create alers to eliminate the risks even further.
  2. Make adequate tweaking of the cycle period in order to consistently meet the RPO.
  3. Accurately report the RPO that can be guaranteed on a consistent basis based on the characteristics of the workload.
  4. Size the bandwidth correctly or confirm the link has an adequate bandwidth that will be needed during critical business hours.

Summary

BVQ is the unique solution to help you determine the optimal cycle period to fit the replication load and the existing bandwidth.  BVQ allows you to accurately determine an RPO that you can guarantee on a consistent bases.  This is important because RPOs that are stated and are only bine met by chance are not good quality of service.  BVQ can help you save time and effort when implementing GMCV in your environment.


  • No labels