click to expand ...

Home
About SVA ... Contact us
Overview (all documents)
Product description
Services with BVQ
BVQ advantages
New functions and future plans
White papers
Downloads and releases
Users Manual and users guide
Further WIKI documents

 


For better reading! Use these buttons to enlarge document area.

 

This document is an example for the BVQ Structured Performance Problem Analysis to solve a performance problem caused by metro mirror. This can also be used for global mirror problems.
Download this analysis document as pdf file

Document content:

 

Description of the performance problem solved in this document


Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09.

The environment are two clusters connected via Metro Mirror. The first aim of this document is to show how the root cause of this problem was found in the link between the two clusters. 
The second aim is to describe how the root cause for this problem was found by using the BVQ Structured Performance Problem Analysis Method. It is demonstrated that successful analysis work needs a structured method and also a tool which supports this method and delivers the needed technical insight. We have the concept that everybody should be able to conduct a performance analysis. This is important because the level of service is lowered day by day and especially small customers are more and more reliant on their own skills or on the skills of their partners. This is a common problem occurring at all vendors!

When the BVQ Structured Performance Problem Analysis Method is used in combination with all the information made available by the BVQ Library it gets very easy for a customer to detect the root causes of at least 80% of all typical performance problems by himself! Solving a problem will become easier, if the root causes are uncovered!

Also this can be a perfect opportunity for partners to use BVQ during their customers' service. Nothing is more impressive than presenting problem solving skills to customers.

Just contact us: https://www.bvq-software.de/en/contact/

What is the BVQ Structured Performance Problem Analysis Method?

Some information about the BVQ Structured Performance Analysis Method which makes it much easier to find performance problems inside storage systems.

 Click here to expand...
  • A step by step approach to identify quickly the root causes of performance problems
  • This method was developed by the BVQ team including the experience of hundreds of solved performance problems
  • Prevents the poking around in the dark for finding the problem cause
  • The BVQ User Interface is aligned to support this analysis method
  • You can read more about this method in the BVQ Library

 

The idea of BVQ structured performance problem analysis is to track down the problem in the data path until we either come to the root cause of the problem or the problem is no longer visible. At each storage layer we have these decision points:

  • Problem is visible in this layer
    --> go to next deeper layer and check whether the problem is visible there
    --> if not, then check whether a root cause can be found
    --> if no root cause can be found here go to the next upper layer
  • Problem is not visible in this layer
    --> the problem should be found in the level before

The analysis normally starts where the symptom of the problem becomes visible - normally at the VDisk layer.

An experienced performance analyst knows shortcuts and probably will use other routes but in the end of his analysis he will also try to find the layer where the problem occurs.

It sounds funny that the first step of the structured performance analysis is always to proof that a problem exists in the storage system at all. We do this by looking up the problematic disk in the given time frame checking for high latencies. We will try to find out whether the problem is a peak problem, an overload problem, is it a read or write operation, is it caused by the SAN connections or by the host? Do we find probably overload problems in the nodes?

This is documented more detailed here.

BVQ Library references (access for customers and partners)

 

Step 0: Proof that the problem is detectable in the storage system


Latency problem was reported for VDisk CA-CL1-Disk04-N at 02/05/15 8:09

The first thing we always should do is to proof that the problem can be tracked down in the storage system at all. It is the easiest part of the analysis where we just use the given time frame and the volume name to lookup whether we can find latency issues there. The first valuable information here is:

  • Is it a peak latency or do we have an overload situation?
  • Is only this single volume affected or do we see the same peaks in even more volumes?

Pict. 1: A sudden latency peak with 400ms and without any obvious reason usually indicates a problem that you have to look for in the environment of the caches or the infrastructure.

This latency peak is the starting point of our analysis example.

BVQ Library references (access for customers and partners)

 

Step 1: Perform a quick health check


It is recommended to carry out a quick health check of the nodes prior to the analysis to exclude an internal problem in the node (CPU, caches, node ports). An issue here would lead the analysis into another direction.

Since the health check is a documented standard procedure, the individual results are not further explained here.

 

BVQ Library references (access for customers and partners)

 

Summary of the results:

  • Node CPU check and node port check 

    It is a common mistake to think that the average CPU load provides a significant information. The CPU can be divided into single cores and every overloaded core of a CPU can be a reason for a performance problem. The cores are serving different storage partitions within the system. For example, a heavily loaded core with about 80% and another with 20% load have an inconspicuous average of 50% load and with this the problem would be missed.

    In the second picture of the node port check overloaded ports, buffer credit wait times or SAN errors are reviewed.

    (tick) No overload in CPU cores in the according period of time
    (info) Relatively high load in one core two hours later (orange box)
    (tick) buffer credit waits always below 1% and uncritical
    (tick) No SAN errors found
    (info) Higher port load on one port two hours later

Pict. 2: CPU cores are shown on the upper picture, node ports on lower picture. Both pictures show no issue when the performance problem happened (red box). However, it is remarkable that approximately two hours after the problem had occurred suddenly high values are visible (orange box). Later we will see that the recovery phase was responsible for this issue. In the orange time frame one CPU core shows high values. This is because a part of the storage had to be recovered.

  • Global cache check

Upper and lower cache are busy within normal limits. Two statements about this result: 

    1. The upper cache would react, if the compression occurred during a latency. But in this system no compression is used. So an upper cache latency would be an event that perhaps indicates a bug in the code. 
    2. The lower cache would indicate a performance issue in the storage backend or a volume which is overloading the cache with long lasting heavy write operations.

Both global caches show no abnormalities. The results are taken as helpful information which now can be used during the next steps.

(tick) No overload situation in upper cache in the considered time period
(tick) No overload situation in lower cache in the considered time period
(info) Higher usage of upper cache two hours later (recovery phase)
(warning) The lower cache does not have any recovery phase - interesting it does not happen below VDisks!

Pict. 3: Upper cache (picture one), lower cache (picture two). No issues when the problem happened but the upper cache shows some reaction in the orange box where we found indicators for an recovery phase. The lower cache is not affected at all and shows a medium write load in the system.

 

Step 2: A deeper look into the failing VDisk


The VDisk is not really failing but it shows very high latencies disturbing the dependent systems. A deeper look into the volume has to be taken to find whether we can exclude some to the typical error causes.

BVQ Library references (access for customers and partners)

We scan the highest latency peak with the mouse by using the right mouse button menu to start the readout of this measurement point.

We find the following intermediate results:

  • The problem is a write problem not a read problem
    • (warning) The highest measured medium write latency was 441 ms (points to storage backend and/or lower caches)
    • (warning) The peak latency seen at the servers was 7974 ms (this is enough to kill servers and processes)

  • The transfer latencies to the host are normal to low 
    • (tick) This is not a problem influenced by the connected server. So we can exclude the server as being part of the problem.
    • (tick) This is also not a communication problem between server and storage.

Pict. 4: Some deeper look into the volume uncovers very high write latencies but no issue with the server the volume belongs to. We now know that we have a write problem and that we can exclude the server from being the cause of the problem because the transfer time is with 0.16 ms within normal limits.

 

Step 3: Go deeper from VDisk to VDisk copy


After looking into the VDisk layer we found the symptoms of the problem and also gained some additional knowledge.

Seeing that the problem exists in the storage system we will go one step deeper to find out whether we can find traces of the problem in the VDisk copies. In V7.3 the VDisk copies are the more interesting part compared to the VDisk itself. The VDisk copy is the place for the the lower cache. Comparing IO from before and after lower cache can give us a very detailed information of how VDisks are hitting the backend storage.

 

BVQ Library references (access for customers and partners)

How to start the VDisk copy analysis from this point ...

Shows how BVQ supports you to easily drill down to the right objects

 Click here to expand...

How to start the VDisk copy analysis from this point

BVQ supports you to easily drill down or drill up between storage layers. You just use the right mouse button on a measurement point and select the level you want to go to. The intelligent part of BVQ always keeps an eye on the topology to lead you to the right objects.

This is for example very helpful in an Easy Tier environment. Drilling down from a VDisk SSDs are only displayed, when the VDisk was using SSDs at the chosen point of time. Thus, analysis steps are no longer distorted and it is made easier to achieve the right outcome.

There is another analysis method called 'backtracking' which is based on this characteristic. This method is used to find VDisks causing peak behavior inside the storage system (node ports, MDisks, CPU cores...). Backtracking is performed from the suspicious peak to the VDisks layer to find out which VDisk is causing the peak.

Pict. 5: Drilling up and drilling down is just easy with BVQ. By using the right mouse button it is possible to drill up or drill down between all available layers in the storage system. One of the most exceptional abilities of BVQ is that these activities are guided based on the topology. So when drilling down from a VDisk to the MDisk layer it will only show MDisks which are really used by the VDisk instead of presenting all MDisks of the managed disk group. There is also an excellent guidance when drilling up from a latency peak inside the system to the VDisk layer to find the peak causing volumes! This is named 'backtracking'.

 

Checking the VDisk copy layer includes two steps. First of all we look into the latency of the VDisk xcopy and then check the fullness of the lower cache being a part of the VDisk copy layer.

When opening the VDisk copy we find two objects in the options panel which have been automatically selected by BVQ based on the topology.

  • Check the latency of the VDisk copies

We check the latencies of the VDisk copies in read and write mode.

In this case we don`t find higher latencies in the VDisk copies. So we assume that we will not find the root cause of this problem in the VDisk copies or any deeper layer.

(tick) No R/W latencies found in VDisk copies

The VDisk copy check is not finished yet. The lower cache also belongs to the VDisk copies and we have to check this too because the problem can also be brought in by another VDisk filling up the cache.

Pict. 6: No higher latencies in the VDisk copy layer. This means that the latency problem is not caused by any layer deeper than VDisk. We also have to check the cache partition of the VDisk to find whether the problem had been caused by cache overflow.
    • Check lower cache partition for overflow

There is a second reason for problems which can be caused in the VDisk copy layer being an overflow into the lower cache. Mostly this is caused by another VDisk or even a VDisk group filling up the cache. 

Using the right mouse button menu BVQ is automatically opening the MDisk group cache partitions of both VDisk copies.

In this situation we do not find a problem in the partition caches. So the second element of the VDisk copies is not showing any symptoms belonging to the problem.

(tick) No lower cache problem found during the monitored period
(info) High cache load situation found two hours later (recovery phase)

Pict. 7: No problem in the partition cache (red box). We have a much higher cache utilization when the storage system recovers from the problem (orange box).

 

Results of VDisk copies analysis make it necessary to look above the VDisk layer:

(tick) No R/W latencies found in VDisk copies
(tick) No lower cache problem found during the monitored period 
(info) High cache load situation found 2 hours later (recovery phase)

No symptom of the problem was visible in the VDisk copy layer. So the symptoms only reach to the VDisk layer and don´t go further into the MDisk or drive layer. Now we have to go to the opposite direction looking into the upper cache and then into the remote copy layer which is located above the upper cache.

Step 4: Going upwards and looking into the upper cache


It is very unlikely that we will find upper cache problems because this customer does not use compression. A problem here could then only point to an overload from another volume (very seldom for upper cache) or to some code fault (appeared in V7.3 levels)

The results are:

(tick) No upper cache problem found during the monitored period  

(info) High cache load situation found 2 hours later (recovery phase)

Pict. 8: In the time when the problem happened the upper cache load is low (red box). We see the recovery phase in the orange box.
So far we found out that the symptoms of the problem are only visible in he VDisks. The VDisk copiesthe lower cache and the upper cache don`t show any symptoms and with this the backend storage and overload situations brought in from other volumes can be excluded as root causes.
The only level left is the remote copy layer

Here are two possible causes for the problems:
    • The target system has a performance problemThe VDisk latency associated with our disk is too high. 
    • There is a problem in the connection of the two storage systems

Step 5: Remote copy analysis


Due to the fact that we did not find the problem in or below the upper cache we have to look into the remote copy layer. To do this, the BVQ Copy Services Package needs to be installed. The possible source of problems can be an issue in the communication line between the clusters or a VDisk latency problem on the target side.

BVQ Library references (access for customers and partners)

    • Check the VDisk on the remote copy target side

The treemap allows us to visualize all three available copies of the VDisk

      • Primary and secondary copy is the source VDisk with its two VDisk copies
      • Remote copy is the target VDisk on the on the other cluster

Pict. 9: This treemap shows all three copies of the VDisk with performance information. The disks are connected via the remote copy relationship object which is the same for all copies. The performance of the remote copy is smaller (only write traffic) which is expressed by a smaller size of this object in the treemap. Please don`t be confused about the equal performances of VDisk and VDisk copy. Performance differences here are only visible on the VDisk copy layer.

We now start an analysis of the source and the target VDisk finding out that the remote VDisk is performing very well with latency of less than 1.02 ms. So the only cause left for the problem is the line between the clusters.

(tick) Target VDisk is performing well with a only 1.02ms latency.

Pict. 10: This picture shows target and source VDisk in one analysis screen. We see the very high latency of the source disk and in the same time period the normal latency of the target disk. This screen shows us without any doubts that the target side of the VDisk is not the cause of the problem.

 

    • Check the connection between the two clusters

We can start an analysis of the cluster-to-cluster connections. Here we find the root cause of the problem in a disturbed connection between the two clusters. We have a time period of more than one hour where the lines show very high latencies.
(warning) Latency problem in the connection between cluster CA_SVC and storage cluster NY_SVC.
This is finally the root cause of the problem.

Pict. 11: The problem has been caused by a big latency in the communication between the two clusters. The mean latency on all lines went up to 700ms.

Pict. 12: This picture shows how remote mirror links should be like when it is working correctly.