Link to BVQ Cookbook

.


 click to expand ...

Home
About SVA ... Contact us
Overview (all documents)
Product description
Services with BVQ
BVQ advantages
New functions and future plans
White papers
Downloads and releases
Users Manual and users guide
Further WIKI documents

 



Performance Monitoring and Bottleneck Analysis with BVQ V5 analysis dashboards
SVC, Storwize V5000, Storwize V3700, Storwize V9000,
Flash 840, Flash 900

 

 

 

How analysis dashboards support storage performance analysis

 

BVQ version 5 introduces a new interface which allows the grouping of logically connected views into dashboard-like analysis dashboards. These groupings can then be stored as favorites and can be used like recipes to help solve problems quickly.

The following picture shows an analysis dashboard which gives an overview of a complete storage system.  This analysis dashboard describes the data path and the load in the hardware in a way that allows for quick identification of technical issues.

 



 

  • In BVQ these multiview dashboards are simply called analysis dashboard. An analysis dashboard can have one or more views which can be treemaps, tables of graphs.

  • Views in an analysis dashboard can be connected in a way that a selection of any object or any object group change the content of all views.

  • These analysis dashboards can be stored as favorites and act like analysis recipes.

  • In other words, to analyze an easy tier group is no longer a question of many clicks. Select the managed disk group and start the right favorite.

  • Favorites carry descriptions and a http link back to the BVQ service pages where the usage of this favorite is described in brief.

 

 

This special analytical favorite displays the data path on the right
VDisk -> lower cache partitions -> MDisk

On the left top the Treemap which controls the favorite content and two elements of the hardware node ports and node CPU.
The data will be displayed depending on the selected objects of the Treemap.

 

The selection can be changed at any time. A reload command will then bring up the new data.

Select the cluster on the Treemap

Select only a single disk in the Treemap

  • VDisk – all VDisks from the cluster
  • Caches – all cache partitions
  • MDisks – all MDisks
  • Node ports – all node ports
  • CPU – all CPUs

 

  • VDisk – only the selected VDisk
  • Caches – the cache partition from the managed
    disk group the VDisk is belonging to
  • MDisks – all MDisks the VDisk is using
  • Node ports – all node ports from the preferred node
  • CPU – all CPUs from the preferred node

 

 


First bottleneck analysis example (cache problem)

 

The following example shows how the BVQ analysis dashboards supports the analysis of a latency peak!

The following story can be directly read from the analysis dashboard:
"The problem is caused by a cache overload in the storage backend which is the result of a massive write activity with or without an overload in the storage backend."




  • (1) Problem Identification A problem is happening in the storage frontend. The overall latency is increasing from 1.5 ms to 3.5 ms
  • (2) Following the data path the problem is visible in the caches The problem is visible in the data path, The caches did react with first a very high load and a full de-stage which is driving the story into a cache overflow direction
  • (3) Following the data path the problem is visible in the storage backend The problem is also visible in the storage backend with a latency increasing from 1.4 ms to 2 ms.
  • (4) Following the data path the problem is not visible in the connection to the SAN The problem is not visible in the node ports (SAN) of the nodes. The node ports are not overloaded and when the problem occurred there was no indication of blockage in the SAN . (a smaller blockage comes later – interesting and good to know as proactive administrator)
  • (5) Deeper look into the hardware – the problem is not visible in the CPU's of the system The problem is also not visible in the CPU's of the Nodes.

Investigate the problem deeper

It needs to be pointed out that all analysis steps so far did not include any special activity. It was just reading and interpreting the facts from the analysis dashboard.
Now deeper analysis starts with the investigation of only two possible circumstances

  1. What is causing the cache overflow from the VDisk side?
  2. Is the cache overflow a result from an overloaded storage backend?

What could be causing the cache overflow from the VDisk side?


This is easy to diagnose – enlarge the cache window and identify the cache partitions which went into a high load condition. By clicking on the lines the cache partitions can easily be identified. There is no need to take notes here – just use the right mouse button and start a VDisk analysis from this marked line. This will start a VDisk analysis from only the VDisks which are connected to this one line. In other words, only from the VDisks which use this area of the cache. This mechanism is called backtracking in BVQ and is one of the standard analysis steps which is repeatedly used.



The result in the lower picture is now only some clicks away. We just need to choose the right metric (VDisk write) and switch to a display which shows a line per VDisk.

It is now becoming very clear that the VDisk rtrtv899-01 is the VDisks which is performing these massive writes.
Another option would not to analyze the VDisk behavior but to analyze the behavior of the VDisk copy above and below the cache which is not shown in this document.

Is the cache overflow also a result from an overloaded storage backend?


Again the same backtracking analysis steps. Instead of going upwards to storage frontend from the caches to the VDisks we just go downwards to the storage backend from the caches to the MDisks.


The following results can be found here. There was an increase in in the latencies of the storage backend but after the problem happened. So the reason for the higher latency is caused by an recovery action from the storage system.
The increase overall was from 0.4ms to 0.8ms which 100%! Going deeper it became obvious that the increase of the flash devices was small whilst the latency increase of the SAS devices.


 Summary

The BVQ analysis analysis dashboard helped to clearly identify the region where the problem was coming from. The analysis steps are standard and well described in the BVQ documentation.

The problem was caused by one volume which had a longer phase of massive writes which ended in a cache overflow situation. Typically, the SVC or Storwize reacts in a way that frontend IO will be reduced to protect the cache which leads to higher latency.  The storage backend ran into latency after the problem happened and when the system started to recover. In this phase the MDisks have not been able to cover the load completely because the SAS devices of the managed disk group have been running into higher latencies.

Recommendation is to first figure out what happened on the host which caused the very high load on the volume and either stretch or reduce the demands of this volume or move the volume to a more performant place inside the storage system.

 If things like this happen more often a clear recommendation would also be to increase the performance of the backend storage.


Second bottleneck analysis example

 

Latency increase from 1.2ms to 2.8ms is happening in the storage frontend. There is no higher load which could be a reason for this.

 This story can be directly read from the analysis dashboard:
The problem is clearly caused by a blockage in the SAN. The yellow buffer credit wait % line is way too high with more than 30%. There is no issue in the data path so we do not have to check Caches or MDisks. CPU load is also OK - nothing to be done here

 

 
Problem Identification:

  • (1) A problem is happening in the storage frontend. The overall latency is increasing from 1.2ms to 2.8ms
  • (2) Following the data path the problem is not visible in the caches
  • (3) Following the data path the problem is not visible in the storage backend
  • (4) Following the data path the problem is clearly visible in the connection to the SAN The yellow line describes blockages in the SAN where ports are not available for data transfer. The value of the yellow line is 30% which is a severe problem!
  • (5) Deeper look into the hardware – the problem is not visible in the CPU's of the system

Investigate the problem deeper

 

We again start the analysis from a point where we know exactly where to look!
Deeper analysis step:

  1. What is the reason for the blockage in the SAN?

What is the reason for the blockage in the SAN?


The identified node port is the symptom of the problem and not the root cause. So we perform backtracking from the node port to identify the VDisks which is utilizing this port.


We found out by reading the details of the measurement points, that the direction of data flow points to the hosts. The node port is mainly sending data to hosts when it fails, which means hosts are reading at this time.



We open an analysis of all VDisks which can use the node port and switch the metric to data rate read. With this we identify a single disk with the name udxb01_004 which is reading data at this moment.

 

Do we only identify single disks? And what is a bully?
In the two examples, we only had a single VDisks which was performing at such high rate that we could pin the problem too. This is not always the case and it can also be a larger number of VDisks which can cause the trouble. We call these trouble makers bullies and the other VDisks which are only affected by the problem victims. A definition for this is "bully / victim behavior". In most cases victims are the ones we start to analyze. The victim leads to the symptom in caches, CPUs, MDisks or node ports. Backtracking then identifies the root cause or "bully".


 

Identifying the causing VDisk leads to the host and the owner of the host


It is now relatively easy to find the host that owns the VDisk . In this case we would like to identify the administrator of this host to tell him that the host or HBAs of this host need to be checked because with the help of BVQ a clear indication was found that this host is causing slowdowns in the SAN.


Summary

The BVQ analysis dashboard again helped to clearly identify the region where the problem was coming from. The analysis steps are standard and well described in the BVQ documentation.

The problem was caused by a blockage in the SAN and became visible one volume which had a longer phase of massive writes which ended in a cache overflow situation. Typically, the SVC or Storwize reacts in a way that frontend IO will be reduced to protect the cache which leads to higher latency.  The storage backend ran into latency after the problem happened and when the system started to recover. In this phase the MDisks have not been able to cover the load completely because the SAS devices of the managed disk group have been running into higher latencies.

Recommendation is to first figure out what happened on the host which caused the very high load on the volume and either stretch or reduce the demands of this volume or move the volume to a more performant place inside the storage system.

 If things like this happen more often a clear recommendation would also be to increase the performance of the backend storage.


Add on information BVQ Host Analysis:

The following picture shows details about the problem host that was identified. Half of the node port connections of this host are displayed in grey which means that they are inactive. These are not part of this problem because the host is only using VDisks of IOGrp0.



Add on information VDisk transfer latencies


Transfer latencies shows the elapsed time from sending data to the host until the acknowledge from the host comes back that the data was received. Typically, the values should be below 0.5ms (red line). The VDisks of the host uvx004 are highlighted in this graph. One can easily see that this host runs into problems every night.

 


The picture shows transfer latencies of all 620 volumes of the cluster over 1 week. It took less than 1 minute to display this picture which has 900.000 measurement points. BVQ has been tested with more than 5 Million measurement points in one graph.

 

 

Become an analysis champion

 

BVQ is not only software. BVQ comes with a lot of add on information how to use the software to uncover risks and to solve storage problems. The new favorites will carry help information and also be connected into the BVQ library so that you can read help pages online.