BVQ analysis VAAI: undercover write operations

BUSINESS VOLUME QUALICISION (BVQ)
White Paper

Download as pdf: BVQ Analysis VAAI undercover write operations.pdf
Download foils: BVQ Analysis PPT VAAI undercover write operations.pdf
 

Performance Bottleneck Analysis- VAAI undercover write operations
In this example we will see that the usage of the VAAI is somehow hidden from ordinary monitoring methods because the VAAI data streams are not visible on the top SCSI layers. So it might look like a system reacts overloaded without any reason but the reason can be uncovered when you look deeper into the system when your analysis tool is tailored good enough to SVC/Storwize and enables you to see needed the performance indicators.
_vStorage APIs for Array Integration is a feature introduced in ESXi/ESX 4.1 that provides hardware acceleration functionality. It enables your host to offload specific virtual machine and storage management operations to compliant storage hardware. With the storage hardware assistance, your host performs these operations faster and consumes less CPU, memory, and storage fabric bandwidth. (from http://kb.vmware.com)

Customer situation:
Customer was experiencing performance issues in his high performance managed disk groups since he added two new mdgs for a new type of storage. With these new mdgs the SVC had to divide the available write cache sizes into more partitions so the already existing mdg's cache sizes were halved from 60% of the existing global cache down to 30%.
In the same timeframe the customer also started to use VAAI – it was now unclear whether we had the performance issue because of the VAAI usage or because of the new mdgs.
The correct answer is, that the performance issue is from an overload situation in the mdg when the customer added new managed disk groups. The best solution for this is adding new nodes.
In some situations the overload of the mdg could be explained by high write data rates which led to a cache full condition in the mdg cache partition. In several situations this explanation did not work. So the question came up – what is responsible for the mdg cache overload?
In Pict 4 we found out that the volume with the highest write activity is working with near to zero cache size. This would not be a problem, because the volume does not win from cache! Instead of this, a volume with no visible activity owns the complete write cache - this is a totally unusual!!
Pict 6 shows the explanation – the volume with no write activity is writing heavily but this is only visible underneath the cache layer – astonishing enough, that this volume is consuming the complete cache. This write is not performed from a host – it is performed from the VAAI which is offloading operations from the VMware host into the storage hardware. This is why it cannot be seen on the volume layet and why it is can only be found with tools which allow deep analysis.


Pict 1. The managed disk cache group in this customer example is already heavily loaded. So only a little bit more load is needed to drive it into response time issues for all 257 volumes in this managed disk group. Typically the mean partition fullness is always on 80% and more and the max values are reaching 100% full for longer timeframes. The constant 80% and more mean cache fullness show that the mdg is already on the edge.

 

 

Pict 2: This is an aggregate curve of 257 Volumes in a managed disk group. The curve shows more or less steady read and write IO (green) and two response time peaks (red – mean value of all 257 volumes response time) which are not motivated by higher IO load.

 


Pict. 3: This graph shows response times of all volumes. In the questionable timeframe nearly all volumes have higher response times up to 1100 ms. These are looking like towers so I like to name them response time towers RTT. The winners are the two SRD volumes with 520 ms. The even higher response times belong to volumes with near to 0 IOPS.

 


Pict 4.: Analysis of the volume write cache sizes (pink) with the write data rates (blue). In the first view these curves look very common – high data rate of single volumes together with high cache usage but the astonishing here in RTT1 is, that the volume with the high cache usage (vmvdi03_v02 0.02MB/s, Caches size 10080 MB) is not the volume with the high data rate (vmvdi03 70 MB/s, Cache Size 0MB), This means that something is happening on vmvdi03_v02 what we cannot see in the SCSI top layer and which is reserving the complete write cache of the managed disk group.




Pict 5: The same analysis for the second peak RTT2. In the first 5 minute measurement period we find a very typical behavior of volume vmvdi03_v00 with high write data rates and a cache usage of 9734 MB. This is understandable but again in the second 5 minutes the situation changes back to the same we had on RTT1 the volume vmvdi03_v02 is reserving again the complete cache and the volume vmvdi03_v00 is working with the same data rate but no cache.

 


PIct 6: A deeper look into vmvdi03_v02 in RTT1. We did not find write activity in the SCSI layer above cache but we find high activity in the layer below SVC cache and high cache size for this volume. This specific activity came from VAAI copy tasks – the tricky side of this is that this load is not visible where we expect it but it is using the SVC resources like any other data write operation.

 



PIct 7: This is how a typical Response time peak should look like – the acting volume is reserving the majority of the cache.
Thanks to Thomas who did a first description of this!
 


BVQ Website

International Websites
Developer Works Documents and Presentations

 

 

Let us help!


or try us out!

Bottleneck Analysis

Planning Analysis

Health Check

Consulting


Popular content:

Page: Performance bottleneck analysis on IBM SVC and IBM Storwize V7000 , Page: BVQ analysis VAAI: undercover write operations , Page: BVQ use cases and experiences , Page: Downloads and BVQ releases , Page: BVQ VMware Integration Package (video and presentation)


General links

Return on invest 

Performance analysis whitepapers


The BVQ Blog
BVQ brings complete storage monitoring, performance analysis, alerting and reporting to the IBM Storwize family.
Nice little success story!
I am very proud about these kind of success stories, where we again could help a customer to solve a performance problem in shortest time. Like this example: 2:30 pm we were informed by one of our sales colleagues, that the client has performance...
Use BVQ V3.2 for the new SVC / Storwize codelevel 7.3.x
BVQ version 3.2 supports the new SVC / Storwize code versions 7.3.x The BVQ scanner from older versions is not able to work with the new Storwize code. New BVQ code Version 3.2   If you upgrade from BVQ...
BVQ Version 3.2 is available for download
We have just released BVQ Version 3.2 Highlights: Improved 'Look and Feel' of the GUI with  customization possibilities and more space for content More possibilities for searching, sorting and filtering functions in all...
Scan mirrored SVC Storwize systems with the the updated BVQ offline scanner
Another new feature of the latest BVQ offline scanner allows to scan more than one systems at once. This is especially helpful when you plan to use BVQ Copy Services Package to analyze Metro Mirror or Global Mirror connected systems. A profile is...
BVQ Offline Scanner for Code Version 7.3.x
The BVQ offline scanner has been updated . This scanner covers now also the SVC / Storwize code version 7.3 . You can find the download and instructions for the offline scanner here . BVQ Offline Scanner  
The advantage of 1 minute measurement intervals
I have added a new document about the benefits and the limits of 1 minutes measurement intervals The advantage of 1 minute measurement intervals     
Understand performance issues in huge SVC / Storwize environments (part 2)
Use BVQ to optimize SVC and Storwize in multi IO group clusters Complete whitepaper Abstract (The whitepaper is in draft mode! ) The experience with the analysis of multi I/O group clusters always show the same kind of problem. Technical...
Understand reasons for performance issues in huge SVC environments (part 1)
I just finished an performance bottleneck analysis for an 8 nmode cluster with several 100 TB and found one thing which I think might be a commonly made mistake in many of the bigger SVC environments. This picture shows the MDisk performance of SSD storage...
an amazing analysis taken from real life
Win a SVC / Storwize analysis for free What I can give for free, when you find the correct answer to this question, is an analysis of your SVC or  Storwize. You just need to find the correct answer to the question "what kind of...
Now even less complexity but the same excellent results for SVC / Storwize performance and bottleneck analysis. The new BVQ Offline Scanner!
Read this ... ... if you haven´t planned a budget for storage analysis at the moment. ... if you just want to know what is happening in your SVC or Storwize! ... if you want to know more about the technical limits of your SVC or Storwize. ......

(.)