Business’s that run their most critical applications on VMware and store their data on IBM Spectrum Virtualize Storage have a challenge; to meet the performance requirements of a variety of production applications and users, while meeting SLAs.  When problems arise, the IT managers expect these IT teams to work together to resolve the problems very quickly.

 In this whitepaper, we take a look at the day to day challenges in today's IT environment and how a true end-to-end visibility  of VMware and IBM Spectrum Virtualize environment helps IT teams overcome these obstacles.

The Challenge

IT teams who are working on resolving performance issues in the VMware environment, face many challenges.  One is that each team is in it's own silo and each have a different set of technical skills and use their own tools. In addition, each team is also using more than one tool set to get the job done. This can cause discrepancies in the results when looking for answers; which can create a delay in solving the performance issues and can also create a lot of finger pointing.

There is an additional layer of complexity due to the existing VMware tools used:

  • Difficult to customize performance views to meet troubleshooting needs
  • Lack of correlation between VMware and Storage layers
  • Limiting monitoring and performance views:
    • Only one object that can be viewed and analyzed at one time
    • Only one metric can be shown in a performance chart

When this insight and flexibility into both layers is not available to the administrator, finding the slow drainer in the VMware environment can be difficult and can consume much valuable time in finding the root of the performance problem.

The Solution

The solution is to have a tool that provides an efficient way of monitoring and resolving performing issues across both environments requires an insight that includes a deep integration between both environments where the performance is correlated between the two layers.

BVQ has the solution to the IT challenges because it offers:

  • BVQ bridges silos while offering the highest quality of performance analytics with a consistent monitoring and alerting solution. 
  • Complete integration between VMware and Storage
  • Correlation of information across layers
  • Ready-to-go Dashboards to monitor and perform root-cause analysis to find the slow drainer

              

The dashboards offer the following functionalities to help you effectively pinpoint the area where the bottleneck is happening:

  • See the activity levels of all Hosts you are interested in monitoring in a single view

  • Easily identify a saturation in the memory, CPU or IO on the ESXi host and pinpoint quickly particular VM causing the saturation
  • Ability to compare several objects to get a clear understanding if it’s just one VM that’s impacted or if it’s a chronic issue where more than one VM is impacted, by seeing all VMs spanning across several ESXi Hosts.
  • See the IO load on the VM LUN -data rate and latency to check for spikes. From identified spikes that surpass threshold levels determine the issue is coming from the storage side and identify which VM is causing that peak load

         provide the views of both environments in a unified interface.

Customer Use Case    -  Users are experiencing high response times on production applications

The administrator follows these troubleshooting steps using BVQ.  BVQ enables the IT team to quickly get to the root of the performance issue because from this single Dashboard the administrator has access to an overall view of

the performance of all ESX hosts in the cluster in a single view, determine which resource is in contention and then determine which set of VMs are creating this high load.


Step one, access the ready to go dashboards called "General VM Performance Overview"with one click from the Favorites menu.

-The treemap gives a hierarchical view of the VMware environment. The path shows the VMv Center/ VM Cluster/ VM Host/ Datastore/Virtual Machines.

-The table below allows you to search directly for any object.

-The performance views on the right give the resource performance information for CPU usage, Memory Utilization and Data Rate all in one single screen. The two performance views are ESX Host Aggregate and VM Aggregate

-To load the performance views, choose all ESX Hosts at once by clicking on the VM Cluster (this highlights it in orange, as shown below). Next, use keyboard shortcuts for easy maneuvering within the dashboard such as Ctrl Shift R to load the performance views.

-The performance view at the top called ESX Host Aggregate has four tabs. The first tab shows, allows you to check the resource contention in all areas such as Memory, CPU and Data Rate from a single view.



Figure 1.


Step two, the subsequent tabs include one metric such as CPU, Memory or Data Rate each; where the performance for each ESX is shown. You are able to compare the load of all ESX in a single view and quickly identify which ESX Host is experiencing a high load based on the

peak (Figure 2.).


Figure 2. identify the ESX host experiencing a storage contention


Figure 2. above shows the Data Rate that each ESX host is experiencing.  There is a high load in Data Rate for an ESX host (ucs-esx1) that occurs on July 17 at 1:19 PM. Therefore, we have narrowed down that the resource constraint is happening on the storage.

-You can isolate this ESX by clicking on it (line becomes bold) and using F5 on the keyboard. This then only displays the Data Rate of this particular host and at the same time it isolates only the VMs that are running on this particular ESX Host as shown in the bottom view.

-Go to the Data Rate Tab (Figure 3, bottom right performance view)and look for the peak which wil lhelp identify the particular VM causing this high load. Hover over the peak measurement point, the name of the VM that is causing this high data rate load at 1:19 PM, this is Master02.


 

Figure 3. Identify the VM causing high Data Rate load


Step three, easily transition into the Storage layer. For this, use favorite "Issue identified from the storage side" Figure 4

This dashboard enables the administrator to seamlessly transition to the storage side to look at the VDisk that is servicing this particular VM.  Search for the VM "Master02" using the table and load the performance views.

The performance view now shows the VDisk that is servicing this particular VM, it is "Prod-Unity1",  and displays its performance, data rate and latency (red line). It also displays the same time frame you were working with in the prior favorite. 

At this point you want to learn if the issue is coming from this VDisk; therefore, check if this VDisk is experiencing a high latency at the particular timeframe.

Immediately, you can see there is a  high latency by focusing on the peak. It occurs at 1:20  PM.

The root of the performance problem is found. The root of the high response times on the VM production applications is due to the VDisk because it cannot handle the load and the indication of this, is that its experiencing high latency at the same high measurement peak seen on the VM side.



Figure 4. Transition into the Storage layer using this favorite


You can now implement remediation steps to improve the response time experienced by users, consider moving this VM to a different host or datastore that uses a different storage back-end.

Longer timeframes can be selected to check if this VDisk is experiencing a high response time over a long period of time; if so, this could affect the storage performance of all other ESX hosts and VMs. 

Conclusion of root-cause analysis 

This root-cause analysis has identified the performance issue. We were also able to pin-point the particular VM driving the high IO load and find the root of this saturation. The production application's high response time was due to the saturated VDisk servicing the VM.  The bottleneck  was found on the storage resource.  This was possible because of the ability to seamlessly transition to the storage layer.  The administrator now can make an informed decision to  move this production critical VM to another VDisk that is not saturated in order to meet expected SLAs.




  • No labels