Business’s that run their most critical applications on VMware and store their data on IBM Spectrum Virtualize Storage have a challenge; to meet the performance requirements of a variety of production applications and users, while meeting SLAs.  When problems arise, the IT managers expect these IT teams to work together to resolve the problems very quickly.

 In this whitepaper, we take a look at the day to day challenges in today's IT environment and how a true end-to-end visibility  of VMware and IBM Spectrum Virtualize environment helps IT teams overcome these obstacles.

The Challenge

IT teams who are working on resolving performance issues in the VMware environment, face many challenges.  One is that each team is in it's own silo and each have a different set of technical skills and use their own tools. In addition, each team is also using more than one tool set to get the job done. This can cause discrepancies in the results when looking for answers; which can create a delay in solving the performance issues and can also create a lot of finger pointing.

There is an additional layer of complexity due to the existing VMware tools used:

  • Difficult to customize performance views to meet troubleshooting needs
  • Lack of correlation between VMware and Storage layers
  • Limiting monitoring and performance views:
    • Only one object that can be viewed and analyzed at one time
    • Only one metric can be shown in a performance chart

When this insight and flexibility into both layers is not available to the administrator, finding the slow drainer in the VMware environment can be difficult and can consume much valuable time in finding the root of the performance problem.

The Solution

An efficient way of monitoring and resolving performing issues across both environments requires an insight that includes a deep integration between both environments where the performance is correlated between the two layers.

BVQ has the solution to the IT challenges because it offers:

  • BVQ bridges silos while offering the highest quality of performance analytics with a consistent monitoring and alerting solution. 
  • Complete integration between VMware and Storage
  • Correlation of information across layers
  • Ready-to-go Dashboards to monitor and perform root-cause analysis to find the slow drainer

              

The dashboards offer the following functionalities to help you effectively pinpoint the area where the bottleneck is happening:

  • See the activity levels of all Hosts you are interested in monitoring in a single view

  • Easily identify a saturation in the memory, CPU or IO on the ESXi host and pinpoint quickly particular VM causing the saturation
  • Ability to compare several objects to get a clear understanding if it’s just one VM that’s impacted or if it’s a chronic issue where more than one VM is impacted, by seeing all VMs spanning across several ESXi Hosts.
  • See the IO load on the VM LUN -data rate and latency to check for spikes. From identified spikes that surpass threshold levels determine the issue is coming from the storage side and identify which VM is causing that peak load

provide the views of both environments in a unified interface.

Customer Use Case    -  End users are experiencing high response times on VM production applications

The administrator follows these troubleshooting steps using BVQ.  BVQ enables the IT team to quickly get to the root of the performance issue because from this single Dashboard the administrator has access to an overall view of

the performance of all ESX hosts in the cluster in a single view, determine which resource is in contention and then determine which set of VMs are creating this high load.

Step one, access the ready to go dashboards called "General VM Performance Overview"with one click from the Favorites menu.

-The treemap gives a hierarchical view of the VMware environment. The path shows the VMv Center/ VM Cluster/ VM Host/ Datastore/Virtual Machines.

-The table below allows you to search directly for any object.

-The performance views on the right give the resource performance information for CPU usage, Memory Utilization and Data Rate all in one single screen. The two performance views are ESX Host Aggregate and VM Aggregate

-To load the performance views, choose all ESX Hosts at once by clicking on the VM Cluster (this highlights it in orange, as shown below). Next, use keyboard shortcuts for easy maneuvering within the dashboard such as Ctrl Shift R to load the performance views.

-The performance view at the top called ESX Host Aggregate has four tabs. The first tab shows, allows you to check the resource contention in all areas such as Memory, CPU and Data Rate from a single view.



Figure 1.


Step two, the subsequent tabs include one metric such as CPU, Memory and Data Rate each; where the performance for each ESX is shown in order to be able to compare the load of all ESX in a single view and quickly determine which one is experiencing a high load (Figure 2.).

Figure 2. Data Rate performance of all ESX hosts in a single view to easily compare and identify which is experiencing a contention

Figure 2. above shows the Data Rate that each ESX host is experiencing.  There is a high load in Data Rate for an ESX host (dl560-esx05) that occurs on Wed 17 at around 16:30 minutes. Therefore, it has been determined that the resource constraint is happening on the storage.

-You can isolate this ESX by clicking on it (line becomes bold) and using F5 on the keyboard. Now only the Data Rate for this host is shown in the performance view and at the same time it has isolated only the VMs that are running on this particular ESX Host (in the bottom view).

-Go to the Data Rate Tab (Figure 3, bottom right performance view)and you will be able to see immediately the name of the VM that is causing this high data rate load at 16: 36, this is docker_tes2.

 

Figure 3. Identify the VM causing high Data Rate load


Step three, easily transition into the Storage layer to confirm that the specific ESX host is generating a high IO load on the disk due to the particular VM causing a high load and determine the VDisk that is servicing this storage for the ESX host. To also check if this VDisk is experiencing a high response time which could affect the storage performance of all other ESX hosts and VMs.

-Use Favorite "Storage Performance _Backend SCSi LUN _VDisk" Figure 4.

-This enables the administrator to seamlessly transition to the storage side and displays the same time frame you were working with in the prior favorite.

-Zoom in to the time frame by using the mouse to select area and drag, then use star "*" on keyboard, to populate the same zoom in time frame to all performance views. (Figure 4 below)

Figure 4. Select and drag mouse over area to zoom in


Description of each performance view from top left going counter-clockwise:

  • Shows ESX Host performance Data Rate and Disk Latency. Click on high measurement point, either blue or red line and F5 on keyboard to load only the performance for the specific ESX host on all other views.

  • Second view "Backend SCSi LUN...", shows the IO load on the backend SCSI LUN that the particular host is generating; which is experiencing a spike in IO demand as seen in performance view "Backend SCSI LUN..." Figure 5.

  • "VDisk (Storage Volume) the VM LUN is running on", identify the VDisk that is servicing this SCSI LUN and check is this VDisk  experiencing a high response time at the same interval time.

  • "Virtual Machines running on SCSI LUN..." confirm the VM that is driving the high storage load

It has been confirmed the bottleneck is coming from a saturation of the SCSI LUN the ESX is using. In this case the VDisk is not experiencing a high latency at this particular time interval; however, it is at 1 hour later as indicated

by the spike in red line in the bottom right performance view in Figure 5.

The administrator can now decide to look at a longer historical view, to verify there is a high latency has been occurring for longer periods, and also monitor this.  The VDisk is  experiencing high latency on a regular basis, then  remediation steps can be

implemented to either move this VM to a different host or datastore that uses a different storage back-end so that it does not impact the performance of other production VMs. In addition, a Storage troubleshooting can be performed with a structured

performance analysis on the Spectrum Virtualize.


Figure 5. Transition into the Storage layer using this favorite

Conclusion of root-cause analysis 

This root-cause analysis has identified the area where the bottleneck is occurring-the storage resource and pin-pointed the particular VM driving the high IO load that is causing the saturation, VM "docker-test2".  This was possible because of the ability to seamless transition to the storage layer to see the IO load on the back-end to the SCSI LUN and into the Storage VDisk to determine peaks in latency due to the demands of the particular VM. The administrator now can make an informed decision to fix the bottleneck in the VMware environment and meet expected SLAs.




  • No labels