Look for slow drainers in the SAN that could be causing front-end storage high latency

When a performance issue is experienced in the SVC Volume servicing a VMware environment, it can become a challenge and time consuming to narrow down and identify the layer where the root of the problem is happening,  Many times, you can find yourself spending hours backtracking the performance issue to the source and the longer this takes, the longer your critical applications are down.

The ability for the administrator to have full visibility into all layers of the environment, Storage, SAN and VMware is critical to keep your business up and running optimally.

Solution

BVQ provides full visibility to see the impact in performance in all areas and helps to efficiently narrow down and identify the layer where the root of the problem is happening,  on the storage side or on the SAN side.

The ability for the administrator to have full visibility into all layers of the environment, Storage, SAN and VMware is very important. It's critical in situations where the performance issue experienced on the production volumes is caused  by a bottleneck in the SAN.

BVQ's pre-defined dashboard provides the performance view of all the layers necessary to determine the origin of the performance issue and detect the bottleneck, to identify if it's occurring on the SAN side. This Dashboard was designed with the expert knowledge and includes the performance views that are relevant in this troubleshooting process and contain the metrics that are most important in this situation.


How the ready to go dashboard is used to find the slow drainer:

We have the situation that an SVC volume (VDisk) that is servicing a VMware ESX host, suddenly shows high latency.


There are several ESX Host that are sharing the same volumes on SVC- they are all using the same VDisk; therefore, this is where the dashboard is valuable. It's valuable because it integrates the three layers in a single dashboard so that it allows you to work through the troubleshooting process.

The root-cause troubleshooting process in this particular case consists of pinpoint the following:

  1. From the identified VDisk a high latency is detected going over 10 ms which is the threshold limit determined for this particular business.
  2. Identify the particular ESX Host that is impacting this high load;
  3. From there, view the load on the SAN port that is servicing the ESX host and identify any bottlenecks the SAN Switch ports might be experiencing
  4. Identifying which virtual machine/s is responsible for the high load

The dashboard below (Figure 1.) integrates expert knowledge for a true end-to-end root-cause analysis. You are seeing from top down: the VDisk Performance, the Host load on the SAN switch ports and the third view shows the SAN switch port performance of the particular switch ports servicing the host.


Figure 1. Dashboard for true end-to-end SAN, VDisk and Virtual Machine.  From  identified Data Rate peak and latency on the Storage volume side, and see how this performance load is seen on the SAN switch side


How to use this dashboard in the troubleshooting process: From the identified Data Rate peak and latency on the Storage volume side, see this performance load on SAN switch layer

  1. The top view helps identify the high load and latency issue on the VDisk side.  This situation became a critical one because the VDisk showed latencies reaching 10 ms; which is the maximum threshold for this particular workload in accordance with the business requirements. Because many ESX Hosts are sharing the same volume, there is no difference on performance for these hosts as seen from the SVC storage side. VDisk performance looks the same for all.
  2. Therefore as a next step, it's important to identify which host is creating the highest load and how is this impacting the storage volume and the SAN switch port as well.  The second (middle view) presents such insight. You can see the load the hosts are creating on the SAN side.  It shows the data rate load coming from the hosts on the switch ports.  The peak identified shows where this high load is coming from, by hovering over the high measurement point and it gives you the name of the particular host which is creating this high load.
  3. From the third bottom view, you can see the SAN port performance and identify if there is a bottleneck on the SAN switch side.  You see the load of the switch ports used by the ESX host and the buffer credit wait % (BCW) utilization of each port. The BCW% indicates that there is not enough buffers to handle the load.
  4. The second part of the dashboard, (Figure 2) is the Virtual Machine performance view. This view helps identify the VM or VMs which are in high demand of storage resources on a consistent basis. It shows all the VMs hosted on the identified ESX host. Look at the peak, this is the VM creating the high load. You can now make a decision to move this particular VM to a different host that uses a different VDisk so that it does not impact other production VMs and also increase the buffer credits on the switch ports servicing this particular host.

Now you have the complete picture of what is going on.



Figure 2. Virtual Machine Performance view 


Conclusion of root-cause analysis 

This root-cause analysis has identified the performance issue. We were also able to pin-point the particular VM driving the high IO load and find the root of this saturation. The production application's high response time was initially identified on the storage side by seeing the VDisk latencies reaching close to 10 ms.  The bottleneck was found on the VM side, this VM is generating a high workload and is using the same storage resource with many other VMs.  The administrator now can make an informed decision to  move this production critical VM to another VDisk that is not saturated in order to meet expected SLAs.

 

congestion bottleneck or latency bottleneck But if you have a latency bottleneck your ISL won't be running at the maximum of the bandwidth. The contrary is the case: it lacks the buffer credits to ensure a proper utilization. If you see a latency bottleneck on an ISL it's often back pressure from a slow drain device attached to the adjacent switch