Donwload Whitpaper as pdf

 

Abstract (The white paper is in draft mode!

This belongs to: understand reasons for performance issues in huge SVC environments (part 2) The experience from the analysis of multi I/O group clusters often reveals the same kind of problems. Technical resources in larger SVC or Storwize environments are utilized very unbalanced which may lead to unnecessary bottlenecks.

The cause of the  bottlenecks are that typically some areas of the system are overload whilst other areas are completely underloaded. These kinds of performance bottlenecks are often very hard to understand without using the appropriate tools and the help of deeply skilled people. This whitepaper shows how you can recognize the unbalanced situation, and what to do to improve them.

 

 

Uneven utilization of SVC / Storwize hardware

 

The following document describes a typical situation, which is often found in larger SVC and Storwize environments having more than one I/O group.
The load on the I/O groups differs quite widely. Very often the first and oldest I/O group is overloaded while the other I/O groups just have to handle small loads.

The Impact of this unequal load distribution can be recognized in various areas. The CPU and cache utilization is very different from node to node. Some SAN ports are generally overloaded whilst other SAN ports are underloaded most of the time.

On the one hand some technical resources are utilized very low and on the other hand some are overloaded all the time and may thereby provoke performance bottlenecks.

Performance problems on application's side are now trickier and harder to understand. For example, it may happen now, that two identical volumes of the same managed disk group show completely different I/O behaviors.

The complexity of such multi IO group environments is hard to understand when they are administrated without appropriate tools. This leads to high investment costs, blind problem shooting and possible frustration.

 

Start a project to solve this early

Will more computer capacity help here?

The best idea here is to start to rebalance the load across the system. A bad idea is to just answer by installing more powerful hardware, because this can provoke a ’bottleneck chasing’ situation, where some bottlenecks are closed and new ones are caused. The bad thing about this is that established problems will be exchanged by new ones, which can cause even more trouble.

So this can also be a trigger to clean up first before starting with new more powerful nodes.

Pro activeness improves in the course of the project

The target of the rebalancing project is not to have a complete rebalanced system. But at least it should be taken care of the known bottlenecks and try to bring the load of all technical layers back to acceptable limits.

The rest of the work can now be replaced by improved ’pro activeness’!

In the course of this project the administrators will get such deep knowledge about the risk areas of the system, that they are able to foresee problems better even before they happen.

The first step in a multi I/O group cluster is therefore always to take a deeper look into the load of the I/O groups. The following example was taken from an 8-node cluster where the load was distributed totally uneven across the I/O groups.


Pict. 1: This is a simple way to visualize the different loads on the I/O groups. We use the BVQ Performance View to visualize the max and min cache utilization of all cache partitions, belonging to the I/O groups. With this we get a clearer picture about the different load situations.

In this example io_grp0 is sometimes overloaded while the load on io_grp3 is very low. Observe the information about cache partitions. This shows that some managed disk groups are not using any I/O groups at all. There are situations, where this is wanted, but then there wouldn't be these very different numbers (18, 7, 10, 9) of cache partitions per I/O group.

The problem with the uneven load develops slowly over time

One reason for the different utilization of the I/O groups is probably that the SVC cluster had been started with just 1 or 2 I/O groups. In the course of time, performance bottlenecks have been recognized. To solve these bottlenecks the cluster has been extended to 3 I/O groups and later 4 I/O groups.

At the time of the extension, some volumes were moved to the new I/O groups but the target was only to solve the actual problems and not the fair distribution of the load with a good usage of all technical resources.

When an acceptable performance level was achieved this process had been stopped, but a good load balancing wasn't still reached.

 

Without intervention this situation will become even worse

After all, the cluster is poorly organized and the situation is still getting worse by the fact, that SVC will now try to distribute automatically new volumes into the I/O groups with less load. By default, the preferred node, which owns a volume within an I/O group, is selected on a load balancing basis.

You could say - not a problem – is it really worth the effort, to rebalance this cluster?

What happens if a new application with a set of new volumes will be added to the SVC cluster? Most of these new volumes will now be added into the one I/O group with the smallest load.

This will lead to an even worse balancing of the technical resources like SAN ports, cache or CPU usage. Load balancing strategies are becoming more and more difficult until finding back to a better balanced system.

 

The volumes are organized in a multi-dimensional matrix

A volume always belongs to a managed disk group and to a node in an I/O group.
This makes two dimensions and with added multiple managed disk groups this matrix becomes multi-dimensional.

 

    • The managed disk group determines the kind of storage which is used by a volume
    • The managed disk group has to be assigned at creation time of a volume
    • The volume can be moved to other managed disk groups by volume migration
    • As long as the paths settings from hosts to servers are correct, this action is completely non-disruptive
    • The node in an I/O group determines the hardware the volume is served from (SAN ports, caches, CPU)
    • Node and I/O group can be assigned at creation time of a vDisk
    • A volume can be moved to other I/O groups or to another nodes inside the I/O group with the ‘movedisk’ command
    • When a volume is moved to another I/O group the volume will use other SAN ports. Plan for it and prepare your server to find the volume under the new path.
    • Quote taken from CLI guide: ‘You can migrate a volume to a new I/O group to manually balance the workload across the nodes in the clustered system. You might end up with a pair of nodes that are overworked and another pair that are underworked.’



Pict. 2: Treemap of a multi I/O group cluster, organized by capacity. The treemap shows first the cluster, second the managed disk groups inside the cluster, third the I/O groups inside the managed disk
groups and fourth the volumes which are belonging to the MDG and I/O groups. Io_grp1 is selected and highlighted in orange. It can be recognized that this I/O group is only used by seven managed disk groups in this cluster.

Pict. 3: Surprised? The treemap was re-sorted by performance. Only three from seven managed disk groups have relevant loads on this I/O group 1. This also shows the limits of the automated I/O group balancing of SVC. There is no way to foresee the future load of the volumes and with this there is no way to come to a balanced system automatically.


Pict. 4: CPU % load for all nodes – it’s obvious, that the CPU load on the nodes differs a lot.

Pict. 5: Node SAN port data throughput shows large deviations in the comparison of I/O-group 0 to I/O group 3. It can be seen, that the ports from I/O group 0 transfer 5 times more data than the ports of I/O group 3. Here is a big risk that one of the high loaded ports will saturate. The effect of this would be a big number of new performance bottlenecks.

Managed disk groups and I/O groups together form cache partitions

The combination of managed disk groups and I/O group forms a cache partition.

A cache partition for a managed disk group in an I/O group is formed when the first primary volume is assigned to one of the nodes of the I/O group.

The cache partitions are an overlay across the complete matrix. The nodes of the I/O group come with cache built in. This cache will be fairly divided to all managed disk groups by using this I/O group. These are then called cache partitions.

The cache partitions can have different sizes depending on the available cache from the nodes and the number of cache partitions in the I/O group.

 

Pict. 6: Taken from the IBM redpiece IBM SAN Volume Controller 4.2.1 Cache Partitioning
The cache has to be divided between all cache partitions.

Here you can find an excellent red paper about the cache partitioning which is at least valid until version 7.2.7 of the SVC code.
http://bvqwiki.sva.de/x/XoCr

 

A managed disk group which uses several I/O groups will receive cache capacities from all these I/O groups. BVQ shows these available capacities in the property sheet of the managed disk groups.
This mechanism allows the overcommitment of cache. The sum of all cache partitions will always be more than 100%.

If an I/O group has 4 cache partitions then each of these cache partitions receives 30% of the complete cache. This makes 120% and a big problem for the I/O group when all cache partitions would run full in the same time.

The idea behind this is, that some of the cache partitions are higher and some are lower so that they never come into the situation of using more than 100%.


Pict. 7: BVQ is showing the amount of cache, a managed disk group is receiving from the I/O groups. BVQ is counting mirrored write cache here, because the cache analysis is working based on write caching. In this example the MDG T2_V71XT_600GBR5 has in sum 12GB mirrored write cache from I/O groups.



Pict. 8: Cache analysis in the BVQ Performance View. The cache analysis can display all cache partitions sorted in any wanted combination to compare results. In this case the cache analysis helps to ilustrate that the cache for a managed disk group is relatively high utilized. But only one of the 4 cache partitions is higher utilized. The others are not. This is a risk because adding more demand into this managed disk group could bring this cache into overflow. The idea here is to plan for a rebalancing of this situation or to add new demanding volumes into the other I/O groups.


Pict. 9: This is the same managed disk group displayed in the BVQ Treemap. It shows that all volumes with high performance are mainly using io_grp1. This makes it easier to understand that this is the only I/O group where we can see cache load. The correct way to balance this is to first create a candidate list, second check the load on the different I/O groups and then third start to move the volumes to the target I/O groups. After this the cache load will be better distributed and lower in all.

The I/O groups are responsible for all volumes of all managed disk groups which are assigned to them.

This is a point which has to be observed in any case: the relocation of a volume into another I/O group always has an impact on the overall system.

In other words, in an unfavorable situation when several high demanding processes occur in the same I/O group the system will run into possible bottlenecks on all volumes from all managed disk groups which are assigned to these I/O groups.

This kind of situation results in very complex behaviors and the reason is nearly impossible to understand without using good analysis tools.

These kinds of bottlenecks can easily be recognized by looking at the global cache utilization of the nodes. From here it is possible to drill down through the cache partitions to single volumes to find the volumes which are the reason for this.

This also explains the situation in which two equal volumes of one managed disk group can have completely different response behavior. If the volumes are operated by different I/O groups, these differences are easily explainable by the load of these I/O groups.


Procedure to come back to a better balanced system

What needs to be done in the long term, is to perform load balancing across all nodes and I/O groups.

Here it is not enough to just distribute the single volumes randomly across all the I/O groups. With all changes, one has to keep an eye on the managed disk groups and on the I/O groups, not to run into new performance bottlenecks because of an unbalanced load situation.

A proven way to do this is, is to first find relocation candidates with an analysis step, then perform the changes, keep an eye on the limits and at the end to compare the results with the expectations.

  • Step 1: Get enough knowledge about the load on the different I/O groups

    • CPU and global cache
    • Node ports
    • Cache partitions

  • Step 2: Analysis step – find candidate volumes for relocation

    • Use partition cache analysis to find managed disk groups with high load or overload
    • Use treemap and VDisk analysis to find candidates in these MDGs
    • Choose target I/O groups for these candidate volumes
    • Keep an eye on load overlaps in target I/O groups

  • Step 3: Perform volume relocations

    • Keep an eye on I/O group load

  • Step 4: Analysis step – compare results with expectations

    • Go back to step 1 or step 2

 

A practical approach how to reorganize the cluster


Pict. 10: Shows a managed disk group with I/O groups and volumes in the BVQ Treemap. The treemap viewing aspect is capacity where the size of the volume determines the size of the object. It is clear that the managed disk group is using all I/O groups of the cluster. There is only one volume placed in io_grp2 and io_grp3, some 2 TB volumes in io_grp0 and most of the volumes in this managed disk group are located in io_grp1.


Pict. 11: The same managed disk group in the BVQ Treemap now organized by performance. The treemap viewing aspect is now set to IO performance where the performance of the volume determines the size of the object. We only find relevant performance in io_grp1 and io_grp3. We know from the former analysis steps, that especially I/O group 2 and I/O group 3 have a very small load. So the idea is to find candidate volumes from io_grp1 and move them to io_grp2 or io_grp3.

The most performant volumes have already been selected in this treemap (orange). The next step is to start a vDisk analysis to have a deeper look into the behavior of these three selected candidates.


Pict. 12: This picture shows a plot of the three selected volumes. The chosen metrics are data rate read & write. With this picture it is very easy to figure out at which time the volumes are working.



  • Volume #1

    The volume which is already in io_grp3 is painted bold.
    This volume is already in io_grp3, so the question is whether to add another volume to this I/O group or not. It looks like this volume h was moved before to this I/O group to solve another bottleneck because it is the only volume in io_grp3 and it is active in the same timeframe as the other two volumes. So the decision is not to add one of the other volumes to io_grp3.

  • Volume #2

    The second volume starts to work at 9pm and stops at 1am.
    We have already decided not to use io_grp3. We know from the analysis steps before, that io_grp0 is overloaded and io_grp1 is highly loaded at night. The load on io_grp2 is smaller so this could be a target I/O group for this volume.

  • Volume #3

    The third volume starts at midnight and stops at 2 pm.
    We see that this volume is creating more load than volume #2. So with the idea of balancing it would be the best to move this volume to io_grp2, because this step will take more loads off from this I/O group than moving volume #2.

 

Decision: (made on the basis of facts)
Volume #3 will be moved to I/O group 2


More white papers with comparable content

BVQ analysis: performance bottleneck analysis


Saving costs with BVQ, ROI related documents


 

BVQ web pages



BVQ website


SVA website of SVA GmbH


International BVQ blog



If you are interested in further information or you want to become a partner to distribute BVQ to your customers, please contact us at:
bvq@sva.de or use the contact Form in the WWW.


BVQ is a product of the SVA System Vertrieb Alexander GmbH