Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment

July 23, 2011, 7:21 am

≫ Next: Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)

This is a 1st post in a series of 4. Here’s a list of all parts:

Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

A lot of early and current Hyper-V clusters are built on 1Gbps network architectures. That’s fine and works very well for a large number of environments. Perhaps at this moment in time you’re running solutions using blades with 10Gbps mezzanine cards/switches and all this with or without cutting up the bandwidth available for all the different networks needs a Hyper-V cluster has or should have for optimal performance and availability. This depends on the vendor and the type of blades you’re using. It also matters when you bought the hardware (W2K8 or W2K8R2 era) and if you built the solution yourself or bought a fast track or reference architecture kit, perhaps even including all Microsoft software and complete with installation services.

I’ve been looking into some approaches to introducing a 10Gbps network for use with Hyper-V clusters mainly for Clustered Shared Volume (CSV) and Live Migration (LM) Traffic. In brown field environments that are already running Hyper-V clusters there are several scenarios to achieve this, but I’m not offering the “definite guide” on how to do this. This is not a best practices story. There is no one size fits all. Depending on your capabilities, needs & budget you’ll approach things differently, reflecting what’s best for your environment. There are some “don’t do this in production whatever you environment is” warnings that you should take note of, but apart from that you’re free to choose what suits you best.

The 10Gbps implementations I’m dealing with are driven by one very strong operational requirement: reduce the live migration time for virtual machines with a lot of memory running a under a decent to heavy load. So here it is all about bandwidth and speed. The train of taught we’re trying to follow is that we do not want introduce 10Gbps just to share its bandwidth between 4 or more VLANs as you might see in some high density blade solutions. There that has often to do with limited amount of NIC/switch ports in some environments where they also want to have high availability. In high density scenarios the need to reduce cabling is also more urgent. All this is also often driven a desire to cut costs or keep those down as much as possible. But as technology evolves fast my guess is that within a few years we won’t be discussing the cost of 10Gbps switches anymore and even today there very good deals to be made. The reduction of cabling safes on labor & helps achieve high density in the racks. I do need to stress however that way too often discussions around density, cooling and power consumption in existing data centers or server rooms is not as simple as it appears. I would state that the achieve real and optimal results from an investment in blades you have to have the server room, cooling, power and ups designed around them. I won’t even go into the discussion over when blade servers become a cost effective solutions for SMB needs.

So back to 10Gbps networking. You should realize that Live Migration and Redirected Access with CSV absolutely benefit from getting a 10Gbps pipes just for their needs. For VMs consuming 16 Gb to 32 GB of memory this is significant. Think about it. Bringing 16 seconds back to 4 seconds might not be too big of a deal for a node with 10 to 15 VMs. But when you have a dozen SQL Servers that take 180 to 300 seconds to live migrate and reduce that to 20 to 30 seconds that helps. Perhaps not so during automated maintenance but when it needs to be done fast (i.e. on a node indicating serious hardware issues) those times add up. To achieve such results we gave the Live Migration & CSV network both a dedicated 10Gbps network. They consume about 50% of the available bandwidth so even a failover of the CSV traffic to our Live Migration network or vice versa should be easily handled. On top of the “Big Pipes” you can test jumbo frames, VMQ, …

Now the biggest part of that Live Migration time is in the “Brown-Out” phase (event id 22508 in the Hyper-V-Worker log) during which the memory transfer happens. Those are the times we reduce significantly by moving to 10Gbps. The “Black-Out” phase during which the virtual machine is brought on line on the other node creates a snapshot with the last remaining delta of “dirty memory pages”, followed by quiescing the virtual machine for the last memory copy to be performed and finally by the unquiescing of the virtual machine which is then running on the other node. This is normally measured in hundreds of milliseconds (event id 22509 in the Hyper-V-Worker log) . We do have a couple of very network intensive applications that sometimes have a GUI issue after a live migration (the services are fine but the consoles monitoring those services act up). We plan on moving those VMs to 10Gbps to find out if this reduces the “Black-Out” phase a bit and prevents that GUI of acting up. When can give you more feedback on this, I’ll let you known how that worked out.

An Example of these events in the Hyper-V-Worker event log is listed below:

Event ID 22508:

‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1′ migrated with a Brown-Out time of 64.975 seconds.

Event ID 22509:

‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1′ migrated with a Black-Out time of 0.811 seconds, 842 dirty page and 4841 KB of saved state.

Event ID 22507:

Migration completed successfully for ‘XXXXXXXX-YYYY-ZZZZ-QQQQ-DC12222DE1′ in 66.581 seconds.

In these 10Gbps efforts I’m also about high availability but not when that would mean sacrificing performance due to the fact I need to keep costs down and perhaps use approaches that are only really economical in large environments. The scenarios I’m dealing with are not about large hosting environments or cloud providers. We’re talking about providing the best network performance to some Hyper-V clusters that will be running SQL Server for example, or other high resource applications. These are relatively small environments compared to hosting and cloud providers. The economics and the needs are very different. As a small example of this: saving a ten thousand switch ports means that you’ll need you’ll save 500 times the price of a switch. To them that matters a lot more, not just in volume but also in relation to the other costs. They’re probably running services with an architecture that survives loosing servers and don’t require clustering. It all runs on cheap hardware with high energy efficiency as they don’t care about losing nodes when the service has been designed with that in mind. Economics of scale is what they are all about. They’d go broke building all that on highly redundant hardware and fail at achieving their needs. But most of us don’t work in such an environment.

I would also like to remind you that high availability introduces complexity. And complexity that you can’t manage will sink your high availability faster than a torpedo mid ship downs a cruiser. So know what you do, why and when to do it. One final piece of advice: TEST!

So to conclude this part take note of the fact I’m not discussing the design of a “fast track” setup that I’ll resell for all kinds of environments and I need a very cost effective rinse & repeat solution that has a Small, Medium & Large variety with all bases covered. I’m not saying those aren’t good or valuable, far from it, a lot of people will benefit from those but I’m serving other needs. If you wonder why they want to virtualize the applications at all, it has to do with disaster recovery & business continuity and replicating the environment to a remote site.

I intend to follow up on this in future blog posts when I have more information and some time to write it all up.

↧

Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)

August 1, 2011, 12:32 am

≫ Next: Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)

≪ Previous: Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment

This is a 2^nd post in a series of 4. Here’s a list of all parts:

Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

Introduction

In this post we continue along the train of thought we set in a previous blog post “Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)”. Let’s say you want to set up a Hyper-V cluster for SQL Server virtualization. Your business & IT manager told you the need to provide them with the best performance you can get. They follow up on that statement with a real budget so you can buy high end servers (blades or rack) and spec them out optimally for SQL Server. You take into consideration NUMA issues, vCPU:pCPU ratios, SQL memory demands, the current 4 vCPU limit in hyper-V, etc. By the way, this will be > 16vCPU with Windows Server 8, which leads me to believe the 64GB memory ceiling for virtual machines will also be broken. But for now this means that with regard to CPU & memory you’ve done all you can. That leaves only networking and IO to deal with. Now the IO is food for another & very extensive discussion, but basically you have to design that around the needs of the application(s) or you’ll be toast. The network part is what we’ll tackle here.

Without going into details, what does a Hyper-V cluster need in terms of networking?

Who/What	Function	Traffic	Connection Type
Host Management	Hyper-V host connectivity.	Relatively low bandwidth. But don’t forget about deploying VMs or backups.	Public

VM Network	Provides network connectivity to the VMs	Very dependent on the VMs using it.	Dedicated Hyper-V
Cluster Heartbeat	Internal cluster communication to determine the status of other cluster nodes	Not much traffic but low latency or cluster might think it’s in trouble due to dropped packets. OK to combine with CSV.	Private Cluster Network
Cluster Shared Volume (CSV)	For updating CSM metadata & scenarios where redirected I/O is required	Mostly idle. When in redirected I/O it demands high bandwidth & low latency required.	Private Cluster Network
Live Migration	Used to transfer the running VM’s from one cluster node to another	Mostly idle. When Live Migrating it demands high bandwidth & low latency required.	Private Cluster Network

Host Management: It is fine to leave this on 1Gbps, unless you have a need to deploy massive amounts of VMs or you backups are consuming all bandwidth. If so consider dedicated NICs for those roles and/or 10Gbps. Also note that you might be able to leverage your SAN for virtual machine deployment / backups.

VM Network: Use multiple “single” NICs or NIC teams to spread both the load and the risk. Remember that you can lose the host management or CSV network of a node, without affecting your virtual machine connectivity but not the virtual machine network(s). So don’t put all your eggs in one basket. So do consider multiple NICs and NIC teaming. Do remember that there are other bottle necks than bandwidth to a virtual machine running apps so don’t go completely overboard as there is no single magic bullet here for virtual machine performance. 2 or 3 will do perfectly fine. What about backups in the guest? Yes, that’s an extra burden but there are better solutions than that and if you hit and bandwidth issue with guest based backups it’s time to investigate them seriously. As you will see in these series I’m not a mincer with NIC ports but there’s no need to have one for every 2 Virtual machines. If you have really high bandwidth needs consider 10Gbps, not a truck load of NIC ports.

Heartbeat: Due to the mostly moderate needs it is often combined with the CSV traffic.

Cluster Shared Volume (CSV): Well you have the need for metadata of the clustered shared volumes. But that’s not all. You also have redirected access when you’re doing backups, defragmenting your CSV storage or when the storage paths are unavailable. So go for 10Gbps when you can, especially since this is your backup path for Live Migration traffic!

Side Note: Don’t say that Redirected Access over the CSV network will never happen when you have redundant storage paths. We’ve seen it happen in an environment with dual FC HBA cards, dual SAN controllers and the works. Redirected Access saved our service availability during that event! What happened exactly and how it all ties together is a long story and complicated but in essence an arbitrated loop management module when haywire and caused a loop, the root cause of this was a defective disk. When that event was over one of the controllers went nuts and decided this wasn’t his cup of tea and called it a day. Guess what? Some servers could not failover to the other controller as something went wrong in the internal workings of the SAN itself, dual HBA didn’t help here. How did our services stay available? Thanks to Redirected Access. It was at 1Gbps speeds so that hurt a little but we kept ‘m running. Our vendor worked through this with us but things where pretty bad and it was pucker time. However this is one example where we kept our services running for 24 hours (whilst working at the issue with the vendor) via redirected access. The bad thing was we needed to take the spare controller of line & restart both to get the replacement controller to be recognized, yes a complete shutdown of the cluster nodes to restart both SAN controllers. I still remember the mail I send and the call I made to management that is was shutting down the business for 30 minutes. But it was not because of Hyper-V, quite the opposite; it helped us out a lot!

Also note that when you run software VSS based backups and disk defragmentation on your CSV storage you’ll be running in Redirected Access mode. Also see http://workinghardinit.wordpress.com/2011/06/02/some-feedback-on-how-to-defrag-a-hyper-v-r2-cluster-shared-volume/ Some Feedback On How to defrag a Hyper-V R2 Cluster Shared Volume

Live Migration: The bigger and better the pipe the faster Live Migration gets done. With high density or resource (memory) intensive servers this becomes a lot more important. Think of SQL Server, Exchange consuming 16, 24, 32 or more GB of memory. So do consider 10Gbps.

iSCSI: As we are using Fiber Channel in our SAN we did not include iSCSI in the networking needs table above. Now I do want to draw your attention to the need for iSCSI in the virtual machines themselves. This is needed for clustering within the virtual machines. Today this is almost a requirement as clustering in the guest becomes more and more important. You’ll need at least two NIC ports in production for this, if possible in on two separate cards for ultimate redundancy. Now as a best practice we won’t share the iSCSI NICS between the hosts and the guests. I do this in the lab but won’t have it in production. So that could mean at least two more NIC ports. With 10Gbps you’ll have ample performance but depending on your IO needs you might want 4 if you’re using 1Gbps so those NIC numbers are rising fast.

What	Function	Traffic	Connection Type
iSCSI Guest	Virtual machine shared storage.	High bandwidth need, low latency is required to get good I/O	Dedicated to Hyper-V
iSCSI host	Host shared storage	High bandwidth need, low latency is required to get good I/O	Excluded from cluster, dedicated to the host.

What to move to 10Gbps?

Cool, you think, let’s throw some 10Gbps NICs & switches into our network. After that, depending on the rest of your network equipment & components, your virtual machines might be able to talk to other virtual and physical servers on the network at speeds up to 10Gbps or at least 1Gbps. I kind of hope that none of you are running 100 Mbps in your server racks today. And last but not least, with your 10Gbps network you’ll be able to do get the best performance for your CSV and Live Migration traffic. Life is good!

Until your network engineer hears about your plans. All of a sudden it’s no so cool anymore. You certainly woke the network people up! They’re nervous now they have seen all the double (redundancy) lines you’ve drawn on your copy of the schema representing the rack / server room network. They start mumbling things about redundancy, loops, RSTP, MSTP, LAG, stacking and a boatload other acronyms that sound like you’ve heard ‘m before but can’t quite place. They also talk about doom and gloom scenarios that might very well bring down the network. So unless you are the network admin you should dust of your communication skills and get them on board. So for your sake I hope they’re not the kind of engineers that states that most network problems that can’t be solved by removing servers and applications that ruin the nirvana of their network design. If so they’ll be vary weary of that “virtual switch” you’re talking about as well.

The Easy Way Out – A Dedicated CSV & Live Migration Network

Let’s say that you need a lot more time to get to a fully integrated solution for the 10Gbps network architecture figured out and set up. But your manager states you need to improve the Live Migration and other cluster network speeds today. What are your options? Based on the above information your boss is right, the networks that will benefit the most from a move to 10Gbps are CSV and Live Migration (and Heart Beat that piggy backs along with CSV). Now you have to remember that those cluster networks (subnets/VLANs) are for the Heart Beat, CSV and Live Migration cluster traffic only. So basically the only requirement you have is that these run on separate subnets/VLANs (to present them as distinct networks to your failover cluster) and that every node of the cluster can communicate over those subnets/VLANs. This means that you can leave the switches for those networks completely isolated from the rest of the network as shown in the picture below. I used some very common and often used DELL PowerConnect switches (5424, 6248, 8024F) in some scenario drawings for this blog series. They could make that 8024F an unbeatable price/quality deal if they would make them stackable. The sweet thing about stackable switches is that you can do Active-Active NIC teaming across switches rather than active-passive. I never went that way as I’m waiting to see what virtual switch innovations Hyper-V 3.0 will bring us. You see I’m a little cheap after all

But naturally, feel free think about these scenarios with your preferred ProCurves, CISCO, Juniper, NetGear … switches in mind. Smile

Suddenly things are cool again. The network people get time to figure out an integrated & complete long term solution and you can provide you nodes with 10Gbps for cluster only traffic. By a couple of 10Gbps switches & NICs and you’re on your way. Is this a good idea? I can’t make that call for you. I just provide some ideas. You decide.

The Case For Physically Isolating Them

Now you might wonder if this isn’t very wasteful in resources. Well not necessarily. If your cluster is big enough, let’s say 12-16 nodes or if you have a couple of clusters (4 clusters with 5 nodes for example) this might be not overly expensive. Unless you’re on a converged network, you do (I hope) the same for your storage networks, isolate them that is. You have to when you’re using fiber and you’d better do it when using iSCSI. It provides for the best performance and less complex switch configurations. Remember I mentioned that high availability requires some complexity. Try to keep that complexity as low as possible and when you introduce complexity make sure you can manage it. This serves two purposes. One is making sure that the complexity doesn’t ruin you high availability and two is that you’ll be happy you did it when it comes to trouble shooting and fixing issues. Now you might say that this ruins the concept of converged networks. Academically this is true but when you are filling up ports on switches for a single purpose there is no room for anything else anyway. Don’t lose sight of the aim of a converged network. That is to have the ability to use the same hardware/technology when possible for multiple needs. This gives you options and capabilities where and when needed. It’s not about always using all technology and protocols on each and every switch. Don’t forget also that you’ll need to address QOS/Performance on a converged network per type of traffic. There is also the fact that in brown field scenario’s you’re dealing with replacing a part of the infrastructure and this example is a good way to get 10Gbps where needed and not making any change on the existing network infrastructure. This reduces risk and impact. As a matter effect if you plan this right you can do this without service interruption. That means going node by node (maintenance mode, evacuate all VMs), moving the CSV network first for example and only then the Live Migration network. You’re leveraging the ability of the cluster networks to take on each other’s role here to achieve this.

Another good reason to physically isolate the networks is for security. There was an exploit for manipulating VMs during live migrations in 2008 (http://www.eecs.umich.edu/techreports/cse/2007/CSE-TR-539-07.pdf). You can protect against this via very careful switch configuration and VLAN design. But isolating the switches is very easy, clean and effective as well. Overkill? I don’t know, but perhaps not if you do works for intelligence agencies.

Ethernet Out-of-Band (OOB) Port For Management

Don’t forget you still need to be able to manage those switches but today, in this class of equipment you get an Ethernet Out-of-Band (OOB) port for that. This one you can safely uplink to your regular management network. So if you really don’t need communication with the rest of the network you have no functional reason not to isolate them.

Money, Cost? No Value!

Still you think, isn’t this very expensive? Well look at the purpose. Manageable complexity, high availability and your management stated to eliminate, where possible, any limitation on performance and approved the budget for it all. Put this into perspective. The SQL Server data center editions running on these clusters, combined with the cost of development & maintenance of the databases and applications relying on this infrastructure put those extra € spent on a couple of switches really into perspective. On top of that you’re not wasting those switches. When the network people get their plans finished they’ll be integrated into the final solution if still needed and possible. Don’t forget that you might use all ports for just cluster traffic depending on the number of hosts you have! So even without integrating them into the rest of the network, you’re still getting very solid results. On top of that, sometimes you get to build solutions where budget is not the first, last and only concern. Sweet! I do know some people who’ll call me a money wasting nut case J. But get real, when you’re building high available, highly performing failover clusters and you’re in a discussion about the cost of a couple of NIC ports and you are going to adjust your design over that, perhaps you have a sponsorship issue. Put in into perspective. Hyper-V cluster are not a competition where the one who uses the least NIC ports/cards and switch ports/ switches wins. That’s why it hurts when I see designs like this claiming victory:

What I want to see is more like this:

But that will never fit into a blade design! Really? Have you seen the blades like the DELL M910? It’s a beast, comparable to the R810. It’s was the first blade I really felt like buying. Cisco also entered that market with guns drawn and is pushing HP to keep performing. So Again put the NIC/Switch and NIC port : Switch Port count into perspective against what you’re trying to achieve. To quote Anton Ego “… you know what I’m craving? A little perspective, that’s it. I’d like some fresh, clear, well-seasoned perspective.”

↧

Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)

August 11, 2011, 5:39 am

≫ Next: Hyper-V Cluster Nodes Upgrade: Zero Down Time With Intel VT FlexMigration

≪ Previous: Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)

This is a 3^th post in a series of 4. Here’s a list of all parts:

Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

As you saw in my previous blog post “Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)” we created an isolated network for Hyper-V cluster networking needs, i.e. Heartbeat, Cluster Shared Volume and Live Migration traffic. When you set up failover clustering you’re doing so to achieve some level of high availability. We did this by using 2 switches and setting up redundant paths to them, making use of the fault tolerance the cluster networks offer us. The darks side of high availability is that is always exposes the next single point of failure and when it comes to networking that means you’ll need redundant NICs, NIC ports, cabling and switches. That’s what we’ll discuss in this blog post. All the options below are just that. There is never an obligation to use them everywhere and it might be not needed depending on the type of network and the business needs we’re talking about. But one thing I have learned is to build options into your solutions. You want ways and opportunities to work around issues while you fix them.

Redundant Switches

The first thing you’ll need to address is the loss of a switch. The better ones have redundant power supplies but that’s about it. So you’ll need to have (at least) two switches and make sure you have redundant connections to both switches. That implies both switches can talk to each other as they form one functional unit even when it is an isolated network as in our example.

One of the ways we can achieve this is by setting up a Link Aggregation Group (LAG) over Inter Switch Links (ISL). The LAG makes all the connections available between the switches for the VLANs you define. There are different types of LAG but one of the better ones is a LAG with LACP.

Stacking your switches might also be a solution if they support that. You might need stacking modules for that. Basically this turns two or more switches into one big switch. One switch in the stack acts as the master switch that maintains the entire stack and provides a single configuration and monitoring point. If a switch in the stack fails the remaining switches will bypass the failed switch via the stacking modules. Depending on the quality of your network equipment you can have some disruption during a the failure of the master switch as then another switch needs to take on that role and this can take anything between 3 seconds and a minute depending on vendor, type, firmware, etc. Network people like this. And as each switch contains the entire the stack configuration it’s very easy to replace a dead switch in a stack. Just rip out the dead one, plug in the replacement one and the stack will do the rest.

We note that more people have access to switches that can handle LAGs versus those who have stackable ones. The reason for this is that the latter tend to be more pricy.

Redundant Network Cards & Ports

Now whether you’re using LAGs or stacking the idea is that you connect your NICs to different switches for redundancy. The question is do we need to do something with the NIC configuration or not to benefit from this? Do we have redundancy in via a cluster wide virtual switch or not? If not can we use NIC teaming? Is NIC teaming always needed or a good idea? Ok, let’s address some of these questions.

First of all, Hyper-V in the current Windows Server 2008 SP1 version has no cluster wide virtual switch that can provide redundancy for your virtual machine network(s). But please allow me to dream about Hyper-V 3.0. To achieve redundancy for the virtual machine networks you’ll need to turn to NIC teaming. NIC teaming has various possible configurations depending on vendor and the capabilities of the switches in use. You might be familiar with terminology like Switch Fault Tolerance (SFT), Adaptive Fault Tolerance (AFT), Link Aggregation Control protocol (LACP), VM etc. Apart from all that the biggest thing to remember is that NIC teaming support has to come from the hardware vendor(s). Microsoft doesn’t support it directly for Hyper-V and Hyper-V gets assess to a NIC team NIC via the Windows operating system.

On NIC Teaming

I’m going to make a controversial statement. NIC teaming can be and is often a cause of issues and it can expensive in time to both set up and fix if it fails. Apart from a lot of misconceptions and terminology confusion with all the possible configurations we have another issue. NIC teaming introduces complexity with drivers & software that is at least a hundred fold more likely to cause failures than today’s high quality network cards. On top of that sometimes people forget about the proper switch configuration. Ouch!

Do a search on Hyper-V and NIC teaming and you’ll see the headaches it causes so many people. Do you need to stay away from it? Is it evil? No, I’m not saying that. Far from that, NIC teaming is great. You need to decide carefully where and when to use it and in what form. Remember when you can handle & manage the complexity need to achieve high availability, generally speaking you’re good to go. If complexity becomes a risk in itself, you’re on the wrong track.

Where do I stand on NIC teaming? Use it when it really provides the benefits you seek. Make sure you have the proper NICs, Switches and software/drivers for what you’re planning to do. Do your research and test. I’ve done NIC Teaming that went so smooth I never would have realized the headaches it can give people. I’ve done NIC teaming where buggy software and drivers drove me crazy.

I’d like to mention security here. Some people tend to do a lot of funky, tedious configurations with VLANS in an attempt to enhance security. VLANS are not security mechanisms. They can be used in a secure implementation but by themselves they achieve nothing. If you’re doing this via NIC teaming/VLANS I’d like to note that once someone has access to your Hyper-V management console and /or the switches you’re toast. Logical and physical security cannot be replaced or ignored.

NIC Teaming To Enhance Throughput

You can use NIC teaming enhance bandwidth/throughput. If this is you major or only goal, you might not even be worried about using multiple switches. Now NIC teaming does help to provide better bandwidth but, sure but nothing beats buying 10Gbps switches & NICs. Really, switches with LAGs or stack and NIC Teaming are great but bigger pipes are always better for raw throughput. If you need twice or quadruple the ports only for extra bandwidth this gets expensive very fast. And if, on top of that, you need consultants because you don’t have a network engineer to set it all up just for that purpose, save your money and invest it in hardware.

NIC Teaming For Redundancy

Do you use NIC teaming for redundancy? Yes, this is a very good reason when it fits the needs. Do you do this for all networks? No, it depends. Just for heartbeat, CSV & Live migration traffic it might be overkill. The nature of these networks in a Hyper-V Cluster is such that you don’t really need it as they can mutually provide redundancy for each other. But what if a NIC port fails when I’m doing a live migration? Won’t that mean the live migration will fail? Yes. But once the NIC is out of the picture Live Migration will just work over the CSV network if you set it up that way. And you’re back in business while you fix the issue. Have I seen live migration fail? Yes, sure. But it never left the VM messed up, that kept running. So you fix the issue and Live Migrate it again.

The same goes for the other networks. CSV should not give you worries. That traffic gets queued and send to the next available network available for CSV. Heart beat is also not an issue. You can afford the little “down time” until it is sent over the next available network for cluster communication. Really a properly set up cluster doesn’t go down when a cluster networks fails if you have multiple of them.

But NIC teaming could/would prevent even this ever so slight interruption you say. It can, yes, depending on how you set this up, so not always by definition. But it’s not needed. You’re preventing something benign at great cost. Have you tested it? Is it always a lossless, complete transparent failover? No a single packet dropped? Not one ping failed? If so, well done! At what cost and for what profit did you do it? How often do your NICs and switch ports fail? Not very often. Also remember the extra complexity and the risk of (human) configuration errors. As always, trust but verify, testing is your friend.

Paranoia Is Your Friend

If you set up NIC teaming without separate NIC cards (not ports) and the PCIe slot goes bad NIC teaming won’t save you. So you need multiple network cards. On top of that, if you decide to run all networks over that team you put all your eggs in that one basket. So perhaps you might need 2 teams distributed over multiple NIC cards. Oh boy redundancy and high availability do make for expensive setups.

Combine NIC Teaming & VLANS Work Around Limited NIC Ports

This can be a good idea. As you’ll be pushing multiple networks (VLANs) over the same pipe you want redundancy. So NIC teaming here can definitely help out. You’ll need to consider the amount of network traffic in this case as well. If you use load balancing NIC teaming you can get some extra bandwidth, but don’t expect miracles. Think about the potential for bottle necks, QoS and try to separate bandwidth hogs on separate teams. And remember, bigger pipes are always better, so consider 10Gbps when you are in a bandwidth crunch.

Don’t Forget About The Switches

As a friendly reminder about what we already mentioned above, don’t forget to use different switches for up linking the NIC ports. If you do forget your switch is the single point of failure (SPOF). Welcome to high availability: always hitting the net SPOF and figuring out how big the risk is versus the cost in money and complexity to deal with it J. Switches don’t often fail but I’ve seen sys admins pull out the wrong PDU cables. Yes human error lures in all corners in all possible variations. I know this would never happen to you, and certainly not twice, but other people are not so skillful. And for those who’d rather be lucky than good I have bad news. Luck runs out. Inevitably bad things happen to all of our systems.

Some Closings Thoughts

One rule of thumb I have is not to use NIC teaming to save money by reducing NIC Cards, NIC ports, number of switches or switch ports. Use it when it serves your needs and procure adequate hardware to achieve your goals. You should do it because you have a real need to provide the absolute best availability and then you put down the money to achieve it. If you talk the talk, you need to walk the walk. And while not the subject of this post, your Active Directory or other core infrastructure services are not single points of failure , are they? Winking smile

If you do want to use it to save money or work around a lack of NIC ports, there is nothing wrong with that, but say so and accept the risk. It’s a valid decision when you have you have your needs covered and are happy with what that solution provides.

When you take all of this option into consideration, where do you end with NIC teaming and network solutions for Hyper-V clusters? You end up with the “Business ready” or “reference architecture” offered by DELL or HP. They weigh all pros and cons against each other and make a choice based on providing the best possible solution for the largest number of customers at acceptable costs. Is this the best for you? That could very well be. It all depends. They make pretty good configurations.

I tend to use NIC teaming only for the Virtual Machine networks. That’s where the biggest potential service interruption exists. I have in certain environments when NIC teaming was something that was not chosen mediated that risk by providing 2 or 3 single NIC for 2 or 3 virtual networks in Hyper-V. That reduced the impact to 1/3 of the virtual machines. And a fix for a broken NIC is easy; just attach the VMs to a different virtual network. You can do this while the virtual machines are running so no shutdown is required. As an added benefit you balance the network traffic over multiple NICs.

10Gbps with NIC teaming and VLANs provide for some very nice scenarios. This is especially true especially if you have bandwidth hungry applications running in boatload of VMs. This all means that we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So that means we’re entering the network engineers their turf and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios. But that will be discussed in a next blog post.

↧

Hyper-V Cluster Nodes Upgrade: Zero Down Time With Intel VT FlexMigration

August 12, 2011, 9:49 am

≫ Next: Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)

≪ Previous: Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)

Well the oldest Hyper-V cluster nodes are 3 + years old. They’ve been running Hyper-V clusters since RTM of Hyper-V for Windows 2008 RTM. Yes you needed to update the “beta” versions to the RTM version of Hyper-V that came later Smile Bit of a messy decision back then but all in all that experience was painless.

These nodes/clusters have been upgraded to W2KR2 Hyper-V clusters very soon after that SKU went RTM but now they have reached the end of their “Tier 1” production life. The need for more capacity (CPU, memory) was felt. Scaling out was not really an option. The cost of fiber channel cards is big enough but fiber channel switch ports need activation licenses and the cost for those border on legalized extortion.

So upgrading to more capable nodes was the standing order. Those nodes became DELL R810 servers. The entire node upgrade process itself is actually quite easy. You just live migrate the virtual machines over to clear a host that you then evict from the cluster. You recuperate the fiber channel HBAs to use in the new node that you than add to the cluster. You just rinse and repeat until you’re done with all nodes. Thank you Microsoft for the easy clustering experience in Windows 2008 (R2)! Those nodes now also have 10Gbps networking kit to work with (Intel X520 DA SPF+).

If you do your home work this process works very well. The cool thing there is not much to do on the SAN/HBA/Fiber Switch configuration side as you recuperate the HBA with their World Wide Names. You just need to updates some names/descriptions to represent the new nodes. The only thing to note is that the cluster validation wizard nags about inconsistencies in node configuration, service packs. That’s because the new nodes are installed with SP1 integrated as opposes to the original ones having been upgraded to SP1 etc.

The beauty is that by sticking to Intel CPUs we could live migrate the virtual machines between nodes having Intel E5430 2.66Ghz CPUs (5400-series "Harpertown") and those having the new X7560 2.27Ghz CPUs (Nehalem EX “Beckton”). There was no need to use the “Allow migration to a virtual machine with a different processor” option. Intel’s investment (and ours) in VT FlexMigration is paying of as we had a zero down time upgrade process thanks to this.

You can read more about Intel VT FlexMigration here

And in case you’re wondering. Those PE2950 III are getting a second life. Believe it or not there are software vendors that don’t have application live cycle management, Virtualization support or roadmaps to support. So some hardware comes in handy to transplant those servers when needed. Yes it’s 2011 and we’re still dealing with that crap in the cloud era. I do hope the vendors of those application get the message or management cuts the rope and lets them fall.

↧

Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)

August 22, 2011, 5:52 am

≫ Next: Data Protection & Disaster Recovery in Windows 8 Server Hyper-V 3.0

≪ Previous: Hyper-V Cluster Nodes Upgrade: Zero Down Time With Intel VT FlexMigration

This is a 4^th post in a series of 4. Here’s a list of all parts:

Introducing 10Gbps Networking In Your Hyper-V Failover Cluster Environment (Part 1/4)
Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4)
Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)
Introducing 10Gbps & Integrating It Into Your Network Infrastructure (Part 4/4)

In my blog post “Introducing 10Gbps & Thoughts On Network High Availability For Hyper-V (Part 3/4)” in a series of thoughts on 10Gbps and Hyper-V networking a discussion on NIC teaming brought up the subject of 10Gbps for virtual machine networks. This means our switches will probably no longer exist in isolation unless those virtual Machines don’t ever need to talk to anything outside what’s connected to those switches. This is very unlikely. That means we need to start thinking and talking about integrating the 10Gbps switches in our network infrastructure. So we’re entering the network engineers their turf again and we’ll need to address some of their concerns. But this is not bad news as they’ll help us prevent some bad scenarios.

Optimizing the use of your 10Gbps switches

As not everyone runs clusters big enough, or enough smaller clusters, to warrant an isolated network approach for just cluster networking. As a result you might want to put some of the remaining 10Gbps ports to work for virtual machine traffic. We’ve already pointed out that your virtual machines will not only want to talk amongst themselves (it’s a cluster and private/internal networks tend to defeat the purpose of a cluster, it just doesn’t make any sense as than they are limited within a node) but need to talk to other servers on the network, both physical and virtual ones. So you have to hook up your 10Gbps switches from the previous example to the rest of the network. Now there are some scenarios where you can keep the virtual machine networks isolated as well within a cluster. In your POC lab for example where you are running a small 100% virtualized test domain on a cluster in a separate management domain but these are not the predominant use case.

But you don’t only have to have to integrate with the rest of your network, you may very well want to! You’ve seen 10Gbps in action for CSV and Live Migration and you got a taste for 10Gbps now, you’re hooked and dream of moving each and every VM network to 10Gbps as well. And while your add it your management network and such as well. This is nothing different from the time you first got hold of 1Gbps networking kit in a 100 Mbps world. Speed is addictive, once you’re hooked you crave for more Smile

How to achieve this? You could do this by replacing the existing 1Gbps switches. That takes money, no question about it. But think ahead, 10Gbps will be common place in a couple of years time (read prices will drop even more). The servers with 10Gbps LOM cards are here or will be here very soon with any major vendor. For Dell this means that the LOM NICs will be like mezzanine cards and you decide whether to plug in 10Gbps SPF+ or Ethernet jacks. When you opt to replace some current 1Gbps switches with 10gbps ones you don’t have to throw them away. What we did at one location is recuperate the 1Gbps switches for out of band remote access (ILO/DRAC cards) that in today’s servers also run at 1Gbps speeds. Their older 100Mbps switches where taken out of service. No emotional attachment here. You could also use them to give some departments or branch offices 1gbps to the desktop if they don’t have that yet.

When you have ports left over on the now isolated 10Gbps switches and you don’t have any additional hosts arriving in the near future requiring CSV & LM networking you might as well use those free ports. If you still need extra ports you can always add more 10Gbps switches. But whatever the case, this means up linking those cluster network 10Gbps switches to the rest of the network. We already mentioned in a previous post the network people might have some concerns that need to be addressed and rightly so.

Protect the Network against Loops & Storms

The last thing you want to do is bring down your entire production network with a loop and a resulting broadcast storm. You also don’t want the otherwise rather useful spanning tree protocol, locking out part of your network ruining your sweet cluster setup or have traffic specifically intended for your 10Gbps network routed over a 1Gbps network instead.

So let us discuss some of the ways in which we can prevent all these bad things from happening. Now mind you, I’m far from an expert network engineer so to all CCIE holders stumbling on to this blog post, please forgive me my prosaic network insights. Also keep in mind that this is not a networking or switch configuration course. That would lead us astray a bit too far and it is very dependent on your exact network layout, needs, brand and model of switches etc.

As discussed in blog post Introducing 10Gbps With A Dedicated CSV & Live Migration Network (Part 2/4) you need a LAG between your switches as the traffic for the VLANs serving heartbeat, CSV, Live Migration, virtual machines but now also perhaps the host management and optional backup network must flow between the switches. As long as you have only two switches that have a LAG between them or that are stacked you have not much risk of creating a loop on the network. Unless you uplink two ports directly with a network cable. Yes, that happens, I once witnessed a loop/broadcast storm caused by someone who was “tidying up” spare CAT5E cables buy plugging all ends up into free port switches. Don’t ask. Lesson learned: disable every switch port not in use.

Now once you uplink those two or more 10Gbps switches to your other switches in a redundant way you have a loop. That’s where the Spanning Tree protocol comes in. Without going into detail this prevents loops by blocking the redundant paths. If the operational path becomes unavailable a new path is established to keep network traffic flowing. There are some variations in STP. One of them is Rapid Spanning Tree Protocol (RSTP) that does the same job as STP but a lot faster. Think a couple of seconds to establish a path versus 30 seconds or so. That’s a nice improvement over the early days. Another one that is very handy is the Multiple Spanning Tree Protocol (MSTP). The sweet thing about the latter is that you have blocking per VLANs and in the case of Hyper-V or storage networks this can come in quite handy.

Think about it. Apart from preventing loops, which are very, very bad, you also like to make sure that the network traffic doesn’t travel along unnecessary long paths or over links that are not suited for its needs. Imagine the Live Migration traffic between two nodes on different 10Gbps switches travelling over the 1Gbps uplinks to the 1Gbps switches because the STP blocked the 10Gbps LAG to prevent a loop. You might be saturating the 1Gbps infrastructure and that’s not good.

I said MSTP could be very handy, let’s address this. You only need the uplink to the rest of the network for the host management and virtual machine traffic. The heartbeat, CSV and Live Migration traffic also stops flowing when the LAG between the two 10Gbps switches is blocked by the RSTP. This is because RSTP works on the LAG level for all VLANs travelling across that LAG and doesn’t discriminate between VLAN. MSTP is smarter and only blocks the required VLANS. In this case the host management and virtual machines VLANS as these are the only ones travelling across the link to the rest of the network.

We’ll illustrate this with some picture based on our previous scenarios. In this example we have the cluster network going to the 10Gbps switches over non teamed NICs. The same goes for the virtual machine traffic but the NICs are teamed, and the host management NICS. Let’s first show the normal situation.

Now look at a situation where RSTP blocks the purple LAG. Please do note that if the other network switches are not 10Gbps that traffic for the virtual machines would be travelling over more hops and at 1Gbps. This should be avoided, but if it does happens, MSTP would prevent an even worse scenario. Now if you would define the VLANS for cluster network traffic on those (orange) uplink LAGs you can use RSTP with a high cost but in the event that RSTP blocks the purple LAG you’d be sending all heartbeat, CSV and Live Migration traffic over those main switches. That could saturate them. It’s your choice.

In the picture below MSTP saves the day providing loop free network connectivity even if spanning tree for some reasons needs to block the LAG between the two 10Gbps switches. MSTP would save your cluster network traffic connectivity as those VLAN are not defined on the orange LAG uplinks and MSTP prevents loops by blocking VLAN IDs in LAGs not by blocking entire LAGs

To conclude I’ll also mention a more “star like” approach in up linking switches. This has a couple of benefits especially when you use stackable switches to link up to. They provide the best bandwidth available for upstream connections and they provide good redundancy because you can uplink the lag to separate switches in the stack. There is no possibility for a loop this way and you have great performance on top. What’s not to like?

Well we’ve shown that each network setup has optimal, preferred network traffic paths. We can enforce these by proper LAG & STP configuration. Other, less optimal, paths can become active to provide resiliency of our network. Such a situation must be addressed as soon as possible and should be considered running on “emergency backup”. You can avoid such events except for the most extreme situations by configuring the RSTP/MSTP costs for the LAG correctly and by using multiple inter switch links in every LAG. This does not only provide for extra bandwidth but also protects against cable or port failure.

Conclusion

And there you have it, over a couple of blog posts I’ve taken you on a journey through considerations about not only using 10Gbps in your Hyper-V cluster environments, but also about cluster networks considerations on a whole. Some notes from the field so to speak. As I told you this was not a deployment or best practices guide. The major aim was to think out loud, share thoughts and ideas. There are many ways to get the job done and it all depends on your needs an existing environment. If you don’t have a network engineer on hand and you can’t do this yourself; you might be ready by now to get one of those business ready configurations for your Hyper-V clustering. Things can get pretty complex quite fast. And we haven’t even touched on storage design, management etc.. The purpose of these blog was to think about how Hyper-V clustering networks function and behave and to investigate what is possible. When you’re new to all this but need to make the jump into virtualization with both feet (and you really do) a lot help is available. Most hardware vendors have fast tracks, reference architectures that have a list of components to order to build a Hyper-V cluster and more often than not they or a partner will come set it all up for you. This reduces both risk and time to production. I hope that if you don’t have a green field scenario but want to start taking advantage of 10Gbps networking; this has given you some food for thought.

I’ll try to share some real life experiences,what improvements we actually see, with 10Gbps speeds in a future blog post.

↧

Data Protection & Disaster Recovery in Windows 8 Server Hyper-V 3.0

September 15, 2011, 4:21 am

≫ Next: Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster

≪ Previous: Introducing 10Gbps & Integrating It into Your Network Infrastructure (Part 4/4)

The news coming in from the Build Windows conference is awesome. The speculation of the last months is being validated by what is being told and on top of that more goodness is thrown at us Hyper-V techies.

On the data protection and disaster recovery front some great new weapons are at our disposal. Let’s take a look at some of them.

Live Migration & Storage Live Migration.

Among the goodies are the improvements in Live Migration and the introduction of Storage Live Migration. Hyper-V 3.0 supports multiple concurrent Live Migrations now, which combined with adequate bandwidth will provide for fast evacuation of problematic hosts. Storage Live Migration means you can move a VM (configuration, VHD & snapshots) to different storage while the guest remains on line so the users are not hindered by this. I’m trying to find out if they will support multiple networks / NICs with this.

Now to make this shine even more MSFT has another ace up it’s sleeve. You can do Live Migration and Storage Live Migration without the requirement of shared storage on the backend. This combination is a big one. This is means “shared nothing” high availability. Even now when prices for entry level shard storage has plummeted we see SMB being weary of SAN technology. It’s foreign to them and the fact they haven’t yet gained any confidence with the technology makes them hesitant. Also the real or perceived complexity might hold ‘m back. For that segment of the market it is now possible to have high availability anyway with the combo Live Migration / Storage Migration. Add to this that Hyper-V now supports running virtual machines on a file share and you can see the possibilities of NAS appliances in this space of the market for achieving some very nice solutions.

Replication to complete the picture

To top this of you have replication built in, meaning we have the possibility to provide reasonably fast disaster recovery. It might not be real time data center fail over but a lot of clients don’t need that. However, they do need easy recoverability and here it is. To give you even more options, especially if you only have one location, you can replicate to the cloud.

So now I start dreaming We have shared nothing Live & Storage Live Migration, we have replication. What could achieve with this? Do synchronous replication locally over a 10Gbps for example and use that to build something like continuous availability. There we go, we already have requirements for “Windows 8 Server R2”!

NIC Teaming in the OS

No more worries about third party NIC teaming woes. It has arrived in the OS (finally!) and it will support load balancing & failover. I welcome this, again it makes this a lot more feasible for the SMB shops.

IP Virtualization / Address Mobility

Another thing that will aid with any kind of of site disaster recovery / high availability is IP address Mobility. You have an IP for the hosting of the VM and one for internal use by VM. That means you can migrate to other environments (cloud, remote site) with other addresses as the VM can change the hosted IP address, while the internal IP address remains the same. Just imagine the flexibility this gives us during maintenance, recovery, trouble shooting network infrastructure issues and all this without impacting the users who depend on the VM to get their job done.

Conclusion

Everything we described is out of the box with Windows 8 Server Hyper-V. To a lot of business this can mean a huge improvement in their current availability and disaster recovery situation. More than ever there is now no more reason for any company to go down or even out of business due to catastrophic data loss as all this technology is available on site, in hybrid scenarios and in the cloud with the providers.

↧

Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster

October 14, 2011, 4:54 am

≫ Next: KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration

≪ Previous: Data Protection & Disaster Recovery in Windows 8 Server Hyper-V 3.0

Introduction

You’ll find the following recommendations on line about optimizing Live Migrations:

Use bigger pipes (10Gbps is better than 1Gbps)
Enable Jumbo Frames
Up the Receive Buffer to 8192 (Exchange 2010 virtualization recommendation for Live Migration)

As we’ve been building Hyper-V Cluster since the early betas let me share some experiences with this. For the curious I used Intel® Ethernet X520 SFP+ Direct Attach Server Adapters & DELL PowerConnect 8024F 10Gbps switches for my testing. See my blog posts on considerations about the use of 10Gbps in Hyper-V clusters here:

Bigger pipes are better

On bigger pipes I can only say that if you can afford them and need them you should get them. End of discussion.

Jumbo frames rock

Jumbo frames help out a lot (+/- 20 %), especially with the larger memory virtual machines.

The golden nugget

So far so good, but there is one golden nugget of information I want to share. There is little trip wire that can prevent you from getting your optimal performance. Advanced power settings in the BIOS. If you read my blogs you might have come across a blog post Consider CPU Power Optimization Versus Performance When Virtualizing and I encourage you to go and read that post as it holds a lot of good info but also is very relevant to this post. Because we have yet another reason to make sure your BIOS is set right to achieve a decent return on investment in quality hardware.

In our experience those power saving settings, the C states and the C1 states are also not very helpful when it comes to Live Migration & such. I got from a meager 20% bandwidth use all the way up to 35-45% at best with jumbo frames enabled and the power settings set to ”Full Power”. A lot better but still not very impressive.

Now go ahead and disable the C states AND the C1E state to achieve 55% to 65%.

Now the speed of a live migration varies greatly between virtual machines that are idle or running a full load, both CPU & memory wise. It also depends on the load the host you’re migrating from and to, but this impact is less when you disable those advance CPU power settings.

Look at the following screen shots

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States enabled.

A SQL Server with 50GB of RAM being live migrated over 10Gbps. Jumbo frames enabled, Power Settings optimized but with C1E & C States disabled.

The live migration of this virtual SQL Server takes between 74-78 seconds. Not bad!

By the way these settings also help with 1Gbps but there is isn’t as spectacular. You use 99% instead of 75-80% of you bandwidth. And improvement yes, but not on the same scale as with 10Gbps for speeding up Live Migrations.

As you can see in this post on the TechNet support groups, this seems to be a common occurrence. It’s not just me who’s seeing things: Live Migration on 10GbE only 16%. even Dell chimed in there confirming these findings in their labs.

Receive Buffer

There is one setting that’s been advised for Exchange 2010 virtualization with Hyper-V that I have not seen improve speeds and that’s upping the Receive Buffer 8192. You can read this in Best Practices for Virtualizing Exchange Server 2010 with Windows Server® 2008 R2 Hyper V™. In some cases I tested this even reduces the results, especially when you have C1E & C states enabled. It is also a confusing recommendation as they state to set the Receive Buffer to 8192 .This value however is dependent on the NIC type and driver so you might only be able to set it to 4096 or so. The guidance should state to set it as high a possible but I have not seen any benefits. Do mind that I did not test this with a Hyper-V cluster running a virtualized Exchange 2010 guest. Your mileage may vary. Trust but verify is the age old adagio. Also keep in mind I’m running 10Gbps, so the effect of this setting might be not be what it could do for a 1Gbps network, but on the whole I’m not convinced. If you implement all other recommendations you’ll saturate a 1Gbps already.

What does this mean?

The sad news is that in virtual environments or other high performance configurations the penguins have to give way to performance. I wish it was different but unfortunately it isn’t.

By the way, this is vendor agnostic. You’ll see this with HP, DELL, CISCO in all form factors whether they are tower, rack or blade servers. The main thing you need to make sure is that the BIOS allows you to disable the C States en power settings. Not all vendors/BIOS version allow for this I read so make sure you check this. Some CISCO blades have annoying on this front, ruining the performance of VDI projects with less than optimal CPU performance but they have released an updated BIOS now to fix this.

Look, it makes no sense saving on power if it means you’ll by more servers to compensate for the lack of performance per unit. In my honest opinion a lot of all the hardware optimizations are awesome but they still have a long way to go in making sure it doesn’t incur such a hit even on performance. Right sizing servers in number & type of CPU, power supplies etc. still seems the best way to avoid waiting energy and money. Buying more power than needed and counting on the power consumption optimizations to reduce operating cost can be a good idea to protecting your investment for expected future increases in resource demand within the service life of your hardware. On average that is 3 to 5 years depending on the environment & needs.

Conclusion

Three things are needed for lightning fast Live Migrations:

Bandwidth. Hence the 10Gbps network. There is no substitute for bigger pipes.
Jumbo Frames. Configure them right & you’ll reap the benefits
Disable C1E& C states. Also Configure your servers power options for maximum performance.
I have not been able to confirm the receive buffer has a big impact on Live Migration speed or does any good at all. Test this to find out if it works for you

Remember that you’ll be able to do multiple Live Migrations in parallel with Windows 8. So a 10Gbps pipe will be used at full capacity then. Being able to use more networks for Live Migration will only increase the capability to evacuate a host fast or to move virtual machines for load balancing across a cluster. If you look at the RDMA, infiniband, 40/100Gbps evolutions becoming available in the next 12 to 36 months 10Gbps will become a lot more mainstream while at the same time the options for network connectivity will become more diversified. 10Gbps prices are dropping but for the moment they do remain high enough to keep people away.

↧

KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration

January 10, 2012, 12:08 pm

≫ Next: Windows 8 Hyper-V Cluster Beta Teaser

≪ Previous: Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster

The BSOD

I helped hunt down this bug and tested the private fix. Some months ago, during the summer of 2011, I was putting some new Hyper-V clusters under stress tests. You know, letting it work very hard for a longer period of time to see if anything falls off or goes “boink". It all looked pretty robust and and after some tweaking also very fast. Just when you’re about to declare “we’re all set” here you see a BSOD on one of the guests that’s being live migrated happily announcing: “DRIVER_IRQL_NOT_LESS_OR_EQUALSTOP: 0x000000D1 (0×0000000000000000, 0×0000000000000000, 0×0000000000000000, 0×0000000000000000)”

Now that doesn’t make ME very happy however. So I investigate to see if there are any more VMs dropping dead during live migration but we don’t see any. Known issues like out of date versions of the integration tools or the like are not in play nor are any other possible suspects.

We throw the MEMORY.DMP file in the debugger and we come up with the following culprit:

DRIVER_IRQL_NOT_LESS_OR_EQUAL (d1)

The driver probably at fault is storvsc.sys

Probably caused by : storvsc.sys ( storvsc!StorChannelVmbusCallback+2b8 )

Hmmmmmmm, we start searching the internet but we don’t find much. We also throw it on to Twitter to see if the community comes up with something. Meanwhile we keep looking and find this little blog post by a Microsoft support engineer Rob Scheepens:

http://blogs.technet.com/b/dip/archive/2011/10/21/win2008r2-rtm-stop-0xd1-in-storvsc-storchannelvmbuscallback-0x2b5.aspx

We pinged Rob and opened a case with MS support. That evening Hans Vredevoort (www.hyper-v.nu), who saw my tweet, mails me with the details of a fellow MVP in the USA having this same issue. We get in contact an via both Microsoft & the Hyper-V community we start hunting the cause of this bug. The progress on this issue can be read at the Microsoft blog above. You’ll notice that the fix is in the works now.

Hunting down the STOP error

What did we establish:

It only happens occasionally with a live migration and it rather ad hoc, not every time, not after X amount of live migrations or X amount of up time.
It seems sometimes to happens only with guests running dynamically expanding VHDs attached to ISCSI controllers in Hyper-V. But that’s not really the case as I remember one being with fixed VHD attached to an SCSI controller. In our case the VMs we could reproduce the issue with in a reasonable time were all SQL Server test and development guests running SQL Server 2008 where the dynamically expanding disks are used as “poor man’s thin provisioning”.
I have not heard of this on Windows 2008 hosts, only R2, but I have not tested this.

So it’s reproducible but it takes intensive live migration activity. Meanwhile we received private instrumentation to install on both guests & hosts to collect “enriched” memory.dumps when a guest experiences a BSOD. With PowerShell we have continuous live migration running to reproduce the issue. The fact that can live migrate over 10Gbps does help Smile . Because you can get lucky but in reality needs many hundreds of live migration to reproduce it. On some machines many thousands. Not a joke but I total we did 8000 Live migrations to test the fix and we did about 12000 to reproduce the issue on several VMs with different configurations to send memory dumps to MSFT. So yes, you really need some PowerShell and having a 10Gbps Live Migration network also helps Winking smile .

All the collected MEMORY.DMP files form these live migration exercises were uploaded to Microsoft for analysis. That took a while, also because they had a boatload of live migrations to do and I don’t know if their test lab has 10 Gbps.

On Tuesday the 25th of October Microsoft contacts us with good news. They have root-caused the problem and a hotfix is in the works. You can download that here http://support.microsoft.com/kb/2636573

On Thursday 27th of October we get access to a private fix and after installing this one we’ve been running thousands of live migrations without seeing the issue.

The public release of this hotfix is currently planned (HTP11-12) under KB2636573.

The details for the curious

Root Cause?

The root cause can be summarized as follows: “StorVSP was modifying guest memory while the VM’s virtual motherboard was being powered off.” Doing this storvsc access a NULL pointer in a memory buffer that is already freed up. The result of this is a BSOD or STOP error in the virtual machine.

Only SCSI attached VHDs

OK but why do we only see this with SCSI attach VHDs? Now the issue happens during power down of the virtual machines’ mother board because there is a disk enumeration during the shut down phase. And this enumeration only happens with SCSI disks.

Right! So the more VHDs we had attached to SCSI controllers in a virtual machine the higher the likelihood of this happening.

Why so much more likely with dynamically expanding VHDs

But still we saw this exponentially more with dynamically expanding disks. Why is that? Well it’s not that dynamically expanding disks trigger disk enumeration more than fixed disks. However it seems that any disk expansion, which causes write delays, can lead to a timing issue that will cause the disk enumeration to hit the issue described above. So this significantly increases the risk that the STOP error will happen and it explains that the chance this will happen with fixed VHDs attached to SCSI controllers is significantly lower. This is sync with what we saw. The virtual machines with a lot of dynamic disks attach to SCSI controllers that had a lot of activity (and thus potential for expanding) is the ones where we could reproduce this the fasted.

Conclusion

It can take some time to hunt down certain bugs, especially the rare ones that only happen every now and then so occurrences are few and far between. But when you put in some effort Microsoft helps out and works on a fix. And no you don’t have to have the most expensive support contract for that to happen. As a matter of fact this call was logged under a free support call with the TechNet Plus subscription. And as it was a bug, they return it as unused.

↧

Windows 8 Hyper-V Cluster Beta Teaser

March 7, 2012, 5:02 pm

≫ Next: Enhanced (Failover) Placement of Virtual Machines on Windows 8 Hyper-V Cluster

≪ Previous: KB 2636573: Guest Crashes with Win2008R2 RTM/SP1 STOP 0xD1 in storvsc!StorChannelVmbusCallback During Live Migration

What does an MVP do after a day of traveling back home from the MVP Summit 2012 in Redmond? He goes to bed and gets up early next morning to upgrade his Windows 2008 R2 SP1 Hyper-V Cluster to Windows 8. That means when I boot the lab nodes these days I get greeted by the “beta fish” we knew from Windows 2008 R /Windows 7 but it’s “metro-ized”

Here is a teaser screenshot from concurrent Live Migrations in action on a new Windows Server 8 Beta Hyper-V cluster in the lab. As you can see this 2 node cluster is handling 2 concurrent Live Migrations at the time. The other guests are queued. The number of Live Migrations you can do concurrently is dictated by how much bandwidth you want to pay for. In the lab that isn’t very much as you can see Winking smile .

In Hyper-V 3.0 you can choose the networks to use for Live Migrations with a preference order. Just like it was in W2K8R2. So if you want more bandwidth you’ll have to team some NIC ports together or put more NICs in and you should be fine. It does not use multichannel. You have to keep in mind that each live migration only utilizes a single network connection, even if multiple interfaces are provided or network teaming is enabled. If there are multiple simultaneous live migrations, different migrations will be put on different network connections.

If the Live Migration network should become unavailable the CSV network in this example will take over. The CSV & the Live Migration network serve as each others redundant backup network.

There is more to come but I have only 24 hours in a day and they are packed. Catch you later!

↧

Enhanced (Failover) Placement of Virtual Machines on Windows 8 Hyper-V Cluster

April 6, 2012, 2:58 am

≫ Next: Windows Hyper-V Server 2012 Live Migration DOES support pass-through disks–KB2834898 is Wrong

≪ Previous: Windows 8 Hyper-V Cluster Beta Teaser

One of the nice features in Windows 8 Hyper-V clustering are the “drain node” capability and the virtual machine priorities. You can see this in action in a video by Aidan Finn here.

For more details on draining a node see Draining Nodes for Planned Maintenance with Windows Server "8" for a detailed explanation.

Now another very important feature is the fact that the cluster is intelligent enough to determine what node is the best suited as a target during failover or live migration. Not only CPU and memory load of the hosts are taking into consideration but also the resource needs of the VM and the priority you have given those. This entire process is NUMA aware and as such with windows 8 can be evaluated on a per virtual machine basis. This means that you that the cluster will always try to get the best possible placement and thus performance for your virtual machines.

Now we also have affinity and anti affinity rules. Anti Affinity ensures that the nodes of a virtualized NLB farm will be placed on separate hosts to minimize risk. You don’t want one host to house all the nodes of you NLB farm!

On the other hand sometime you want virtual machines to stick together lets say you have an NLB farm but the virtual machines with the front end and middle tier need to stay together. In that case you use affinity rules to achieve this. On top of this the anti affinity rules will ensure that the NLB farm virtual machines are on different nodes.

Do note that when the cluster has to choose to break these rules versus bot being able to run the virtual machines it will choose to keep them running. It knows its priorities! Now if in such a situation there are not enough resource the priority will also come into play and the low priority machine may be shut down to ensure the higher priority ones can be up and running.

As you can imagine there are potentially a lot of factors/permutations at play here and I’m looking into doing some more test of these features and the intelligence in the process to see if we make the same decisions and how to best configure this for maximal performance & availability.

↧

Windows Hyper-V Server 2012 Live Migration DOES support pass-through disks–KB2834898 is Wrong

April 6, 2013, 11:53 am

≫ Next: Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations?

≪ Previous: Enhanced (Failover) Placement of Virtual Machines on Windows 8 Hyper-V Cluster

See update in yellow in line (April 11th 2013)

I recently saw KB2834898 (pulled) appear and it’s an important one. This fast publish statement is important as until recently it was accepted that Live Migration with pass through disks was supported with Windows Server 2012 Hyper-V Live Migration (just like with Windows Server 2008 R2 Hyper-V) as long as the live migration is managed by the Hyper-V cluster, i.e. the pass through disk is a clustered resource => see http://social.technet.microsoft.com/wiki/contents/articles/440.hyper-v-how-to-add-a-pass-through-disk-on-a-failover-cluster.aspx

UPDATE April 11th 2013: Now after consulting some very knowledgeable people at Microsoft (like Jeff Woolsey and Ben Armstrong) this KB article is not factual correct and leaves much to be desired. It’s wrong, as pass-through disks are still supported with Live Migration in Windows Server 2012 Hyper-V, when managed by the cluster, just like before in Windows 2008 R2. The KB article has been pulled meanwhile.

Mind you that Shared Nothing Live Migration with pass through disks have never been supported as there is no way to move the pass through disk between hosts. Storage Live Migration is not really relevant in this scenario either, there are no VHDX file to copy apart fro the OS VHDX. Live migrations between stand alone host are equally irrelevant. Hence it’s a Hyper-V Cluster game only for pass through disks.

I have never been a fan of pass through disks and we have never used them in production. Not in the Windows Server 2008 R2 era let alone in the Windows Server 2012 time frame. No really we never used them, not even in our SQL Server virtualization efforts as we just don’t like the loss of flexibility of VHDX files and due to the fact that they tend to complicate things (i.e. things fail like live migration).

I advise people to strongly reconsider if they think they need them and only to use them if they are really sure they actually do have a valid use case. I know some people had various reasons to use them in the past but I have always found them to be a bit of over engineering. One of the better reasons might have been that you needed disks larger then 2TB but than I would advise iSCSI and now with Windows Server 2012 also virtual Fibre Channel (vFC), which is however not needed due to VHDX now supporting up to 64TB in size. Both these options support Live Migration and are useful for in guest clustering, but not as much for size or performance issues in Windows Server 2012 Hyper-V. On the performance side of things we might have eaten a small IO hit before in lieu of the nice benefits of using VHDs. But even a MSFT health check of our Virtualized SQL Server environment didn’t show any performance issues, Sure your needs may be different from ours but the performance argument with Windows Server 2012 and VHDX can be laid to rest. I refer you to my blog Hyper-V Guest Storage Performance: Above & Beyond 1 Million IOPS for more information of VHDX performance improvements and to Windows Server 2012 with Hyper-V & The New VHDX Format Leads The Way for VHDX capabilities in general (size, unmap, …).

Is see only one valid reason why you might have to use them today. You have > 2TB disks in the VM and your backup vendor doesn’t support the VHDX format. Still a reality today unfortunately Annoyed But that can be fixed by changing to another one Winking smile

↧

Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations?

June 9, 2013, 11:45 pm

≫ Next: I’m Ready To Test Windows Server 2012 R2 Live Migration Over Multichannel & RDMA!

≪ Previous: Windows Hyper-V Server 2012 Live Migration DOES support pass-through disks–KB2834898 is Wrong

Do you remember my blog from 2011 on optimizing some system settings to get way better Live Migration performance with 10Gbps NICs? It’s over here Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster. This advice still holds true, but the power optimization settings & interaction between DELL Generation 12 Server and with Windows Server 2012 has improved significantly. Where with Windows Server 2008 R2 we could hardly get above 16% bandwidth consumption out of the box with Live Migration over a 10Gbps NIC today this just works fine.

Don’t believe me? You do now? A cool Winking smile

For overall peak system performance you might want to adjust your Windows configuration settings to run the High Performance preferred power plan, if that’s needed.

You do no longer need to dive into the BIOS. Of cause if you have issues because your hardware isn’t that intelligent and/or are still running Windows 2008 R2 you do want to there. As when it comes to speed we want it all and we want it now Smile . But again, changes are you will not need to do this on DELL 12th Generation hardware. Test & confirm I’d say.

Well let’s revisit this again as we are now no longer working with Generation 10 or 11 servers with an “aged” BIOS. Now we have decommissioned the Generation 10 server, upgrade the BIOS of our Generation 11 Servers and acquired Generation 12 servers. We also no use UEFI for our Hyper-V host installations. The time has come to become familiar with those and the benefits they bring. It also future proofs our host installations.

So where and how do I change the power configuration settings now? Let’s walk through one together. Reboot your server and during the boot cycle hit F2 to enter System Setup.

Select System BIOS

Click on System Profile Settings

The settings you want to adapt are:

CPU Power Management should be on Maximum Performance
Setting Memory Frequency to “Maximum Performance”
C1E states should be disabled
C states should be disabled

That’s it. The below configuration has optimized your power settings on a DELL Generation 12 server like the R720.

When don, click “Back” and than Finish. A warning will pop up and you need to confirm you want to safe your changes. Click “Yes” if you indeed want to do this.

You’ll get a nice confirmation that your settings have been saved. Click “OK” and then click Finish.

Confirm that you want to exit and reboot by clicking yes and voila, when the server comes back on it will be running a full speed at the cost of more power consumption, extra generated heat and cooling.

Remember, if you don’t need to run at full power, don’t. And if you consider using Dynamic Optimization and Power Optimization in System Center Virtual Machine Manager 2012. Save a penguin!

↧

I’m Ready To Test Windows Server 2012 R2 Live Migration Over Multichannel & RDMA!

June 10, 2013, 12:50 pm

≫ Next: Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members

≪ Previous: Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations?

I’m so ready for the first Windows 2012 R2 preview bits. Yes that’s what our current setup looks like. Two RDMA capable NICs at the ready Winking smile … let the bits come. I’m pretty excited to test Live Migration over Multichannel & RDMA.

We choose to get RDMA NICs for new servers and replace non RDMA card in host where there’s a clear benefit. By the way have you seen the news on the Mellanox ConnectX®-3 Pro Single/Dual-Port Adapters => NIC support for NVGRE is here people!

↧

Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members

June 30, 2013, 6:19 am

≫ Next: Configuring Performance Options for Live Migration In Windows Server 2012 R2 Preview

≪ Previous: I’m Ready To Test Windows Server 2012 R2 Live Migration Over Multichannel & RDMA!

Introduction

Between this blog NIC Teaming in Windows Server 2012 Brings Simple, Affordable Traffic Reliability and Load Balancing to your Cloud Workloads which states “TCP/IP can recover from missing or out-of-order packets. However, out-of-order packets seriously impact the throughput of the connection. Therefore, teaming solutions make every effort to keep all the packets associated with a single TCP stream on a single NIC so as to minimize the possibility of out-of-order packet delivery. So, if your traffic load comprises of a single TCP stream (such as a Hyper-V live migration), then having four 1Gb/s NICs in an LACP team will still only deliver 1 Gb/s of bandwidth since all the traffic from that live migration will use one NIC in the team. However, if you do several simultaneous live migrations to multiple destinations, resulting in multiple TCP streams, then the streams will be distributed amongst the teamed NICs” and other information out their such as support forum replies it is dictated that when you live migrate between two nodes in a cluster only one stream is active and you will never exceed the bandwidth of a single team member. When running some simple tests with a 10Gbps NIC team this seems true. We also know that you can consume near to all of the aggregated bandwidth of the members in a NIC Team for live migration if you these conditions are met:

1. The Live Migrations must not all be destined for the same remote machine. Live migration will only use one TCP stream between any pair of hosts. Since both Windows NIC Teaming and the adjacent switch will not spread traffic from a single stream across multiple interfaces live migration between host A and host B, no matter how many VMs you’re migrating, will only use one NIC’s bandwidth.

2. You must use Address Hash (TCP ports) for the NIC Teaming. Hyper-V Port mode will put all the outbound traffic, in this case, on a single NIC.

When we look at these conditions and compare them to the behavior we expect from the various forms of NIC teaming in Windows 2012 this is a bit surprising as one might expect all member to be involved. So let’s take a look at some of the different NIC Teaming setups.

Any form of NIC teaming with Hyper-V Port Mode

This one is easy as condition 2 above is very much true. In all my testing with any NIC team configuration in the Hyper-V Port mode traffic distribution algorithms I have not been able to exceed 10Gbps. I have seen no difference between dependent static of LACP mode or switch independent (active-active) for this condition. As you can see in the screenshot below, the traffic maxes out at 10Gbps.

This is also demonstrated in the following screenshots taking with the resource manager where you can see only half of the bandwidth of the Team is being used.

Exceeding a single NIC team member’s bandwidth when migrating between 2 nodes

The first condition of the previous heading doesn’t seem true. In some easy testing with a low number of virtual machines and not too much memory assigned you never exceed the bandwidth of one 10Gbps NIC team member. So on the surface, with some quick testing it might seem that way.

But during testing on a 2 node cluster with dual port 10Gbps cards and I have found the following

Switch Dependent LACP and Static

Take a sufficient number of large memory virtual machines to exceed the capacity of a single 10Gbps pipe for a longer time (that way you’ll see it in the GUI).
Live migrate them all from host A to host B (“Pause” with “Drain Roles” or “select all” + “Move”)
Note that with a 2 node cluster there is no possibility to Live Migrate to multiple nodes simultaneous. It’s A to or B or B to A or both at the same time.

Basically it didn’t take long to see well over 10Gbpsbeing used. So the information out there seems to be wrong. Yes we can leverage the aggregated bandwidth when we migrate from host A to host B as long as we have enough memory assigned to the VMs and we migrate a sufficient number of them. Switch dependent teaming, whether it is static or LACP does its job as you would expect.

Let’s think about this. The number of VMs you need to lie migrate to see > 10Gbpss used is not fixed in stone. Could it be that there is some intelligence in the Live Migration algorithm where it decides to set up multiple streams when a certain number of virtual machines with sufficient memory are migrated as the sorting is mitigated by the amount of bandwidth that can leveraged? Perhaps he VMMS.EXE kicks off more streams when needed/beneficial? Further experimenting indicates that this is not the case. All you need is > 1 VM being live migrated. When looking at this in task manager you do need them to be of sufficient memory size and/or migrate enough of them to make it visible. I have also tried playing with the number of allowed simultaneous live migrations to see if this has an effect but I did not find one (i.e. 4, 6 or 12).

It looks like it is more like one TCP/IP connection per Live Migration that is indeed tied to one NIC member. So when you live migrate VMS between two hosts you see one VM live migration go over 1 member and the other the other as static/LACP switch dependent teaming did does its job. When you do enough live migrations of large VMs simultaneously you see this in Task Manager as shown below. In this case as each VM live migration stream sticks to a NIC team member you do not need to worry about out of order packets impacting performance.

But to make sure and to prevent falling victim to the fall victim to the limits of the task manger GUI during testing this behavior we also used performance monitor to see what’s going on. This confirms we are indeed using both 10Gbps NIC team member on both the target and the source host server. This is even the case with 2 virtual machines Live Migration. As long as it’s more than one and the memory assigned is enough to make the live migration last long enough you can see it in Task Manager; otherwise it might miss it. Performance Monitor however does not..

This is interesting and frankly a bit unexpected as the documentation on this subject is not reflecting this. However it IS in agreement with the NIC teaming documented behavior for other tan Live Migration traffic. We took a closer look however and can reproduce this over and over again. Again we tested both switch dependent static and LACP modes and we found the behavior to be the same.

Switch Independent with Address Hash

Let’s test Live Migration over switch independent teaming with Address Hash. Here we see that the source server sends on the two member of the NIC team but that the target server receives on only one. This is normal behavior for switch independent teaming. But from the documentation we expect that one member on the source server would send and one member on the target server would receive. Not so.

Basically with Windows Server 2012 this doesn’t give you any benefit for throughput. You are limited to the bandwidth of one member, i.e. 10Gbps.

Red is Total Bytes received on the target host. It’s clear only one member is being used. Green is Bytes Sent/Sec on the source server. As you can see both team members are involved. In a switch independent scenario the receiving side limits the throughput. This is in agreement the documented behavior of switch independent NIC teaming with Address hash.

Helpful documentation on this is Windows Server 2012 NIC Teaming (LBFO) Deployment and Management (A Guide to Windows Server 2012 NIC Teaming for the novice and the expert).

Hope this helps sort out some of the confusion.

↧

Configuring Performance Options for Live Migration In Windows Server 2012 R2 Preview

July 10, 2013, 12:15 am

≫ Next: An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview

≪ Previous: Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members

New Options For Optimizing Live Migrations

In Windows Server 2012 R2 we have a whole range of options to leverage Live Migration of our environment and needs. Next to the new default (Compression) we can now also leverage SMB 3.0 (Multichannel, RDMA) for all forms of Live Migration and not just for Shared Nothing Live Migration (see Shared Nothing Live Migration Leverages SMB 3.0 Under the Hood) or Storage Live Migration when both the source and the target are SMB 3.0 storage.

TCP/IP

Here you can use a one NIC or a NIC Team for bandwidth aggregation for live migration (see Teamed NIC Live Migrations Between Two Hosts In Windows Server 2012 Do Use All Members). This is the process you have known in Windows Server 2012. You can select multiple NICs or even Teams of NICs but only one of those (one NIC or one Team) will be used. The other(s)will only be used when the first one is not available.

Compression

This option leverages spare CPU cycles to compress the memory contents of virtual machines being migrated. Only then is it sent over the wire via TCP/IP connection. This speeds up the Live Migration Process. This process is CPU load aware so it will only use idle cycles to protect the workload on the hosts. This is the default setting in Hyper-V running on Windows Server 2012 R2 Preview.

SMB

This setting will leverage two SMB 3.0 features. Multichannel and, if supported by and for the NICs involved, RDMA.

SMB Direct (RDMA) will be used when the network adapters of both the source and destination servers have Remote Direct Memory Access (RDMA) capabilities enabled.
SMB Multichannel will automatically detect and use multiple connections when a proper SMB Multichannel configuration is identified.

Where to set these options?

In Hyper-V Manager go to “Hyper-V Settings” in the Actions pane.

Expand the Live Migrations node under Server in the left pane (click the “+”) and select to “Advanced Features”.

Select the option desired under" “Performance Options”.

Happy testing!

EDIT: Aidan Finn posted the PowerShell commands to configure the performance options in Configuring WS2012 R2 Hyper-V Live Migration Performance Options Using PowerShell The MVP community at work & it rocks Smile

↧

An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview

July 10, 2013, 11:15 pm

≫ Next: Preliminary Results With Live Migration Over RDMA Speed & Useful Number Of NICs

≪ Previous: Configuring Performance Options for Live Migration In Windows Server 2012 R2 Preview

Introduction

With Windows Server 2012 R2 (Preview) we can Live Migrations over TCP/IP like before. That’s either using a single NIC or by teaming two or more NICs. We also have compression and Multichannel. In this blog post we’ll play with TCP/IP and Multichannel.

We have a dual port 10Gbps Mellanox RDMA card (RoCE) in each host. But for these tests we have disable the RDMA capabilities of these NICs. As in the RMDA blog post, one pair of the ports are interconnected via a direct attach cable. The other one is connected over a Force10 S4810 switch. We’re using in box Windows Server 2012 R2 preview drivers for everything as we have found drivers not to install properly (or not at all) on this release and cause issues.
We are using one VM running Windows 2012 RTM with upgraded Integration Services components. This VM has 4 vCPUs and 55GB of fixed memory assigned. For this purpose we had no workload running in the VM. The servers are standard DELL PowerEdge R720 kit running the Windows Server 2012 Preview bits.

Results

No Performance tweaking

We test a a Live Migration over one 10Gbps NIC. It’s fast but I don’t like the jig saw effect and we don’t push the bandwidth to the limit yet.

We can move the 55GB Memory VM in about 70 seconds on average. You have a bit more CPU load here but nothing to bad. Most often the Hyper-V host has ample of CPU cycles left so this will not hinder performance. I also remember Aidan Finn’s work testing a truck load of concurrent live migrations with a host that has only 1 low end CPU with 4 cores making it throttle the number of CPUs it would start to save guard the workload.

So let’s do what we’ve always done. Turn on Jumbo Frames. This helps to peek to 1.25GB/s and improves speeds (10% or more) but the jig saw is still a bit visible. As I think we can do better we move in the big guns and we optimize our power setting as discussed in Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations? and Optimizing Live Migrations with a 10Gbps Network in a Hyper-V Cluster. Now with C & C1E states disabled and both processor & memory optimized for performance we see this.

Now that’s power. We have faster Live Migrations (54 seconds on average) with top bandwidth use during the entire migration process and we see 50% better blackout times. What’s not to like here? CPU usage isn’t that bad and you’ll likely have some cycles to spare unless you’re over 60-70% of CPU use by your VMs and then you need to fix that anyway Smile as you’re out of the save zone. So, Jumbo Frames & Power Optimization are key!

Of cause we’re always looking for better and more. In Live Migrations terms that means speed. So let’s see what Multichannel can do for us. So let’s switch to SMB. As we have disable RDMA on the NICs this “only” gives us multichannel. The cool thing is, the second NIC doesn’t have Jumbo frames enabled yet. I have always found Jumbo Frames to matter and now with multichannel I have a very nice way of demonstrating / visualizing this. Here’s a screen shot of moving our test VM back and forward. As you can see we have one NIC with Jumbo frames disabled and one with Jumbo frames enabled. You don’t have to guess which one is which I guess. Yup Jumbo frames do matter Smile When you push to the limits. We are getting about 31 seconds on average here with the 55GB VM.

Here’s the same with Jumbo Frames enabled on both NICs. And guess what we just cut another 3 seconds of the Live Migration time Smile . 28 seconds flat.

In a histogram it looks like this. That’s what maximum throughput looks like.

Let’s see what our CPUs are up to during all this. Some core are rather busy dealing with the interrupts. But this is just one VM.

If you wonder why with 2*10Gbps you only see 2*4 CPUs doing work while the default RSS queues are at 8 and you’d expect 16. It’s because Multichannel defaults to 4. So we get 8. This I configurable and testing will show what difference this could make and whether it’s wise to tweak. It all depends.

Sure this is only one large memory VM but what if we do more? Like 6 VMs with 9GB of memory. Not to bad.

What if that host is running 30 or 40 VMs? That adds up. Well that’s what RDMA is for Smile ! But that yet another blog post.

Do keep in mind this is al just the Preview bits … MSFT does two things now until R2 is released. They kill bugs and tweak for speed. I tune my Live Migration setting in production so that get the most bang for the buck I try to avoid dips in bandwidth like you see above. So the work is not finished yet Smile

Conclusion

I can conclude that all the hints & tips of the past to optimize Live Migration still hold true. Yes, you should enable Jumbo Frames and yes you should still optimize your host for performance over power savings. That said, the times that you’d only get 16% of bandwidth usage out of a 10Gbps NIC when you do power optimize have long gone ever since Windows Server 2012. But if you feel the need for (even more) speed …, then by all means go for it.

If you want to conserve energy & be environmentally sound make the most of the least number of nodes possible and use Dynamic Optimization / Power Optimization to shut them down when not needed and fire them up to rise to the occasion Smile

Oh yes, test people, test. Trust but verify and determine the best possible configuration for both your environment and needs.

Now we’ll have a look at compression … but again that’s another blog post!

↧

Preliminary Results With Live Migration Over RDMA Speed & Useful Number Of NICs

July 11, 2013, 11:52 pm

≫ Next: Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & Compression in Windows Server 2012 R2

≪ Previous: An Early Look At Live Migration Over TCP/IP & Multichannel In Windows Server 2012 R2 Preview

Introduction

With Windows Server 2012 R2 (Preview) we can leverage SMB to do Live Migrations. That means we can now offload the process to the NIC if they support RDMA, save on CPU cycles and potentially get VMs moves a lot faster without impacting the performance of running VMs on the involved hosts. Perhaps it’s even faster than over TCP/IP. Sounds great so let’s do some testing.

We have a dual port 10Gbps Mellanox RDMA card (RoCE) in each host. One pair of the ports are interconnected via a direct attach cable. The other one is connected over a Force10 S4810 switch. We’re using in box Windows Server 2012 R2 preview drivers for everything as we have found drivers not to install properly (or not at all) on this release and cause issues.
We are using one VM running Windows 2012 RTM with upgraded Integration Services components. This VM has 4 vCPUs and 55GB of fixed memory assigned. For this purpose we had no workload running in the VM. The servers are standard DELL PowerEdge R720 kit running the Windows Server 2012 Preview bits.

Results

No Performance tweaking

Live Migration over RDMA in action. Here we are using 1 10Gbps RoCE RDMA NIC. Here we are moving via the NIC port that goes over the S4810 Switch.

As you can see the entire process took 74 seconds. RDMA did not kick in until after 19 seconds had past since the start.

The CPU load remains low, which is where you’ll find the biggest benefit of RDMA with live migrations.

No let’s put two RDMA RoCE ports into play and see what that does for us. We now Live Migrate the 55GB memory VM in 52-54 seconds. Not bad. Again we saw over 20 seconds time pass before RDMA kicks in.

Again we see that CPU usage remains low. This is just a quick screenshot. On a hyper-V node you’ll need to dive into Performance Monitor to get some real info.

Let’s repeat this exercise and see what happens if we move the traffic over the NOC ports that are directly attached. That will give us an indication about the configuration of the switch. Configuring RoCE DCB features like PFC/ETS is not exactly a well documented process at the moment and often I feel like a magician’s apprentice.

Once more we see that it takes about 20 seconds for RDMA to kick in and that the time rises to 79 seconds. It fluctuates between 74-79 seconds actually?

The CPU load was low again. So both paths seem to perform comparable.

Live Migrations over SMB seem to function faster using two RDMA ports but not twice as fast. These are the preview bits so nothing definitive yet. And sorry, I cannot do 40Gbps or 56Gbps Infiniband tests. Unless you want to donate the gear and pay for the power, time & reporting Open-mouthed smile .

Max Performance Tweaking

As my readers very well know I tweak my nodes for best performance. The savings of energy (power, cooling) have to come from making the most out of every node and shutting them down when not needed (Dynamic Optimization/Power Optimization in System Center). I still have a standing order to tale away any physical limitations possible for the business.

While Windows Server 2012 (R2) has made tremendous strides to better use of the available bandwidth of a 10Gbps pipes out of the box I still dive in to the BIOS to turn of the C/C1E states and set the CPU Power Management and Memory Frequency to Maximum performance. Have a look at this blog post Still Need To Optimizing Power Settings On DELL 12th Generation Servers For Lightning Fast Hyper-V Live Migrations? on how to do this with DELL Generation 12 Servers. It also contains a link to the older generations guidance.

As you can imagine I was quite interested to see if the settings effect RDMA as well. So let’s have a look with these settings here:

One RDMA NIC used (Mellanox, RoCE, 10Gbps)

54 seconds for that 55GB memory (fixed) VM. We also note that the delay of 19-20 seconds before RDMA kicks in has dropped to 3-4 seconds, which is quite interesting. Basically this makes it as fast as 2 RDMA NICs without performance tweaking.

Two RDMA NICs used (Mellanox, RoCE, 10Gbps)

30 seconds flat, in a repetitive manner, for that 55GB memory (fixed) VM. Again we note that the delay of 19-20 seconds before RDMA kicks in is again 3-4 seconds. So this is about 45% better than without the power optimization.

What is the CPU doing during all this? Well taking care of the VM load, not spending it on network interrupts Smile . Again, this is a quick screenshot. On a hyper-V node you’ll need to dive into Performance Monitor to get some real info.

By now you must all be eager to see how this compares against Live Migration over TCP/IP, Multichannel and with Compression. That’s material for other blogs.

Why am I doing this?

We need to get the most out of every € or $ we spent. It’s not that we don’t have any cash left or so but why buy more servers & higher end gear to get better results when the answer lies in the correct configuration & better choices when designing a solution. It’s going to be a while before this knowledge becomes main stream and widely available. Years probably and why wait. It takes time to experiment but the results & ROI are great. Why spend another 50.000 to another 100.000 Euro on Servers, 10Gbps cards & switch ports if you don’t need to? Count the cost to host, power & cool them and you’ll see that this time is an investment. You could also conclude to leverage the cloud but wasting VM cycles there is also money you have better uses for, so testing will also be needed.

↧

Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & Compression in Windows Server 2012 R2

July 18, 2013, 12:15 am

≫ Next: Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory

≪ Previous: Preliminary Results With Live Migration Over RDMA Speed & Useful Number Of NICs

In a previous blog post Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & TCP/IP in Windows Server 2012 R2 we looked at what Dynamic load balancing mode in NIC teaming can do for us . Especially in a switch independent configuration as until now there was no possibility to leverage the complete bandwidth provided by the NIC team when migrating between only 2 nodes. I that blog we used TCP/IP. Now we’ll configure Compression and see what that does for us.

So we set up a NIC team in switch independent mode with Dynamic load balancing, it’s identical as that one used for the tests with TCP/IP.

Compression basically slashes the live migration times in half at a cost. CPU cycles.And again with Dynamic load balancing we can now also use all member of a NIC team for live migration even in switch independent mode. The speeds for live migrating 6 VMs with 9GB of memory simultaneously were 12-14 seconds.

Take a look at the screen shot above. You see 6 VMs coming in to the host where these counters are collected and after that you see them being live migrated away from the host. As we have plenty of idle cycles I this test lab they get used, both when being the target and the source of the VMs being live migrated. You can also see that a lot less bandwidth is needed to achieve a faster live migration experience (compared to TCP/IP).

By the looks of it the extra bandwidth will help out when we have less CPU and vice versa. This is both the case for a single NIC or teamed NICs. Do note that you cannot combine compression with Multichannel. That means that the only scenario allowing for multiple NICs to be used with compression is NIC teaming. When you have a bunch of free 1Gbps NICs in surplus this might get things moving for you!

Interesting stuff. I’m really looking forward to the moment we can run production loads on these configurations …

↧

Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory

August 27, 2013, 11:50 pm

≫ Next: Preventing Live Migration Over SMB Starving CSV Traffic in Windows Server 2012 R2 with Set-SmbBandwidthLimit

≪ Previous: Live Migration over NIC Team in Switch Independent Mode With Dynamic Load Balancing & Compression in Windows Server 2012 R2

Prologue

On gloomy day, it was dark, grey and cold, we gave battle with RoCE & DCB (PFC/ETS). The fight was a long one, the battle field uncharted and we had only our veteran attitude towards adversity to guide us through the switch configurations. It seemed that no man had gone that far to the edges of the Windows Server 2012 empire. And when it came to RoCE & DCB meets Didier, I needed to show it that it had been conquered and was remembered of a quote in Gladiator:

Quintus: P~~eople~~ RoCE/DCB configs should know when they are conquered.
Maximus: Would you, Quintus? Would I?

After many, many lonely & unsuccessful hours dealing with Performance monitor, switch configurations, reloads, firmware, drivers & Windows we got results:

… “it’s working” … “holey s* look at those numbers” …

On that dark day in a scarcely illuminated room, in the faint glare of the monitors even the CLI of the switches in PUTTY felt like a grim cold place. But all that changed at as the impressive results brightened up the day and made all efforts seem worthwhile. “Didier Victor” I thought as I looked way from the screen, ‘”Once more”.

But it has been a hard won victory. And should you fight it? We’ll let’s discuss this a bit now we’ve got your attention. RDMA is a learning process for many of us and neither Infiniband, iWARP or RoCE are the one that need to win at this game. It’s you, via the knowledge you’ll gain working with RDMA technologies.

SMB Direct or SMB over RDMA comes in flavors

Infiniband (Mellanox)

That’s been here for a while. Has high cost associated (depends on where you come from) and also has a psychological barrier to it. Try discussing buying 10Gbps versus Infiniband with semi technical managerial types. You’ll know what I mean.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-2/ConnectX-3 using InfiniBand – Step by Step

iWARP (Chelsio / Intel)

RDMA but it’s TCP/IP offloaded to the card. It can leverage DCB but doesn’t require it.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Chelsio T4 cards using iWARP – Step by Step

RoCE (Mellanox)

“Infiniband over Ethernet” > so you “NEED” DBC with PFC/ETS (DCBx can be handy) for it to work best. No need for Congestion Notification as it’s for TCP/IP but could be nice with iWARP (see above). Do note that you’ll need to configure your switches for DCB & that’s highly dependent on the vendor & even type of switches.

Deploying Windows Server 2012 with SMB Direct (SMB over RDMA) and the Mellanox ConnectX-3 using 10GbE/40GbE RoCE – Step by Step

Here’s an older overview of RDMA flavor’s pros & cons:

Please see Jose Barreto’s excellent work on explaining SMB 3.0 over RDMA in his presentations at SNIA, TechEd and on his blog.

While I have heard of two people I have in my network working with Infiniband for SMB Direct and Windows Server 2012 (R2) most of us are doing 10Gbps. Pricing for Infiniband has a bad reputation. Not because Infiniband is super costly compared to 10/40Gbps (I’m told by most people who ask quotes are positively surprised) but when you can’t afford a Porsche you’re not shopping for a Ferrari either. Especially not when a mid size sedan will serve al of your needs above and beyond the call of duty. On top of that you might have bought all that nice “converged network ready” 10Gbps gear some years ago. Some of us may be working towards 40Gbps but most are 10Gbps shops. My 40Gbps is “limited” to the inter links & uplinks. Meaning that we either go for iWarp or RoCE.

RoCE or iWARP

Which one is best of those two? Well, as the line is drawn between vendors. RoCE today equals Mellanox (yes the Infiniband vendor, RoCE is sometimes called “Infiniband on layer 4 over Ethernet layer 2”) and iWarp means Chelsio or Intel (their cards look a bit old in the teeth however).

You’ll find comparisons by both vendors claiming superiority for varied reasons. Here’s the Mellanox side http://www.mellanox.com/pdf/whitepapers/WP_RoCE_vs_iWARP.pdf & here’s Chelsio’s take http://www.chelsio.com/roce/ & http://www.moderntech.com.hk/sites/default/files/whitepaper/V09_iWAR_Summary_WP_0.pdf. It’s good to look at your needs and map them. But I cannot declare a winner. I did notice that at least one vendor of SOFS/CiB uses iWarp. Is that a statement? And if so about what? Price? Easy of use? Perfomance/Cost?

What I do find is that Chelsio is really hacking into RoCE as you can see here http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-The-Grand-Experiment1.pdf, http://www.chelsio.com/roce-whitepaper/, http://www.chelsio.com/wp-content/uploads/2011/05/RoCE-FAQ-1204121.pdf So that begs the question are the right or are the scared of RoCE, as the Infiniband boys are out to eat their lunch?

My take on this for now

iWarp is way easier to get started. That’s for sure. RoCE is firmware sensitive (NIC, Switches), driver sensitive (NIC). Configuring your switches (DCB) now is usually followed by a rebooting that switch (so you might not do that so easily in production and depending on where in the stack those switches live you really need to Force10 VLT or Cisco vPC or a independent redundant switches to get away with it. RoCE loves green field. Stacking I hear you say? I don’t like stacking on that spot of the stack as firmware updates will get you …

Disclaimer: RoCE in itself does not DEMAND/REQUIRE DCB but the consensus is that it will work better, especially under heavy load. Weather SMB Direct over RoCE requires DCB is another question. For all practical purposes I’m working from the prerequisite it does for a production environment. But as you can do RoCE RDMA between to NIC with no DCB switch in between this indicates that the hard requirement for DCB is not there. Mind you not using DCB might not be smart in regards to QoS & error handling (no TCP/IP goodness handling this for you). But I’m no expert on this subject. Paul Grun however is and he’s involved with RoCE at https://www.openfabrics.org/component/search/?searchword=Paul+grun&ordering=&searchphrase=all They tend to know their stuff. Read some of the comments below this article and you’ll know a lot http://www.hpcwire.com/hpcwire/2010-04-22/roce_an_ethernet-infiniband_love_story.html

iWarp doesn’t require DCB so you can get away with cheaper switches. Or, not so cheap switches that don’t support DCB (choose wisely). So cheaper switches is probably true on the low end. But, even very economically priced switches from DELL have good DCB support. Some other vendors who are more expensive don’t.

DCB is uncharted terrain for SMB Direct purposes & new to many for us. So if you want to do RDMA the easy way … go iWARP. As said, the use of DCB for PFC/ETS is not mandatory in that case, you’ll get great results and it’s easy. Mind you, you’ll still be dabbling with DCB if you want to do lossless magic in the switches Smile . Why you say? Well, that “converged network” story makes it kind of interesting to do so and PFC, DCBx/TLV is generic and can be leveraged for other things than iSCSI or FCoE. And for all practical purposes SMB 3.0 with SMB Direct is a storage protocol since Windows Server 2012 made it so (CSV). Or do you do DCB for iSCSI/FCoE & iWarp for SMB Direct? After all there’s only 2 lossless queues to be had. But hey how many do you need? Choices, choices and no vast pool of experienced practitioners yet.

iWARP routes, it’s not bound by a single Ethernet broadcast domain. That could be useful info depending on your environment & needs. I’ll note that I leverage RDMA for East-West traffic, not north south & as such this could not be an issue. The time that I do “Shared Nothing Live Migration" from on premise to the cloud has not arrived yet.

The Mellanox cards in my neck of the woods were 35% cheaper than Chelsio (SFP+)

What about the scalability? “iWarp doesn’t scale that well” is stated left and right but I think that might often be based on older information. Chelsio makes a strong case for iWARP scalability. Especially when it comes to long distances, multiple hops & routing.

Again, your mileage may vary. But for “the smaller environments” who want to leverage RDMA with SMB 3.0 I’d say that iWarp is the easiest path to go & will do just fine. Now if you’re already into lossless Ethernet for iSCSI or working with FCoE you might have all the hardware you need & the experience to deal with DCB. The latter might not always be true however. Most people have Lossless Ethernet for iSCSI or FCoE set up by the vendor or consultants who’ll use well defined step by step guides. These do not exist for the RoCE variant of SMB 3.0 over RDMA.

The case for RoCE can be made as well. Some claim that high volume of connections consumes memory when using iWARP and TCP’s flow and reliability controls are less suited for large-scale datacenters & cloud deployments due to performance issues. Where iWARP does not know multicast, RoCE does and that could be important to you.

So why did I or still do RoCE?

So why did I walk the walk? Basically because just talking the talk isn’t enough. We considered it an investment in our education. DCB is not going away (the abstraction isn’t their yet and won’t be for a while) and we need to gain knowledge of it to both handle it and make informed decisions. By the way once you go to lossless you might leverage DCB/PFC with iWarp as well just like you do for iSCSI to make it lossless (leveraging DCBx/TLV). Keep in mind that DCB is key in converged networking and as such deserves your attention. That’s why I chose not to avoid it but gave battle. DCB is all over the place when it comes to converged networking, so we need to learn the good, the bad and the ugly. Until that day that perhaps, the hardware stack is that good, powerfull & has so much bandwidth TCP/IP never needs it built in protection for packet loss. Hmmmmmm, I remember people saying that about 10Gbps, but then they wanted to send everything over 2*10Gbps pipes and it becomes an issue again?

It’s early days yet but you have to give credit to Microsoft for getting RDMA/DCB on the radar screen of the worlds virtualization & storage admins than ever before. It’s not a well established segment yet and it will be interesting to see how this all turns out. I do know that now that I’ve figured out a thing or two about RoCE, I won’t be intimidated & won’t make choices out of fear. And do remember that if you have plenty of idle CPU cycles & 10Gbps you might not even need RDMA. The value for me and my employers is the knowledge gained. DCB has it’s role to play but we’ll leverage iWARP or RoCE without a preference. Today you have 2 choices. RoCE is the newer one while iWarp has been around longer and both have avid proponents it seems.

I know one thing. If you need or want RDMA in any existing 10Gbps environment with minimal effort & no risk to existing switch infrastructure, you’ll use iWarp it seems.

Epilogue

You sit there staring at a truckload of VMs with 120GB of memory assigned in total being evacuated in +/- 70 seconds seconds, while doing a Shared Nothing Live Migration between the same hosts and without consuming CPU load … and have DCB for SMB 3.0 running on your switches … Yes!

Remember, “What we do in life, echo’s in eternity” Winking smile You might think now that I’m a bit nutters, but I assure you that in my quest to find someone who had hands on experience configuring DCB on switches for SMB Direct with RoCE I had to turn to myself. I’ll be sharing more info on our setup and configurations in the future. Once you wrap your head around the concepts, you understand why things are done and how. There in lies the value for me.

↧

Preventing Live Migration Over SMB Starving CSV Traffic in Windows Server 2012 R2 with Set-SmbBandwidthLimit

September 3, 2013, 12:20 am

≫ Next: Live Migration Can Benefit From Jumbo Frames

≪ Previous: Adventures In RDMA – The RoCE Path Over DCB To Windows Server 2012 R2 SMB 3.0 Glory

One of the big changes in Windows Server 2012 R2 is that all types of Live Migration can now leverage SMB 3.0 if the right conditions are met. That means that Multichannel & SMB Direct (RDMA) come in to play more often and simultaneously. Shared Nothing Live Migration & certain forms of Storage Live Migration are often a lot more planned due to their nature. So one can mitigate the risk by planning. Good old standard Live Migration of virtual machines however is often less planned. It can be done via Cluster Aware Updating, to evacuate a host for hardware maintenance, via Dynamic optimization. This means it’s often automated as well. As we have demonstrated many times Live Migration can (easily) fill 20Gbps of bandwidth. If you are sharing 2*10Gbps NICs for multiple purposes like CSV, LM, etc. Quality of Service (QoS) comes in to play. There are many ways to achieve this but in our example here I’ll be using DCB for SMB Direct with RoCE.

New-NetQosPolicy “CSV” –NetDirectPortMatchCondition 445 -PriorityValue8021Action 4
Enable-NetQosFlowControl –Priority 4
New-NetQoSTrafficClass "CSV" -Priority 4 -Algorithm ETS -Bandwidth 40
Enable-NetAdapterQos –InterfaceAlias SLOT41-CSV1+LM2
Enable-NetAdapterQos –InterfaceAlias SLOT42-LM1+CSV2
Set-NetQosDcbxSetting –willing $False

Now as you can see I leverage 2*10Gbps NIC, non teamed as I want RDMA. I have Failover/redundancy/bandwidth aggregation thanks to SMB 3.0. This works like a charm. But when leveraging Live Migration over SMB in Windows Server 2012 R2 we note that the LM traffic also goes over port 445 and as such is dealt with by the same QoS policy on the server & in the switches (DCB/PFC/ETS). So when both CSV & LM are going one how does one prevent LM form starving CSV traffic for example? Especially in Scale Out File Server Scenario’s this could be a real issue.

The Solution

To prevent LM traffic & CSV traffic from hogging all the SMB bandwidth ruining the SOFS party in R2 Microsoft introduced some new capabilities in Windows Server 2012 R2. In the SMBShare module you’ll find:

Set-SmbBandwidthLimit
Get-SmbBandwidthLimit
Remove-SmbBandwidthLimit

To use this you’ll need to install the Feature called SMB Bandwidth Limit via Server Manager or using PowerShell: Add-WindowsFeature FS-SMBBW

You can limit SMB bandwidth for Virtual machine (Storage IO to a SOFS), Live Migration & Default (all the rest). In the below example we set it to 8Gbps maximum.

Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1000MB

So there you go, we can prevent Live Migration from hogging all the bandwidth. Jose Baretto mentions this capability on his recent blog post on Windows Server 2012 R2 Storage: Step-by-step with Storage Spaces, SMB Scale-Out and Shared VHDX (Virtual). But what about Fibre Channel or iSCSI environments? It might not be the total killer there as in SOFS scenario but still. As it turns out the Set-SmbBandwidthLimit also works in those scenarios. I was put on the wrong track by thinking it was only for SOFS scenarios but my fellow MVP Carsten Rachfahl kindly reminded me of my own mantra “Trust but verify” and as a result, I can confirm it even works to cap off Live Migration traffic over SMB that leverages RDMA (RoCE). So don’t let the PowerShell module name (SMBShare) fool you, it’s about all SMB traffic within the categories.

So without limit LM can use all bandwidth (2*10Gbps)

With Set-SmbBandwidthLimit -Category LiveMigration -BytesPerSecond 1250MB you can see we max out at 10Gbps (2*5Gbps).

Some Remarks

I’d love to see a minimum bandwidth implementation of this (that could include safety buffer for spikes in CSV traffic with SOFS). The hard cap limit might lead to some wasted bandwidth. In other scenarios you could still get into trouble. What if you have 2*10Gbps available but one of those dies on you and you capped Live Migration Traffic at 16Gbps. With one NIC gone you’re potentially in trouble until the NIC has been replaced. OK, this is not a daily occurrence & depending on you environment & setup this is less or more of a potential issue.

↧