VMware HCX - common misconceptions and unsupported configuration - chrisdooks.com

This is a post covering some things which I regularly see in the field. I’ll endeavour to keep this updated as new releases come out. The below is currently correct (as far as I know) with HCX 4.6. Always refer to the HCX documentation for the source of truth. This is not an official VMware post, it is based on my experience and my own interpretation of the documentation.

Whilst it may be a bit doom and gloom, the post is designed to help you use the product for your migrations and have a successful project. HCX is a fantastic toolset and it has been used to migrate thousands upon thousands of workloads all over the world.

It’s a bit wordy, sorry for that! Because of that I’ll list a few things you can’t do, and I’ll explain further below if you want more information. This post is by no means comprehensive, it aims to cover the common questions I see asked.

Set expectations early:

You can’t live migrate a VM using newer CPU architecture than the destination unless it is a reverse migration of a VM which had per VM EVC enabled during its forward migration using HCX. Or, the VM has per-VM EVC enabled, or the cluster has EVC enabled to the same or lower level than the target.
P2V using HCX OSAM is not what the product is for, it’s to migrate from KVM or Hyper-V environments.
Whilst you potentially can migrate native public cloud VMs using HCX, it is currently unsupported and the behaviour may be unexpected. Tread with caution.
You can’t have more than 300 parallel migrations per HCX Manager. Update 21 Nov 2023 – You can upscale version 4.7 and above to support up to 600, please see KB 93605.
You can’t have more than 200 parallel migrations in a single Service Mesh.
You can’t deploy multiple IX appliances in a single Service Mesh. Well, you can, but subsequent IX appliances will not be used by the mesh. Update 21 Nov 2023 – with HCX 4.8 or later this is now possible, and you can choose which Service Mesh to use for the migration

EVC – You cannot overcome the rules of EVC using HCX. Many seem to get confused because HCX allows you to enable per VM EVC. To be clear, this means only one thing – if you live migrate forward to a newer CPU generation host, this flag enables you to live migrate back to the host or cluster where it came from enabling true mobility, even if the VM is power cycled. It does not allow you to migrate a VM from a host with a newer generation CPU to an older one. Wait, what? Why would we do that? Well, whilst fairly uncommon, there are situations where the source environment has newer CPUs than the destination. I see this mostly with public cloud providers as the hardware refresh cycle can be slower than on-premesis. For example, it was still possible until recently to purchase i3 hosts in VMC on AWS on a 3 year term, these hosts use Broadwell CPUs, which are ~9 years old now. Another case is you may want to move your workloads into another public cloud provider, or even back to on-premesis, where differing CPU generations can cause problems for live migrations.

EVC Workaround – whilst it is not possible to live migrate ‘backwards’ to an older generation, it is still possible to use Bulk Migration and the downtime is similar to a normal VM restart event. Bulk Migration is my most favoured approach as it offers loads of benefits such as near immediate roll-back and in most cases application owners can be brought around to minor downtime.

As an alternate, you can enable EVC in the source cluster to match the level at the destination. Unfortunately it is not possible to do this on a running cluster as all VMs have to be powered off, but depending on the environment this may be worth looking into if live migrations are critical to you.

OSAM migration – whilst it might be possibly to migrate EC2, Azure, Google, Oracle or any other public cloud VMs using VMware HCX, it is currently unsupported. You are able to migrate Hyper-V and KVM VMs from an on-premesis environment only. It may work if you meet the networking requirements to reach the Sentinel Gateway, but the results may be unpredictable. I have been testing this and will hopefully share a post soon on the premise that is is for experimental lab testing only. Use vCenter Converter for native public cloud VMs where the product supports it.

Coming back to on-prem, if we have a full KVM or Hyper-V deployment and want to migrate into the public cloud such as VMC on AWS, we require HCX components to be deployed on-prem too, which also means a vCenter server and at least 1 (I’d recommend 2!) vSphere hosts. What might be possible here is to ‘borrow’ hosts from the environment we want to migrate from, or use some other older hardware, as long as the installed versions of the products have supportability from a HCX point of view – see here. What you can do is ‘drain’ the source system and add more hosts to the temporary vSphere environment as fewer are required in the source.

If you are migrating from on-prem Hyper-V/KVM into on-prem vSphere, you can deploy a single site HCX topology which I have recently blogged about. The topology is currently unavailable in public cloud.

Also, it may technically work migrating from Nutanix, it’s unsupported currently. The same goes for any other solution that uses KVM underneath. However in these situations where it is not KVM on Linux metal, I would strongly advise to test, because different systems employ different management stacks which could interfere with the OSAM workflow, or use some derivative of KVM such as QEMU.

P2V using OSAM – You might be thinking that OSAM is just an agent, what’s to stop me from deploying it onto a physical server? Whilst it may technically work, it is unsupported and again there may be unpredictable behaviour. There was a gap where vCenter Converter was not available – it is now, so use that instead.

Guest customisation – I have seen some instances where guest customisations fails when the version of VMtools running on the guest is not installed by vCenter. For example, the majority of Linux distributions use open-vm-tools. I have seen it quite often where running any customisation fails, such as changing the IP address. I’m going to lab this as I haven’t managed to test it fully but I do see a lot of chatter about it. I don’t think this is necessary an issue with HCX, but rather the guest itself.

Bulk Migration/RAV issues due to vSphere Replication network design – this is quite common in VxRail deployments where vmk0 is used for VxRail Management and not vSphere management as is usually the case regular deployments. If for some reason the source host cannot talk to the IX appliance using the vSphere replication VMkernel then the migration will fail. What you need to do is tag vmkX with vSphere Replication traffic as well as NFC and ensure that the underlay network can steer the replication traffic to the IX appliance, and if necessary use HCX static routes. Thanks to Bilal for helping me with this.

Single vCenter deployments – there are two supported configurations for HCX in a single site. This means, there is only a single vCenter Server to register the HCX Cloud Manager and Connector to. I’ve recently blogged about both.

First is Single vCenter Bulk Migration. This is where we want to mass migrate VMs from one cluster to another using Bulk Migration, such as Intel to AMD visa versa.

The second supported scenario is using HCX OSAM, where we have a vSphere environment we want to migrate into, and a Hyper-V or KVM environment we want to migrate out of. As long as the networking requirements (especially Sentinel Agent to Manager) are satisfied, it is supported.

Migration limits – These are updated regularly on the HCX Config Maximum page. At the time of this post, we can parallel migrate up to 200 VMs using either Bulk or RAV if you have the bandwidth for it. If we have multiple source and/or destination clusters, we can use Compute Profiles per cluster rather than vSphere Datacenter level and deploy additional Service Meshes. However, the maximum per HCX manager is 300 parallel migrations. Any more than this and you will likely see unpredictable behaviour and potentially failed migrations. Plan your migration waves accordingly around the published limits of the product.

If you have a single Service Mesh, deploying additional IX appliances will not increase throughput, and it is not supported. You can have one Service Mesh per Compute Profile pair only, and you cannot configure a Compute Profile on a cluster which already has one configured. I’m going to do a diagram for this soon and update the post to help explain it better.

Maintenance Windows during migration waves – I understand that migrations are highly complex and take an awful lot of planning and execution. I also understand that they can take a long time in larger environments. HCX has a pretty aggressive release cycle, for example 4.6 has just been released and all minor versions of it will be supported for 1 year only. It is strongly recommended to have maintenance windows if you are migrating lots of VMs. This is to allow you to upgrade all HCX components maintaining supportability. Another good byproduct of this is that you won’t end up in a situation where you have a multiple step upgrade. It’s good practice to keep your software up to date, allowing new features, PR fixes, and to patch any security related vulnerabilities.

Consider application based replication for heavy hitters, where VMs which have a high churn rate and HCX may struggle to migrate them. There are technologies such as SQL always on replication or DFSR where we can migrate workload data into an empty vessel in the target environment within the guest. This may be favoured by the various teams as an alternate to HCX, and in some cases this can work better. Always talk to the application owners and infrastructure teams prior to planning migrations.

Touching on in built replication, I would strongly advise that you do not migrate Microsoft Domain Controllers (DCs), the reason being is if there is some kind of issue, or you need to roll back. It would be far more sensible to deploy a new one in the destination environment and move FSMO roles as required once replication has finished. And when doing so, ensure to configure AD Sites & Services accordingly, and if necessary update NTP configuration if you move the PDC Emulator role. Once complete, demote and decommission the old one. Think of this as a win/win situation, you are migrating to nice new environment, and you have an opportunity to upgrade the underlying OS of your DCs or other software components.

Bulk Migration and RAV – Seed in advance! Whilst there are numbers published for VM change rate and bandwidth figures, it is extremely difficult (borderline impossible) to calculate how long X amount of VM data will migrate over a network link which has Y bandwidth available. It is far from as simple as using a data copy calculator to work it out. On my projects I usually try and start seeding data two weeks in advance of the agreed cutover data. This should give ample time for everything to replicate within the constraints of the Service Mesh and Manager limits. Your mileage may vary, for example, migrating 200 large and busy databases may take longer than 2 weeks, and I would not recommend that approach at all. In all cases it might be prudent to slow the cadence of backups or even shut some services down as the migration window approaches.

HCX Network Extension – If you are stretching networks then try and focus on evacuating VLANs one at a time. HCX makes it incredibly easy to bridge L2 networks from one location to another. Bridged networks long term are not a good idea. Most network engineers I have worked with have never liked them and I’m not a big fan of them either. It is not uncommon for stretches to be open far longer than required and this can cause issues with HCX upgrades as you are reliant on a network outage if you are not using NE HA. If you plan your waves with evacuating VLANs in mind, you are much more likely to be in a position to un-stretch the networks and cutover the gateway sooner.

MON – Since I’m covering stretched networks. MON is a great tool and can significantly reduce the latency between VMs on the far side of the stretch, or for reachability into public cloud services if we have migrated to a public cloud.

With MON and VMC on AWS, you can only egress out of the Tier-1 to public endpoints, this is clearly documented. If you have workloads in your connected VPC, MON unfortunately won’t be able to help you here due to the way which VMC/A hosts work. The network would have to be unstretched, and then you can have reachability using the vTGW/TGW.

MTU – There’s a big whitepaper that goes into this in some detail. In short, HCX adds a 150 byte header for any traffic between IX or NE appliances, which means that traffic can fragment. This is particularly important for Network Extension where some applications are particularly sensitive to fragmentation. HCX TCP Flow Conditioning is extremely helpful, but it won’t help if the underlay network is misconfigured, and it doesn’t help UDP traffic. HCX often gets blamed for ‘network issues’ and in the majority of cases it’s the network underlay at fault, HCX wrongly gets the blame where all it is doing is highlighting issues. Fragmentation has a big part in migrations also, where you can experience slowdown. Thoroughly test the underlay network and set the HCX Uplink MTU values accordingly. I plan on doing another blog post covering MTU in the future.

Just because you can, doesn’t mean you should – whilst it is within the published limits (or lack thereof) to for example stretch a network from the US to London, and then again to Singapore, you shouldn’t. The latency is going to be unworkable for anything apart from very simple testing applications. Network stretching should only really be used where the Round Trip time is under 10 ms ideally and probably 20-30 ms max, again entirely depending on the sensitivity of the application. MON can help here, but it’s not going to change the laws of physics.

Thanks for reading.

Post Views: 34,705

VMware HCX – common misconceptions and unsupported configuration

1 thought on “VMware HCX – common misconceptions and unsupported configuration”

Leave a comment Cancel reply