The HCX team have just released version 4.3 and there are some really nice new features. Go ahead and check out the release notes here. One such addition is High Availability for HCX Network Extension and it is something which every customer I have worked has requested or asked about. As such it is most welcomed and will be well received with our customers.
In order to benefit, you will require all managers and the service mesh appliances to be at version 4.3, plus, you will need a spare NE appliance. Finally, you cannot activate HA on any appliance which already has a stretched network. You will have to unstretch any networks which the appliance is stretching, activate HA, and then re-stretch the networks. This is unfortunate but at the moment it’s the only way to do it. With that said, you can probably accomplish these tasks in under 10 minutes per appliance.
With all the requirements satisfied, under service mesh, you can find a new HA Management tab which will tell you that no HA groups have been created (I will file an internal bug around the spacing here):
Please remember that HA for NE is considered Early Access. This means that while it is supported, it’s the first time that it’s available for HCX and as such there may be minor issues or bugs which will be corrected as new versions of HCX are released. Whilst our engineering teams endeavour to test all scenarios, there is nothing quite like customer feedback.
In order to activate HA, we need to select the NE appliance(s) that we wish to activate it on:
With the appliance selected, click on Activate High Availability. Here we have a reminder that HA will require two appliances, and if you don’t have one deployed which is not in you, you can do so now. Clicking on Increase Appliance Count will take you to the edit service mesh screen where you will have to deploy at least one other appliance depending on how many you wish to activate HA for.
With the new NE appliance deployed and tunnels all green, we can now activate HA again. Unless of course there is already a stretched network, which you will receive another warning about. If not, you will see this screen:
Part of me hoped to see some party animation GIF here as HA has been a big ask for quite some time, but all you get is a task to say it is being created and then again when done. This process really does not take long at all, which is good news if there are a lot of networks to activate it on.
Once the HA group is setup, we can stretch networks again and you’ll notice that when you get to choose the appliance, it now shows that it is a HA group.
I have a functioning HCX service mesh with a VMC on AWS environment and I have stretched VLAN 85 to the destination. The subnet is 172.16.85.0/24 and I have a VMware Photon VM with the IP 172.16.85.11 and it is at the far end of the stretch. What I am going to do is have a ping going to this VM from the near side and then power off the active HA appliance and see how long it takes for it to recover.
Reply from 172.16.85.11: bytes=32 time=12ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63
Well, that is unexpected. Based on documentation I had read prior to release, it suggested that there would be some time for recovery to happen but I guess the BU have made some pretty epic improvements, in that there is no noticeable drop in pings. This is excellent work and really, really impressive. Just checking my work and making sure I have the active appliance (handy view on HA Management):
Let’s try it again, powering off NE1:
ply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Request timed out. Reply from 172.16.85.11: bytes=32 time=10ms TTL=63 Reply from 172.16.85.11: bytes=32 time=10ms TTL=63
One drop this time, which proves that there is a failover happening but that is lighting quick! I am very impressed and I’m looking forward to configuring this for customers.
Looking at the HA Management screen we have a few options: manual failover, deactivate, redeploy and force sync. If there is a software update, you can do that here too. There’s an option as well to recover, although at the moment I am not sure what that does.
On minor thing which I think could be improved, is the HA Management location in the GUI. As I wrote above, it is under the service mesh and then appliances. For me, it would be easier to access if it were under Services on the main menu. Perhaps this is something which will come with future updates.
There are plenty of other improvements in HCX 4.3 however this post is based mainly around HA for Network Extension and a short demo of it. Although published, I may alter the wording and the diagrams as and when I find time.
As a final note, the above has been done in a lab environment and may not necessarily represent real world use cases with loaded networks etc.
Thanks for reading.