Automating the configuration supporting Scale Enhancements for Bulk and Replication Assisted vMotion (RAV) Migrations

At the time of writing, in its default configuration HCX will support up to 300 concurrent migrations per Manager, and 200 concurrent migrations per Service Mesh/IX Appliance. Whilst these numbers are high, there are scenarios where higher numbers are required. For example, if you have three Service Meshes, you are still constrained by the HCX Manager supporting 300 concurrent migrations which means you can’t use all three Service Meshes to their full potential. With the release of HCX 4.7 there is an option to upsize (or scale up) the HCX Manager to support parallel 600 migrations in total and the instructions are here – https://kb.vmware.com/s/article/93605

In short, the VM configuration is changed to 32 vCPU, 48GB RAM, and has an additional 300GB HDD. Post VM changes, there’s quite a bit of CLI involved and it can invariably lead to mistakes or potentially confusion. I figured I’d create a bash script to take care of the CLI steps.

Disclaimer: These steps or the script have not been verified internally, therefore I would advise caution and mention that it is likely not supported. However feel free to try it out in a lab environment ensuring to comment out the relevant checks if you don’t want to assign the resources. If you do run it without the increased resources you are likely to break the manager. You can always revert to snapshot in case of unexpected results.

Important: There are actually two parts of the script, the first part is only relevant when running this for the first time. The second part must be run on any existing manager that has this new configuration and has been upgraded, ie, from 4.7 to 4.9, run the second script at the end of this post.

First, shut the HCX manager down and change the VM configuration as per the kb (CPU, RAM, and add the 300GB HDD). Then take a snapshot. Power the VM on, put the below script in /home/admin or /tmp, switch to root, make it executable and then run. Post completion and checking everything is okay, don’t forget to remove the snapshot.

If you find it useful, please let me know in the comments below, and be sure to check back in case of any updates. Feedback is of course welcome.

# Author - Christopher Dooks, VMware by Broadcom Professional Services
# Title - upscalehcx.sh
# Version - 0.3 24/04/2024
# About - this script automates the CLI steps in https://kb.vmware.com/s/article/93605 - HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide (93605). There are a lot of manual steps in the kb via CLI, this will allow these steps to be automated and reduces user error. The script can take a few minutes to complete.
# Prerequisites - HCX Manager MUST be 4.7 or later. Shut down the manager, take a snapshot. Increase vCPU and RAM as per the kb, add 300GB HDD. Power on manager. The script must be run as root. Only run this on a Manager which has not been upgraded post scale up, run postupgradetuning.sh after an appliance upgrade. Don't forget to remove the snapshot post verification.

#!/bin/bash

# Define the log file
LOG_FILE="/home/admin/upscalehcx.log"

# Function to log messages
log_message() {
    echo "$(date +"%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
}

# Function to check command execution status
check_status() {
    if [ $1 -ne 0 ]; then
        log_message "Error: $2"
        exit 1
    fi
}

#Prerequisite checks (HCX version, 32 vCPU/48GB RAM, 300GB HDD added)
# Check HCX version
# I haven't validated all versions of HCX against this check, if you are facing issues and you are on a version later than 4.7 simply comment out this section (currently lines 28-42)

HCX_VERSION=$(awk -F: '/^CLOUDVM_VERSION:/ {print $2}' /etc/vmware/buildInfo)

REQUIRED_VERSION="4.7.0.0"

if [[ "$(printf '%s\n' "$HCX_VERSION" "$REQUIRED_VERSION" | sort -V | head -n1)" != "$REQUIRED_VERSION" ]]; then
    echo "Stopping the script, HCX version 4.7.0.0 or newer is required! Please upgrade HCX Manager and retry."
    exit 1
fi

# Check if /dev/sdb exists
if [ ! -e "/dev/sdb" ]; then
    echo "Error: /dev/sdb does not exist. Please make sure the second disk is attached to the VM."
    exit 1
fi

# Check if a partition already exists on /common_ext and if it's mounted
if grep -qs "/common_ext" /proc/mounts; then
    echo "/common_ext is already mounted. It is highly likely that this script has already been run. Please verify."
    exit 1 
fi

# Check if the VM resource changes have been applied, first one looks for 32 vCPU
if [ "$(nproc)" -ne 32 ]; then
    echo "Stopping, the VM does not have 32 CPUs."
    exit 1
fi

# Then check RAM
RAM=$(free -m | awk '/^Mem:/ {print $2}')

REQUIRED_RAM_MB=48128  # 47 GB in MB + 1 GB tolerance

if [[ "$RAM" -lt "$REQUIRED_RAM_MB" ]]; then
    echo "Stopping, the VM does not have 48GB of RAM"
    exit 1
fi

# If no partition exists, proceed to create one
log_message "Creating partition on /dev/sdb"
echo "Creating partition on the new 300GB disk"
echo -e "n\np\n1\n\n\nw" | fdisk /dev/sdb >> "$LOG_FILE" 2>&1

# Format the partition with ext3 filesystem
log_message "Formatting the new partition"
echo "Formatting partition /dev/sdb1 with ext3 filesystem"
mkfs -t ext3 /dev/sdb1 >> "$LOG_FILE" 2>&1

# Mount new drive
echo "Mounting new drive"
log_message "Mounting new drive"
mkdir /common_ext
mount /dev/sdb1 /common_ext >> "$LOG_FILE" 2>&1

# Add entry to /etc/fstab
echo "Updating /etc/fstab"
log_message "Updating /etc/fstab"
echo "/dev/sdb1 /common_ext ext3 rw,nosuid,nodev,exec,auto,nouser,async 1 2" >> /etc/fstab
check_status $? "Failed to update /etc/fstab"

# Stop services
# Function to check command execution status
check_status() {
    if [ $1 -ne 0 ]; then
        log_message "Error: $2"
        exit 1
    fi
}

# Function to wait for service to stop
wait_for_service_stop() {
    service="$1"
    echo "Waiting for $service to stop"
    while systemctl is-active --quiet "$service"; do
        sleep 1
    done
}

# Stop services

# List of services
services=("postgresdb" "zookeeper" "kafka" "app-engine" "web-engine" "appliance-management")

for service in "${services[@]}"; do
    echo "Stopping $service"
    log_message "Stopping $service"
    systemctl stop "$service" >> "$LOG_FILE" 2>&1
    wait_for_service_stop "$service"
done

echo "All services stopped successfully."

# Back up kafka-db
echo "Backing up old kafka-db content"
log_message "Backing up kafka-db"
cd /common
mv kafka-db kafka-db.bak >> "$LOG_FILE" 2>&1

# Create new kafka-db directory
echo "Creating new kafka-db directory"
log_message "Creating new kafka-db directory"
cd /common_ext
mkdir kafka-db >> "$LOG_FILE" 2>&1
chmod 755 kafka-db >> "$LOG_FILE" 2>&1
chown kafka:kafka kafka-db >> "$LOG_FILE" 2>&1

# Create symbolic link for kafka-db
echo "Creating symbolic link for kafka-db"
log_message "Creating symbolic link for kafka-db"
cd /common
ln -s /common_ext/kafka-db kafka-db >> "$LOG_FILE" 2>&1

# Back up postgres-db 
echo "Backing up postgres-db"
log_message "Backing up postgres-db"
mv postgres-db postgres-db.bak >> "$LOG_FILE" 2>&1

# Copy postgres-db.bak to /common_ext
echo "Copying postgres-db directory"
log_message "Copying postgres-db directory"
cp -r /common/postgres-db.bak /common_ext/postgres-db >> "$LOG_FILE" 2>&1
chown -R postgres:postgres /common_ext/postgres-db >> "$LOG_FILE" 2>&1

# Create symbolic link for postgres-db
echo "Creating symbolic link for postgres-db"
log_message "Creating symbolic link for postgres-db"
cd /common
ln -s /common_ext/postgres-db postgres-db >> "$LOG_FILE" 2>&1

# Performance tuning 

# Edit /etc/systemd/app-engine-start
echo "Updating /etc/systemd/app-engine-start"
log_message "Updating /etc/systemd/app-engine-start"
sed -i -e 's/-Xmx2048m/-Xmx4096m/g' -e 's/-Xms2048m/-Xms4096m/g' -e 's/-XX:MaxPermSize=512m/-XX:MaxPermSize=1024m/g' /etc/systemd/app-engine-start >> "$LOG_FILE" 2>&1

# Edit configuration files
echo "Updating configuration files"
log_message "Updating configuration files"
sed -i 's/"numberOfThreads": "5"/"numberOfThreads": "50"/g' /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql >> "$LOG_FILE" 2>&1
sed -i 's/"numberOfThreads":25/"numberOfThreads":50/g' /opt/vmware/deploy/zookeeper/MobilityTransferService.zql >> "$LOG_FILE" 2>&1 
sed -i 's/"kafkaMaxMessageSizeBytes":2097152/"kafkaMaxMessageSizeBytes":4194304/g' /opt/vmware/deploy/zookeeper/vchsApplication.zql >> "$LOG_FILE" 2>&1
sed -i 's/message.max.bytes=2097152/message.max.bytes=4194304/g' /etc/kafka/server.properties >> "$LOG_FILE" 2>&1

# Start services
echo "Starting services"
log_message "Starting services"
for service in "${services[@]}"; do
    systemctl start "$service" >> "$LOG_FILE" 2>&1
    check_status $? "Failed to start $service"
done

echo "All services started successfully."
log_message "All services started successfully."

And a screenshot of it running in my lab:

And here is the script to run on an upgraded manager. There are no checks here, I will update.

# Author - Christopher Dooks, VMware by Broadcom Professional Services
# Title - postupgradetuning.sh
# Version - 0.1 23/04/2024
# About - this script is a follow on to upscalehcx.sh and must be run on an already scaled manager which has since had an upgrade. It automates the Performance Tuning steps in https://kb.vmware.com/s/article/93605 - HCX - Bulk Migration & Replication Assisted vMotion (RAV) scalability guide (93605). 
# Do not run this script on a HCX Manager which has not been scaled up already as it will not have the correct impact.
# Prerequisites - HCX Manager MUST be 4.8 or later AND have been scaled up, and since upgraded to a newer version of HCX.
# The script must be run as root. 

#!/bin/bash

# Define log file
LOG_FILE="/home/admin/postupgradetuning.log"

# Function to log messages
log_message() {
    echo "$(date +"%Y-%m-%d %H:%M:%S") - $1" >> "$LOG_FILE"
}

# Function to check command execution status
check_status() {
    if [ $1 -ne 0 ]; then
        log_message "Error: $2"
        exit 1
    fi
}

# Performance tuning 

# Stop services

# Function to wait for service to stop
wait_for_service_stop() {
    service="$1"
    echo "Waiting for $service to stop"
    while systemctl is-active --quiet "$service"; do
        sleep 1
    done
}

# List of services
services=("postgresdb" "zookeeper" "kafka" "app-engine" "web-engine" "appliance-management")

for service in "${services[@]}"; do
    echo "Stopping $service"
    log_message "Stopping $service"
    systemctl stop "$service" >> "$LOG_FILE" 2>&1
    wait_for_service_stop "$service"
done

echo "All services stopped successfully."

# Edit /etc/systemd/app-engine-start
echo "Updating /etc/systemd/app-engine-start"
log_message "Updating /etc/systemd/app-engine-start"
sed -i -e 's/-Xmx2048m/-Xmx4096m/g' -e 's/-Xms2048m/-Xms4096m/g' -e 's/-XX:MaxPermSize=512m/-XX:MaxPermSize=1024m/g' /etc/systemd/app-engine-start >> "$LOG_FILE" 2>&1

# Edit configuration files
echo "Updating configuration files"
log_message "Updating configuration files"
sed -i 's/"numberOfThreads": "5"/"numberOfThreads": "50"/g' /opt/vmware/deploy/zookeeper/MobilityMigrationService.zql >> "$LOG_FILE" 2>&1
sed -i 's/"numberOfThreads":25/"numberOfThreads":50/g' /opt/vmware/deploy/zookeeper/MobilityTransferService.zql >> "$LOG_FILE" 2>&1 
sed -i 's/"kafkaMaxMessageSizeBytes":2097152/"kafkaMaxMessageSizeBytes":4194304/g' /opt/vmware/deploy/zookeeper/vchsApplication.zql >> "$LOG_FILE" 2>&1
sed -i 's/message.max.bytes=2097152/message.max.bytes=4194304/g' /etc/kafka/server.properties >> "$LOG_FILE" 2>&1

# Start services

# List of services
services=("postgresdb" "zookeeper" "kafka" "app-engine" "web-engine" "appliance-management")

echo "Starting services"
log_message "Starting services"
for service in "${services[@]}"; do
    systemctl start "$service" >> "$LOG_FILE" 2>&1
    check_status $? "Failed to start $service"
done

echo "All services started successfully."
log_message "All services started successfully."

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.