VMware VCF – Deployment of new clusters and the impact on NSX-T Transport Zones
At one of our customers we are running a large VMware VCF environment. Due to design decisions taken earlier and some constraints regarding the setup of the NSX Edges, the VCF part of our automated deployment by the SDDC Manager ends after the configuration of the NSX-T Managers and the initial setup of the (default) Overlay Transport Zone (TZ) as per below:
We are running seperate script to perform the deployment of the edges, edge clustere etc.
It is good to know that any new clusters that you deploy within this workload domain will be added to this default transport zone. As you can see in the bottom of the screenshot, there are two VCF tags assigned to the overlay TZ: VCF (scope: Created by) & vcf (scope: vcf-orchestration). These tags will ensure that the SDDC manager can identify & find this TZ and add any new hosts to it.
In this case we are running multiple clusters in the workload domain each with their own overlay TZ. The “mistake” that we made though is that we left the first cluster that we deployed in this default TZ with the tags.
This first cluster is running production workloads and a lot of overlay segments are present. The problem now is that our cloud native team uses tooling to deploy Kubernetes clusters on the vSphere clusters in the workload domain. As a result of our setup though, all clusters we deploy are now added to the default overlay TZ with the VCF tags on it. And thus, the segments that already exist are also added to the new cluster. Within vSphere this is shown as portgroups with the same name (this is possible because we use separate VDS per cluster). The problem is that the team’s tooling now suddenly sees 2 networks with the same name and gets confused.. ergo: no modifications to running or new containers can be created anymore causing a service interruption.
So how do we resolve this: Fortunately, solving this is not a big deal. We have created a new “default” Overlay Transport Zone in NSX-T, removed the VCF tags from the original default TZ and added them to the new TZ. Now all new clusters that are deployed are added to the new default TZ and some manual work is required in order to create a new cluster specific TZ and move stuff over (which of course can be automated as well 😊 ) But it doesn’t interfere with our production systems again.
In addition to the fact that it is good to know how the SDDC manager will behave and what it expect to be present in NSX-T when deploying new clusters, it’s also good to know that it will look for those same tags when you are removing clusters/workload domains from the environment!
In case you are trying to deploy a new cluster and SDDC manager is unable to find any TZ with the appropriate tags, you will see an error message like the one below:
All of this was tested and performed running VMware VCF 4.5.1 and the respective product versions in the Bill of Materials (BoM) for this version. Behavior might vary in other versions and when in doubt, it is always good to contact support first!!
Credits also go to my co-workers Paul (https://www.hollebollevsan.nl) and Sjaak !!