VCD

VMware Cloud Director 10.6.x - The Definitive MSP Operations Guide

Eric Black

18 Mar 2026 — 142 min read

Enterprise & Service Provider Engineering | VMware Cloud Director

VCD 10.6.x Veeam Cloud Connect Zerto MSP / Service Provider / Enterprise Gotcha Catalog

VMware Cloud Director 10.6.x is the current release and it is not forgiving of misconfiguration. Broadcom dropped VCD 10.6 GA in June 2024 and it includes architecture changes that quietly break assumptions you made in 10.4 and 10.5 - networking tenancy models that cannot be changed after creation, AMQP deprecation that takes out Zerto if you are not paying attention, and logging behavior that will fill your root disk if you upgrade and walk away. This article is the operational guide I wish existed when we started deploying it. Every gotcha in here is real. Every fix has been validated.

This covers the full stack: VCD cell configuration and sizing, NSX integration gotchas, IP Space and tenant isolation, Veeam Cloud Connect and direct VBR integration end-to-end, and Zerto Cloud Manager with the specific caveats that apply to VCD 10.6. The content applies equally to service providers running multi-tenant infrastructure and to enterprise architects deploying VCD as a private cloud management plane - the platform is the same, the tenancy model and scale differ. Read it front to back before you touch production.

Coverage: VCD 10.6.0 through 10.6.1.2 (December 2025). Every bug entry is sourced directly from the official Broadcom release notes and tagged with the version it was introduced and the version it was resolved, so you can use this as an upgrade decision matrix.

🏗️

1. VCD 10.6.x Architecture Overview

Before configuring anything, you need a clear picture of what changed in 10.6 versus previous releases. Several decisions you make during initial setup are permanent.

Release Baseline

Item	Detail
GA Release	June 27, 2024 - Build 24055813 (installer: 24055916)
10.6.0.1	September 6, 2024 - Build 24250702
10.6.1	January 31, 2025 - Build 24532678
10.6.1.1	May 30, 2025 - Build 24756346
10.6.1.2	December 5, 2025 - Build 25086345 (installer: 25088252)
Current Patch	VCD 10.6.1.2 (Dec 2025) - recommended production baseline
Supported NSX Version	NSX 4.1.x, 4.2.x (current: 4.2.3.3)
Supported vSphere	vSphere 8.0 U1, U2, U3 (current: 8.0 Update 3i)
Supported RabbitMQ	3.10.x, 3.11.x, 3.12.x
API Version	36.3 (latest in 10.6.x)

What Changed in 10.6

These are not incremental tweaks. Several of these changes alter how you architect a deployment from day one.

NSX Projects (Networking Tenancy): Introduced in 10.5.1, expanded significantly in 10.6. NSX Projects allow true multi-tenant networking isolation at the NSX level. The critical constraint: once you enable NSX Projects on a VDC, the network pool backing that VDC cannot be swapped. You also cannot change the networking tenancy model after the VDC is created. Make this decision before you provision anything.

Three-Tier Provider Model: VCD 10.6 significantly expanded the Sub-Provider role. You now have Provider, Sub-Provider, and Tenant as distinct tiers with independent resource delegation. Sub-Providers can create and manage their own tenants without requiring provider-level credentials. This is the right model for MSPs who wholesale to resellers.

AMQP Deprecated: AMQP is deprecated in 10.6. If you are upgrading from VCD 10.1 or earlier and still have AMQP configured, you must migrate to MQTT before or immediately after upgrade. AMQP broker settings cannot be deleted through the GUI - you need the API. This directly affects Zerto integrations.

DC Group Scaling: Data Center Groups now support up to 2000 VDCs per group, up from previous limits. If you are building large multi-datacenter MSP infrastructure, this opens architectural options that were not viable before.

Strict Mode IPAM: Sub-tenants can now be enforced to use only IPs within assigned static pools. This was not enforceable in prior releases and was a frequent source of IP overlap incidents in multi-tenant environments.

Mixed Tier-0/VRF Import: You can now import both a parent Tier-0 and its child VRFs simultaneously into VCD. This simplifies the routing import workflow considerably for MSPs running VRF-based tenant segmentation.

Certificate Library: CA-signed certificates are now managed through the Certificate Library (introduced in 10.5.1, expanded in 10.6). If you are restoring a VCD deployment from backup and have CA-signed certs, the restore will fail if there is a password mismatch between the cert password at backup time and restore time. There is a specific procedure for this - covered in the certificates section.

💻

2. Cell Sizing and Infrastructure Requirements

VMware publishes cell sizing recommendations but they are conservative baselines. MSP deployments that handle significant tenant load or are running Veeam + Zerto integration alongside VCD need to size up from those baselines.

Cell VM Specifications

Deployment Size	vCPU	RAM	OS Disk	Transfer Dir	Notes
Small (<100 tenants)	8	16 GB	60 GB	500 GB+	Single cell acceptable
Medium (100-500 tenants)	16	32 GB	60 GB	1 TB+	2+ cells recommended
Large (500+ tenants)	32	64 GB	60 GB	2 TB+	4+ cells, dedicated DB

Root Partition - Critical

The OS disk layout matters. The root partition filling up will cause the vmware-vcd service to stop unexpectedly with no obvious external warning. This is a known failure mode in 10.6 deployments. The transfer directory at /opt/vmware/vcloud-director/data/transfer and the log directory will grow without bounds if not managed. Mount both on separate volumes from root. This is not optional for production.

Disk Layout Recommendations

Mount Point	Minimum Size	Notes
/	60 GB	OS and VCD binaries only. Do not let logs or transfer data land here.
/opt/vmware/vcloud-director/data/transfer	500 GB+	Separate volume. This is where backup files, uploads, catalog items stage. Size to your workload.
/var/log	50 GB+	Separate volume strongly recommended. Logging volume increases significantly after upgrade to 10.6.

10.6 Logging Volume Warning

Logging volume increases noticeably after upgrading to 10.6 compared to 10.5. If you are upgrading in place, check your log volume free space before and monitor it after upgrade. Environments that were fine at 10.5 have run out of log space within days of upgrading to 10.6 without any configuration change on their part.

Networking Requirements

Connection	Port	Direction	Notes
VCD UI/API (HTTPS)	443	Inbound to cell	Load balanced across cells
VCD Console Proxy	8443	Inbound to cell	VM remote console access
Cell-to-cell	5480	Between cells	Management and heartbeat
VCD to vCenter	443	Outbound	vSphere API
VCD to NSX Manager	443	Outbound	NSX API
VCD to RabbitMQ	5671	Outbound	AMQP/MQTT (TLS)
VCD database (PostgreSQL)	5432	Outbound	External DB deployments

Database Sizing

The embedded PostgreSQL instance is fine for small deployments. For production MSP environments with 100+ tenants, run an external PostgreSQL instance. This lets you manage backups, replication, and disk growth independently of the VCD cell VMs. PostgreSQL 14 or 15 is the tested range for VCD 10.6.x. Allocate at minimum 4 vCPU and 8 GB RAM for the database VM, with fast storage - latency to the database directly impacts VCD API response times.

⚙️

3. Installation and Upgrade Gotchas

Gotcha: Fresh Install vs. In-Place Upgrade Behavior Differs

Fresh installs of 10.6 default to MQTT only. In-place upgrades from pre-10.1 environments that had AMQP configured will retain the AMQP broker settings in the database even though AMQP is deprecated. Those stale settings will not cause obvious failures immediately - they will quietly prevent MQTT from working correctly or cause external integrations (Zerto) to fail to connect. You must clean up the AMQP broker settings via the API after upgrade if you are coming from a pre-10.1 baseline.

Fix: Remove Stale AMQP Settings Post-Upgrade

The VCD UI only shows Edit and Test AMQP Config for existing AMQP settings - there is no delete option in the GUI. Use the API to remove them. For VCD 10.5.1+, the correct endpoint is the /cloudapi/1.0.0/extension/amqp API. A DELETE to this endpoint with system administrator credentials removes the AMQP broker configuration and allows the MQTT path to function cleanly.

Gotcha: Windows Server 2025 Guest Customization Fails

VMs configured with the guest OS type windows11_64Guest in vSphere will fail VCD guest customization. This affects Windows Server 2025 deployments where the vSphere VM hardware version was set to use this guest OS identifier. VCD 10.6.x does not recognize windows11_64Guest as a valid Windows Server target for guest customization. There is no UI workaround currently available.

Fix: Use windowsServer2022_64Guest

Change the guest OS type on the VM in vSphere to windowsServer2022_64Guest before importing to or managing through VCD. Windows Server 2025 runs correctly under this designation from VCD's perspective. Monitor the Broadcom KB for a resolution if Windows Server 2025 as a native VCD guest OS type is required.

Gotcha: IP Reservation Expiry Fires One Day Early

IP reservations in VCD 10.6 have a confirmed date calculation bug. If you set an IP reservation to expire on a specific date, it actually expires at midnight of the day before that date - not at the end of the day you selected. In multi-tenant environments where IP reservations have SLA implications, this will cause tenant-visible failures the day before the intended expiry.

Fix: Add One Day Buffer

Set all IP reservation expiry dates one day later than your intended expiry. The fix for this bug is tracked by Broadcom. Check your current 10.6.x patch level before implementing this workaround - it may be resolved in a later point release.

Gotcha: Failed vApp Template Captures Are Not Auto-Cleaned

When a vApp template capture fails partway through, VCD 10.6 leaves the incomplete template artifact in the catalog without cleaning it up. These orphaned entries do not resolve on their own and accumulate over time, consuming catalog namespace and potentially causing confusion for tenants browsing their catalogs.

Fix: Manual Cleanup Required

Periodically query for vApp templates in a non-READY state using the VCD API and delete them manually. Build this into your operational runbook. The query is: GET /api/query?type=adminVAppTemplate&filter=status!=READY with system admin credentials. Any template not in READY state after more than 24 hours is likely a failed capture and safe to remove.

🌐

4. NSX Integration and Networking Tenancy

NSX configuration is where most MSP deployments go sideways during initial setup. The decisions made here are largely permanent.

NSX Projects (Networking Tenancy) - What You Need to Know Before You Start

NSX Projects allow VCD to create tenant-scoped NSX constructs - Tier-1 gateways, segments, and security policies - within a namespace isolated from the provider infrastructure. This is the right model for MSP environments because it prevents tenant networking from polluting the provider NSX space.

Permanent Decision - Cannot Be Undone

Once you enable NSX Projects on a VDC and select a network pool, you cannot change the network pool assignment. You also cannot switch between NSX Projects mode and non-NSX Projects mode after the VDC is created. If you get this wrong, you are recreating the VDC. Plan your network pool architecture before you create any VDCs.

NSX Projects Configuration

Step 1 - Create NSX Projects in NSX Manager

Before configuring VCD, create your NSX Projects in NSX Manager. Each project gets its own namespace for T1 gateways and segments. Assign the Tier-0 gateway or VRF that will serve as the uplink for this project. The project name in NSX becomes the boundary identifier - name these to match your VCD Organization naming convention.

Step 2 - Import NSX Projects into VCD

In VCD, navigate to Infrastructure Resources, then NSX. Import the NSX Manager and allow VCD to discover the Projects you created. VCD will present them as available network pool sources when you configure Provider VDCs and Organization VDCs.

Step 3 - Create Provider VDC with NSX Project-backed Network Pool

When creating the Provider VDC, select the NSX Project as the network pool source. This is the permanent binding. All Organization VDCs carved from this Provider VDC will inherit NSX Project-based networking.

Mixed T0/VRF Import

VCD 10.6 now supports importing both a parent Tier-0 and its child VRFs simultaneously. In prior releases you had to choose one or the other as your uplink model. The mixed import enables MSPs to share a physical Tier-0 infrastructure across tenants while using VRFs for routing isolation - a common design in MPLS-adjacent environments.

VCD API - Import T0 with VRF children

POST /cloudapi/1.0.0/nsxTResources/importableTier0s/{tier0Id}/import { "importVrfs": true, "vrfIds": ["vrf-uuid-1", "vrf-uuid-2"] }

BFD IPv6 BGP UI Bug

Configuring BFD on IPv6 BGP neighbors through the VCD UI is broken in 10.6.x. The UI does not render the BFD toggle correctly for IPv6 peers. Use the API to configure this. The endpoint is /cloudapi/1.0.0/edgeGateways/{id}/routing/bgp. This is a confirmed UI bug - functionality works correctly through the API.

DC Group Scaling

VCD 10.6 raised the Data Center Group limit to 2000 VDCs per group. For MSPs running stretched DRaaS architectures across multiple regions, this is a meaningful change. A DC Group in VCD is the construct that enables cross-site networking - stretched segments, centralized gateway configuration, and cross-VDC routing. With the 2000-VDC limit, you can now consolidate what previously required multiple DC Groups into a single management plane.

NSX Geneve Overlay Architecture

NSX TEP VLAN Design for VCD Environments

NSX Edge OVF Certificate Expired January 3, 2026

New Edge deployments and Edge resizing fail on NSX 4.2.3.0 and 4.2.3.1 due to an expired OVF signing certificate. If you need to deploy or resize Edge nodes, upgrade NSX to 4.2.3.2 or later (current: 4.2.3.3) before attempting. See Section 11 for full details.

The Tunnel Endpoint (TEP) VLAN is the underlay network that carries all Geneve-encapsulated overlay traffic between ESXi hosts and NSX Edge nodes. Getting this wrong causes intermittent VM-to-VM connectivity failures that are extremely difficult to diagnose after the fact. Design it before you rack anything.

MTU - The One Thing You Cannot Get Wrong

Geneve encapsulation adds 50-100 bytes of overhead to every packet. Geneve frames cannot be fragmented. The physical fabric carrying TEP traffic must support an MTU of at least 1600 bytes end to end - every switch port, every uplink, every trunk. VMware recommends 1700 bytes to leave room for future header expansion. The Gateway Interface MTU must be set 200 bytes below the TEP MTU. If your physical fabric is 1700, gateway interfaces are 1500. VMs stay at 1500 because the fabric handles the encapsulation overhead transparently.

Layer	Minimum MTU	Recommended MTU	Where Configured
Physical switches (TEP VLAN ports)	1600	1700	Switch interface config
ESXi vmnic (physical uplinks)	1600	1700	vSwitch uplink MTU
NSX Global TEP MTU	1600	1700	NSX: System > Fabric > Settings > Global Fabric Settings
NSX Uplink Profile MTU	1600	1700	NSX: System > Profiles > Uplink Profiles
VDS MTU (if using VDS)	1600	1700	vCenter: Distributed Switch > Edit > Advanced
NSX Gateway Interface MTU	1400	1500	NSX: Global Networking Config - must be 200 below TEP MTU
Tenant VM NIC MTU	1500	1500	Guest OS - no change required

Verify MTU end to end - run from each ESXi host TEP vmk

# SSH to ESXi host - find TEP vmkernel interface esxcfg-vmknic -l | grep -i vxlan # Ping from TEP vmk to peer TEP with DF bit set - test at 1672 bytes # (1672 + 28 byte IP/ICMP header = 1700 byte frame) vmkping -I vmk10 -s 1672 -d {remote-tep-ip} # Same test to Edge TEP vmkping -I vmk10 -s 1672 -d {edge-tep-ip} # If ping succeeds at 1672 but fails at higher values, MTU is set correctly

When Upgrading NSX from 3.x to 4.x - MTU Auto-Increases

NSX upgrades from 3.x to 4.x automatically increase the Global TEP MTU from 1600 to 1700 during the upgrade process. If your physical fabric only supports 1600, this will break overlay networking immediately after the NSX upgrade. Confirm your physical switch MTU before upgrading NSX to 4.x.

Edge Uplink Profile Design

The NSX Edge uplink profile defines how Edge nodes connect to the physical network. In VCD environments this matters because Edge nodes carry all tenant north-south traffic. A badly designed uplink profile causes asymmetric routing, MTU mismatches, or LACP negotiation failures that make every tenant's internet connectivity unreliable.

Setting	Recommended Value	Why
MTU	1700	Matches Global TEP MTU. Must be consistent.
Teaming policy	Load Balance Source MAC (Failover Order for bare-metal edges)	Source MAC teaming works without LACP on the physical switch. Use Failover Order for bare-metal Edge appliances where LACP is not available.
Transport VLAN	Dedicated VLAN (e.g. VLAN 100) separate from VM traffic	TEP traffic must be isolated from tenant VM VLANs. Mixing them causes broadcast domain issues and complicates troubleshooting.
Named teaming override	Match your physical switch bond/LAG config	If physical switches use LACP, use LACP Active/Passive. If using static LAG, use Load Balance Source Port ID.

NSX CLI - verify uplink profile on Edge transport node

# SSH to NSX Edge CLI (not ESXi - the Edge VM itself) # List interfaces and TEP info get logical-router get interface # Verify TEP IP and MTU get interfaces | grep -A 5 TEP # Check Geneve tunnel state get tunnel-ports get forwarding-table

Geneve Tunnel Troubleshooting

When overlay connectivity fails between VMs on different hosts, the diagnostic sequence is the same every time. Work the layers from bottom up: physical MTU, TEP reachability, tunnel state, then logical forwarding.

Step 1 - Verify TEP-to-TEP Reachability

TEPs must be able to reach each other on UDP 6081 (Geneve). If this is blocked by a firewall between ESXi hosts, all east-west VM traffic fails silently. The VMs appear connected to their logical segments but cannot actually communicate.

Test TEP reachability and UDP 6081

# From ESXi host - check TEP vmk to peer TEP vmkping -I vmk10 {peer-host-tep-ip} # Check tunnel ports are UP on ESXi host esxcli network ip interface ipv4 get esxcli network ip route ipv4 list # Verify Geneve port is not blocked # Run packet capture on the TEP vmk and check for UDP 6081 traffic pktcap-uw --uplink {vmnic0} --proto 0x11 --dstport 6081 --count 10

Step 2 - Check Tunnel State in NSX Manager

NSX Manager tracks the Geneve tunnel state for every pair of transport nodes. A DOWN tunnel means TEP connectivity has failed at the UDP/IP layer.

NSX CLI - check tunnel state

# SSH to NSX Manager # List all transport nodes and tunnel status get transport-nodes get tunnel-ports # Show tunnel details for a specific host get tunnel-port {tunnel-port-id} # From NSX Manager UI: System > Fabric > Nodes > Transport Nodes # Select host > Monitor > Tunnel Status # Green = UP, Red = DOWN

Step 3 - Verify Logical Forwarding

TEPs are up but VMs still cannot communicate. Check the logical layer - is the VM on the right segment, is ARP resolving, is the MAC table populated.

NSX CLI - check logical forwarding

# From NSX Edge or a host transport node CLI # Check the MAC table for a logical switch get logical-switch {ls-uuid} mac-table # Check the ARP table for the Tier-1 logical router get logical-router {tier1-uuid} neighbor # Trace the data path - verify which host the VM's MAC is on get forwarding-table

Step 4 - NSX Traceflow for Path Verification

Traceflow is the fastest way to confirm whether a packet can traverse the logical topology end to end without needing actual VM access. Run it from NSX Manager UI or API.

NSX API - initiate a Traceflow

POST /api/v1/network-transport/traceflows { "lport_id": "{source-vm-lport-id}", "packet": { "eth_header": { "src_mac": "{source-mac}", "dst_mac": "{destination-mac}", "eth_type": 2048 }, "ip_header": { "src_ip": "{source-vm-ip}", "dst_ip": "{destination-vm-ip}", "protocol": 6, "ttl": 64 }, "transport_header": { "src_port": 12345, "dst_port": 80 } } } # Then retrieve results: GET /api/v1/network-transport/traceflows/{traceflow-id}/observations

In NSX Manager UI: Plan & Troubleshoot > Traffic Analysis > Traceflow. Select source VM port, enter destination IP, click Trace. The result shows exactly where the packet is dropped and why.

Common Geneve Tunnel Failure: MTU Mismatch Between Host and Edge

A frequent scenario: host-to-host Geneve tunnels are UP and working, but VM-to-Edge-to-internet traffic fails for large packets while small packets (ping) work. This is almost always MTU. The Edge uplink profile MTU does not match the host TEP VLAN MTU. Large packets traversing the T1 to T0 path get silently dropped at the Edge because the Edge's uplink cannot pass them.

Fix: Align Edge Uplink Profile MTU with Host TEP MTU

In NSX Manager, navigate to System, then Profiles, then Uplink Profiles. Edit the Edge uplink profile and confirm the MTU matches the host uplink profile MTU (both should be 1700 if that is your fabric setting). Apply the profile change and verify tunnel state recovers. Test with a large packet ping from a VM to confirm.

Test large packet path through Edge

# From a tenant VM - ping with large packet and DF bit set # Linux: ping -M do -s 1400 8.8.8.8 # Windows: ping -f -l 1400 8.8.8.8 # If this fails but ping -s 100 works, MTU is the problem

🔒

5. IP Spaces and Tenant Isolation

IP Spaces replaced the legacy External Networks model for managing public IP allocations in VCD. If you are still on legacy External Networks, plan your migration - IP Spaces are the only supported model going forward for new deployments.

IP Space Architecture for MSPs

The correct model for a multi-tenant MSP: one Public IP Space containing your provider IP block, then IP Space Uplinks assigned to each Org VDC Edge Gateway. Tenants request IPs from their assigned uplink quota. The provider controls how many IPs each tenant can consume and can enforce that tenants cannot use IPs outside their assigned static pools using Strict Mode IPAM.

Strict Mode IPAM

Strict Mode IPAM is the setting that prevents sub-tenants from allocating IPs outside the static pools you have assigned to them. Without Strict Mode, a tenant with API access can request IPs from the broader IP Space range, not just their allocation. Enable this on every sub-provider and tenant account unless you have a deliberate reason not to.

Enable Strict Mode IPAM via API

PUT /cloudapi/1.0.0/ipSpaces/{ipSpaceId} { "ipSpaceType": "PUBLIC", "ipSpaceInternalScope": ["192.168.0.0/16"], "ipSpaceExternalScope": "0.0.0.0/0", "strictMode": true }

Gotcha: Storage Policy Change Fails When IOPS Limits Are Active

Changing the storage policy on a VM from a VMFS-backed policy to a vSAN-backed policy will fail if the source VMFS storage policy has vSphere IOPS limits configured. VCD 10.6 does not handle the IOPS limit translation between storage policy types. The VM will show an error and the storage profile change will roll back.

Fix: Remove IOPS Limits Before Storage Policy Migration

Edit the source storage policy in vCenter and remove the IOPS limit configuration before initiating the storage policy change in VCD. Once the storage profile migration completes, you can apply a vSAN-appropriate IOPS policy to the VM. This is a VCD 10.6 limitation - expect it to be addressed in a future patch.

📊

6. Allocation Models and Resource Pools

Getting the allocation model wrong is the most common misconfiguration in VCD deployments. Each model behaves differently under load and the differences matter for MSP billing accuracy.

Model	How It Works	Best For	Gotcha
Allocation Pool	Fixed CPU/RAM reservation at VDC level. Tenants consume from the fixed pool.	Predictable dedicated tenants, SLA-backed compute. Enterprise departments with defined budgets.	Over-reservation possible. Size carefully.
Reservation Pool	All resources guaranteed and reserved immediately on creation.	High-performance workloads, latency-sensitive apps, compliance environments requiring guaranteed compute	Expensive in terms of provider resource consumption. Rarely the right default.
Pay-As-You-Go	No reservation. VMs consume what they use. Provider cluster resources are shared.	Dev/test, variable workloads, non-critical internal workloads	No hard isolation. Noisy neighbors are possible.
Flex	Hybrid. Guaranteed minimum + elastic burst. Memory overhead optional.	Best default for most use cases - MSP tenants and enterprise business units. Best balance of isolation and efficiency.	See below - specific configuration requirement in 10.6.

Gotcha: Flex VDC Breaks When Initial Guaranteed = 0

The Flex VDC model has a bug when configured with initial guaranteed resources set to zero combined with the over-provisioning multiplier. When both of these conditions are true, the over-provisioning calculation fails and the VDC cannot provision VMs as expected. The VDC appears healthy but VM deployments into it will error.

Fix: Set Non-Zero Guaranteed Resources

Always configure a non-zero guaranteed resource baseline on Flex VDCs. Even setting the guaranteed minimum to 1% of total allocation resolves the calculation bug. This is the required configuration when using include_vm_memory_overhead=true - the memory_guaranteed field must be explicitly set and non-zero.

Gotcha: Flex VDC Over-Provisioning Broken at Guaranteed = 0

A related but distinct issue: if you set the initial resource guaranteed value to 0 and also have memory overhead inclusion enabled, the over-provisioning ratio calculation produces incorrect results even for VDCs where VM deployment is working. Billing and resource reporting will be inaccurate. Monitor the resource pool utilization through vCenter directly and compare to what VCD reports to catch this.

📨

7. AMQP, MQTT, and RabbitMQ Configuration

If you are running Zerto, read this section before anything else. Zerto depends on RabbitMQ for VCD event delivery. Get this wrong and Zerto goes silent with no obvious error.

The AMQP Deprecation Timeline

MQTT replaced AMQP as the inter-cell and external messaging protocol starting with VCD 10.1. AMQP remained available as a compatibility option through 10.5. In 10.6, AMQP is officially deprecated. Fresh installs of 10.6 configure MQTT only. Upgrades from environments that had AMQP configured leave the old broker settings in the database.

The immediate impact: Zerto historically used AMQP/RabbitMQ for its VCD integration. The AMQP deprecation is a compatibility concern you need to verify with your Zerto version before upgrading VCD.

RabbitMQ Version Requirements for VCD 10.6

RabbitMQ Version	VCD 10.6 Support
3.10.x	Supported
3.11.x	Supported
3.12.x	Supported
3.9.x and below	Not supported with VCD 10.6
3.13.x+	Verify with current Broadcom interoperability matrix

VCD 10.6.1.1 Known Bug - RabbitMQ Event Flooding

VCD 10.6.x has a confirmed bug where a large number of accumulated events in the VCD event table causes VCD to stop sending events to RabbitMQ. This item appears in the known issues list of every 10.6.x release note through 10.6.1.2. Broadcom is inconsistent about removing fixed issues from known issues lists, so the bug may be resolved in 10.6.1.2 even though it still appears there - if you are running 10.6.1.2 and your event flow is healthy, that is the best evidence it is working. The symptom when it fires: recurring MQTT/AMQP disconnect reports from external services - particularly Zerto. The VCD Admin extensibility check shows AMQP healthy because the internal service is running, but events stop flowing to RabbitMQ. The primary operational defense regardless of patch level is keeping the event table clean.

Cleaning Up Stale AMQP Configuration

If you are on VCD 10.5.1+ and still have AMQP broker settings configured (because you upgraded from an older version), remove them via the API before they cause problems:

Remove AMQP Broker Settings - VCD API

# First, verify AMQP settings exist GET https://vcd.example.com/api/admin/extension/settings/amqp Authorization: Bearer {token} Accept: application/json;version=36.3 # Delete via API (no UI option exists) DELETE https://vcd.example.com/api/admin/extension/settings/amqp Authorization: Bearer {token}

systemExchange Queue for Zerto

If Zerto is configured to consume events from VCD via RabbitMQ, it requires a specific queue named systemExchange. When migrating to a new RabbitMQ cluster, this queue is not automatically re-created. Create it manually in the RabbitMQ management UI before pointing VCD at the new cluster. Queue settings: durable=true, all other options false. The Zerto Cloud Manager configuration reads AMQP address information from VCD - it does not have its own connection settings for the broker address.

🔑

8. Certificate Management

VCD 10.6 uses the Certificate Library as the central management plane for all CA-signed certificates. This changes the restore and renewal workflow.

Certificate Library Overview

The Certificate Library (available since 10.5.1) stores CA-signed certificates as named objects that can be referenced by VCD cells, Edge Gateways, and other components. This is a significant improvement over the old model where certificates were tied directly to individual component configurations. Renewals and replacements now happen in one place.

Gotcha: VCD Restore Fails on CA-Signed Cert Password Mismatch

If you are restoring a VCD deployment from backup and the CA-signed certificate was protected with a password at backup time, the restore will fail if the password presented during restore does not match. VCD 10.6 stores certificates in the Certificate Library and the password is part of the stored credential binding. This is not obvious from the restore error message.

Fix: Document Certificate Passwords in Your Runbook

The certificate password used when importing into the Certificate Library must be documented and stored securely in your operational runbook. It is not recoverable from VCD if lost. If you hit this during a restore, the path forward is to re-import the certificate with the correct password via the API after restoring the database, then rebind the certificate to the appropriate VCD components.

Certificate Renewal Process

Step 1 - Generate CSR from Certificate Library

In VCD, navigate to Administration, then Certificate Management, then Certificate Library. Use the Generate CSR function to create a new CSR. Submit to your CA.

Step 2 - Import Signed Certificate

Import the signed certificate chain back into the Certificate Library. Include the full chain (leaf + intermediates). Assign a clear name that includes the expiry date for easy identification.

Step 3 - Rebind Certificate to VCD Components

For each VCD cell, Edge Gateway, and any other component using the old certificate, update the certificate binding to point to the newly imported cert. VCD cell certificate rebinding requires a service restart on the cell.

Step 4 - Verify Zerto and Veeam Connectivity

After certificate rotation, verify that Zerto Cloud Manager and Veeam Cloud Connect can still reach VCD. Both perform certificate validation. If you are using a private CA, ensure the CA root is trusted by both Zerto and Veeam backup servers.

💾

9. Veeam Cloud Connect Integration

Veeam Cloud Connect with VCD is well-documented but there are specific configuration points that cause problems in production that the official docs do not emphasize enough.

Architecture Overview

Veeam Cloud Connect + VCD Architecture

Veeam Cloud Connect integrates with VCD at the Organization level. The service provider does not create new vSphere objects for VCD tenants. Instead, the SP pre-creates Organization VDCs in VCD with the appropriate resources, then exposes those to Veeam tenants. The tenant accesses the cloud host using their Organization Administrator credentials from VCD.

Component	Role	Notes
Veeam Backup Server (SP)	Cloud Connect gateway, job orchestration	Connects to VCD via VCD API
Cloud Gateway	Data path for backup traffic	Separate from VCD; handles Veeam protocol
VCD Organization	Tenant namespace in VCD	One org per Veeam Cloud Director tenant account
Organization VDC	Compute resources for replicas	Pre-created by SP in VCD before assigning to tenant
VCD Org Admin	Credential used by tenant to connect	Must have Org Administrator rights in VCD org

Creating a Cloud Director Tenant Account in Veeam

Step 1 - Configure VCD in Veeam Cloud Connect

In the Veeam Backup console, navigate to Backup Infrastructure, then Managed Servers, then Add Server, and select VMware Cloud Director. Enter the VCD cell FQDN (use the load balancer VIP if multi-cell) and System Administrator credentials. Veeam will discover all Organizations and VDCs.

Step 2 - Create the Cloud Director Tenant Account

In the Cloud Connect console, create a new tenant. Select Cloud Director as the tenant type (not the default standalone/hardware plan type). Select the VCD Organization to associate with this tenant account. You are mapping one Veeam tenant to one VCD Organization - this is a 1:1 relationship.

Step 3 - Allocate Backup Repository Resources

Assign a cloud repository quota to the tenant. This is independent of the VDC compute resources. The tenant can use the cloud repository for Veeam Agent backups and VBR job backups regardless of whether they also have VDC compute allocated.

Step 4 - Allocate VDC Compute Resources for Replication

Assign one or more Organization VDCs from the tenant's VCD Organization as the compute target for VM replicas. Unlike hardware plan tenants, there is no hardware plan configuration required - the VDC itself is the resource pool. The SP controls resource limits in VCD directly.

Step 5 - Configure Network Extension Appliance

If the tenant needs network extension for partial or full site failover, enable it and configure the network extension appliance settings. The appliance runs as a VM in the tenant's VDC. Hard limit: the appliance supports a maximum of 9 Org VDC networks due to the 10 NIC limit on the appliance VM itself. Plan tenant network architecture around this constraint.

Gotcha: Network Extension Appliance 9-Network Limit

The Veeam network extension appliance has a hard limit of 9 Org VDC networks per appliance. This is a physical constraint - the appliance VM has a maximum of 10 NICs and one is used for management. Tenants with more than 9 networks requiring failover capability need multiple appliances or a network architecture redesign.

Gotcha: Hardware Plans Do Not Apply to VCD Tenants

Hardware plans are not used for Cloud Director tenant accounts. If you try to configure replication resources for a VCD tenant using the hardware plan workflow, you are in the wrong workflow. VCD tenants use Organization VDC resources pre-provisioned in VCD. The hardware plan workflow is for standalone (non-VCD) tenant accounts only.

Veeam Self-Service Backup Portal for VCD

Veeam Backup Enterprise Manager provides a Self-Service Backup Portal that VCD tenants can access using their native VCD credentials. The SP does not need to manage separate Veeam user accounts - authentication passes through VCD's identity mechanism. The SP configures job templates, repository quotas, and scheduling restrictions at the Organization level in Enterprise Manager.

Key controls the SP can enforce per Organization: backup job template (forces all tenant jobs to use a specific template), repository destination, quota limits, scheduling frequency limits (prevent jobs from running too often), and full scheduling lockout (where the SP controls the schedule, not the tenant).

Veeam Plugin Management in VCD

Veeam ships a VCD UI plugin with each VBR release. This plugin adds Veeam backup controls into the VCD tenant portal. When you upgrade VBR, you must manually update the plugin in VCD - it does not auto-update. Running a mismatched plugin version against a newer VBR causes API errors and broken backup portal functionality for tenants.

Always Delete and Re-Upload After VBR Upgrades

The plugin cannot be upgraded in place. You must delete the old version and upload the new one from the VBR ISO. Tenants lose access to the backup portal during this window - plan it as a brief maintenance window. The operation takes under 5 minutes.

Finding the Plugin on the VBR ISO

Plugin location on VBR ISO and installed VBR server

REM On the VBR ISO: Plugins\VeeamBackup\VeeamBackupPlugin.zip REM On a Windows VBR server after installation: C:\Program Files\Veeam\Backup and Replication\Backup\UIPlugins\VeeamBackupPlugin.zip

Removing the Old Plugin

Step 1 - Remove via VCD Provider Portal

Log in to VCD as a System Administrator. Navigate to Administration, then Customize Portal, then UI Plugins. Find the Veeam plugin entry. Select it and click Delete. Confirm the deletion. The plugin disappears from all tenant portals immediately.

Step 1 Alternative - Remove via VCD API

API - list and delete plugins

GET https://vcd.example.com/cloudapi/extensions/ui Authorization: Bearer {token} Accept: application/json;version=36.3 DELETE https://vcd.example.com/cloudapi/extensions/ui/{plugin-id} Authorization: Bearer {token}

Uploading and Publishing the New Plugin

Step 2 - Upload New Plugin ZIP

In VCD: Administration, then Customize Portal, then UI Plugins, then Upload. Select the VeeamBackupPlugin.zip from the VBR ISO or installation directory. VCD uploads and validates the plugin. Once complete, the plugin appears in the list with its version number.

Step 3 - Publish to Organizations

After uploading, the plugin is not visible to tenants until published. Select the new plugin and click Publish. You can publish to all organizations or to specific ones. For MSP environments where all tenants use backup, publish to all.

Step 4 - Verify Tenant Access

Log in as a tenant Organization Administrator. The Veeam Backup menu item should appear in the left navigation. Click it and confirm the backup portal loads and can communicate with the VBR server. If it shows a connection error, verify the VBR server FQDN is reachable from the tenant's browser and that the VBR SSL certificate is trusted.

Gotcha: Plugin Version Mismatch Silently Breaks Backup Portal

If you upgrade VBR but forget to update the plugin, tenants see the backup portal load but operations fail with generic API errors. VCD shows no errors because it is just hosting the plugin UI - it has no awareness that the plugin is calling a newer VBR API than it was built for. Symptom: tenants report the interface loads but they cannot see jobs, create jobs, or trigger restores.

Gotcha: Browser Cache Serves Old Plugin After Update

After updating the plugin, tenants may see the old version because their browser cached the plugin assets. Ask tenants to hard refresh (Ctrl+Shift+R / Cmd+Shift+R) or clear browser cache after a plugin update. Particularly common in Chrome.

VBR Version Change	Plugin Action Required	Where to Get Plugin
12.x to 12.y (minor update)	Delete old, upload new	New VBR ISO or installation directory
12.x to 13.0.1 (major upgrade)	Delete old, upload new - mandatory	VBR 13.0.1 ISO
13.0.1 Patch update	Check KB4738 - patch notes state if plugin changed	KB4738 documents each patch's scope

CDP for VCD Replicas

Continuous Data Protection is supported for VCD replica targets. CDP requires vSphere 7.0 U1+ on the source and VCD 10.3+ with NSX on the target. For VCD 10.6 deployments, CDP is available if the underlying vSphere meets the requirements. Configure CDP replication jobs in Veeam and target the Organization VDC as you would for a standard replication job - the CDP policy is set at the job level, not the infrastructure level.

Veeam + VCD Compatibility Matrix

Veeam VBR Version	VCD 10.6 Support	Notes
Veeam VBR v13	Full support	Recommended version for VCD 10.6 deployments
Veeam VBR v12.x	Support per KB4488	VCD 10.1-10.6 listed as supported; verify patch level
Veeam VBR v11 and below	Not recommended	VCD 10.6 API changes; use v12 or v13

GFS Retention for Immutable Backup Copy Jobs

If you are configuring backup copy jobs with immutability for VCD tenant data, GFS retention must be enabled on the backup copy job. Immutability on hardened repositories requires the GFS retention policy to be active - without it, the immutability flag is not set correctly on the backup files. This applies regardless of whether the repository is on-premises or cloud-connected.

🔄

10. Zerto Cloud Manager Integration

Zerto in a VCD environment operates through the Zerto Cloud Manager (ZCM). ZCM is the management layer that sits above individual ZVM (Zerto Virtual Manager) instances at each site and provides the multi-tenant VCD-aware control plane.

Architecture

In a VCD-integrated Zerto deployment:

The Zerto Cloud Manager communicates with VCD to discover organizations and VDCs. Individual ZVMs are deployed at each protected vSphere site. The Zerto Self-Service Portal (ZSSP) integrates with CSP portals and allows tenants to manage their own VPGs within their VCD Organization scope. VPG failover targets are pre-configured to use Organization VDCs in VCD as recovery destinations.

VCD 10.6 and AMQP - The Critical Dependency

Zerto's VCD Integration Uses RabbitMQ

Zerto's VCD integration relies on the RabbitMQ-backed messaging path to receive VCD events. This was historically an AMQP connection using a dedicated queue named systemExchange in RabbitMQ. With VCD 10.6 deprecating AMQP and moving to MQTT, you must verify that your Zerto version supports the MQTT-based event path before upgrading VCD. Running VCD 10.6 with a Zerto version that only supports the legacy AMQP integration path will result in Zerto losing event awareness from VCD - VPG state changes, organization changes, and failover triggers that depend on VCD events will stop working.

VCD 10.6.1.1 RabbitMQ Event Flooding Bug - Impact on Zerto

Gotcha: VCD 10.6.1.1 Stops Sending Events to RabbitMQ

VCD 10.6.x has a confirmed open bug where event table accumulation causes VCD to stop sending events to RabbitMQ. This issue is documented through 10.6.1.2 and has no resolution as of the current release. From Zerto's perspective the symptom is recurring disconnects from the VCD event stream. The VCD Admin extensibility check shows healthy because the internal service is running, but events are not flowing. This is a Broadcom bug in VCD 10.6.x - not a Zerto or RabbitMQ configuration issue.

Fix: Upgrade to VCD 10.6.1.2 or Later, Purge Event Table as Workaround

This bug still appears in the 10.6.1.2 known issues list. Whether it was actually fixed in 10.6.1.2 is unclear - Broadcom does not always remove items from known issues lists when a fix ships. If you are on 10.6.1.2 and not seeing event disconnects from Zerto, you are likely fine. If you are seeing them, the workaround is to delete stale events from the VCD event table - this requires engagement with Broadcom support, as the operation needs to be done carefully to avoid corrupting the event log. Keep the event table pruned regardless of what version you are on. A healthy event table should not grow unbounded.

RabbitMQ Requirements for Zerto + VCD

Component	Requirement
RabbitMQ version	3.10.x, 3.11.x, or 3.12.x for VCD 10.6
Erlang/OTP	Must be compatible with selected RabbitMQ version
Zerto queue name	systemExchange - must exist; not auto-created on new cluster
Queue durability	Durable=true; all other settings false
TLS	Port 5671 (TLS-encrypted AMQP). Configure TLS on RabbitMQ listener.
vHost	Default vHost (/) unless customized in VCD AMQP configuration

Zerto Cloud Manager VCD Configuration

Step 1 - Install ZCM

Deploy the ZCM appliance. ZCM requires network access to: VCD API endpoint (TCP 443), all ZVM instances at protected sites (TCP 9080), RabbitMQ (TCP 5671). ZCM does not need direct access to vCenter or NSX.

Step 2 - Connect ZCM to VCD

In ZCM, add the VCD endpoint. Use the VCD System Administrator credentials. ZCM will discover all Organizations and VDCs. The AMQP configuration is inherited from VCD - ZCM reads the broker address from VCD and connects automatically. You provide the AMQP credentials in ZCM.

Step 3 - Add ZVM Sites

Add each ZVM instance to ZCM. The ZVM must have the VCD plugin installed and configured to use the same VCD instance. ZCM maps ZVM sites to VCD Organizations to establish the tenant topology.

Step 4 - Configure VPG Targets

For each VPG targeting a VCD recovery site, the recovery VDC must be pre-created in VCD and the tenant must have appropriate rights in that VDC's Organization. VPG replication does not automatically provision VDC resources - they must exist before failover is attempted.

Known Zerto Limitations with VCD

Limitation	Detail
vApp network replication	Zerto does not replicate vApp network settings (NAT, firewall, DHCP). Post-failover scripts required to restore these settings.
vApp network fence mode	Fence mode must not be enabled when performing failover, test failover, or move operations.
Protected machine scope	VMs are protected as a VCD vApp in the recovery VCD. Individual VMs from a vApp can still be independently protected.
Metadata replication	vCD entity metadata is replicated. Network settings within the vApp (NAT/FW/DHCP) are not.
Disk resize during replication	If a protected VM disk is resized, the recovery disk cannot be automatically resized - manual intervention required post-failover.

Zerto Self-Service Portal Integration

ZSSP is the tenant-facing interface for VCD-integrated Zerto deployments. Tenants log in using their VCD Organization credentials. ZSSP shows them only the VPGs and recovery assets within their VCD Organization scope. Configure the ZSSP-to-CSP portal integration to present Zerto DRaaS as a self-service offering within your existing MSP customer portal.

🚧

11. Complete Bug Catalog by Version

Every significant known issue across the 10.6.x release line, tagged by the version in which it was introduced and the version in which it was resolved (if fixed). Use this as your upgrade decision matrix. If a bug affects your environment and is fixed in a later release, that is your upgrade path justification.

How to Read This Section

Each entry is tagged with the version it first appeared in and the version it was resolved in. "Open" means it remains unresolved as of 10.6.1.2. Bugs marked "Operational Impact" require immediate attention regardless of patch level.

Infrastructure and Platform

Root Partition Full Stops vmware-vcd - Introduced 10.6.0 | Open (Ongoing Operational Risk)

Two distinct disk fill vectors cause this. First: the nginx access.log and vcd_ova_ui_uwsgi.log files fail to rotate and consume root partition space. Second: general log growth from increased logging verbosity in 10.6. Both result in vmware-vcd stopping without a clear external error. Cell appears down, API returns 503.

Fix: Separate Volumes + KB 411155 Log Rotation Fix

Mount transfer directory and log directories on separate volumes from root. For the nginx log rotation specifically, apply the configuration documented in Broadcom KB 411155. For general log rotation, see Section 12. Monitor root at 75% and 90% thresholds.

PostgreSQL Disk Spikes After Upgrade to 10.6 - Introduced 10.6.0 | Open

Upgrading to 10.6 increases TOAST data size in PROPERTY_MAP and PROPERTY_MAP_STATUS columns. Under high VCListener update volume from vCenter, PostgreSQL file system utilization spikes sharply and VCD can crash. High-activity environments (many concurrent VM operations) are most at risk.

Fix: Add 8 GB+ to PostgreSQL Disk Before Upgrading

Pre-provision at least 8 additional GB on the PostgreSQL data volume before upgrading to VCD 10.6. Monitor after upgrade and scale further if growth continues.

Decommissioned Cell IP Reuse Duplicates Cell Entries, Slows All Tasks - Introduced 10.6.0 | Open

If you decommission a VCD cell and then deploy a new cell reusing the same IP address and FQDN, VCD creates duplicate entries in cell-runtime.log. The result is all VCD tasks running slow. This is not immediately obvious because the service appears healthy.

Fix: Never Reuse IP/FQDN of a Decommissioned Cell Directly

If you must reuse the same IP for a replacement cell, stop the vmware-vcd service on all cells, then start each cell one at a time to force a clean registration cycle. Broadcom's workaround is the graceful sequential restart: stop all cells, then start them one by one with confirmation between each.

VM Relocate Task Timeouts - Present 10.5.1.1 through 10.6.1.1 | Mitigated in 10.6.1.2

From 10.5.1.1 onward, storage policy changes or VM relocation tasks that take longer than the configured timeout cause the VCD task to fail with a java.util.concurrent.TimeoutException in cell-runtime.log. The default timeout was 5 minutes in 10.5.1.1 through 10.6.1.1. In 10.6.1.2 Broadcom extended the default to 10 minutes, which resolves most cases. If relocate tasks still timeout on large storage migrations or slow storage, you can increase further.

Fix: Open Broadcom Support Case

In 10.6.1.2 the default timeout was extended to 10 minutes, resolving most cases. If jobs still timeout, run: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n relocate.vm.workflow.timeout.minutes -v 20 to extend further. Adjust based on your storage migration performance. Verify with -l flag to confirm the value is set.

Upgrade from 10.6 Fails with Broadcom Repository Error - Introduced 10.6.0 | Resolved 10.6.0.1

Upgrading VCD using the Broadcom repository may fail under some configurations. Additionally, fresh upgrades to 10.6.0 may silently fail to install some RPMs while reporting success.

Fix: Skip 10.6.0 - Start at 10.6.0.1 Minimum

Do not deploy or upgrade to 10.6.0 as a target. Use 10.6.0.1 as the absolute minimum baseline for all new deployments and upgrades.

High CPU Spikes Every Minute After Upgrade from 10.5.1 - Introduced During Upgrade | Resolved 10.6.1.1

CPU usage spikes for a few seconds every minute on VCD appliance cells after upgrading from 10.5.1 or earlier. Caused by the appliance synchronization service redundantly calling import-trusted-certificates even when certificates are already present in the truststore.

Fix: Upgrade to 10.6.1.1 or Later

Resolved in 10.6.1.1. If you cannot upgrade immediately, monitor that the CPU spikes are not causing cascading performance issues. The spikes are transient (a few seconds) but on undersized cells they can affect API responsiveness.

Logging

Excessive Log Spam: "User missing ORG_TRAVERSE right" - Introduced 10.6.0 | Open (Workaround Available)

After upgrading to 10.6, the message User missing ORG_TRAVERSE right, loggedInOrgId = [uuid] appears millions of times in the logs within 24 hours. This is the primary driver of the logging volume increase in 10.6 noted in the deployment warnings. It is not a functional error - it is a logging verbosity bug - but it will fill your log disk.

Fix: Set SSDC Module Log Level to INFO

Add the following line to /opt/vmware/vcloud-director/etc/log4j.properties on all cells:

log4j.logger.com.vmware.ssdc=INFO

This suppresses the spam without losing meaningful log output. Apply to all cells and restart the VCD service on each cell sequentially.

/var/log/messages Does Not Rotate - Introduced 10.6.0 | Open (Workaround Available)

A logrotate misconfiguration in the VCD appliance causes the Linux system log at /var/log/messages to grow without bounds and never rotate. This is a separate disk fill vector from the nginx log rotation bug and the general logging volume increase. All three can hit you simultaneously on an unpatched 10.6 deployment.

Fix: Manual logrotate Configuration

Create or update /etc/logrotate.d/syslog to explicitly include /var/log/messages with daily rotation and compression. Verify the rotation is working with logrotate -d /etc/logrotate.d/syslog. Also apply KB 411155 for the nginx log rotation fix. Treat all three log issues as a combined remediation task.

Networking and Edge Gateway

DNAT Rule Fails with "External IP Should Belong to Sub-Allocated Range" - Introduced 10.6.0 | Open

When an Edge Gateway has two uplinks - one to a Provider Gateway and one to an External Network - and both have the same CIDR block allocated, creating a DNAT rule using the External Network's IP fails with the error External IP for NAT should belong to the sub-allocated IP range of the edge gateway. This happens even though the IP is validly allocated on the external network.

Fix: Use Non-Overlapping CIDRs on Dual-Uplink Edge Gateways

Design edge gateway uplinks so that Provider Gateway and External Network uplinks use non-overlapping CIDR blocks. If you are already in this state, use the VCD API to create the DNAT rule directly, bypassing the UI validation. Alternatively, remove and reconfigure one of the uplinks with a distinct CIDR.

Rate Limiting Cannot Be Set on Edge Gateways with NSX Tenancy - Introduced 10.6.0 | Resolved 10.6.1.1

When selecting rate limiting profiles on an edge gateway created in an NSX Tenancy-enabled VDC, the UI lists profiles from the default NSX space instead of from the tenant's NSX Project. Saving fails because the profile cannot be found in the Project namespace.

Fix: Upgrade to 10.6.1.1

Resolved in 10.6.1.1. On prior versions, use the VCD API to apply rate limiting directly, specifying the profile UUID from the correct NSX Project context.

Firewall Policies Created in Stateless Mode for Edge Gateways in VDC Groups - Introduced 10.6.0 | Resolved 10.6.1.1

Even with stateful mode enabled on the edge cluster, VCD creates firewall policies in stateless mode for edge gateways created within a VDC Group. Edge gateways created directly in a VDC (not a VDC Group) are not affected. Stateless firewall policies cannot track connection state, which breaks return traffic for many application patterns.

Fix: Upgrade to 10.6.1.1

Resolved in 10.6.1.1. On prior versions, configure stateful firewall settings directly in NSX for the affected Tier-1 gateways.

vApp Gateway Firewall Change Does Not Propagate to NSX - Introduced 10.6.0 | Open

Modifying the gateway firewall state on a vApp network in VCD does not change the corresponding NSX-T gateway firewall state. The NSX-T gateway firewall remains ON regardless of what VCD shows.

Fix: Apply Firewall Changes Directly in NSX

Make the firewall state change in both VCD and NSX Manager. The VCD setting is cosmetic for vApp gateways in this context - NSX Manager is the authoritative source.

Cannot Configure BFD on IPv6 BGP Neighbors via UI - Introduced 10.6.0 | Open

Documented in Section 4. The UI does not render the BFD toggle correctly for IPv6 BGP peers on Provider Gateways.

Fix: Use API

Use PUT /cloudapi/1.0.0/edgeGateways/{id}/routing/bgp with the BFD configuration in the request body. Functionality works correctly through the API.

Virtual Machine Operations

Windows Server 2025 and Windows 11 Guest Customization Fails - Introduced 10.6.0 | Open

Documented in Section 3. Guest OS types windows11_64Guest and Microsoft Windows Server 2025 (64-bit) fail VCD guest customization with windows11_64Guest os type is not one of supported types.

CentOS Stream 10 and RHEL 10 Guest Customization Fails - Introduced 10.6.0 | Open

Documented in Section 3. No workaround. VCD guest customization agent does not support the NetworkManager profile format used in RHEL 10+.

VM Power On Fails After Flex VDC Reservation Decrease - Introduced 10.6.0 | Resolved 10.6.1.1

In a Flex VDC, if you specify a CPU or memory reservation percentage on a VM and then decrease that value, the VM fails to power on after the change. The VDC's internal resource accounting gets out of sync.

Fix: Click "Make VM Compliant"

Navigate to the VM and click Make VM Compliant. This re-syncs the resource accounting and allows the VM to power on. Resolved in 10.6.1.1.

vCenter Snapshots Count Against VDC Storage Quota - Introduced 10.6.0 | Open

With the introduction of multiple snapshot support in 10.6, VCD now shows all VM snapshots regardless of where they were created. Snapshots taken directly in vCenter (for example, by a backup tool using VADP) now appear in VDC storage metrics and consume VDC storage quota. If your backup process relies on vCenter-level snapshots, your tenant storage quotas may be unexpectedly consumed.

Fix: Account for Snapshot Storage in VDC Quota Planning

Increase VDC storage quotas to accommodate snapshot storage if backup workflows create vCenter-native snapshots. Alternatively, configure VCD to hide external snapshots by setting vm.hideExternalSnapshots=true in VCD cell configuration. This was the previous default behavior - 10.6 changed it.

VM Import Retry Fails After Initial Import Error - Introduced 10.6.0 | Resolved 10.6.1

If a VM import from vCenter fails for any reason (e.g., missing compatible storage policy), retrying the import fails with Cannot import VM into vCD, since it is already managed by vCD. The failed import leaves the VM in a locked state in VCD's database.

Fix: Upgrade to 10.6.1 / Use API Import Workaround

Resolved in 10.6.1. On prior versions, use the VCD API importVm and importVmAsVApp endpoints with a force-unlock parameter, or contact Broadcom support to unlock the VM object in the database.

Static IP Not Configured Correctly After Guest Customization - Introduced 10.6.0 | Open (10.6.1.2 Known Issue)

If you perform guest customization with a static IP address pool, the VM's guest OS may be configured with a pre-existing IP instead of the newly assigned one. This causes tenant VMs to come up with incorrect network configuration after customization.

Fix: Force Re-Customization

Power off the VM, then power it on using Power On, Force Customization. This clears the cached customization state and applies the correct IP from the pool. Monitor the Broadcom release notes for a permanent fix.

VM Suspended for Lease Expiry Despite "Never Expires" Setting - Introduced 10.6.0 | Resolved 10.6.1.2

A VM is suspended for runtime lease expiration even when the runtime lease is set to Never Expires. This is a policy evaluation bug in the lease enforcement engine. Resolved in 10.6.1.2.

Snapshot Revert Fails on vSAN Express Storage Architecture - Introduced 10.6.0 | Open

Reverting a snapshot for a VM with a disk on vSAN Express Storage Architecture fails with Expected completed future, but received future which is still in progress. There is no workaround. Track this if you are deploying workloads on vSAN ESA.

Snapshot Revert Never Completes on vSAN Datastore - Introduced 10.6.0 | Resolved 10.6.1

When reverting a snapshot on a VM stored on a standard vSAN datastore (not ESA), VCD creates a new delta file with the same name as the old one. The revert task never completes. Resolved in 10.6.1. For vSAN ESA the issue remains open separately.

Flex VDC and Resource Pool

Flex VDC Over-Provisioning Broken at Initial Guaranteed = 0 - Introduced 10.6.0 | Open

Documented in Section 6. Setting initial guaranteed to zero with the over-provisioning multiplier active breaks VM deployment.

Flex VDC Elasticity Deactivation Takes Down All VDCs in the Org - Introduced 10.6.1.1 | Resolved 10.6.1.2

This is the most severe Flex VDC bug in the 10.6 line. When a Provider VDC is backed by multiple resource pools and an Organization VDC was converted from Pay-As-You-Go to Flex, deactivating the Flex VDC's elasticity makes all vrp_rp records for every VDC under that Organization inaccessible. Every other VDC in the same Organization stops functioning. The error message is VDC is not associated to primary Resource Pool. This bug was introduced in 10.6.1.1 and remains listed as a known issue in 10.6.1.2 release notes - treat it as open until confirmed fixed in a subsequent release.

Fix: Upgrade to 10.6.1.2 Before Touching Flex Elasticity

Do not toggle elasticity on any Flex VDC that was converted from PAYG on 10.6.1.1. Upgrade to 10.6.1.2 first. If you have already triggered this issue, restoring the vrp_rp records requires Broadcom support assistance - this is not self-recoverable.

PAYGo to Flex Conversion Breaks Elasticity Toggle - Introduced 10.6.1.1 | Resolved 10.6.1.2

When a Provider VDC is backed by multiple resource pools and you convert an existing Organization VDC from Pay-As-You-Go to Flex, subsequently deactivating and activating the Flex VDC elasticity fails with VDC is not associated to primary Resource Pool. This is the same root cause as the elasticity deactivation bug above but manifests specifically on the PAYG-to-Flex conversion path.

VM Power On Fails on Flex VDC with Multiple Resource Pools - Introduced 10.6.0 | Resolved 10.6.1.2

When a Flex VDC has elasticity enabled and the Organization VDC has fewer resource pools than the backing Provider VDC, powering on a VM fails with a StaleObjectStateException. This occurs when you try to add the remaining resource pools to bring the VDC to parity with the Provider VDC.

Applying VM Sizing Policy Fails Due to Zero CPU/Memory Guarantee - Introduced 10.6.0 | Resolved 10.6.1

VCD incorrectly allows VMs to power on in a Flex VDC with zero CPU or memory guarantee. When you subsequently try to apply a VM sizing policy to such a VM, the operation fails with The operation could not be performed, because there are insufficient CPU resources. Resolved in 10.6.1. The non-zero guarantee requirement in Section 6 prevents this.

Storage and Catalog

VMFS to vSAN Storage Policy Change Fails with IOPS Limits Active - Introduced 10.6.0 | Open

Documented in Section 5. When vSphere IOPS is enabled on the source VMFS policy, the storage policy change fails. The exact error is Unable to relocate VM - The operation is not supported on the object.

Fix: Clone Policy and Remove IOPS Before Migration

Clone the source storage policy, disable IOPS on the clone, migrate from source to clone, then migrate from clone to the vSAN target. This is a multi-step workaround but it works and is documented in the Broadcom release notes.

Failed vApp Template Captures Leave Unresolved Orphans - Introduced 10.6.0 | Open

Documented in Section 3. Failed template capture operations leave Unresolved template entries in catalogs that are never auto-cleaned.

Overwriting Catalog Item Fails with StaleObjectStateException - Introduced 10.6.0 | Open

When you deploy a vApp or VM with a sizing policy and then create a catalog template from it, attempting to overwrite the existing catalog template fails with a StaleObjectStateException. There is no workaround - delete the old template and create a new one.

Catalog Sync Fails with "/" in Catalog Name - Introduced 10.6.0 | Resolved 10.6.1.2

If a catalog name contains a forward slash (/) character with the "Enable early catalog export to optimize synchronization" option active, the Cached Organization System task fails repeatedly. Avoid using forward slashes in catalog names on any 10.6.x version prior to 10.6.1.2.

Disk Bus Number and Unit Number Do Not Match on Multi-Disk Add - Introduced 10.6.0 | Open

When adding multiple hard disks to a VM simultaneously, the Bus Number and Unit Number values after saving do not match what was specified. Add disks one at a time to guarantee correct addressing.

IP Management and Reservations

IP Reservation Expiry Fires 24 Hours Early - Introduced 10.6.0 | Resolved 10.6.1.1

Documented in Section 3. Resolved in 10.6.1.1. Use the +1 day buffer workaround on versions prior to 10.6.1.1.

Cannot Modify Custom IP Space Quota via UI - Introduced 10.6.1 | Resolved 10.6.1.2

Attempting to modify the custom IP space quota settings for an Organization in the Provider UI results in the Save button becoming unresponsive. Workaround on 10.6.1: toggle the Floating IPs switch before clicking Save to re-activate the button. Resolved in 10.6.1.2.

Authentication, Certificates, and Identity

CA-Signed Certificate Restore Fails on Password Mismatch - Introduced 10.5.1 | Open

Documented in Section 8. The global.properties file path for certificates changed in 10.5.1. The workaround is to edit the backup's global.properties before restoring, pointing certificate paths to /opt/vmware/vcloud-director/etc/user.http.pem and /opt/vmware/vcloud-director/etc/user.http.key.

SAML User Import Accepts Invalid Email; Edit/Deactivate Then Fails - Introduced 10.6.0 | Resolved 10.6.1.1

VCD does not validate email addresses during SAML user import. If a user is imported with an invalid email, editing or deactivating the user later triggers email validation and fails. Clean up SAML user email addresses before they become production accounts.

Static Routes Lost After Appliance Reboot - Introduced 10.6.1 | Open

Static routes configured in the VCD appliance via ovfenv are not persistent across reboots. The ovfenv population produces null values due to a service startup order race condition. If your VCD cell connectivity depends on static routes, they will not survive a reboot.

Fix: Add Routes to Network Configuration Scripts

Configure persistent static routes through Photon OS network configuration files (/etc/systemd/network/) rather than relying on ovfenv. Set up a post-boot validation that checks route presence and alerts if routes are missing after any cell restart.

TKG and Kubernetes (Enterprise Deployments)

TKG and TKGS Clusters Unsupported with vSphere 8.0 U3 - Introduced 10.6.0 | Open

VMware Tanzu Kubernetes Grid and Tanzu Kubernetes Grid Service underwent architectural changes in vSphere 8.0 Update 3 that break compatibility with VCD. TKG and TKGS cluster management through VCD is supported only on vSphere 8.0 Update 2c or earlier. This is a hard architectural incompatibility - not a configuration issue.

Fix: Stay on vSphere 8.0 U2c for Kubernetes Workloads Under VCD

If you are running Kubernetes workloads managed through VCD Container Service Extension, do not upgrade vSphere to 8.0 U3. Maintain vSphere 8.0 U2c for any clusters that VCD manages. Monitor Broadcom's interoperability matrix for resolution - this is a significant gap that impacts enterprise deployments.

TKG Cluster Creation Fails with "Unknown API Version" - Introduced Before 10.6.1.2 | Resolved 10.6.1.2

After upgrading to a VCD version below 10.6.1.2, creating a Tanzu Kubernetes Grid Cluster with Container Service Extension fails with an Unknown API version error message. Resolved in 10.6.1.2.

NSX Edge OVF Signing Certificate Expired January 3, 2026 - Critical for New Edge Deployments

Starting January 3, 2026, deploying new NSX Edge nodes fails because the OVF signing certificates used by Broadcom expired. Affected operations: new Edge installation, Edge redeployment, and Edge resizing. Error appears during OVF deployment validation. This affects any MSP trying to add Edge capacity or replace Edge nodes on NSX 4.2.3.0 and 4.2.3.1. Resolved in NSX 4.2.3.2 and later.

Fix: Upgrade NSX to 4.2.3.2 or Later

Upgrade NSX Manager and Edge nodes to 4.2.3.2 (released January 2026) or 4.2.3.3 (current as of March 2026). If you cannot upgrade immediately and need to deploy an Edge, Broadcom KB 372634 documents a workaround involving downloading the Edge OVF manually and bypassing certificate validation during deployment. Do not attempt new Edge deployments on NSX 4.2.3.0 or 4.2.3.1 without applying this fix.

RabbitMQ and External Integrations

VCD Stops Sending Events to RabbitMQ (Event Table Overflow) - Introduced 10.6.0 | Resolved 10.6.1.2

Documented in Sections 7 and 10. Event table accumulation causes VCD to stop sending events to RabbitMQ, directly breaking Zerto event awareness. Still listed in the 10.6.1.2 known issues - status uncertain, Broadcom may have fixed it without updating the known issues list. Keep the event table pruned regardless. Purge stale events with Broadcom support assistance and configure aggressive retention as the primary operational workaround.

VCD Multisite Fails Between 10.6.1 and Older Versions - Introduced 10.6.1 | Resolved 10.6.1.1

When one site in a multisite VCD configuration runs 10.6.1 and another runs an older version, the sites fail to load with IllegalArgumentException: Unknown API version. The 10.6.1 local site does not recognize the Alpha API version of the remote site running an older version. Resolved in 10.6.1.1.

Bug Resolution Summary by Version

Version	Key Bugs Fixed	Recommendation
10.6.0	GA - multiple bugs present from launch	Never deploy as target. Go to 10.6.0.1 minimum.
10.6.0.1	RPM install failure, catalog metadata sync	Acceptable minimum baseline only if 10.6.1 is not yet tested in your environment.
10.6.1	VM import retry, snapshot revert on vSAN, 100-rule firewall limit, VM sizing policy fail	Significant improvements. Upgrade from 10.6.0.1.
10.6.1.1	Rate limiting on NSX Tenancy VDCs, stateless firewall on VDC groups, IP expiry bug, CPU spikes post-upgrade	Good release but introduces the Flex elasticity / vrp_rp deactivation bug. Upgrade to 10.6.1.2.
10.6.1.2	Flex elasticity deactivation (fixed), VM lease expiry (fixed), catalog slash bug (fixed), TKG CSE fix, storage policy change. Relocate task timeout increased from 5 to 10 minutes.	Current recommended production target. RabbitMQ event flooding remains open. Verify post-upgrade event flow to RabbitMQ.

Recommended Production Target as of This Writing

VCD 10.6.1.2 is the current recommended production baseline. It resolves the critical Flex VDC elasticity deactivation bug, the VM lease expiry bug, and the TKG CSE cluster creation failure. The RabbitMQ event flooding bug remains open as of 10.6.1.2 - monitor your event table size and verify event flow to RabbitMQ after any upgrade. The relocate task timeout was extended to 10 minutes in this release, which resolves most storage policy change timeout failures.

Three-Tier Tenancy Sub-Provider Permission Gaps

Sub-Providers can create and manage tenant accounts but cannot modify the underlying VCD infrastructure elements - network pools, external networks, storage profiles - that those tenants consume. If a Sub-Provider's tenant hits an infrastructure limit, escalation to the Provider is required. Document this escalation path before giving Sub-Providers live access. This is by design, not a bug, but it surprises people in production.

📈

12. Monitoring, Logging, and Disk Management

VCD Log Locations

Log Path	Content	Growth Rate in 10.6
/opt/vmware/vcloud-director/logs/cell.log	Primary VCD service log	High - rotate aggressively
/opt/vmware/vcloud-director/logs/vcloud-container-debug.log	Debug output	Very high if debug enabled
/opt/vmware/vcloud-director/data/transfer/backups	VCD configuration backups (10.6+)	Grows with each scheduled backup
/var/log/vmware	VMware platform logs	Moderate; increased in 10.6

Log Rotation Configuration

VCD 10.6 increased the default logging verbosity compared to 10.5. If you are not running centralized log shipping (Syslog to your SIEM or log aggregator), you will fill local disk. Configure log rotation explicitly:

Add VCD log rotation - /etc/logrotate.d/vmware-vcd

/opt/vmware/vcloud-director/logs/*.log { daily rotate 7 compress missingok notifempty sharedscripts postrotate /bin/kill -HUP $(cat /opt/vmware/vcloud-director/run/vcloud-director.pid 2>/dev/null) 2>/dev/null || true endscript }

Event Table Maintenance

The VCD event table in PostgreSQL grows indefinitely without active maintenance. In VCD 10.6.1.1, a large event table actively causes the RabbitMQ integration to fail. Beyond that specific bug, an oversized event table degrades API performance across the board. Implement a scheduled maintenance task to archive or purge events older than your retention requirement.

Event Table Purge Requires Broadcom Support

Do not modify the VCD database directly without Broadcom support guidance. The event table has relationships to other tables and a poorly executed purge can cause data integrity issues. Open a support case and request guidance on safe event table maintenance for your deployment size and version.

Monitoring Checklist

The following items should be in your monitoring stack for every VCD 10.6 deployment:

Cell service health: poll https://vcd-cell/api/sessions - a response indicates the API is alive. VCD cell disk usage: root, transfer, and log volumes at 75% and 90% thresholds. RabbitMQ connectivity: VCD Admin extensibility check, and independently verify event flow by monitoring the systemExchange queue message rate. Database connection pool: VCD API latency increases significantly when the PostgreSQL connection pool is saturated. Certificate expiry: set alerts at 60 and 30 days before expiry for all certs in the Certificate Library. VCD backup file age: verify the backup file in /opt/vmware/vcloud-director/data/transfer/backups is being updated on schedule.

Monitoring Wired to Real Tools

The checklist in Section 12 tells you what to monitor. Here is how to wire it up in the monitoring tools MSPs actually use.

Zabbix

Zabbix - VCD API health check item

Type: HTTP agent Name: VCD API Health URL: https://{vcd-lb-fqdn}/api/sessions Request method: GET Request headers: Accept: application/*+xml;version=36.3 Authentication: None (unauthenticated sessions endpoint returns version info) Expected response code: 200 Update interval: 60s Trigger: {VCD API response code} <> 200 Severity: High Message: VCD API not responding on {HOST.NAME}

Zabbix - VCD cell disk space via SSH

Type: Zabbix agent (SSH check, or SNMP if configured) Name: VCD Root Disk Used Percent Key: vfs.fs.size[/,pused] Trigger - Warning: {ITEM.LASTVALUE} > 75 Trigger - High: {ITEM.LASTVALUE} > 90 Message: VCD cell {HOST.NAME} root disk at {ITEM.LASTVALUE}% # Repeat for transfer directory: Key: vfs.fs.size[/opt/vmware/vcloud-director/data/transfer,pused] Name: VCD Transfer Dir Used Percent

Zabbix - RabbitMQ event flow (requires RabbitMQ Zabbix template)

Type: HTTP agent (RabbitMQ Management API) Name: VCD systemExchange Queue Message Rate URL: https://{rabbitmq-host}:15671/api/queues/%2F/systemExchange Authentication: Basic User: {rabbitmq-admin} Password: {rabbitmq-password} JSONPath: $.message_stats.publish_details.rate Trigger: {ITEM.LASTVALUE} = 0 for 30m Severity: Warning Message: VCD is not sending events to RabbitMQ on {HOST.NAME} (Note: zero rate overnight when VCD is quiet is normal. Trigger on zero rate during business hours or following VCD activity.)

Zabbix - VCD service status via systemd check

Type: Zabbix agent Name: VCD Service Status Key: systemd.unit.info[vmware-vcd.service,ActiveState] Trigger: {ITEM.LASTVALUE} <> "active" Severity: Disaster Message: VCD service is not active on {HOST.NAME}

PRTG

PRTG - VCD API HTTP sensor

Sensor type: HTTP sensor URL: https://{vcd-lb-fqdn}/api/sessions Method: GET Expected HTTP status: 200 Timeout: 30s Sensor name: VCD API Health Warning limit: response time > 5000ms Error limit: HTTP status != 200 # Add additional channels: # - Response time (ms) - track latency trends # - SSL certificate expiry - PRTG SSL cert sensor against same URL

PRTG - VCD cell disk via SSH or WMI

Sensor type: SSH Disk Free Space Sensor (Linux) Target: {vcd-cell-ip} Credentials: SSH user with read access Monitor these paths: / (root) - warn 75%, error 90% /opt/vmware/vcloud-director/data/transfer - warn 70%, error 85% /var/log - warn 75%, error 90% # For multi-cell environments, create a device group # and apply the same sensors to all cells

PRTG - RabbitMQ via REST API sensor

Sensor type: REST Custom Sensor (JSON) URL: https://{rabbitmq}:15671/api/queues/%2F/systemExchange Authentication: Basic auth Parse: JSONPath Channel 1: $.message_stats.publish_details.rate (name: "VCD Event Rate") Channel 2: $.messages (name: "Queue Depth") Warning trigger: Queue Depth > 10000 Error trigger: Queue Depth > 50000 (events backing up, not being consumed)

Prometheus / Grafana (for MSPs running modern observability stacks)

Prometheus - VCD API blackbox exporter config

# blackbox.yml - add VCD probe modules: vcd_api: prober: http timeout: 30s http: valid_http_versions: ["HTTP/1.1", "HTTP/2"] valid_status_codes: [200] method: GET headers: Accept: "application/*+xml;version=36.3" fail_if_ssl_error: true tls_config: insecure_skip_verify: false # prometheus.yml - scrape config scrape_configs: - job_name: vcd_api metrics_path: /probe params: module: [vcd_api] static_configs: - targets: - https://{vcd-lb-fqdn}/api/sessions relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: {blackbox-exporter-host}:9115

Grafana - alert rule for VCD API health

Alert name: VCD API Down Query: probe_success{job="vcd_api"} == 0 Condition: IS BELOW 1 For: 2m Severity: critical Message: VCD API is not responding at {{ $labels.instance }}

RabbitMQ Prometheus exporter - VCD event flow alert

# If using rabbitmq_exporter or rabbitmq built-in prometheus plugin: # Alert on zero publish rate to systemExchange during business hours Alert name: VCD RabbitMQ Event Flow Zero Query: rate(rabbitmq_queue_messages_published_total{queue="systemExchange"}[10m]) == 0 Condition: IS EQUAL TO 0 For: 20m Severity: warning Message: VCD has not published events to RabbitMQ in 20 minutes Check VCD event table size and cell.log for errors

Certificate Expiry Monitoring (All Platforms)

Script-based cert expiry check - runs from any monitoring platform

#!/bin/bash # Check days until VCD SSL cert expiry # Run as a custom check from Zabbix, PRTG, or monitoring cron HOST="vcd.example.com" PORT=443 WARN_DAYS=60 CRIT_DAYS=30 EXPIRY=$(echo | openssl s_client -servername $HOST -connect $HOST:$PORT 2>/dev/null | openssl x509 -noout -dates 2>/dev/null | grep "notAfter" | cut -d= -f2) EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s) NOW_EPOCH=$(date +%s) DAYS_LEFT=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 )) echo "VCD cert expires in $DAYS_LEFT days ($EXPIRY)" if [ $DAYS_LEFT -lt $CRIT_DAYS ]; then exit 2 # CRITICAL elif [ $DAYS_LEFT -lt $WARN_DAYS ]; then exit 1 # WARNING else exit 0 # OK fi

✅

13. Deployment Configuration Checklist

Run this checklist against every new VCD 10.6 deployment before going live - MSP and enterprise alike. It covers the permanent decisions and the most common misconfiguration points. Some items are SP-specific (multi-tenant IP isolation, sub-provider escalation paths) - mark them N/A if they do not apply to your deployment model.

Pre-Deployment Planning

Item	Verified	Notes
Network pool architecture documented before VDC creation	[ ]	Cannot change after VDC creation
NSX Projects topology designed and created in NSX Manager	[ ]	One project per tenant tier or shared per decision
RabbitMQ version confirmed 3.10.x-3.12.x	[ ]	Required for VCD 10.6 compatibility
Zerto version confirmed compatible with VCD 10.6 and MQTT	[ ]	Verify with Zerto interoperability matrix at time of deployment
Veeam version confirmed - VBR v12+ recommended	[ ]	v13 preferred for VCD 10.6
Disk layout planned: separate volumes for transfer, logs, OS	[ ]	Do not put transfer data on root
IP Space architecture documented	[ ]	Which tenants get which subnets
Tenancy model decision documented (Provider / Sub-Provider / Tenant for SP, or flat Org model for enterprise)	[ ]	Document escalation paths for sub-providers if applicable
Certificate password policy documented for Certificate Library	[ ]	Required for restore operations

Post-Deployment Verification

Item	Verified	Notes
vmware-vcd service running on all cells	[ ]	Check systemctl status vmware-vcd
VCD API responding on load balancer VIP	[ ]	GET /api/sessions returns 200
AMQP stale settings removed (if upgrade from pre-10.1)	[ ]	Via API - not available in UI
RabbitMQ event flow confirmed from VCD	[ ]	Monitor systemExchange queue activity
Strict Mode IPAM enabled on all sub-provider/tenant accounts	[ ]	Prevents IP range escape
Flex VDC guaranteed resources set to non-zero value	[ ]	Avoids over-provisioning calculation bug
Log rotation configured on all cells	[ ]	Verify with logrotate -d
Disk monitoring alerts active for root, transfer, log volumes	[ ]	75% and 90% thresholds minimum
Veeam Cloud Director tenant accounts created and tested	[ ]	Test backup and restore with a test VM
Zerto Cloud Manager connected and VPG test failover performed	[ ]	Do not accept production without a test failover
Certificate expiry alerts configured	[ ]	30 and 60 day advance notice
VCD configuration backup running and verified	[ ]	Confirm backup file age in transfer/backups
IP reservation expiry dates adjusted (+1 day buffer)	[ ]	Known date calculation bug in 10.6
Failed vApp template cleanup process documented	[ ]	Operational runbook entry

🗄️

14. PostgreSQL Database Access, Maintenance, and Stale Table Cleanup

The embedded vPostgres database in the VCD appliance is not a black box you leave alone. It requires active maintenance, and several of the bugs documented in Section 11 require direct database access to remediate. What follows covers connecting locally and remotely, safely cleaning up the tables that cause performance degradation and the RabbitMQ event flooding bug, and the autovacuum settings that prevent TOAST bloat from filling your disk.

Stop All Cells Before Any Direct DB Write Operations

Any write operation against the VCD database - truncating tables, deleting rows, vacuuming - must be performed with all VCD cells stopped. Running writes against the database while cells are active will cause data corruption. The only exception is read-only queries for diagnostics. Always take a VCD backup before any database maintenance operation.

Connecting Locally on the Primary Cell

The embedded database is named vcloud. Connect as the postgres superuser from the primary cell shell:

Local psql connection on primary cell

sudo -i -u postgres psql vcloud

Once connected, verify you are in the right database and check table sizes before doing anything else:

List databases and check sizes of the problem tables

-- Confirm connected database \conninfo -- List all databases \l -- Check sizes of the tables most likely to cause issues SELECT relname AS table_name, pg_size_pretty(pg_total_relation_size(relid)) AS total_size, pg_size_pretty(pg_relation_size(relid)) AS table_size, pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) AS index_size FROM pg_catalog.pg_statio_user_tables WHERE relname IN ( 'audit_event', 'audit_trail', 'activity', 'scheduled_activity_jobs', 'activity_pc_queue', 'activity_pc_event_queue', 'fifo_activity_queue', 'task_activity_queue', 'vc_activity_queue', 'property_map', 'property_map_status' ) ORDER BY pg_total_relation_size(relid) DESC;

Finding the vcloud User Password

The VCD database credentials are stored in /opt/vmware/vcloud-director/etc/global.properties. Look for database.username and database.password. These are needed for PgAdmin remote access and for any connection using the vcloud role rather than the postgres superuser.

Retrieve database credentials

grep -E "database\.(username|password)" /opt/vmware/vcloud-director/etc/global.properties

Connecting Remotely with PgAdmin

The embedded PostgreSQL listens only on localhost by default. Remote access requires three changes: a pg_hba.d drop-in rule, a firewall rule for port 5432, and confirming listen_addresses. The VCD appliance uses a drop-in directory model - do not edit pg_hba.conf directly, use the drop-in directory.

Step 1 - Create a pg_hba.d Drop-in File

SSH to the primary cell as root. Navigate to the drop-in directory and create a new file for your management host entries. Do not edit the main pg_hba.conf - changes there are overwritten. The drop-in directory is the supported path:

Create drop-in ACL file

cd /opt/vmware/appliance/etc/pg_hba.d/ # Create a file named for your purpose (e.g., pgadmin-access) cat > pgadmin-access << 'EOF' # TYPE DATABASE USER ADDRESS METHOD host vcloud vcloud 10.10.10.50/32 md5 host vcloud vcloud 10.10.10.51/32 md5 EOF # Set correct permissions chown vpostgres:vpostgres pgadmin-access chmod 640 pgadmin-access

New entries are appended to the dynamically managed /var/vmware/vpostgres/current/pgdata/pg_hba.conf automatically. Restrict entries to specific management host IPs only - never use 0.0.0.0/0 in production.

Step 2 - Open Port 5432 in the Appliance Firewall

Allow TCP 5432 inbound

iptables -A INPUT -p tcp -m tcp --dport 5432 -j ACCEPT # Make persistent across reboots (Photon OS) iptables-save > /etc/systemd/scripts/iptables

Step 3 - Reload PostgreSQL Configuration

No restart needed - reload picks up the new pg_hba entries:

Reload pg_hba config without restart

sudo -i -u postgres /opt/vmware/vpostgres/current/bin/psql -c "SELECT pg_reload_conf();"

Step 4 - Connect with PgAdmin

In PgAdmin, create a new server connection:

Field	Value
Host	Primary cell IP address
Port	5432
Database	vcloud
Username	vcloud (from global.properties)
Password	vcloud user password from global.properties
SSL mode	prefer

If connection fails with no pg_hba.conf entry for host, verify the drop-in file was applied with: sudo -i -u postgres psql -c "SHOW hba_file;" and confirm your IP appears in the resulting file.

Audit Trail and Event Table Cleanup

The audit_trail table is the primary driver of database bloat in active VCD deployments. The open RabbitMQ event flooding bug in VCD 10.6.x is directly linked to an oversized event table - keeping this table clean is the primary operational workaround until Broadcom ships a fix. This is the supported cleanup procedure from Broadcom KB 320412.

Take a VCD Backup First

Run /opt/vmware/appliance/bin/create-backup.sh on the primary cell before any of the following operations. Confirm the backup file exists in /opt/vmware/vcloud-director/data/transfer/backups before proceeding.

Step 1 - Stop All VCD Cells

Run on every cell, starting with secondaries and finishing with primary:

Stop VCD cell service

/opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s

Step 2 - Take VCD Database Backup

Create VCD backup

/opt/vmware/appliance/bin/create-backup.sh

Step 3 - Connect to the Database

Connect as postgres superuser

sudo -i -u postgres psql vcloud

Step 4 - Check Row Counts Before Truncating

Verify row counts - run these before any deletes

SELECT COUNT(*) FROM audit_trail; SELECT COUNT(*) FROM audit_event; SELECT COUNT(*) FROM activity; SELECT MIN(created_date), MAX(created_date) FROM audit_trail;

If audit_trail contains millions of rows, or if the oldest entries are years old, it is safe to truncate. If you want to preserve recent data, use a dated DELETE instead of TRUNCATE.

Step 5 - Truncate the Audit Trail

Full truncate (fastest - removes all audit history)

TRUNCATE TABLE audit_trail;

Selective delete - keep last 30 days (slower but preserves recent history)

DELETE FROM audit_trail WHERE created_date < NOW() - INTERVAL '30 days';

Step 6 - Clear Activity Tables (If ActivityLogCleanerJob Is Stuck)

If the automatic cleanup job has stopped running - which happens when the activity tables grow beyond a certain threshold - clear the scheduler state:

Clear activity and scheduler tables

DELETE FROM activity; DELETE FROM scheduled_activity_jobs; DELETE FROM activity_pc_queue; DELETE FROM activity_pc_event_queue; DELETE FROM fifo_activity_queue; DELETE FROM task_activity_queue; DELETE FROM vc_activity_queue;

Step 7 - Start VCD Cells Back Up

Start the primary cell first. Wait for it to fully come up before starting secondaries:

Start VCD cell service

systemctl start vmware-vcd # Monitor startup tail -f /opt/vmware/vcloud-director/logs/cell.log

Configure Automatic Retention to Prevent Recurrence

After cleaning up, configure VCD to automatically enforce retention limits. Run these on any one cell - the settings propagate to all cells via the database:

Set audit trail, task, and event retention via CMT

# Retain audit trail for 45 days (max is 60) /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.audittrail.history.days -v 45 # Retain tasks for 30 days /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.tasks.history.days -v 30 # Verify settings were applied /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.audittrail.history.days -l /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.tasks.history.days -l

TOAST Bloat and Autovacuum Configuration

The TOAST data size increase in PROPERTY_MAP and PROPERTY_MAP_STATUS in 10.6 requires autovacuum to be properly tuned. Default PostgreSQL autovacuum settings are too conservative for a busy VCD instance. The following settings belong in /var/vmware/vpostgres/current/pgdata/postgresql.conf:

postgresql.conf - autovacuum tuning for VCD 10.6

autovacuum = on track_counts = on autovacuum_max_workers = 3 autovacuum_naptime = 1min autovacuum_vacuum_cost_limit = 2400

After editing, reload without a full PostgreSQL restart:

Reload postgresql.conf

sudo -i -u postgres /opt/vmware/vpostgres/current/bin/psql -c "SELECT pg_reload_conf();"

For the activity_parameters table specifically, which is a frequent bloat source, run a manual vacuum if autovacuum has not caught up:

Manual vacuum on problem tables

sudo -i -u postgres psql vcloud -c "VACUUM VERBOSE ANALYSE activity_parameters;" sudo -i -u postgres psql vcloud -c "VACUUM VERBOSE ANALYSE property_map;" sudo -i -u postgres psql vcloud -c "VACUUM VERBOSE ANALYSE property_map_status;"

Stale Long-Running Transactions Block Autovacuum

If autovacuum is configured correctly but still not cleaning up, a stale long-running transaction is likely holding an XID that prevents vacuum from reclaiming space. Find and terminate it:

Find blocking transactions

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state FROM pg_stat_activity WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes' AND state != 'idle' ORDER BY duration DESC;

If you find transactions running for hours, those are your blockers. Terminate with SELECT pg_terminate_backend(pid); - do this only after confirming what the transaction is doing.

Database Monitoring Queries

Run these periodically or set them up as scheduled monitoring tasks in your RMM tooling:

Database size and growth monitoring

-- Overall database size SELECT pg_size_pretty(pg_database_size('vcloud')) AS db_size; -- Top 10 tables by total size SELECT relname, pg_size_pretty(pg_total_relation_size(relid)) AS size FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC LIMIT 10; -- Active connections and states SELECT state, count(*) FROM pg_stat_activity GROUP BY state; -- Check autovacuum last run on problem tables SELECT relname, last_autovacuum, last_autoanalyze, n_dead_tup, n_live_tup FROM pg_stat_user_tables WHERE relname IN ('audit_trail', 'activity_parameters', 'property_map', 'property_map_status') ORDER BY n_dead_tup DESC;

vPostgres Service Management

Operation	Command	Notes
Check service status	systemctl status vpostgres	Run on primary cell
Start vPostgres	systemctl start vpostgres	Start before vmware-vcd
Stop vPostgres	systemctl stop vpostgres	Stop after vmware-vcd is down
Check replication status	sudo -i -u postgres psql -c "SELECT * FROM pg_stat_replication;"	Run on primary - shows standby lag
Check standby lag	sudo -i -u postgres psql -c "SELECT now()-pg_last_xact_replay_timestamp() AS lag;"	Run on standby cell
Find data directory	sudo -i -u postgres psql -c "SHOW data_directory;"	Confirm before disk operations
Find log directory	ls /var/vmware/vpostgres/current/pgdata/log/	PostgreSQL logs here

WAL Corruption After Unclean Shutdown

If VCD and vPostgres crash together due to a storage failure or power loss, PostgreSQL may fail to start due to WAL corruption. Symptoms: vPostgres logs show invalid WAL record errors and the service exits immediately after starting. This is not resolved by a simple service restart. Recovery requires removing the stale postmaster.pid lock file and potentially clearing corrupted WAL segments - both operations that require Broadcom support if you have not done them before. A current VCD backup is your only clean recovery path if WAL recovery fails. This is the most important reason to validate that your scheduled VCD backups are completing and the files are not zero-byte.

☕

15. Java Error Diagnosis - Reading cell.log and cell-runtime.log

VCD is a Java application. When something goes wrong - a cell fails to start, the API stops responding, tasks hang - the first place you go is the logs. Most engineers skim these and give up too fast. Read on for how to read them quickly and match error patterns to known causes.

Log File Map

Log File	What It Contains	When to Use It
cell.log	Cell startup sequence. Reports subsystem init percentage and status. Rewrites on every service restart.	First stop when a cell won't start or the UI is unreachable. Watch tail -f cell.log during startup.
cell-runtime.log	Runtime Java exceptions, task failures, internal service errors, timeout errors.	Cell started but something isn't working - a task is stuck, an API call is failing, the 5-minute restart loop.
vcloud-container-debug.log	Internal inter-service communications, detailed debug output.	Deep troubleshooting only. Very high volume. Look here when cell.log and cell-runtime.log don't explain the failure.
vcloud-container-info.log	Warnings and errors from the VCD container. Less verbose than debug.	Good secondary source. Auto-import failures log here.
vmware-vcd-watchdog.log	Watchdog service attempts to restart vmware-vcd if it hangs or stops.	Cell is restarting repeatedly - confirms the watchdog is triggering and shows restart frequency.
networking.log	NSX and network provisioning operations.	Edge gateway creation failures, NSX API timeouts, network pool exhaustion.
vclistener.log	vCenter event listener. All VCListener traffic from vSphere.	PostgreSQL TOAST spike investigation. High event volume correlation.

Reading cell.log During Startup

A healthy startup in cell.log looks like this - each percentage step represents a subsystem coming up:

Healthy cell.log startup sequence

Application startup begins: Successfully bound network port: 443 on host address: 10.x.x.x Application Initialization: 9% complete. Subsystem 'com.vmware.vcloud.common.core' started Successfully connected to database: jdbc:postgresql://127.0.0.1:5432/vcloud Application Initialization: 27% complete. Subsystem 'com.vmware.vcloud.backend-core' started Application Initialization: 63% complete. Subsystem 'com.vmware.vcloud.imagetransfer-server' started Application Initialization: 72% complete. Subsystem 'com.vmware.vcloud.rest-api-handlers' started Application Initialization: 100% complete. Server is ready in 0:52 (minutes:seconds)

If it never reaches 100%, look at what percentage it stopped at:

Stuck At	Failing Subsystem	Most Likely Cause
9%	common.core	Database connection failure. Check PostgreSQL is running, check JDBC URL in global.properties, check pg_hba.conf allows the connection.
WAITING (any %)	Any	A dependency subsystem has not started. Look at vcloud-container-debug.log for what the stalled subsystem is waiting on.
Startup fails immediately	global.properties	File not found or permissions wrong. Compare ownership and chmod to a working cell. The vcloud user needs read access.
Console proxy error	consoleproxy	IP address in the console proxy config does not exist on the cell's network interface. Check the console proxy IP setting in the VCD admin portal against the cell's actual IP.

Common Java Exceptions and What They Mean

java.lang.IllegalStateException: Unable to bind to [IP]:443

The console proxy or HTTPS listener is trying to bind to an IP address that does not exist on this cell's network interfaces. Usually caused by: wrong console proxy IP configured in VCD after a cell IP change, stale configuration after a cell rebuild, or a misconfigured multi-cell deployment where the proxy IP was copied from another cell.

Fix: Update Console Proxy IP in VCD Admin

In the VCD Service Provider Admin Portal, navigate to System, then Administration, then Cloud Cells. Find the cell and update the public console proxy address to match the cell's actual IP. Alternatively, reconfigure via the VAMI at port 5480.

java.sql.SQLException / "Cannot connect to database"

VCD cannot reach PostgreSQL. Causes in order of frequency: vpostgres service is not running, the JDBC connection string in global.properties points to the wrong host or port, pg_hba.conf does not have an entry for this cell's connection, the transfer filesystem is full and vpostgres stopped (yes, disk fill can kill the database too).

Fix: Check in This Order

Database connectivity diagnostic sequence

# 1. Is vpostgres running? systemctl status vpostgres # 2. Can the cell connect to the database at all? sudo -i -u postgres psql vcloud -c "SELECT version();" # 3. Check what the cell is trying to connect to grep database /opt/vmware/vcloud-director/etc/global.properties # 4. Check disk usage - full disk kills vpostgres df -h /var/vmware /opt/vmware/vcloud-director/data/transfer /

java.util.concurrent.TimeoutException: Timed out waiting for...

A service or subsystem took too long to respond during startup or during a task. The text after "waiting for" tells you what timed out. In the VM relocate context, this exception causes task failures when storage migrations take longer than the configured timeout (documented in Section 11). In other contexts it indicates resource contention - the cell JVM is not getting enough CPU, or the database is responding slowly.

Fix: Context-Dependent

For the 10.6.1.2 restart loop: open a Broadcom support case. For general timeout exceptions: check CPU ready time on the cell VM in vCenter, check PostgreSQL query latency, and check if the PostgreSQL disk is under pressure. A common cause is the cell VM being over-committed on the host during a busy period.

org.springframework.beans.factory.BeanCreationException

A Spring bean - an internal VCD service component - failed to initialize. The full exception stack trace following this line will name the specific bean. The most common causes are: a dependent service is not running, a configuration property is missing or malformed in global.properties, or a certificate is not readable by the vcloud user.

Fix: Read the Full Stack Trace

Do not stop at the BeanCreationException line. Scroll down in the log to the "Caused by:" chain - it will eventually reach a root cause that is actionable, such as a file permission error, a missing config property, or a connection refused message.

Extract the root cause from a BeanCreationException

grep -A 50 "BeanCreationException" /opt/vmware/vcloud-director/logs/cell-runtime.log | grep "Caused by:" | tail -5

com.vmware.ssdc.library.exceptions.MultipleLMException

Multiple concurrent operations failed. Common context: trying to power off or delete a vApp whose underlying VMs are in inconsistent states between VCD and vCenter. Each VM's failure is packaged into this aggregate exception.

Fix: Resolve Individual VM States First

Unpack the exception to find which specific VMs are failing. Use the VCD API to query the vApp's VMs individually and check their deployment_status. Resolve each stuck VM using the procedures in Section 16 before retrying the vApp-level operation.

StaleObjectStateException

VCD's Hibernate ORM layer tried to update or delete a database record that was modified by another transaction between the time VCD read it and the time it tried to write it. Common scenarios: two operations on the same object running concurrently (catalog overwrite, VM reconfigure during a storage policy change), or a stale session after a database failover.

Fix: Retry the Operation

StaleObjectStateException is a transient concurrency error in most cases. Retry the operation. If it happens consistently on the same object, check if another process (Veeam job, Zerto VPG, automated script) is concurrently modifying the same object.

NullPointerException (NPE) in VCD Logs

A code path hit a null reference. NPEs in VCD are usually bugs, not operator errors. Several are documented in the release notes (the health monitor NPE on VRF setups, the edge cluster NPE on VDC group edge gateways). Collect the full stack trace and check it against the known bugs in Section 11 before raising a new support case.

Useful Log Grep Patterns

Essential log analysis commands

# Watch startup in real time tail -f /opt/vmware/vcloud-director/logs/cell.log # Find all FATAL and ERROR entries in the last 500 lines of cell-runtime.log tail -500 /opt/vmware/vcloud-director/logs/cell-runtime.log | grep -E "FATAL|ERROR" # Find all exceptions with their root causes grep -E "Exception|FATAL|ERROR" /opt/vmware/vcloud-director/logs/cell-runtime.log | tail -100 # Find the specific ORG_TRAVERSE spam and count it grep -c "ORG_TRAVERSE" /opt/vmware/vcloud-director/logs/vcloud-container-debug.log # Find VCListener event volume (contributing to PostgreSQL TOAST bloat) grep "VCListener" /opt/vmware/vcloud-director/logs/vclistener.log | wc -l # Find auto-import failures grep -i "import" /opt/vmware/vcloud-director/logs/vcloud-container-info.log | grep -i "fail\|error\|skip" # Find networking errors grep -E "ERROR|Exception" /opt/vmware/vcloud-director/logs/networking.log | tail -50 # Check watchdog restart frequency grep "restart" /opt/vmware/vcloud-director/logs/vmware-vcd-watchdog.log | tail -20

Enabling Debug Logging Temporarily

Default log level is INFO. For active troubleshooting, you can bump specific components to DEBUG without restarting the VCD service using the log4j configuration. Add or modify entries in /opt/vmware/vcloud-director/etc/log4j.properties on the cell where the problem is occurring:

Temporary debug logging for specific components

# Suppress the ORG_TRAVERSE spam (covered in Section 11) log4j.logger.com.vmware.ssdc=INFO # Debug NSX/networking operations log4j.logger.com.vmware.vcloud.networking=DEBUG # Debug vCenter inventory sync (VCListener) log4j.logger.com.vmware.vcloud.vclistener=DEBUG # Debug database operations (use briefly - very high volume) log4j.logger.com.vmware.vcloud.database=DEBUG

Changes to log4j.properties take effect within 60 seconds without a service restart. Remove debug entries after troubleshooting to avoid disk fill from verbose output.

👻

16. Stranded, Unknown, and Stuck VMs - Identification and Cleanup

This is one of the most common operational pain points in VCD environments. VMs end up in states VCD cannot reconcile - deleted in vCenter but still showing in VCD, stuck in UNKNOWN power state, stranded in the StrandedItems folder, or orphaned in the database after a failed import. Each scenario has a different resolution path.

The cloud.uuid Tracking Mechanism

VCD tracks ownership of VMs through a custom attribute called cloud.uuid written into the VM's configuration parameters (the .vmx file) in vCenter. When VCD discovers a VM in a managed resource pool, it checks for this attribute. If cloud.uuid is present and matches a record in VCD's database, VCD considers that VM managed. If a VM is deleted directly in vCenter and the cloud.uuid record persists in the VCD database, VCD shows the VM as UNKNOWN. Understanding this mechanism is the key to resolving most stranded VM situations.

Scenario 1: VM Shows as UNKNOWN - Deleted in vCenter but Still in VCD

This is the most common scenario. Someone deleted a VM directly from vCenter without going through VCD. The VCD database still has the record. The VM shows as UNKNOWN and all operations on it are disabled.

Step 1 - Identify the VM in the Database

Connect to the VCD database and run the official Broadcom diagnostic query. Replace <VM Name> with the VM name as it appears in VCD:

Query for VM deleted in vCenter (Scenario 1 - Broadcom KB 371530)

-- Run both queries and capture output before taking any action SELECT org.name AS Tenant, vdc.name AS VDC, vmc.name AS vApp, vapp_vm.name AS VM, cvm.deployment_status AS Status, dvm.date_deployed AS Deployed FROM vapp_vm LEFT JOIN vm ON vm.id = vapp_vm.svm_id LEFT JOIN vm_inv ON vm_inv.moref = vm.moref LEFT JOIN computevm cvm ON cvm.id = vapp_vm.cvm_id LEFT JOIN deployed_vm dvm ON dvm.vm_id = vm.id LEFT JOIN networked_vm nvm ON nvm.id = vapp_vm.nvm_id LEFT JOIN vm_container vmc ON vmc.sg_id = vapp_vm.vapp_id LEFT JOIN org_prov_vdc vdc ON vmc.org_vdc_id = vdc.id LEFT JOIN organization org ON org.org_id = vdc.org_id WHERE vapp_vm.name LIKE '%<VM Name>%' ORDER BY org.name;

Step 2 - Attempt Force Delete via API

Before touching the database, try the API force delete. This is the supported path:

Force delete a stuck vApp via API

DELETE https://vcd.example.com/api/vApp/{vAppUrn}?force=true Authorization: Bearer {token} Accept: application/*+xml;version=36.3

If the vApp contains multiple VMs, delete each VM first then the vApp. If force=true still fails because VCD cannot reach vCenter to confirm deletion, proceed to Step 3.

Step 3 - Database Cleanup (If Force Delete Fails)

Take a database backup first. Then use the database to locate and remove the orphaned record. The key tables are vm_container (vApps) and vapp_vm (individual VMs). Foreign key constraints mean you must delete in the right order:

Locate the stuck vApp record

-- Find the vApp by name SELECT sg_id, name, creation_status, id FROM vm_container WHERE name LIKE '%<vApp Name>%'; -- Confirm it has no active VMs first SELECT vapp_vm.name, vm.moref FROM vapp_vm JOIN vm ON vm.id = vapp_vm.svm_id WHERE vapp_vm.vapp_id = '<sg_id from above>';

Foreign Key Constraints Will Block Direct Deletes

You cannot simply DELETE FROM vm_container - other tables reference it. The correct path is the API force delete or engaging Broadcom support for complex cases. Direct database row deletion requires navigating the full foreign key chain: deployed_vm, vapp_vm, networked_vm, computevm, vm, then vm_container. If in doubt, open a support case and provide the query output from Step 1. Broadcom support has a tested deletion procedure for complex stranded cases.

Scenario 2: Mass UNKNOWN State After Crash or Storage Failure

After an unclean shutdown - storage failure, power outage, host crash - multiple or all VMs in VCD may show UNKNOWN. The VMs are fine in vCenter. VCD cannot reconcile their states because the inventory cache tables are stale or corrupted.

Fix: Truncate the Inventory Cache Tables and Force Resync

The *_inv tables (vm_inv, dv_portgroup_inv, etc.) are purely cache tables - they hold the last-known state read from vCenter. Truncating them forces VCD to do a full fresh inventory sweep from vCenter on next startup. Stop all cells first.

Truncate inventory cache tables to force full resync

-- Stop all VCD cells before running this -- Take a backup first sudo -i -u postgres psql vcloud -- Truncate all inventory cache tables TRUNCATE TABLE vm_inv CASCADE; TRUNCATE TABLE dv_portgroup_inv CASCADE; TRUNCATE TABLE property_map; TRUNCATE TABLE property_map_status; -- Exit psql \q

Then start all cells and monitor startup. VCD will perform a full vCenter inventory sweep. On large environments this can take several minutes and you will see activity in vclistener.log. VM states should resolve to their correct values within 5-10 minutes of the cells coming up.

property_map Truncation and 10.6 TOAST Bloat

Truncating property_map also addresses the TOAST data bloat issue introduced in 10.6. VCD repopulates it from vCenter on restart. This is a valid maintenance operation but should not be done routinely - it causes a temporary spike in vCenter API calls during the resync.

Scenario 3: VM in Inconsistent Power State (VCD Says Off, vCenter Says On)

This happens when a VM's power state is changed directly in vCenter outside of VCD's control - HA restart, manual power-on, Veeam instant recovery, Zerto test failover. VCD receives the state change event but the sequencing leaves VCD expecting one state while vCenter has another.

Fix: Power Operation Through VCD UI

The simplest resolution is to perform a power operation on the VM through the VCD UI. Navigate to the VM in the tenant or provider portal, click All Actions, then Power, then Power On (even if it appears on in vCenter). This forces VCD to requery vCenter for the current state and reconcile. If the UI operation hangs, it means the inconsistency is deeper - use the CMT debug tool.

Alternative: Use the CMT Inconsistency Troubleshooter (10.6+)

VCD 10.6 added a troubleshooting mechanism in the Cell Management Tool for VM inconsistency resolution. The mechanism is not externally documented by Broadcom - if the UI power operation does not resolve the inconsistency, open a support case and specifically request the CMT inconsistency resolution tool for VCD 10.6.

Scenario 4: VM Shows in VCD But Not Discovered (Invisible to VCD)

A VM exists in vCenter in a managed resource pool but VCD has not discovered or adopted it. This happens after a restore, a vMotion from an unmanaged pool, or when the auto-discovery mechanism skipped the VM for a specific reason.

Fix: Use debug-auto-import to Find the Skip Reason

The debug-auto-import CMT subcommand lists all VMs that the discovery mechanism has skipped and explains why each was skipped:

Debug VM discovery for all VDCs

/opt/vmware/vcloud-director/bin/cell-management-tool debug-auto-import

Debug VM discovery for a specific VDC

/opt/vmware/vcloud-director/bin/cell-management-tool debug-auto-import --vdc "VDC Name Here"

Common skip reasons and their resolutions:

Skip Reason	Cause	Resolution
VM not connected to Org VDC network	VM's NIC is on a port group not managed by VCD	Move VM's NIC to an Org VDC network, or connect to one
VM has cloud.uuid already set	VM was previously managed by VCD - VCD sees a duplicate	Clear cloud.uuid if the original record is gone (see below)
VM is a template	Templates are not discoverable	Convert to VM in vCenter first, then it becomes discoverable
VM is Fault Tolerant	FT VMs cannot be adopted by VCD	Cannot be imported - architectural limitation
VM has IDE controller and is powered on	IDE controller VMs must be powered off for adoption	Power off in vCenter, then wait for next discovery cycle

Auto-discovery runs every 3 minutes. After fixing the skip reason, wait one discovery cycle and check again.

Scenario 5: Restored VM Not Discovered (cloud.uuid Collision)

You restored a VM from backup (Veeam, Zerto test failover, snapshot revert) into the same resource pool. VCD does not discover it even though the VM is visible in vCenter. The cause: the restored VM has the same cloud.uuid as the original. VCD sees what it thinks are two copies of the same VM and will not import the restored copy.

Fix: Clear cloud.uuid from the Restored VM

Power off the restored VM. In vCenter, clear the cloud.uuid configuration parameter (Configuration Parameters in VM Settings). Leave the field blank - do not delete the key, just blank the value. VCD will then treat this as an unmanaged VM and auto-discover it on the next cycle.

Clear cloud.uuid via PowerCLI (can do on running VM)

$vm = Get-VM -Name "RestoredVMName" $setting = Get-AdvancedSetting -Entity $vm -Name "cloud.uuid" $setting | Set-AdvancedSetting -Value "" -Confirm:$false

After clearing, VCD auto-discovery picks up the VM within 3 minutes. The VM gets a new cloud.uuid assigned by VCD during adoption.

Scenario 6: VM Appears in VCD, vApp is Stuck in "Busy" State, No Operations Available

The vApp is in a permanent Busy state. No power operations, no delete, nothing. This typically happens after an interrupted vApp creation, a VCD crash mid-operation, or a failed network provisioning step that left the vApp in a transitional state with no way to proceed or roll back through the UI.

Step 1 - Try Force Delete via API First

Force delete the stuck vApp

DELETE https://vcd.example.com/api/vApp/{vAppUrn}?force=true Authorization: Bearer {token}

Step 2 - If Force Delete Fails, Check vApp creation_status in DB

Find stuck vApp and check its state

SELECT sg_id, name, creation_status, id FROM vm_container WHERE name LIKE '%<vApp Name>%'; -- Common creation_status values: -- 0 = RESOLVED (healthy) -- 1 = CREATED (normal) -- -1 = UNRESOLVED (stuck) -- 2 = UNKNOWN

Step 3 - Escalate to Broadcom Support for Database Removal

Provide Broadcom support with: the vApp ID (sg_id), the output of the query above, and the output of the Broadcom KB 371530 diagnostic query. They will walk through the correct deletion order given the foreign key relationships in the database.

Scenario 7: VMs Deleted from VCD, Now in StrandedItems Folder in vCenter

When you delete an auto-discovered VM from VCD (not a VCD-provisioned VM), VCD moves the VM to a StrandedItems folder in vCenter and renames it with a suffix like vcentervm-1 (vm-uuid). The VM is not deleted from vCenter - it is just moved and tagged as stranded.

Fix: Recover from StrandedItems

The VM's data is intact in vCenter. To recover: move the VM from the StrandedItems folder back to a resource pool (managed or unmanaged depending on your intent), rename it to remove the suffix, and clear the cloud.uuid parameter if you want VCD to re-adopt it. If you want to remove it entirely, delete it from vCenter once it is out of the StrandedItems folder - deleting from StrandedItems directly will not work from the VCD side.

Removing a VM from VCD Management (Deliberate Extraction)

Sometimes you legitimately need to pull a VM out of VCD management - moving to a different CMP, migrating cross-vCenter, or rebuilding the VCD environment without destroying the VM. This is the supported procedure from Broadcom KB 320484:

Step 1 - Power Off the VM

Power off through VCD if possible. If not, through vCenter.

Step 2 - Clear cloud.uuid

In vCenter VM Configuration Parameters, blank the cloud.uuid value. Do not delete the key - blank the value.

Step 3 - Move VM Out of VCD-Managed Resource Pool

Move the VM in vCenter to a resource pool not managed by VCD. This prevents VCD from re-discovering it.

Step 4 - Remove from vCenter Inventory and Re-Add

Remove the VM from vCenter inventory (not delete from disk). Re-add it from its datastore. This forces vCenter to generate a new moref. The old VCD references to the old moref become stale.

Step 5 - Delete the VCD Object

In VCD, the VM will now show as UNKNOWN because its moref no longer matches anything. Use force delete via API to remove the VCD record. The vCenter VM is unaffected.

Force delete the now-stranded VCD object

DELETE https://vcd.example.com/api/vApp/{vAppUrn}?force=true

Step 6 - Release IP from VCD Pool

If the VM was using a static IP from a VCD IP pool, release the IP manually in VCD to prevent it being marked as consumed permanently.

Auto-Discovery Configuration Reference

CMT Setting	Default	Description
managed-vapp.discovery.retry-delay-sec	3600 (60 min)	How long VCD waits before retrying a failed auto-import. Set lower during troubleshooting.
Discovery cycle frequency	Every 3 minutes	Not configurable. VCD polls vCenter every 3 minutes for new VMs in managed resource pools.

Reduce auto-import retry delay for troubleshooting

/opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n managed-vapp.discovery.retry-delay-sec -v 25 # Reset to default after troubleshooting /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n managed-vapp.discovery.retry-delay-sec -v 3600

🔌

17. Integration, Ecosystem, and Infrastructure Issues

The failure modes in this section span VCD and the components around it - the console proxy, NTP, certificate trust chains, vCenter connectivity, Veeam integration specifics, Terraform, CSE Kubernetes, and media uploads. Most of these are not bugs in the 10.6 release notes but operational patterns that hit MSP and enterprise environments consistently.

Console Access and VMRC

Console Proxy Architecture Changed in 10.4 - Legacy Config Breaks Consoles

VCD 10.4 unified the console proxy onto port 443 alongside the API and UI. The legacy dedicated port 8443 and separate console proxy certificate are gone in 10.4.1+. If you upgraded from pre-10.4 and still have legacy console proxy settings configured, they will conflict with the unified implementation and console sessions will fail with a wmks.connecting loop or immediate disconnect. The load balancer must no longer have a separate pool for port 8443 - all traffic goes through 443.

Fix: Clear Legacy Console Proxy Settings

On each cell, run:

Remove legacy console proxy certificate config

/opt/vmware/vcloud-director/bin/cell-management-tool clear-console-proxy-settings

This removes user.consoleproxy.certificate.path and user.consoleproxy.key.path from global.properties. After running on all cells, update your load balancer to remove any separate 8443 pool or VIP. All console traffic now goes through the standard 443 pool alongside API and UI traffic.

VMRC and Web Console Fail After Upgrade - ESXi Certificate Not Trusted

VCD 10.4 added a security requirement: the VMCA (VMware Certificate Authority) root certificate from each registered vCenter must be trusted in VCD's trust store. ESXi hosts obtain their certificates from VMCA. If VMCA is not trusted by VCD, the console proxy cannot validate the ESXi host it is trying to proxy through - even though the vCenter connection itself works fine. The error appears in console-proxy.log as:

SSLHandshakeException: PKIX path building failed: unable to find valid certification path to requested target

This affects any upgrade path that bypassed the certificate trust step. VMRC console acquisition (MKS ticket) succeeds - the API can talk to vCenter - but the actual console session through the cell to ESXi fails.

Fix: Import Infrastructure Certificates via CMT

Run on the primary cell (or any cell - the command updates the shared database):

Import all vSphere, NSX, and VMCA certificates into VCD trust store

/opt/vmware/vcloud-director/bin/cell-management-tool trust-infra-certs --vsphere --unattended

This automatically retrieves and trusts certificates from all registered vCenter instances, NSX Managers, and VMCA. Run this after every vCenter certificate renewal, after adding a new ESXi host, and after any vCenter upgrade that regenerates VMCA-signed certificates. If a specific ESXi host is still failing, retrieve its certificate manually and import it via the VCD UI under Administration, then Trusted Certificates. Alternatively, use Renew from the vSphere Client on the ESXi host to force a fresh VMCA-signed certificate, then run trust-infra-certs again.

Instant VMRC Disconnects - NTP Drift or DNS Resolution Failure

VMRC connects and immediately drops. The MKS ticket is issued (task completes in VCD) but the console session dies within seconds. Two common causes: NTP drift between the client, VCD cell, and ESXi host (even 2+ seconds causes the ticket timestamp validation to fail), or the client cannot resolve the DNS name of the public console proxy address. VCD embeds the console proxy FQDN in the MKS ticket. If the client cannot resolve that name, the console opens and drops.

Fix: Verify NTP and DNS Resolution

Check NTP sync on all VCD cells and on the ESXi hosts. Run timedatectl status on each cell - "System clock synchronized: yes" is required. Then verify the public console proxy address configured in VCD System settings resolves from both inside and outside your network. The console proxy address must be the externally reachable name, not the internal cell IP.

ESXi Firewall Blocking Port 443 to ESXi Hosts Breaks Consoles

The VCD console proxy connects from the cell to the ESXi host on TCP 443 (for VMCA-authenticated connections in 10.4+). If a firewall between the VCD cell network and the ESXi management network blocks TCP 443 to individual ESXi hosts, all console sessions fail. This is invisible to tenants - the console proxy log shows the connection attempt and failure, but the VCD UI just shows the session failing to load. Check console-proxy.log for connection refused or timeout entries to specific ESXi IPs.

NTP - The Silent Infrastructure Killer

NTP Max Drift: 2 Seconds Across All Components

VCD requires all clocks to be synchronized within 2 seconds. This includes: all VCD cells, the PostgreSQL database server (embedded or external), all vCenter instances, all ESXi hosts, and the RabbitMQ server. Exceeding this threshold causes a cascade of failures that look unrelated to time at first glance. VMware recommends a minimum of 4 NTP servers configured on all components pointing to the same time source.

Symptom	Root Cause
SAML authentication failures / "token expired" errors	SAML assertions have short validity windows. Even 30 seconds of drift causes them to fail validation.
API token validation failures	VCD API tokens are time-bounded. Drift causes token rejection before the configured TTL.
PostgreSQL HA replication breaks	The embedded PostgreSQL replication heartbeat relies on time synchronization. Drift causes the standby to fall out of sync.
VMRC console instant disconnect	MKS ticket timestamp validation fails at ESXi host if clocks are out of sync.
Zerto VPG state inconsistency	Zerto journal timestamps drift relative to VCD event timestamps, causing VPG sync issues.
Veeam backup job failures with "timestamp" errors	Veeam's authentication to VCD uses time-sensitive tokens that fail when clocks are skewed.
SSL certificate "not yet valid" errors	A clock that is behind may see a certificate as not yet valid even though it is deployed correctly.

NTP verification across all VCD components

# On each VCD cell (Photon OS) timedatectl status # Check NTP peers and sync state chronyc tracking chronyc sources -v # Force immediate NTP sync if drift is detected chronyc makestep # Configure NTP servers on Photon OS (edit /etc/chrony.conf) # server 10.x.x.x iburst # server 10.x.x.y iburst # server 10.x.x.z iburst # server 10.x.x.w iburst # After editing: systemctl restart chronyd

VMware Tools Time Sync Overrides NTP

If VMware Tools time synchronization is enabled on the VCD cell VMs, it will periodically override the NTP-synchronized clock with the ESXi host's clock. If the ESXi host itself has NTP configured correctly this is fine. If the ESXi host's clock is drifting, VMware Tools will propagate that drift to the VCD cells. Verify ESXi host NTP is configured and syncing. The recommended configuration is: NTP on ESXi hosts pointing to your authoritative time source, VMware Tools time sync disabled on VCD cell VMs, and NTP (chrony) configured directly on the Photon OS cells pointing to the same authoritative time source.

Certificate Trust and Chain Issues

vCenter SSL Certificate Renewal Silently Breaks VCD Connection

When the vCenter SSL certificate is renewed or replaced - which happens automatically if VMCA manages the certificate lifecycle - VCD continues to work initially because it has the old certificate's thumbprint cached. At some point after the renewal, VCD's connection to vCenter fails and resources that depend on it (VM discovery, task processing, storage policy sync) stop working. There is no alerting in VCD for this condition - you discover it when tenant VMs stop reporting power state changes or when storage policies show stale data.

Fix: Re-trust vCenter Certificates After Any Renewal

Immediately after any vCenter certificate renewal (manual or VMCA-automated):

Re-import all infrastructure certificates

/opt/vmware/vcloud-director/bin/cell-management-tool trust-infra-certs --vsphere --unattended

Alternatively, from the VCD UI: navigate to Resources, then Infrastructure Resources, then vCenter Server Instances. Select the vCenter, click Edit, click Save without changing anything. VCD will prompt you to trust the new certificate. Accept it. Build this step into your vCenter certificate renewal runbook - it is frequently omitted and always causes a support call.

Incomplete Certificate Chain - Intermediate Not Bundled

VCD requires the full chain including intermediate certificates to be present in the certificate file you import. Importing only the leaf certificate - the most common mistake - causes TLS handshake failures with clients that do strict chain validation. Veeam, Zerto, and modern browsers all perform chain validation. The error is typically "unable to find valid certification path" on the client side, with no obvious indication the problem is in the certificate file itself rather than a trust store issue.

Fix: Always Bundle Leaf + Intermediates in PEM Order

When importing a CA-signed certificate into VCD's Certificate Library, the PEM file must contain the full chain in order: leaf certificate first, then each intermediate, then optionally the root. Verify the chain is complete before importing:

Verify certificate chain completeness

# Verify the chain can be built to a trusted root openssl verify -CAfile /path/to/ca-bundle.pem /path/to/leaf.pem # View the chain in a PEM bundle openssl crl2pkcs7 -nocrl -certfile /path/to/bundle.pem | openssl pkcs7 -print_certs -noout | grep "subject\|issuer" # The leaf's issuer must match the next certificate's subject # Continue until you reach a self-signed cert (subject = issuer)

CN/SAN Mismatch - Public Address Not Matching Certificate

VCD's Public Addresses (configured in System, then Administration, then Public Addresses) must match the certificate's CN or SAN exactly. If the certificate is issued for vcd.example.com but the Public Address is configured as vcd-api.example.com, clients performing strict certificate validation will reject the connection. This is common after migrations or rebranding where the DNS name changes but the certificate or the VCD public address setting is not updated to match.

Fix: Check SAN Entries and Public Address Alignment

Verify the certificate SAN entries

openssl x509 -in /path/to/cert.pem -noout -text | grep -A 5 "Subject Alternative Name"

Confirm every hostname listed in VCD Public Addresses (including the API endpoint, the console proxy address, and any tenant-facing URLs) appears in the certificate SAN. If adding additional SANs requires a new certificate, renew via the Certificate Library before updating the Public Addresses setting.

Veeam Fails After VCD Certificate Renewal - Cell Certificate Mismatch

Veeam VBR connects to VCD through the load balancer public address but also validates the individual cell certificates when it discovers cells during configuration sync. After a VCD certificate renewal, Veeam may fail with "Failed to validate the VMware Cloud Director cell certificate. This may be caused by the presence of a load balancer." This happens because Veeam tries to connect directly to each cell's IP to validate the certificate there, but the cell certificate does not match what Veeam expects (usually because the certificate was updated on the VCD side but Veeam's cached thumbprint is stale).

Fix: Re-register VCD in Veeam After Certificate Changes

Open the VCD server properties in the Veeam Backup console. Step through the wizard to the credentials page. Veeam will prompt you to accept the new certificate. Accept it and complete the wizard. If Veeam does not prompt (the bug confirmed in 10.6.1 environments) and skipping verification breaks backups, open a Veeam support case - the fix requires updating the certificate thumbprint directly in the Veeam SQL database, a procedure Veeam support will provide. Preventively: plan certificate renewals on VCD and notify Veeam admins so they can re-register before backups fail.

vCenter and Storage Integration

Storage Policy Sync Lag - New Datastores Not Visible to Tenants

After adding a new datastore or modifying storage profiles in vCenter, the changes do not immediately appear in VCD. VCD synchronizes storage policy information from vCenter on a scheduled cycle, not in real time. Tenants attempting to deploy VMs on the new storage see the old policy list and the new policy is absent.

Fix: Trigger a Manual Refresh

In the VCD Service Provider Admin Portal, navigate to Resources, then Infrastructure Resources, then vCenter Server Instances. Select the vCenter, then click Refresh. This triggers an immediate synchronization of storage policies, datastores, and resource pools from that vCenter. The refresh typically completes within 2-3 minutes depending on the size of the vSphere inventory. Add this step to your runbook for any vCenter storage infrastructure change.

Orphaned Shadow VMs After VCD vApp Delete

When deleting a vApp through VCD, the delete task in vCenter occasionally does not complete cleanly - particularly when the underlying vCenter is under load or when NSX network cleanup takes longer than expected. The vApp disappears from VCD but one or more VMs remain in vCenter in a powered-off state, still consuming datastore space. These orphans accumulate silently. At MSP scale, this can result in significant hidden storage consumption across multiple tenants.

Fix: Periodic vCenter Orphan Audit

Periodically audit vCenter for VMs that have a cloud.uuid attribute set but no corresponding VCD record. The presence of cloud.uuid indicates VCD ownership. If VCD has no record of that VM, it is a shadow orphan.

PowerCLI - find VMs with cloud.uuid that VCD no longer tracks

Get-VM | Where-Object { (Get-AdvancedSetting -Entity $_ -Name "cloud.uuid" ` -ErrorAction SilentlyContinue).Value -ne $null } | Select-Object Name, PowerState, @{N="cloud.uuid";E={(Get-AdvancedSetting -Entity $_ -Name "cloud.uuid").Value}}, @{N="Datastore";E={($_ | Get-Datastore).Name}} | Format-Table

Cross-reference the resulting cloud.uuid values against your VCD database. Any cloud.uuid that does not appear in the VCD vm table is an orphan. Delete from vCenter after confirming no VCD record exists and after snapshot/backup if data recovery is needed.

Stuck Edge Gateway in "Updating" State

Edge gateway configuration changes hang in "Updating" indefinitely. The VCD task never completes and no error is surfaced to the operator. Root causes: NSX API timeout during the T1 gateway reconfiguration, NSX edge cluster resource exhaustion (all edge nodes busy), or a lock on the NSX object that was not released after a previous failed operation.

Fix: Check NSX Edge Cluster Health and Cancel Stuck Tasks

In NSX Manager, check the edge cluster and transport nodes for pending or failed configuration tasks. Look for stuck jobs in NSX's task audit. If a prior VCD-initiated task is still pending in NSX, canceling it in NSX allows VCD to retry. On the VCD side, if the task is stuck for more than 30 minutes, cancel it via the VCD API:

Cancel a stuck task via VCD API

POST https://vcd.example.com/api/task/{taskId}/action/cancel Authorization: Bearer {token}

After canceling, verify the NSX T1 gateway state in NSX Manager before retrying the VCD operation.

DFW Rules Not Applying to Tenant VMs

Distributed Firewall rules configured at the VDC Group level are not enforced on tenant VMs. The VMs are in Org VDCs that have not been added to the VDC Group. DFW enforcement scope is the VDC Group - VMs in Org VDCs outside the group are not covered by the DFW rules configured within it.

Fix: Add All Relevant Org VDCs to the VDC Group

In the VCD provider portal, navigate to Networking, then Data Center Groups. Edit the group and add all Org VDCs whose VMs should be subject to the DFW rules. Verify each Org VDC's edge gateway is also associated with the DC Group to ensure routing consistency. DFW rule application takes effect immediately after the VDC is added to the group - no VM reconfiguration required.

Veeam Integration - Additional Specifics

Restored VMs Are Invisible to VCD (Restored Directly to vCenter)

An operator restores a Veeam backup directly to vCenter (not through the VCD Self-Service Backup Portal). The VM powers on in vCenter. VCD has no idea it exists because the restore bypassed VCD's API and no cloud.uuid was written. The VM runs without VCD management, consuming resources that VCD cannot account for. If the restored VM's IP conflicts with a VCD-managed IP pool allocation, tenants get network conflicts.

Fix: Restore Through VCD or Import Post-Restore

Always restore through the Veeam Self-Service Backup Portal or via the VCD API restore workflow. This ensures VCD's database is updated and the cloud.uuid is written correctly. If a restore was already done directly to vCenter, use the VCD auto-discovery mechanism (Section 16, Scenario 4) to import the VM into the correct Org VDC. Ensure the VM is connected to an Org VDC network before the discovery cycle runs.

SureBackup / Virtual Lab Network Conflicts with NSX-T vApps

Running SureBackup or Veeam Virtual Lab recovery verification against vApps with NSX-T routed networks can cause NIC connectivity failures in the isolated lab environment. Veeam preserves MAC addresses from the original VMs. When the lab network isolation kicks in to prevent production network conflicts, VMs may come up with disconnected vNICs if the NSX-T network isolation is not properly handled by the Veeam network extension appliance.

Fix: Use Isolated Org VDC Networks for Virtual Lab

Configure SureBackup Virtual Lab to use isolated (non-routed) Org VDC networks as the lab network. Configure the masquerade IP for the isolated network to avoid production subnet conflicts. For NSX-T backed vApps, verify the Veeam network extension appliance is deployed in the correct Org VDC and that its NIC count does not exceed the 9-network limit (Section 9). Do not use routed Org VDC networks directly in the Virtual Lab - isolation cannot be guaranteed and you risk production network pollution.

CSE Kubernetes

CSE Kubernetes Cluster Stuck - Ephemeral VM Cannot Reach Internet

When a tenant requests a Kubernetes cluster through CSE, VCD spins up a temporary ephemeral VM in the tenant's Org VDC to orchestrate the cluster build. This VM needs internet access to pull Kubernetes package repos and container images. If the Org VDC is missing a SNAT rule allowing the ephemeral VM's subnet to reach the internet, or if the Edge Gateway firewall blocks outbound HTTP/HTTPS, the ephemeral VM hangs indefinitely pulling packages and the cluster deployment never completes. The task shows running but makes no progress.

Fix: Verify SNAT and Firewall Before Enabling CSE for a Tenant

Before enabling CSE cluster creation for an Org VDC, verify: an SNAT rule exists on the Edge Gateway allowing the Org VDC routed network subnet to NAT to a public IP, the Edge Gateway firewall allows outbound TCP 80 and 443 from that subnet, and DNS resolution works from the Org VDC network (the ephemeral VM needs to resolve package repo hostnames). Test by deploying a temporary VM in the same Org VDC and confirming internet access before handing off CSE access to the tenant.

CSE UI Plugin API Version Mismatch After VCD Upgrade

After upgrading VCD, the CSE UI plugin may call deprecated API versions or endpoints that were removed. The cluster status shows as Error in the tenant portal even though the underlying Kubernetes nodes are running correctly. This is a plugin version mismatch - the CSE plugin was not updated alongside VCD.

Fix: Update CSE and the UI Plugin After Every VCD Upgrade

CSE has its own compatibility matrix against VCD versions. After every VCD upgrade, verify CSE compatibility and update both the CSE server component and the UI plugin if a new version is available. The CSE UI plugin update is performed through the VCD provider portal under Customization, then UI Plugins. Tenants will not see the cluster error resolve until the plugin is updated and the portal cache is cleared.

Terraform and Infrastructure as Code

Terraform State Drift When UI Changes Are Made

Terraform manages VCD objects (Org VDCs, Edge Gateways, networks, catalogs) through the vCD Terraform provider. If a tenant or operator makes changes directly in the VCD UI - adding a network, resizing a VM, adding a firewall rule - the next terraform plan will detect drift. Depending on how the Terraform configuration is written, terraform apply may attempt to destroy and recreate the drifted resources to bring them back into conformity with the state file. At MSP scale, this can destroy tenant networking configurations.

Fix: Enforce API-Only Changes and Use terraform import for Exceptions

Establish a strict policy: all VCD infrastructure changes go through Terraform. The VCD UI is read-only for IaC-managed environments. For legitimate exceptions (emergency fix, tenant-requested change), use terraform import to bring the manual change into the state file before the next apply:

Import a manually created VCD resource into Terraform state

# Example: import a manually created Org VDC network terraform import vcd_network_routed_v2.my_net my-org.my-vdc.my-network-name

Use terraform plan output to identify drift before applying. Never run terraform apply against a production VCD environment without reviewing the plan output for unexpected destroy operations.

Deprecated API Versions Breaking Terraform Provider After VCD Upgrade

VCD 10.x deprecates and removes older API versions with each major release. The Terraform vCD provider pins to specific API versions. If the provider version is not updated alongside VCD, API calls fail with "API version X is not supported" or "deprecated API endpoint" errors, breaking all Terraform operations against that VCD instance. This affects automated pipelines immediately on the day of VCD upgrade if provider pinning is not part of the change management plan.

Fix: Test Provider Compatibility Before VCD Upgrade

Check the vmware/vcd Terraform provider changelog against the VCD version you are upgrading to. Update the provider version in your Terraform configuration and run terraform plan in a test environment against the new VCD version before upgrading production. Pin provider versions explicitly in your configuration:

terraform - pin vCD provider version

terraform { required_providers { vcd = { source = "vmware/vcd" version = "~> 3.14" # pin to tested version } } }

The Terraform vCD provider and VCD compatibility is documented in the provider's README. Match provider version to VCD version before every upgrade.

Stuck Media Uploads

ISO/OVA Uploads Stuck at 1% or "Pending" Indefinitely

Media and OVA uploads through the VCD tenant portal hang at 1% or in Pending state. The upload task never progresses. Causes in order of frequency: asymmetric routing between the client and the VCD cell handling the upload, load balancer session timeout set too low (many LBs default to 60 seconds; large file uploads take much longer), or the transfer filesystem at /opt/vmware/vcloud-director/data/transfer is full.

Fix: Check in This Order

First, check transfer filesystem space: df -h /opt/vmware/vcloud-director/data/transfer. If it is more than 80% full, clear old files and cancel orphaned transfers. Second, check the load balancer TCP session timeout. For large file uploads it should be set to at least 600 seconds (10 minutes). Third, check for asymmetric routing: the upload HTTP PUT must follow the same path in both directions. If the response takes a different path back through the network, the session will time out. Fourth, check for NAT timeout issues - stateful firewalls with short UDP/TCP timeouts may terminate long-lived upload connections. The transfer protocol uses chunked HTTP PUT - the connection is held open for the duration of the upload.

Check for orphaned transfer files (consuming transfer space)

find /opt/vmware/vcloud-director/data/transfer -name "*.vdisk" -mtime +1 -ls # Cancel any transfers that have been stuck for more than 24 hours # via the VCD API GET https://vcd.example.com/api/query?type=media&filter=status==UNRESOLVED Authorization: Bearer {token}

⚡

18. Deep-Cut Platform Issues and Advanced Gotchas

These are the issues that experienced VCD engineers hit in production but rarely find documented in one place. Some are architectural constraints you work around, some are bugs with specific remediation steps, and some are operational patterns that only emerge at MSP scale.

Veeam + VCD Version Lockstep - The Upgrade Trap

VCD and Veeam Must Be Upgraded in Lockstep

Veeam VBR has a hard version dependency on VCD. The VCD API changes significantly between major VCD releases. Upgrading VCD before upgrading Veeam breaks the integration completely - Veeam loses visibility into vApps and Organization VDCs and all VCD-integrated backups fail.

VCD Version	Minimum Veeam VBR Required	Notes
VCD 10.5.x	VBR 12.1+	VBR 12.1.2 was the first version with VCD 10.5 support
VCD 10.6.x	VBR 12.2+ (build 12.2.0.334, August 2024)	VBR 12.1.2 does not support VCD 10.6. VBR 12.2 added the API support for VCD 10.6 methodologies.

Upgrade Sequence: Veeam First, Then VCD

Always upgrade Veeam VBR to the required minimum version before upgrading VCD. The sequence is: upgrade VBR on the SP backup server, verify VCD integration is healthy, then upgrade VCD. After upgrading VCD, delete the old Veeam plugin from VCD and upload the new plugin version from the VBR ISO. The plugin cannot be upgraded in place - it must be deleted and re-uploaded. Navigate to Customization, then UI Plugins in the VCD provider portal to manage plugins.

Verify current plugin version in VCD

GET https://vcd.example.com/cloudapi/extensions/ui Authorization: Bearer {token} Accept: application/json;version=36.3

Always check KB4488 before any VCD upgrade to confirm the current Veeam compatibility matrix.

Veeam CDP I/O Filter and ESXi Certificate Sync

CDP Storage Shows as "Incompatible" - Silent ESXi Certificate Failure

Veeam CDP relies on vSphere I/O filters installed on ESXi hosts. When tenants try to configure CDP replication jobs through VCD, Veeam refuses to present any compatible destination datastore - not because of a storage configuration problem, but because the ESXi I/O filter certificate synchronization between ESXi and vCenter has silently failed. The veecdp storage provider drops offline. From the tenant's perspective, all datastores show as incompatible with no meaningful error explaining why.

Fix: Renew ESXi Certificates and Re-sync the I/O Filter Provider

In vCenter, navigate to each affected ESXi host. Right-click the host, select Certificates, then Renew Certificate. This forces vCenter to re-validate the ESXi certificate chain and re-register the I/O filter storage provider. After renewing certificates on affected hosts, re-run the VCD infrastructure certificate import:

Re-sync infrastructure certificates after ESXi cert renewal

/opt/vmware/vcloud-director/bin/cell-management-tool trust-infra-certs --vsphere --unattended

Then verify CDP datastores appear as compatible in the Veeam replication job wizard. If the problem persists across multiple hosts, check the vCenter vpxd.log for I/O filter registration failures - a pattern of registration timeouts indicates a deeper ESXi-to-vCenter certificate trust issue that may require re-registering the host with vCenter.

Veeam Self-Service Portal - Tenant API Authentication Gap

Tenant Automation Cannot Use VCD Credentials for Veeam Enterprise Manager API

The Veeam Self-Service Backup Portal integrates seamlessly into the VCD HTML5 UI through SSO for interactive users. However, tenants who want to automate backup management through Infrastructure as Code or custom scripts cannot use their VCD credentials to authenticate against the Veeam Enterprise Manager API. The Enterprise Manager API requires separate Veeam-native credentials. This breaks the multi-tenant SSO illusion and forces tenants to manage two credential sets - VCD credentials for everything else, Veeam credentials for backup automation.

Workaround: Veeam REST API + VCD OAuth Token Bridge

The cleanest workaround is a thin automation bridge: a service account that authenticates to both VCD (using the tenant's org admin credentials) and Veeam Enterprise Manager (using tenant-specific Veeam credentials), and exposes a unified API surface to the tenant. This is architecture work rather than a product fix. Veeam is aware of the gap - check release notes for native VCD OAuth support in future VBR releases. For now, document the separate credentials requirement clearly during tenant onboarding and build it into your IaC templates as a separate credential input.

NSX-T Geneve MTU - The Silent Drop Problem

Tenant Applications Timing Out - Geneve Overlay MTU Mismatch

NSX-T uses the Geneve encapsulation protocol for overlay networking. Geneve adds approximately 50-100 bytes of header overhead to every packet. If the physical network path carries tenant traffic at a standard 1500-byte MTU, any overlay packet that would be larger than 1400-1450 bytes after decapsulation gets silently dropped at the physical switch - not at any point in VCD or NSX, and not with any error logged in either system. The overlay tunnels show healthy. The NSX UI shows green. VMs appear up. But specific applications that generate larger packets - database queries returning large result sets, file transfers, some backup protocols - silently fail or experience extreme retransmission.

Official NSX MTU Requirements

Per Broadcom documentation: minimum required physical MTU for NSX-T Geneve overlay is 1600 bytes. Recommended MTU is 1700 bytes to accommodate current and future Geneve header expansion. The Global Gateway Interface MTU must be at least 200 bytes lower than the TEP MTU. VMs can stay at 1500 bytes as long as the physical fabric is at 1700+.

Fix: Verify MTU End-to-End Across Every Hop

MTU must match at every point in the physical path: physical ToR switches, vSphere Distributed Switch, ESXi vmkernel TEP interfaces, and NSX Edge transport node uplink profiles. A single 1500-byte hop anywhere in the path causes the silent drop pattern.

MTU verification sequence

# 1. Check NSX Global Fabric MTU (should be 1600 minimum, 1700 recommended) # NSX Manager: System > Fabric > Settings > Global Fabric Settings > Tunnel Endpoint MTU # 2. Check vDS MTU on each ESXi host esxcfg-nics -l # Look for MTU column - must be >= 1600 on overlay-carrying NICs # 3. Check TEP (vmkernel) interface MTU esxcfg-vmknic -l | grep -i vxlan # MTU column should show 1600+ # 4. Test with DF bit set - this confirms actual path MTU # Run from one ESXi host's TEP interface to another vmkping -I vmk10 -S vxlan -s 1600 -d # If -s 1600 fails but -s 1400 succeeds, you have a 1500-byte hop somewhere # Find it by testing intermediate hops # 5. Check NSX Edge transport node MTU # NSX Manager: System > Fabric > Nodes > Edge Transport Nodes # Verify uplink profile MTU on each edge node

When you find the 1500-byte hop, increase the MTU on that specific device. Physical switches require MTU changes on every port in the path including inter-switch links. Some D-Link and older switch models require 1700 bytes at the physical level to reliably deliver 1600-byte frames. After changing, rerun the vmkping validation before declaring the issue resolved.

The Elastic pVDC Slingshot Bug (10.4.x)

VM Migrates to Wrong vSphere Cluster After Hardware Edit (10.4.x)

In VCD 10.4.x with Elastic Provider VDCs spanning multiple underlying vSphere clusters, editing a VM's hardware configuration - adding a vCPU or changing memory - could trigger VCD to dynamically migrate the VM to a different vSphere cluster before powering it on. This "slingshot" behavior violated DRS anti-affinity rules, broke storage locality assumptions, and caused VMs to land on clusters with different storage policies than expected. This is resolved in 10.5.x and later.

Fix: Upgrade Past 10.4.x

This bug is resolved in VCD 10.5.x and does not affect 10.6.x. If you are on 10.4.x and cannot upgrade immediately, workaround is to avoid hardware edits on running VMs in Elastic pVDC configurations. Power off before any hardware changes, make the change, then power on. The VM placement decision happens at power-on and the slingshot occurs during the power-on sequence following a reconfigure - not during the reconfigure itself.

Appliance eth1 IP Change - Why You Cannot Just Change It

Changing eth1 IP Breaks Database HA and Cell Replication

The VCD appliance uses eth0 for the primary network (API, UI, tenant traffic) and eth1 for the internal database replication network between cells. The eth1 IP is burned into multiple places during appliance deployment: the OVF environment parameters, PostgreSQL pg_hba.conf, the repmgr replication manager configuration, and the cell management tool's cluster configuration. There is no supported in-place procedure to change the eth1 IP after deployment. Attempting to change it only in the OS creates a split-brain where VCD cells can no longer replicate and the database HA breaks.

Fix: Plan eth1 Addressing Before Deployment - Recovery Requires Broadcom Support

The only supported path if eth1 must change is to redeploy the appliance from a database backup. Get the eth1 IP range right before you deploy. If you are already in the situation where eth1 needs to change, open a Broadcom support case before touching anything. Support may have an unsupported but field-tested procedure that avoids a full redeploy. The unsupported path (do not attempt without support guidance) involves: stop all cells, stop vpostgres, update pg_hba.conf and postgresql.conf on all nodes to reference the new eth1 IPs, update the repmgr configuration, update OVF environment parameters, restart vpostgres, restart cells sequentially. One misconfigured step leaves the database in a state that requires a restore.

Deployment Checklist Addition

Add eth0 and eth1 IP addresses to your pre-deployment planning documentation alongside the network pool and NSX Projects decisions. These are permanent architecture choices. See Section 13 for the full checklist.

The Quartz Scheduler / Inventory "Whack" - Forcing a Full vCenter Resync

VCD Stuck in "Busy" State / Storage Policies Not Appearing

Sometimes VCD gets completely out of sync with vCenter. vCenter shows as permanently busy in the VCD provider UI. New storage policies created in vCenter never appear in VCD regardless of how many times you trigger a refresh. Tasks queue up and never process. The root cause: VCD's internal Quartz scheduler and vCenter inventory cache tables get jammed - typically after an unclean shutdown, a database failover, or a high-volume period where the scheduler fell behind and got stuck in a retry loop it cannot escape.

Fix: Clear the Quartz Scheduler and Inventory Tables

This is a controlled database cleanup procedure. Stop all cells first, take a backup, then clear the specific tables that hold the jammed scheduler state and the stale inventory cache. VCD rebuilds these from vCenter on restart.

Stop All Cells Before Running This

Running these truncations while cells are active will corrupt the running state. All cells must be stopped and you must have a verified backup before proceeding.

Clear jammed Quartz scheduler and inventory tables

-- Stop all cells first. Take a backup first. sudo -i -u postgres psql vcloud -- Clear Quartz scheduler job state (jammed jobs blocking the queue) DELETE FROM qrtz_fired_triggers; DELETE FROM qrtz_simple_triggers; DELETE FROM qrtz_simprop_triggers; DELETE FROM qrtz_cron_triggers; DELETE FROM qrtz_blob_triggers; DELETE FROM qrtz_triggers; DELETE FROM qrtz_job_details; DELETE FROM qrtz_calendars; DELETE FROM qrtz_paused_trigger_grps; DELETE FROM qrtz_scheduler_state; DELETE FROM qrtz_locks; -- Clear vCenter inventory cache tables (forces full resync on restart) TRUNCATE TABLE vm_inv CASCADE; TRUNCATE TABLE dv_portgroup_inv CASCADE; TRUNCATE TABLE property_map; TRUNCATE TABLE property_map_status; \q

After these truncations, start the primary cell and monitor cell.log and vclistener.log. VCD will perform a full vCenter inventory sweep on startup. On large environments with many VMs and storage policies, this can take 10-20 minutes. Storage policies will appear in VCD once the sweep completes. Do not start secondary cells until the primary is fully up and showing 100% in cell.log.

IP Spaces Migration - Terraform and Legacy API Breakage

Legacy External Networks and IP Blocks Break After IP Spaces Migration

IP Spaces became the supported model for external IP management starting in VCD 10.4. If you are migrating a legacy VCD environment to IP Spaces, any automation built against the legacy External Networks and IP blocks API breaks after migration. Terraform vCD provider configurations using vcd_external_network or older IP block resources fail. Any automation that requests IPs from deprecated network pool endpoints gets 404 or unsupported operation errors. The migration is not reversible - once External Networks are migrated to IP Spaces, the legacy objects are removed.

Fix: Migrate Terraform State Before Migrating IP Objects

Before migrating any External Networks to IP Spaces in VCD:

Update the Terraform vCD provider to a version that supports IP Spaces (3.9.0+).
Replace all vcd_external_network resources with vcd_external_network_v2 in your Terraform code.
Use terraform import to bring existing external network objects into the new resource type's state.
Replace vcd_network_direct references with vcd_nsxt_network_imported where applicable.
Only after Terraform state is migrated to the new resource types, proceed with migrating the actual VCD objects to IP Spaces.

Run terraform plan after each step before applying. The plan output will show you any remaining references to legacy resource types. Address all of them before the IP Spaces migration in VCD - there is no rollback.

Zerto Custom Guest Properties Sync Failure on Failback

VCD Custom Guest Properties Not Replicated by Zerto - Failback Customization Fails

VCD allows tenants to set Custom Guest Properties on VMs - metadata key-value pairs used for post-customization scripts, IP injection, application-level configuration. Older Zerto versions do not replicate this metadata as part of the VPG replication. During failover, VMs power on at the recovery site but without the Custom Guest Properties. For VMs that rely on these properties for guest OS customization or application bootstrap, the VM comes up in an unconfigured state and requires manual intervention. The problem is worst during reverse protection and failback - the VMs that have been running at the recovery site may have accumulated property changes that never get replicated back to the source.

Fix: Post-Failover Script to Restore Custom Guest Properties

Build a post-failover script that reads the required Custom Guest Properties from a configuration repository (your CMDB, a dedicated key-value store, or a VCD catalog item that stores the metadata) and re-applies them to the recovered VMs via the VCD API.

Apply Custom Guest Properties via VCD API post-failover

PUT https://vcd.example.com/api/vApp/vm-{vmId}/productSections/ Content-Type: application/vnd.vmware.vcloud.productSections+xml;version=36.3 Authorization: Bearer {token} <?xml version="1.0" encoding="UTF-8"?> <ProductSectionList xmlns="http://www.vmware.com/vcloud/v1.5"> <ovf:ProductSection xmlns:ovf="http://schemas.dmtf.org/ovf/envelope/1"> <ovf:Property ovf:key="your.custom.key" ovf:value="your-value"/> </ovf:ProductSection> </ProductSectionList>

Verify your Zerto version's release notes for Custom Guest Properties replication support - newer Zerto versions have improved VCD metadata handling. Run a test failover that validates Custom Guest Properties are intact before any production failover. If they are absent after test failover, your post-failover script must be part of your runbook.

Kubernetes (CSE) Backup - The Full Picture

Standard Veeam VBR VM Backup Does Not Protect Kubernetes Workloads

Backing up worker node VMs through standard VBR jobs does not constitute a valid backup of Kubernetes workloads. VBR captures disk state, but it does not capture the etcd state (the cluster's configuration and secrets database), PersistentVolumeClaims in a multi-tenant consistent manner, or the namespace/label-level metadata required for a proper Kubernetes restore. Restoring a worker node VM from a VBR backup restores the node but not necessarily the workloads running on it in a consistent state.

The Correct Architecture: Velero + VCD Object Storage Extension

Protecting Kubernetes workloads deployed through VCD CSE requires a separate protection stack:

Component	Role
Velero	Runs inside the Kubernetes cluster. Captures etcd state, PVCs, namespace resources, and label-level objects. This is the Kubernetes-native backup tool.
VCD Object Storage Extension (OSE)	Bridges Velero backup output to S3-compatible object storage accessible through the VCD multi-tenant framework. Provides the storage target for Velero backups at the MSP level.
S3-Compatible Object Storage	The actual destination for Velero backups. Can be on-premises (MinIO) or cloud (AWS S3, Azure Blob via compatibility layer).

Velero is deployed into each Kubernetes cluster as a DaemonSet. It integrates with VCD OSE to drop backup data into tenant-scoped object storage buckets. Recovery of a Kubernetes cluster means: first restoring the VBR backup of the worker node VMs (to get the compute infrastructure back), then using Velero restore to bring back the workload state into the recovered cluster. These are two separate restore operations that must happen in sequence. Document this two-phase restore procedure and test it - it is not intuitive and it is different from any other recovery workflow in the VCD environment.

🗺️

19. Visual References: Compatibility Matrix, K8s Backup Architecture, and CDP Trace

Three topics from the preceding sections warrant visual treatment. The compatibility matrix is a reference you will check before every upgrade. The Kubernetes backup architecture shows why the tooling shift is necessary. The CDP trace-through is the exact diagnostic path from tenant complaint to root cause.

VCD / Veeam VBR / vCenter Upgrade Lockstep Matrix

Every cell is a verified compatibility state sourced from Veeam KB4488 and KB2443. Red cells are confirmed breakage points - upgrading in that combination breaks the integration immediately. Upgrade VBR before VCD. Upgrade vCenter within the validated range for your VBR version. Check the KBs before every upgrade.

Fully Supported

Partial / Known Issues

BREAKS - Do Not Upgrade

Works With Caveats

Part A - Veeam VBR minimum version required for each VCD version

VCD Version	VBR 12.0 Jan 2023	VBR 12.1 Nov 2023	VBR 12.1.2 Aug 2024	VBR 12.2 Aug 2024	VBR 12.3 Dec 2024	VBR 13 Feb 2025	Minimum VBR
VCD 10.4.x	OK	OK	OK	OK	OK	OK	VBR 12.0
VCD 10.5.x	BREAKS	OK	OK	OK	OK	OK	VBR 12.1+
VCD 10.6.0 - 10.6.0.1	BREAKS	BREAKS	LIMITED	OK	OK	OK	VBR 12.2+
VCD 10.6.1 - 10.6.1.2	BREAKS	BREAKS	BREAKS	OK	OK	OK	VBR 12.2.0.334+

Source: Veeam KB4488. VBR 12.1.2 on VCD 10.6.0 has partial compatibility with known issues - not recommended for production. VBR 13.0.1 (November 2025, current recommended) provides full Cloud Connect and VCD support. VBR 12.3.2 (build 12.3.2.4465, March 2026) is the current v12 build for environments not yet on v13. After upgrading VCD, delete the Veeam plugin from VCD and re-upload the new version from the VBR ISO - it cannot be upgraded in place.

Part B - Veeam VBR minimum version required for each vCenter version

vCenter Version	VBR 12.0	VBR 12.1	VBR 12.1.2	VBR 12.2	VBR 12.3	VBR 13	Minimum VBR
vCenter 7.0 / U1 / U2 / U3	OK	OK	OK	OK	OK	OK	VBR 12.0
vCenter 8.0 / 8.0 U1	OK	OK	OK	OK	OK	OK	VBR 12.0
vCenter 8.0 U2	BREAKS	OK	OK	OK	OK	OK	VBR 12.1+
vCenter 8.0 U3	BREAKS	BREAKS	vCLS issue	OK	OK	OK	VBR 12.2+

Source: Veeam KB2443. VBR 12.1.2 on vCenter 8.0 U3 had a known vCLS VM visibility bug resolved in VBR 12.2. Minor patch releases (a/b/c) within the same Update are generally supported without a VBR upgrade.

Part C - VCD vs vCenter (Broadcom interoperability) - Key breakage points

VCD Version	vCenter 7.0 U2-U3	vCenter 8.0 U1	vCenter 8.0 U2c	vCenter 8.0 U3	CSE / TKG on U3
VCD 10.5.x	OK	OK	OK	LIMITED	OK on U2c
VCD 10.6.x	OK	OK	OK	OK	BROKEN on U3

VCD 10.6 requires NSX 4.1.x or 4.2.x. TKG/TKGS cluster management through VCD Container Service Extension is broken on vSphere 8.0 U3. Stay on vCenter 8.0 U2c if running CSE-managed Kubernetes. This is an architectural incompatibility - not a configuration issue.

REQUIRED UPGRADE SEQUENCE

1. Upgrade Veeam VBR to required minimum → verify VCD integration healthy
2. Upgrade VCD appliance → delete old plugin, re-upload new plugin from VBR ISO
3. Upgrade vCenter (if required) → run trust-infra-certs after any cert changes

The Kubernetes Backup Disconnect - Traditional VBR vs. the Reality

This is the architectural mismatch that catches MSPs off guard. Standard VBR job protects compute. It does not protect Kubernetes control plane state. Here is exactly what each approach captures and misses, and the two-phase recovery that results.

Standard VBR VM Backup of Worker Nodes

What it captures:

✓ OS disk and installed binaries

✓ Container runtime (containerd / docker)

✓ kubelet configuration files

✓ Local EmptyDir storage

Does NOT capture:

✗ etcd database (cluster state, Secrets)

✗ PersistentVolumeClaims and their data

✗ Namespace resource definitions

✗ ConfigMaps and application config

✗ Helm release state

✗ Multi-tenant namespace isolation

Restoring a worker node VM from VBR backup restores the OS. It does not restore workloads. The node rejoins the cluster as a blank worker - applications do not come back without Velero.

Velero + VCD Object Storage Extension

What this stack captures:

✓ etcd snapshot (full cluster state)

✓ All namespace resources (Deployments, Services)

✓ PersistentVolumeClaims and their data

✓ Secrets and ConfigMaps

✓ Helm release state

✓ Namespace-scoped backup isolation

✓ Label and annotation selectors

✓ Tenant-scoped S3 isolation via OSE

Velero runs as a DaemonSet inside the cluster. VCD OSE routes backup output to tenant-scoped S3 buckets. Recovery is at the application level, not the node level.

Two-Phase Recovery Sequence for CSE Kubernetes Clusters

Phase 1 - Infrastructure Recovery

Restore worker node VMs from VBR backup. Brings the compute layer back - OS, container runtime, node binaries. Cluster has nodes but no workloads running.

→

Phase 2 - Workload Recovery

Run velero restore create --from-backup <name> against the recovered cluster. Restores application state, PVCs, and namespace config from the Velero backup stored in VCD OSE.

Both phases are required and sequential. Phase 1 without Phase 2 gives you empty nodes. Phase 2 without Phase 1 has nowhere to restore into. Test this sequence - it is counterintuitive for engineers familiar with standard VM recovery workflows.

The CDP Storage Ghost - Traced From Tenant Complaint to Root Cause

The symptom is obvious. The cause is three layers deep and invisible at each layer above it. Engineers who do not know this path waste hours checking the wrong things.

Tenant Complaint

Tenant opens a CDP replication job in the Veeam Self-Service Portal. At the storage selection step, no destination datastores are presented. Dropdown is empty or shows "No compatible storage available." The tenant changed nothing. It worked before. Everything appears healthy in VCD and Veeam.

First Wrong Assumption

You check VCD storage policy allocations - correct. You check the datastore in vCenter - online and healthy. You check Veeam - no alerts. You check the CDP proxy - running. You check the Veeam CDP Replication storage policy in vSphere SPBM - it exists. Nothing is obviously wrong at any of these layers.

Look Here: vCenter Storage Providers

Navigate in vSphere Client to the vCenter level. Go to Configure, then Storage Providers. This is the SPBM provider registry. Look for the Veeam entry. If it shows Offline or Disconnected, you found the layer-2 problem - the SPBM layer cannot register compatible hosts, so Veeam sees zero compatible datastores.

vSphere Client: vCenter > Configure > Storage Providers Expected state: Online Broken state: Offline / Disconnected / Last sync failed Also check vCenter events - filter for "storage provider": vCenter > Monitor > Events > search "sps" or "storage provider registration"

Root Cause: ESXi Certificate Invalidated the veecdp I/O Filter Registration

ESXi host certificates were renewed - by VMCA automatically, by a vCenter upgrade, or manually - and the vCenter SPS (Storage Management Service) lost trust in the ESXi certificate chain. The veecdp I/O filter is registered per-host through SPBM. When ESXi certificate validation fails, SPS drops the storage provider offline. Veeam CDP sees no compatible storage because SPBM cannot enumerate hosts. This is a certificate chain problem masquerading as a storage problem.

Check Veeam I/O Filter deployment log on backup server: C:\ProgramData\Backup\Utils\I_O_filter_deployment.log Look for any of: "Could not find a trusted signer" "Storage providers are offline" "Unable to create a storage policy: one or more storage providers are offline" Confirm in vCenter vpxd.log (SSH to vCSA): grep -i "sps\|storage provider\|certificate" /var/log/vmware/vpxd/vpxd.log | tail -100

Resolution - Three Options, Try in Order

Option 1 (fastest) - Reinstall Veeam I/O Filter: In Veeam Backup and Replication, go to Storage Infrastructure, find the vCenter, expand to the affected cluster. Right-click the cluster and choose Rescan Storage. If the provider remains offline, right-click and choose Uninstall I/O Filter, wait for completion, then Install I/O Filter. This rebuilds the veecdp registration with fresh certificate handshaking.

Veeam Console: Storage Infrastructure > [vCenter] > [Cluster] Right-click > Rescan Storage If still offline: Right-click > Uninstall I/O filter (wait for task) Right-click > Install I/O filter

Option 2 - Reconnect ESXi host through vCenter: In vSphere Client, right-click the affected ESXi host, select Disconnect, then Reconnect. This forces vCenter to re-validate the host certificate chain and re-register all storage providers associated with that host. Often resolves the issue without touching Veeam.

Option 3 - VMware SPS certificate rebuild (Veeam KB4242): Run the VMware unreg_vasa script that recreates the SPS certificate and re-registers the storage management service. This requires SSH access to the vCenter appliance. Reference Veeam KB4242 for the exact script and procedure. If this fails, open a Broadcom support case.

Prevention

Add to your vCenter certificate renewal runbook: after any ESXi certificate event (VMCA automated or manual), verify vCenter Storage Providers are online before closing the change window. Run trust-infra-certs --vsphere --unattended on all VCD cells after any vCenter or ESXi certificate change. Verify CDP storage selection from a tenant account before marking the change complete. The failure is invisible until a tenant attempts to configure CDP - which may be days after the certificate event that caused it.

🆙

20. Veeam v13 and HPE Zerto 10.x - What Changed for VCD Deployments

2025 was a big year for both platforms. Veeam shipped v13, Zerto shipped 10.8, and if you run VCD as your cloud delivery platform you need to know exactly what changed and what it means for your upgrade planning. Here is the straight version.

Veeam v13 (13.0.1, November 2025) - What MSPs Need to Know

Where Things Stand Now

Build	Date	Cloud Connect / VCD	Notes
VBR 13.0.1 (GA)	November 19, 2025	FULL SUPPORT	Windows + Linux (Veeam Software Appliance). Full Cloud Connect, full VCD integration. VSPC v9 required. In-place upgrade from VBR 12.3.1+ (build 12.3.1.1139 or later).
VBR 13.0.1 Patch 2 (current)	March 12, 2026	FULL SUPPORT	Build 13.0.1.2067. Current recommended build. Includes security patches. See KB4738 for known issues still open at this build.

The Cloud Connect Upgrade Sequence - SP Goes First, Every Time

Tenant Upgrades to v13 Before SP - Tenant Is Locked Out

VBR v13 is a breaking major version change for Cloud Connect. If a tenant upgrades their VBR from v12 to v13 before the Service Provider upgrades the Cloud Connect server, the tenant immediately loses access to their cloud repository and cloud host. The Veeam console shows: Your service provider uses an outdated Veeam Cloud Connect platform version that is not compatible with your product version. This is not recoverable until the SP upgrades.

Fix: SP Upgrades First, Then Notify Tenants

Upgrade the SP Cloud Connect server (VBR 13.0.1) before any tenant upgrades their VBR. VBR 13.0.1 Cloud Connect is backward compatible with tenants running VBR 12.3.2 (build 12.3.2.3617 and later). Communicate the upgrade schedule to all tenants and explicitly tell them not to upgrade their VBR to v13 until the SP confirms the Cloud Connect server is on 13.0.1. Include this in your tenant SLA communications and VSPC service notifications.

VSPC v8 Is Not Compatible with VBR v13 Cloud Connect

Veeam Service Provider Console version 8 does not work with VBR v13. If your VSPC is on v8 and you upgrade VBR to v13, you lose Cloud Connect management through VSPC. VSPC v9 is required. This means you are upgrading both VBR and VSPC as part of the same change window.

Upgrade Sequence: VSPC v9 Before or Alongside VBR 13.0.1

Upgrade VSPC to v9 before or simultaneously with upgrading VBR to 13.0.1. VSPC v9 is backward compatible with VBR 12.x environments, so upgrading VSPC first does not break anything. Confirm VSPC v9 is running and healthy before starting the VBR upgrade.

VBR v13 New Features Relevant to VCD Environments

Universal CDP to VCD: The headline v13 feature for MSPs. CDP in v12 was vSphere-only on the source side. v13 opens it to any Windows machine - physical servers, VMs on any hypervisor, cloud instances - with VCD as the replication target. You can now pitch sub-second RPO DRaaS to tenants running bare-metal or Hyper-V workloads and deliver it over your existing Cloud Connect Replication infrastructure. More target hypervisors are coming in future releases, but VCD is supported at GA.

Veeam Software Appliance (VSA): Rocky Linux under the hood, DISA STIG hardened at deployment, OS updates managed by Veeam. No Windows Server license. The actual VBR code is identical between Windows and VSA in 13.0.1 - you get the same feature set either way. If you are running VBR on Windows today and want to move to VSA, the path is: upgrade Windows VBR to 13.0.1 first, then use conversion tooling. You cannot jump directly from v12 Windows to v13 VSA - the intermediate Windows upgrade step is required.

Web UI: Finally. v13 ships a proper web-based management console that sits alongside the Windows UI. Both are supported. The web UI covers the day-to-day backup and restore operations well, though it does not yet have full parity with the Windows console for every advanced operation. Good enough for most work, and it will catch up over time.

v13 Upgrade Blockers - Check These Before Upgrading:

Blocker	Check	Remediation
Backup jobs with metadata still in v11 format	VBR console shows warning on affected jobs	Run a full active full backup to upgrade metadata to v12 format before upgrading to v13
Backup Copy jobs in legacy mode	Settings show "legacy" in job properties	Resync backup copy jobs to convert to per-machine chain format before upgrading
Veeam Agent for Windows older than v6	Check agent versions in managed agents list	Upgrade agents to v6.3.1+ before upgrading VBR to v13
Reversed incremental backup mode in use	Job settings show reversed incremental	Convert affected jobs to forward incremental. New jobs cannot use reversed incremental in v13.
VBR version older than 12.3.1	Check current build in Help > About	Upgrade to 12.3.1 (build 12.3.1.1139) minimum before upgrading to v13. In-place upgrade only supports from 12.3.1+.

v13 Known Issues Relevant to VCD

Cloud Connect License Report Shows 0 Points for Tenant Consumption (KB4738)

In VBR 13.0.1, the license report for Cloud Connect providers incorrectly displays 0 points for tenant consumption. This is a reporting bug - billing is not affected, but the license report cannot be used for tenant consumption tracking until a patch resolves it. Check Veeam KB4738 for the current fix status.

Datastore Cluster Cannot Be Selected in VMware Hardware Plan Wizard (KB4738)

In VBR 13.0.1, when configuring a VMware Hardware Plan for Cloud Connect replication, Datastore Clusters cannot be selected as the storage target. Individual datastores within the cluster can be selected as a workaround. Check KB4738 for fix availability.

HPE Zerto 10.x - Current State for VCD Environments

Current Version Landscape

Zerto Version	Status	Key Change
Zerto 9.7	Critical Support (EOSL approaching)	Last version with Windows ZVM support. ZVMA available but optional.
Zerto 10.0	General Support	Windows ZVM removed. ZVMA mandatory. Major architecture change.
Zerto 10.7	General Support	Security hardening, ICMP blocked by default
Zerto 10.8	General Support - Current	VCF 9.0 support, host attestation for vault recovery, CrowdStrike integration

Windows ZVM Is Gone - ZVMA Is Mandatory from Zerto 10.0

The Windows-based Zerto Virtual Manager is done. Zerto 10.0 removed it entirely. Every deployment running Zerto 10.x or later must be on the ZVMA - a Linux OVF appliance. If you are still on Zerto 9.7 with a Windows ZVM, your clock is ticking. Zerto shipped a migration tool at v10 to move from Windows ZVM to ZVMA. Use it before 9.7 exits support and you lose a supported upgrade path.

Zerto and VCD 10.6 - The AMQP/MQTT Reality

Zerto's VCD integration uses the RabbitMQ AMQP messaging path to receive VCD events. VCD 10.6 deprecated AMQP in favor of MQTT, so the obvious question is: when does Zerto move to MQTT? That answer is not public yet. What matters right now is that AMQP still works in VCD 10.6.x alongside the deprecated path, so Zerto continues to function as long as you keep RabbitMQ running on 3.10.x, 3.11.x, or 3.12.x.

The Broadcom KB (article 422585) documenting the VCD 10.6.1.1 MQTT/AMQP disconnect explicitly names Zerto as the affected external service. The short-term fix is the RabbitMQ connection for Zerto remains supported in VCD 10.6 alongside the deprecated AMQP path. RabbitMQ 3.10.x, 3.11.x, or 3.12.x is required. The Zerto connection to VCD continues through this AMQP/RabbitMQ path even as AMQP is deprecated for VCD's own internal use.

Check the Matrix Before Every VCD Upgrade

Zerto commits to supporting new VCD versions within 90 days of GA. That means there is a gap window after every VCD release where Zerto has not yet validated it. Always check the interoperability matrix before upgrading VCD. The Zerto matrix is hosted on Zoomin and frequently renders blank - use the MyZerto portal directly or call HPE Zerto support to get a definitive answer. Do not assume compatibility just because AMQP is still connecting.

Zerto 10.x VCD-Relevant Operational Changes

ZVMA mandatory: The Linux ZVMA appliance is now the only supported deployment. This changes your backup and DR posture for Zerto itself - ZVMA backup via snapshot (Zerto provides built-in ZVMA backup configuration), no Windows OS patching concern, Keycloak-based authentication built in.

ICMP blocked by default (10.0_U7 and later): Ping is blocked on the ZVMA as a security measure from 10.0 Update 7 forward. If your monitoring platform pings the ZVMA to check reachability, those checks will start failing false negatives after upgrade. Switch your checks to TCP port connectivity against the Zerto API ports. ICMP can be re-enabled via CLI Option 10 (ICMP Echo Management), but Zerto flags it with a security warning and it is off by default for a reason.

VCF 9.0 support (10.8): Zerto 10.8 added support for VMware Cloud Foundation 9.0 for new deployments. This is relevant for MSPs running VCF-backed VCD infrastructure who are evaluating VCF 9.0 adoption.

Offline recovery and vault integration (10.0 U6+): Zerto added offline recovery mode for use with HPE Cyber Resilience Vault scenarios. Zerto 10.8 added host attestation for vault recovery - a TPM-based host verification process before allowing interaction with sensitive vault data. Relevant for MSPs building ransomware resilience offerings alongside VCD DRaaS.

Zerto 10.x Upgrade Path for VCD Environments

From	To	Action Required
Zerto 9.7 Windows ZVM	Zerto 10.x ZVMA	Use Zerto migration tool. Deploy ZVMA first, migrate config, decommission Windows ZVM. Do not upgrade Windows ZVM in place - it is not supported in v10.
Zerto 9.7 ZVMA	Zerto 10.x ZVMA	Standard upgrade via ZVMA management UI. Backup ZVMA before upgrading. VPGs remain in protection during upgrade.
Zerto 10.x any	Zerto 10.8	Standard upgrade. Verify VCD compatibility in Zerto interoperability matrix for your specific VCD version before upgrading Zerto.

Always Check the Interoperability Matrix Before Upgrading Either Product

The Zerto interoperability matrix is at help.zerto.com (Zoomin-hosted, often rendering blank - use MyZerto portal directly if the page does not load). The Veeam KB4488 is at veeam.com/kb4488. Check both before every VCD upgrade, every Veeam upgrade, and every Zerto upgrade. The 90-day support commitment means a new VCD version may have a 1-3 month window where Zerto has not yet validated it.

📋

21. Upgrade Runbook: 10.5.x to 10.6.x Step by Step

This is the procedure I would follow myself. Not the abbreviated version - the one that actually accounts for what breaks. VCD upgrades are not click-next affairs. Each step has a verification gate. Do not proceed past a failed gate.

Never Upgrade on a Friday

VCD upgrades take 45-90 minutes per cell in a multi-cell configuration. PostgreSQL major version upgrades (required when moving to 10.6) can take longer on large databases. Plan for a 4-hour maintenance window minimum. The first time through any new major version, block a full day.

Pre-Upgrade Checklist (Complete Before Maintenance Window)

Item	Command / Action	Expected Result
Check current VCD version and build	VCD UI: Help > About, or rpm -qa \| grep vmware-vcd	Know your starting point exactly
Confirm all cells are healthy	VCD UI: System > Server Configuration > Manage Cell Cluster	All cells show Active status
Snapshot all VCD cell VMs	vCenter: right-click each cell VM > Take Snapshot	Snapshots created and visible
VCD database backup	/opt/vmware/appliance/bin/create-backup.sh on primary cell	Backup file created in transfer/backups
Verify backup file is not zero bytes	ls -lh /opt/vmware/vcloud-director/data/transfer/backups/	File exists with non-zero size and recent timestamp
Check PostgreSQL disk space	df -h /var/vmware	At least 8 GB free before upgrade
Check root disk space on all cells	df -h /	At least 10 GB free
Verify vCenter and NSX connectivity	VCD UI: Resources > Infrastructure Resources - all green	No connection errors on vCenter or NSX
Confirm RabbitMQ is healthy	VCD Admin UI: Administration > General Settings > AMQP - Test connection	Connection test passes
Check for stale AMQP settings (if upgrading from pre-10.1)	GET /api/admin/extension/settings/amqp	Know whether AMQP settings exist that need cleanup
Confirm Veeam plugin version	VCD UI: Administration > Customize Portal > UI Plugins	Note current plugin version - will need update after upgrade
Notify Zerto of upcoming maintenance	Email / ticketing system	Zerto team aware VCD upgrade is occurring
Download the target VCD ISO	Broadcom Support Portal	ISO downloaded and accessible from all cells
Verify ISO checksum	sha256sum VMware_Cloud_Director-{version}.iso	Matches Broadcom published checksum

The Upgrade Sequence

Upgrade Cells One at a Time - Never Simultaneously

VCD maintains backward compatibility between adjacent cell versions during a rolling upgrade. Only one cell at a time should be running the upgrade process. The other cells continue serving traffic (with degraded capacity) while each cell upgrades. If you upgrade multiple cells simultaneously, you may hit database schema conflicts.

Step 1 - Mount the ISO on the Primary Cell

SCP the ISO to the cell or mount it from a datastore. The VCD appliance upgrade uses an update package approach on the appliance, or an RPM-based approach for Linux installations.

Appliance upgrade approach - using update package

# SCP the VCD ISO update package to the cell scp VMware_Cloud_Director-{version}.iso root@vcd-primary:/tmp/ # Mount the ISO mkdir -p /mnt/cdrom mount -o loop /tmp/VMware_Cloud_Director-{version}.iso /mnt/cdrom # The update package is at: ls /mnt/cdrom/vmware-vcd-*.rpm

Step 2 - Run Pre-Upgrade Checks

Before the upgrade, VCD runs a system check. Do not skip this step - it catches the problems that would abort the upgrade partway through.

Run VCD pre-upgrade system check

# For appliance-based deployments - upgrade via VAMI # Navigate to: https://{cell-ip}:5480 # Go to Update > Check Updates # If your upgrade source is a local ISO, first upload it via VAMI > Update > Upload # For Linux installation approach: cd /mnt/cdrom ./vmware-vcd-*.run --check # If any check fails, resolve it before proceeding # Common failures: insufficient disk space, PostgreSQL version mismatch, clock drift

Step 3 - Run the Upgrade on the Primary Cell

The primary cell must be upgraded first. During this step, VCD will also upgrade the PostgreSQL major version if needed. This is where the 8 GB disk pre-provisioning matters.

Execute upgrade (appliance/VAMI approach)

# Via VAMI: https://{primary-cell-ip}:5480 # Update > Install Updates # Click Install Updates and confirm # Monitor progress in VAMI. The cell service will stop, upgrade, and restart. # Monitor the upgrade log: tail -f /var/log/vmware/vcd/upgrade.log # Verify the VCD service restarts cleanly: systemctl status vmware-vcd tail -f /opt/vmware/vcloud-director/logs/cell.log | grep "Application Initialization"

Wait for Application Initialization: Complete in cell.log before proceeding.

Step 4 - Verify Primary Cell Post-Upgrade

Verification checks on primary cell after upgrade

# Check version rpm -qa | grep vmware-vcd # Check cell cluster status - primary should show upgraded version # VCD UI: System > Server Configuration > Manage Cell Cluster # Test API response curl -sk https://localhost/api/sessions | grep -i version # Check that RabbitMQ connection is still working # VCD Admin UI: Administration > General Settings > AMQP > Test # Check PostgreSQL is running and replication is healthy (if HA DB) sudo -i -u postgres psql -c "SELECT * FROM pg_stat_replication;"

Step 5 - Upgrade Secondary Cells One at a Time

Repeat Steps 1-4 on each secondary cell. Between each cell, wait for: - The upgraded cell to fully rejoin the cluster (visible in Manage Cell Cluster) - At least 5 minutes of normal operation with no errors in cell.log - Verify tenant portal is accessible through the load balancer The load balancer should be configured to drain connections from a cell before the upgrade starts on that cell. If your LB supports health check-based draining, the upgrade process will naturally trigger this as VCD stops the service.

Step 6 - Post-Upgrade Mandatory Steps

Post-upgrade cleanup and verification

# 1. Remove stale AMQP settings if upgrading from pre-10.1 (see Section 7) # DELETE /api/admin/extension/settings/amqp # 2. Re-import infrastructure certificates /opt/vmware/vcloud-director/bin/cell-management-tool trust-infra-certs --vsphere --unattended # 3. Set log level fix for ORG_TRAVERSE spam (see Section 11) echo "log4j.logger.com.vmware.ssdc=INFO" >> /opt/vmware/vcloud-director/etc/log4j.properties # Restart VCD service on all cells sequentially after this # 4. Verify event table is not already large sudo -i -u postgres psql vcloud -c "SELECT COUNT(*) FROM audit_trail;" # 5. Confirm RabbitMQ event flow # Watch the systemExchange queue in RabbitMQ management UI # Queue message rate should show activity within 5 minutes # 6. Update Veeam plugin (see Section 9 - Plugin Management) # 7. Notify Zerto team to verify VCD connectivity in ZCM

Step 7 - Delete Snapshots

Once you have verified the upgraded environment is stable for at least 24 hours with no issues, delete the pre-upgrade VM snapshots. Snapshots consume significant storage and degrade VM performance the longer they exist. Do not keep them indefinitely "just in case" - if something was going to go wrong, it would have gone wrong by now.

Delete snapshots after 24-hour burn-in

# In vCenter: right-click each VCD cell VM # Snapshots > Delete All Snapshots

If the Upgrade Fails Midway

Revert to Snapshot - Do Not Try to Continue a Partially Upgraded Cell

If the upgrade fails on a cell and the VCD service does not restart, revert that cell to its pre-upgrade snapshot immediately. Do not attempt to manually fix the partially upgraded installation. A partial VCD upgrade leaves the RPM database in an inconsistent state that is very difficult to recover from without a snapshot. Revert, identify the root cause from the upgrade log, fix the root cause, and restart the upgrade.

Common upgrade failure causes and fixes

Failure: Insufficient disk space Fix: Expand root or PostgreSQL volume, retry Failure: PostgreSQL version upgrade fails Fix: Check /opt/vmware/var/log/vcd/db_diskresize.log Ensure PostgreSQL data disk has 8+ GB free Failure: FIPS mode enabled Fix: Disable FIPS mode before upgrade: /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.fips.enabled -v false Then retry the upgrade Failure: PostgreSQL certificate SAN issue (upgrading to 10.6.1+) Fix: The PostgreSQL certificate must include the database IP in SAN Regenerate PostgreSQL cert with correct SAN before upgrading to 10.6.1 Failure: Cell clock drift Fix: Run: chronyc makestep Then verify: timedatectl status shows "System clock synchronized: yes" Retry upgrade

🔧

22. Cell Management Tool Quick Reference

The cell-management-tool (CMT) is used constantly throughout this guide as the fix mechanism for bugs, configuration issues, and maintenance tasks. A quick reference for every CMT command you will actually need in production. All commands are run on a VCD cell as root unless otherwise noted. Commands that modify configuration need to be run on the primary cell only - settings propagate via the database.

Base command path

/opt/vmware/vcloud-director/bin/cell-management-tool

Cell and Service Management

Purpose	Command
Graceful cell shutdown (drain tasks first)	cell-management-tool cell -i $(service vmware-vcd pid cell) -s
Show cell status	cell-management-tool cell -i $(service vmware-vcd pid cell) -S
Show all cells in cluster	cell-management-tool cell -l
Put cell in maintenance mode	cell-management-tool cell -m
Remove cell from maintenance mode	cell-management-tool cell -r
Decommission a cell (remove from cluster)	cell-management-tool deregister

Certificate Management

Purpose	Command
Import vSphere and NSX infrastructure certs	cell-management-tool trust-infra-certs --vsphere --unattended
Import HTTPS certificate	cell-management-tool certificates -j --cert /path/cert.pem --key /path/key.pem --key-password {pass}
Remove legacy console proxy settings	cell-management-tool clear-console-proxy-settings
List certificates in trust store	cell-management-tool certificates -l

Configuration Management

Purpose	Command
Set a configuration property	cell-management-tool manage-config -n {property.name} -v {value}
Read a configuration property	cell-management-tool manage-config -n {property.name} -l
Set audit trail retention (days)	cell-management-tool manage-config -n com.vmware.vcloud.audittrail.history.days -v 45
Set task history retention (days)	cell-management-tool manage-config -n com.vmware.vcloud.tasks.history.days -v 30
Set relocate task timeout (minutes)	cell-management-tool manage-config -n relocate.vm.workflow.timeout.minutes -v 20
Set ORG_TRAVERSE log spam fix	Add to log4j.properties: log4j.logger.com.vmware.ssdc=INFO
Enable/disable FIPS mode	cell-management-tool manage-config -n com.vmware.vcloud.fips.enabled -v {true\|false}

VM and VApp Discovery

Purpose	Command
Trigger VM auto-import discovery	cell-management-tool debug-auto-import -v {vm-moref}
Force full vCenter inventory resync	cell-management-tool resync-vc

Most-Used CMT Commands in This Guide

The CMT commands referenced most throughout this article

# Graceful cell stop (Section 14, 21) - always before DB writes /opt/vmware/vcloud-director/bin/cell-management-tool cell -i $(service vmware-vcd pid cell) -s # Trust infrastructure certs (Section 8, 17, 21) - after every cert change /opt/vmware/vcloud-director/bin/cell-management-tool trust-infra-certs --vsphere --unattended # Set audit trail retention (Section 14) - prevent event table bloat /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n com.vmware.vcloud.audittrail.history.days -v 45 # Extend relocate timeout (Section 11) - fix storage policy change timeouts /opt/vmware/vcloud-director/bin/cell-management-tool manage-config -n relocate.vm.workflow.timeout.minutes -v 20 # Clear legacy console proxy settings (Section 17) - after upgrade from pre-10.4 /opt/vmware/vcloud-director/bin/cell-management-tool clear-console-proxy-settings

💳

23. Usage Meter and VCSP Operational Considerations

If you are a VMware Cloud Service Provider, Usage Meter runs alongside VCD and is how Broadcom tracks your consumption for VCSP billing. It is frequently under-documented and over-ignored until something breaks. Here is what you need to know operationally.

What Usage Meter Does

Usage Meter collects consumption data from VCD (and other VMware products in your stack) and reports it to Broadcom for monthly billing under your VCSP contract. It connects to VCD via a read-only service account and polls usage at regular intervals. The data it collects feeds your monthly commerce portal report. If Usage Meter is not running or cannot reach VCD, your reported consumption will be inaccurate - either under-reported (you pay less but risk compliance issues) or missing data that creates dispute scenarios with Broadcom.

Usage Meter + VCD Integration Requirements

Requirement	Detail
Usage Meter version	Must be compatible with your VCD version. Check the VMware Product Interoperability Matrix. For VCD 10.6.x, use Usage Meter 4.7 or later.
VCD service account	Create a dedicated VCD service account for Usage Meter with read-only System Administrator rights. Do not use the vcd-system-admin account.
Network access	Usage Meter must reach VCD API on TCP 443. Usage Meter must also reach Broadcom's commerce portal endpoints outbound (HTTPS). Proxy configurations are supported.
VCD API version	Usage Meter uses specific VCD API versions. If you upgrade VCD to a version that drops an old API, verify Usage Meter compatibility before the upgrade.
SSL trust	Usage Meter must trust the VCD SSL certificate. If using a private CA, import the CA root into Usage Meter's trust store.

Setting Up the VCD Service Account for Usage Meter

Step 1 - Create Dedicated Usage Meter User in VCD

In the VCD System organization, create a local user account specifically for Usage Meter. Do not reuse any other account.

Create Usage Meter service account via VCD API

POST https://vcd.example.com/api/admin/org/{system-org-id}/users Authorization: Bearer {system-admin-token} Content-Type: application/vnd.vmware.admin.user+xml;version=36.3 <User xmlns="http://www.vmware.com/vcloud/v1.5"> <Name>svc-usagemeter</Name> <IsEnabled>true</IsEnabled> <Password>{strong-password}</Password> <Role href="https://vcd.example.com/api/admin/role/{system-admin-role-id}"/> </User>

Step 2 - Add VCD to Usage Meter

In Usage Meter, navigate to Products, then Add, then VMware Cloud Director. Enter the VCD cell FQDN (or load balancer VIP), the service account credentials, and accept the SSL certificate. Usage Meter will test connectivity and begin collecting data on the next polling cycle (typically every 4 hours).

Common Usage Meter Issues in VCD Environments

Gotcha: Usage Meter Stops Reporting After VCD Certificate Renewal

When the VCD SSL certificate is renewed, Usage Meter may reject the new certificate if it has not been updated in Usage Meter's trust store. Symptom: Usage Meter shows VCD as disconnected and collection data stops. Fix: in Usage Meter, navigate to Products, find the VCD entry, click Edit, and accept the new certificate when prompted.

Gotcha: Usage Meter Counts Stopped VMs

VCSP billing for VCD is based on allocated vRAM and vCPU in active Organization VDCs, not just powered-on VMs. A tenant with 16 vCPU and 64 GB RAM allocated in their VDC contributes to your monthly bill regardless of whether their VMs are running. Make sure tenants understand this - and make sure your pricing reflects it. Powered-off VMs in an active VDC are not free.

Gotcha: Missing Data in Commerce Portal After VCD Upgrade

If you upgrade VCD without first verifying Usage Meter compatibility, Usage Meter may silently stop collecting during the version transition. You discover this at month-end when the commerce portal shows a gap in reported usage. Broadcom will work with you to reconstruct missing data but it is a support case and a manual process. Always check Usage Meter is collecting after every VCD upgrade.

Monitoring Usage Meter

Usage Meter collection verification

# Check Usage Meter collection log for VCD # In Usage Meter appliance: tail the collection log tail -f /var/log/um/collection.log | grep -i "vcd\|cloud.director" # Verify last successful collection timestamp via Usage Meter UI: # Reports > Collection Status # Look for VCD entry - "Last Collected" should be within the last 4-6 hours # Test API connectivity from Usage Meter to VCD curl -sk -u svc-usagemeter:{password} https://vcd.example.com/api/sessions -H "Accept: application/*+xml;version=36.3" # Should return 200 with session token, not 401 or connection error

Veeam with VCSP Licensing

If you are selling Veeam-powered backup as part of your VCSP offering, note the distinction between Veeam Cloud Connect licensing (per-workload tenant billing through VCSP) and Veeam's own usage reporting. Veeam Cloud Connect tracks tenant consumption for your own billing but this is separate from what Usage Meter reports to Broadcom. You manage these two billing streams independently.

VSPC v9 (required alongside VBR 13.0.1) provides enhanced VCSP-specific reporting that helps consolidate your view of tenant consumption across both the VCD and backup dimensions. If you have not yet upgraded to VSPC v9, the reporting improvement is a genuine operational benefit alongside the required VBR v13 compatibility.

🏗️

24. Three-Tier Tenancy - Full Setup Walkthrough

The three-tier model (Provider, Sub-Provider, Tenant) is one of the most powerful things VCD 10.6 offers MSPs who wholesale to resellers. This walkthrough covers the actual setup from scratch, not just the concepts. By the end, a reseller (Sub-Provider) will be able to create and manage their own tenants without ever needing your provider credentials.

Architecture Overview

Three-Tier Tenancy Model

In the three-tier model:

Provider (you) - owns the physical infrastructure, creates Provider VDCs and resource pools, manages NSX, manages RabbitMQ, sets global limits.

Sub-Provider (your reseller) - receives a delegated quota of compute, storage, and network resources from you. Creates and manages their own tenant organizations within their quota. Cannot see or modify provider infrastructure.

Tenant (the reseller's customer) - gets an Organization VDC from the Sub-Provider. Has no awareness of the provider layer at all.

Sub-Provider Cannot Fix Infrastructure Issues

A Sub-Provider can manage everything within their delegated quota but cannot modify the underlying infrastructure. If a tenant's VDC hits a provider-level constraint (network pool exhausted, storage policy unavailable, NSX quota reached), the Sub-Provider cannot resolve it. They escalate to you. Document this escalation path and SLA before onboarding your first reseller.

Step-by-Step: Creating a Sub-Provider Organization

Step 1 - Create the Sub-Provider Organization

Log in to VCD as a System Administrator. Navigate to Administration, then Organizations, then New Organization. When creating the organization, set the Organization Type to Sub-Provider. This is the key distinction - a Sub-Provider org has elevated rights over its own tenant scope that a standard tenant organization does not.

Create Sub-Provider org via API

POST https://vcd.example.com/cloudapi/1.0.0/orgs Authorization: Bearer {system-admin-token} Content-Type: application/json { "name": "reseller-acme", "displayName": "ACME Reseller", "description": "Sub-Provider org for ACME MSP", "isEnabled": true, "orgVdcCount": 0, "canPublishCatalogs": true, "canPublishExternally": false, "canSubscribe": true, "vAppLeaseSettings": { "deploymentLeaseSeconds": 0, "storageLeaseSeconds": 0 }, "orgType": "SUBPROVIDER" }

Step 2 - Create the Sub-Provider Administrator Account

Create a local user in the Sub-Provider organization with the Organization Administrator role. This is the account your reseller uses to log in and manage their tenants.

Create Sub-Provider admin user

POST https://vcd.example.com/api/admin/org/{subprovider-org-id}/users Authorization: Bearer {system-admin-token} Content-Type: application/vnd.vmware.admin.user+xml;version=36.3 <User xmlns="http://www.vmware.com/vcloud/v1.5"> <Name>admin@acme</Name> <IsEnabled>true</IsEnabled> <Password>{strong-password}</Password> <Role href="https://vcd.example.com/api/admin/role/{org-admin-role-id}"/> </User>

Step 3 - Create a Provider VDC for the Sub-Provider

The Sub-Provider gets resources from a Provider VDC. You can create a dedicated Provider VDC for the reseller (cleaner isolation, harder to expand) or grant them access to a shared Provider VDC with resource limits (more flexible). For VCSP reseller deployments, a dedicated Provider VDC per reseller is the right design - it prevents resource contention between resellers. Create the Provider VDC normally (Infrastructure Resources > Provider VDCs > New). When complete, grant the Sub-Provider organization access to it.

Grant Sub-Provider access to Provider VDC via API

PUT https://vcd.example.com/cloudapi/1.0.0/providerVdcs/{pvdc-id}/orgs Authorization: Bearer {system-admin-token} Content-Type: application/json { "values": [ { "id": "{subprovider-org-id}" } ] }

Step 4 - Sub-Provider Creates Their Own Org VDCs

Once the Sub-Provider has access to the Provider VDC, their administrator can log in and create Organization VDCs for their tenants. The Sub-Provider admin navigates to Resources, then Cloud Resources, then Organization VDCs, then New. They see only the Provider VDCs you have granted them access to. The Sub-Provider sets: - VDC name and allocation model - CPU and memory limits from within their Provider VDC quota - Storage policy selections from the Provider VDC's storage profiles - Network pool from the Provider VDC's network pools What they cannot do: change the underlying Provider VDC, access other Provider VDCs, modify NSX, or view other organizations' resources.

Step 5 - Sub-Provider Creates Tenant Organizations

The Sub-Provider creates tenant organizations the same way a Provider does, but scoped to their own space. In the VCD UI, a Sub-Provider admin sees an Administration section that allows them to create organizations, manage users, and configure tenant policies - all within their scope. The Sub-Provider assigns Organization VDCs to their tenants, sets resource quotas, and manages tenant lifecycle without any provider involvement.

Step 6 - Configure Rights Bundles (Optional but Recommended)

Rights Bundles control what actions tenants within the Sub-Provider's scope can perform. The Sub-Provider can grant or restrict rights to their tenants within the rights the provider has granted to the Sub-Provider. This lets the reseller offer different service tiers (standard tenant vs. power tenant with additional self-service capabilities).

Publish a rights bundle to a Sub-Provider tenant via API

POST https://vcd.example.com/cloudapi/1.0.0/rightsBundles/{bundle-id}/tenants Authorization: Bearer {subprovider-admin-token} Content-Type: application/json { "values": [ { "id": "{tenant-org-id}" } ] }

What the Sub-Provider Can and Cannot Do

Action	Sub-Provider Can	Sub-Provider Cannot
Create tenant organizations	Yes - within their scope	Cannot create organizations outside their Sub-Provider org
Manage tenant users and roles	Yes - for their tenants	Cannot modify system roles or other Sub-Providers' users
Create Organization VDCs	Yes - using granted Provider VDC resources	Cannot access Provider VDCs not granted to them
Manage network pools and segments	Within their Org VDCs	Cannot modify NSX, create network pools, or modify provider gateways
Manage IP Space quotas	Can allocate IPs to tenants from their quota	Cannot modify the provider-level IP Space or expand their own quota
Publish catalogs	Can publish catalogs to their tenant organizations	Cannot publish to organizations outside their scope
View provider infrastructure	No visibility	No access to physical infrastructure, NSX, vCenter
Resolve provider-level resource exhaustion	Cannot - must escalate to Provider	Cannot add resources to Provider VDC

Billing Considerations for Three-Tier Deployments

In a three-tier model, you need a billing model that accounts for the reseller layer. Common approaches:

Wholesale commitment model: The reseller commits to a monthly resource reservation (e.g., 100 vCPU, 1 TB RAM, 20 TB storage). You invoice the reseller at the committed rate regardless of how much their tenants use. The reseller marks up and invoices their own tenants. Simplest operationally - one billing relationship per reseller.

Consumption pass-through: The reseller passes actual consumption to their tenants. You invoice the reseller based on actual Usage Meter data. More accurate but requires the reseller to have their own billing system that can ingest VCD usage data. VSPC can help with this if the reseller also uses Veeam.

For most MSP-to-reseller relationships, the wholesale commitment model is the right choice. It simplifies your operations, gives the reseller predictable costs, and gives you predictable revenue.

🚨

25. Disaster Recovery of VCD Itself

VCD Disaster Recovery Decision Tree

Every runbook covers what happens when tenant VMs fail. Almost none cover what happens when VCD itself goes down. That scenario is what this section covers - the primary cell is gone, the database is corrupted, or the entire site is lost. These are low-probability but high-consequence events, and the time to figure out your recovery procedure is not while VCD is down and tenants are calling.

What the VCD Backup Actually Contains

The VCD appliance backup (create-backup.sh) creates a ZIP file in the transfer directory containing the PostgreSQL database dump and the VCD configuration files. It does not back up the VM files themselves - those are on your datastores. It does not back up NSX or vCenter configuration. What it gives you is the VCD brain: the database of all tenant organizations, VDCs, catalogs, users, network configurations, and task history. Without this, rebuilding VCD means recreating every tenant by hand.

Validate Your Backup File Before You Need It

The backup script runs and reports success even when the resulting file is corrupted or zero bytes. Test your backup restore process in a lab at least quarterly. The worst time to discover your backup file is unusable is during an actual recovery. Run this check after every scheduled backup:

Verify backup file is valid

ls -lh /opt/vmware/vcloud-director/data/transfer/backups/ # Should show a non-zero file with a recent timestamp # Verify it is a valid ZIP (backup files are ZIP archives) file /opt/vmware/vcloud-director/data/transfer/backups/{backup-file}.zip # Should say: Zip archive data # Extract without writing to confirm integrity python3 -c "import zipfile; z=zipfile.ZipFile('{backup-file}.zip'); z.testzip(); print('OK')"

Copy Backups Off the Cell

The backup file sitting on the transfer volume of the VCD cell is not a backup - it is a copy. If you lose the cell VM and its datastore, you lose the backup too. Move backup files to an external location immediately after creation.

Automated backup copy to external storage (add to cron)

#!/bin/bash # Run this after create-backup.sh completes # Copy backup ZIP to external NFS mount, S3, or secondary datastore BACKUP_DIR="/opt/vmware/vcloud-director/data/transfer/backups" DEST="/mnt/backup-nas/vcd-backups/$(hostname)" KEEP_DAYS=14 mkdir -p "$DEST" # Copy newest backup file NEWEST=$(ls -t "$BACKUP_DIR"/*.zip 2>/dev/null | head -1) if [ -n "$NEWEST" ]; then cp "$NEWEST" "$DEST/" echo "Copied $(basename $NEWEST) to $DEST" fi # Prune old backups find "$DEST" -name "*.zip" -mtime +$KEEP_DAYS -delete

Scenario 1 - Single Cell Failure (Multi-Cell Deployment)

A cell VM crashes or becomes unresponsive. Other cells are still running. VCD is degraded but functional - the load balancer routes around the dead cell.

Step 1 - Confirm the Cell Is Actually Dead

Check cell status from another cell

cell-management-tool cell -l # Dead cell will show inactive or not appear # Check from the load balancer - is the cell's health check failing? # If using NSX LB: check pool member status in NSX Manager

Step 2 - Attempt VM Recovery First

Try to restart the dead cell VM in vCenter before treating it as a total loss. If the VM restarts and vmware-vcd service comes up cleanly, you are done. Check cluster status and verify the cell rejoins.

Step 3 - If VM Is Unrecoverable, Decommission and Replace

Decommission the dead cell from the remaining primary cell

# SSH to the primary cell (one that is still running) # Deregister the dead cell cell-management-tool deregister --ip {dead-cell-ip} # Verify it is removed from cluster cell-management-tool cell -l

Then deploy a new cell VM from the VCD OVA, configure it with the same version, and join it to the cluster. The new cell pulls its configuration from the shared PostgreSQL database automatically.

Scenario 2 - Primary Cell Database Corruption

The PostgreSQL database is corrupted or the primary cell vPostgres service will not start. Secondary cells cannot connect. VCD is completely down.

Check the Standby Cells Before Doing Anything Else

If you have a multi-cell HA deployment with a database standby node, the standby may be promotable. Promoting the standby is far faster than restoring from backup and loses less data.

Step 1 - Attempt Standby Promotion (HA Deployments Only)

Check standby node status and promote

# SSH to standby cell sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr cluster show # If primary shows as failed and standby is connected: sudo -i -u postgres /opt/vmware/vpostgres/current/bin/repmgr standby promote --log-level INFO # Verify promotion succeeded sudo -i -u postgres psql -c "SELECT pg_is_in_recovery();" # Should return: f (false = this is now the primary)

After promotion, update the remaining cells' global.properties to point to the new primary IP if it has changed.

Step 2 - If No Standby or Promotion Fails, Restore from Backup

This is the full restore path. It requires reverting to a VM snapshot (if one exists) or deploying a fresh appliance and restoring the database backup.

Restore VCD appliance from backup (appliance deployments)

# Option A: Revert to pre-failure snapshot in vCenter # Right-click VM > Snapshots > Revert to Snapshot # Then restart vmware-vcd service on all cells # Option B: Fresh appliance restore from backup file # 1. Deploy a new VCD appliance OVA (same version as backup) # 2. Copy backup ZIP to the new appliance transfer directory # 3. Via VAMI (https://{new-cell-ip}:5480): # Restore > Select backup file > Restore # 4. Wait for restore to complete (30-90 minutes depending on DB size) # 5. Verify cell service starts cleanly

Step 3 - Post-Restore Validation

Verify everything after restore

# Check database is healthy sudo -i -u postgres psql vcloud -c "SELECT COUNT(*) FROM organization;" sudo -i -u postgres psql vcloud -c "SELECT COUNT(*) FROM org_prov_vdc;" # Check VCD service is up systemctl status vmware-vcd tail -50 /opt/vmware/vcloud-director/logs/cell.log | grep "Initialization" # Re-import infrastructure certs (vCenter/NSX certs change if time passed) cell-management-tool trust-infra-certs --vsphere --unattended # Verify tenant portal is accessible curl -sk https://localhost/api/sessions | grep -i version # Check vCenter reconnection # VCD UI: Resources > Infrastructure Resources > vCenter - should show connected # Notify Zerto and Veeam teams to re-verify their VCD connections

Scenario 3 - Complete Site Loss

The entire VCD site is gone - data center disaster, power failure with corrupted storage, or catastrophic hardware failure. No VMs, no storage. You are rebuilding from offsite backups.

The Hard Reality of Complete Site Loss

A complete site rebuild from backup restores VCD's knowledge of what was deployed, but not the actual tenant workloads. Tenant VMs that existed on the lost site are gone unless they were replicated to a second site via Veeam or Zerto. The VCD restore gives you the management plane. The data plane recovery depends entirely on your replication and backup strategy for the actual VM data. Make sure your SLA documentation to tenants clearly distinguishes between these two layers.

Step 1 - Deploy VCD at the Recovery Site

Stand up a new VCD appliance at the recovery site using the same version as the production backup. This can be at a colocation facility, a second availability zone, or a cloud-based recovery environment.

Step 2 - Restore the Database Backup

Copy the most recent backup ZIP from your offsite storage to the new appliance's transfer directory. Restore via VAMI. This recreates the VCD database with all tenant configurations intact as of the last successful backup.

Step 3 - Reconnect to vCenter and NSX at Recovery Site

The restored VCD knows about the production vCenter and NSX Manager. In a complete site loss, those are gone. You need to point VCD at the vCenter and NSX at the recovery site.

Update vCenter connection after site change

# VCD UI: Resources > Infrastructure Resources > vCenter Server Instances # Edit existing vCenter entry and update hostname/IP to recovery vCenter # Accept new certificate when prompted # Allow VCD to rescan the vCenter inventory at recovery site

Step 4 - Recover Tenant Workloads

This step depends on your replication strategy: - If Veeam Cloud Connect replication was running: trigger failover or restore from cloud repository - If Zerto was running: trigger failover of VPGs to the recovery VCD site - If neither: restore from Veeam backup files if available at recovery site, or restore from snapshots if replication was running at vCenter level

What Veeam and Zerto Do When VCD Is Down

This is the question nobody asks until it matters. Here is the exact behavior of each platform when VCD itself is unavailable:

Platform	What Keeps Working	What Breaks	Recovery Trigger
Veeam VBR (agent backup)	Agent backup jobs continue running. Backups write to the cloud repository normally. VBR communicates directly with the Veeam agent on the VM - no VCD involvement.	Tenant Self-Service Backup Portal (requires VCD API). New Cloud Connect tenant onboarding. Any operation that needs VCD org credential validation.	Portal access restores when VCD comes back up. No manual intervention needed for running jobs.
Veeam Cloud Connect Replication	Existing replication jobs continue. The Cloud Gateway communicates directly with source Veeam servers. VCD is not in the data path for replication.	Replica management through VCD portal. New replica job creation. Failover operations that use VCD API to power on replicas in the Org VDC.	Failover operations require VCD to be available. This is the critical dependency - if you need to failover during a VCD outage, you will need to perform the failover manually in vCenter instead.
Zerto VPG Replication	Replication continues at the ZVM level. Journal checkpoints continue being written. VPG protection is maintained. The ZVMA and ZVMs communicate directly with each other.	VCD event awareness. Organization-level failover through VCD portal. ZCM operations that depend on VCD API responses. The Zerto tenant self-service portal if it depends on VCD authentication.	Failover can still be triggered directly through the ZVMA management interface or the Zerto API, bypassing VCD. You lose the VCD-integrated workflow but not the failover capability itself.

The Critical Takeaway on Veeam Failover During VCD Outage

Veeam Cloud Connect replication runs independently of VCD. But triggering a failover - powering on the replica in the Org VDC - goes through VCD. If VCD is down and you need to trigger a failover simultaneously, you need a documented manual procedure for powering on replica VMs directly in vCenter within the tenant's Org VDC cluster, outside of the Veeam portal. Document this procedure now and test it. This is the gap that causes extended recovery time during compound failures.

🌐

26. VCD Multisite Federation

VCD Multisite Federation Architecture

VCD multisite lets you tie two or more geographically separate VCD installations together so they can be managed and accessed as a unified platform. MSPs running multiple availability zones or data centers need this to offer tenants a consistent experience across sites without managing separate VCD credentials for each location.

What Multisite Is and Is Not

Multisite is a management and identity federation feature. It is not a replication or HA feature. When you associate two VCD sites, organization users can log in at Site A and see their assets at Site B - but the data stays local to each site. Catalogs can be published and subscribed across sites. Organization associations let users access multi-site resources from a single portal session.

What multisite does not do: replicate tenant VMs between sites, synchronize Org VDC configurations between sites, or provide automatic failover. That is what Zerto and Veeam replication do. Multisite handles the identity and management plane. Zerto and Veeam handle the data plane.

Multisite Prerequisites

Requirement	Detail
API version compatibility	Sites must be on the same VCD API version or one major version apart. For VCD 10.6.x sites, all sites should be on 10.6.x for full feature compatibility.
Network connectivity	Each VCD site must be able to reach the other site's API endpoint on TCP 443. The association uses certificate-based mutual authentication. No VPN is required if both sites are internet-accessible, but a private interconnect is strongly recommended for production.
Valid SSL certificates	Both sites must have valid SSL certificates. Self-signed certificates work if each site's certificate is trusted by the other, but CA-signed certificates are easier. The association will fail if certificate validation fails between sites.
Unique installation IDs	Each VCD site must have a unique installation ID. If you built your secondary site by cloning the primary appliance, the installation IDs will be identical and multisite will fail. Regenerate the installation ID on the cloned site before creating the association.
System admin access at both sites	Creating the site association requires System Administrator credentials at both sites. Organization associations require Org Admin credentials at both orgs.

Creating a Site Association

Step 1 - Download Local Site Data from Site A

Log in to Site A as System Administrator. Navigate to Administration, then System, then Multisite. Click Download Local Data. Save the XML file - this contains Site A's public key and endpoint information.

Step 2 - Download Local Site Data from Site B

Repeat the same process logged into Site B. You now have two XML files - one from each site.

Step 3 - Create the Site Association

On Site A, navigate to Administration, then System, then Multisite, then New Site Association. Upload the Site B data XML. Click Create. VCD establishes the association by exchanging public keys between the sites. Then on Site B, create an association pointing to Site A (upload the Site A data XML). The association is established from both sides.

Create site association via API (alternative to UI)

POST https://site-a.example.com/api/site/associations Authorization: Bearer {site-a-token} Content-Type: application/vnd.vmware.admin.siteAssociation+xml;version=36.3 <SiteAssociation xmlns="http://www.vmware.com/vcloud/v1.5"> <SiteData>{paste Site B XML content here}</SiteData> </SiteAssociation>

Step 4 - Verify the Site Association

Verify site association is active

GET https://site-a.example.com/api/site/associations Authorization: Bearer {site-a-token} Accept: application/vnd.vmware.admin.siteAssociations+xml;version=36.3 # Response should show the associated site with status = ACTIVE # If status is PENDING or ERROR, check SSL certificate trust between sites

Creating Organization Associations

A site association lets the sites talk to each other. An organization association lets a specific tenant org at Site A access assets at a corresponding org at Site B. These are created separately and are tenant-controlled (the org admin does this, not the system admin).

Step 1 - Org Admin Downloads Local Org Data

The org admin at Site A logs in to the VCD Tenant Portal. They navigate to Administration, then Settings, then Multisite. Click Download Local Data. This downloads the XML file for their organization. The org admin at Site B does the same.

Step 2 - Exchange and Upload Org Data Files

The Site A org admin uploads the Site B org data file (New Organization Association > Upload Site B file > Create). The Site B org admin uploads the Site A org data file the same way. Once both sides upload, the organization association is Active. Users logged in to the org at Site A can now access the associated org's assets at Site B from the same portal session.

Multisite Catalog Publishing

One of the most useful multisite features for MSPs is cross-site catalog publishing. A gold image catalog at Site A can be subscribed to at Site B. Tenants at both sites get access to the same templates without manual duplication.

Publish a catalog to a remote site subscriber

# Site A - configure catalog for external publishing # VCD UI: Libraries > Catalogs > {catalog} > Edit Settings # Enable "Publish this catalog externally" # Note the subscription URL that VCD generates # Site B - subscribe to the published catalog # VCD UI: Libraries > Catalogs > New > Subscribe to external catalog # Enter Site A's subscription URL # Enter Site A subscriber credentials (set when publishing) # Select sync options (on demand or scheduled)

Gotcha: Subscribed Catalog Sync Fails After Upgrade

After upgrading VCD (either site), subscribed catalogs stop syncing. VCD does not automatically trust the renewed certificate of the publishing site. Each org admin must manually edit their catalog subscription and accept the new certificate (trust on first use dialog appears). This is documented in the known issues for every 10.6.x release and the fix is always manual. Build this into your post-upgrade checklist.

Gotcha: Placement Policy Editing Fails in Multisite (10.6.1 Known Issue)

In multisite environments running 10.6.1, editing a placement policy in the Service Provider Admin Portal fails with TypeError: this.multisiteResponse is undefined. This is a UI bug in 10.6.1 that was not confirmed fixed in 10.6.1.2. Use the API to edit placement policies in multisite environments as a workaround.

Multisite and Veeam/Zerto

Multisite federation is a management plane feature. It does not change how Veeam or Zerto connect to VCD. Each VCD site needs its own Veeam Cloud Connect server (or the same VBR registered to both sites). Zerto Cloud Manager connects to each VCD site independently. There is no automatic federation of Veeam or Zerto configuration across VCD sites - you manage each site's protection stack separately, even if tenants see them as a unified organization.

📜

27. PostgreSQL SAN Certificate - The 10.6.1 Upgrade Blocker

This deserves its own section because it has blocked production upgrades in environments that did everything else right. VCD 10.6.1 added a security enhancement that validates the PostgreSQL certificate's Subject Alternative Name contains the database server's IP address. If you used a custom CA-signed certificate for PostgreSQL without the IP in the SAN, the upgrade fails immediately. Default self-signed certificates generated by the VCD appliance are not affected - this only hits environments that replaced the PostgreSQL certificate with a CA-signed one.

How to Know If You Are Affected Before Upgrading

Check if your PostgreSQL cert has the DB IP in SAN - run before upgrading

# Find the PostgreSQL certificate PGDATA=/var/vmware/vpostgres/current/pgdata ls $PGDATA/server.crt # Check the SAN entries openssl x509 -in $PGDATA/server.crt -noout -text | grep -A 5 "Subject Alternative Name" # Look for an IP Address entry matching your database IP # Good output includes: IP Address:192.168.x.x # Bad output has only DNS names, no IP Address entries # Also check what IP the VCD cells use to connect to PostgreSQL grep "database.jdbcUrl" /opt/vmware/vcloud-director/etc/global.properties # The IP in the jdbcUrl must appear in the PostgreSQL cert SAN

Do Not Proceed With 10.6.1 Upgrade Until This Is Confirmed

If your PostgreSQL certificate does not have the database IP in SAN, you have two options. Pick one and complete it before starting the upgrade. Do not start the upgrade hoping it will work - it will fail and you will need to roll back.

Option A - Regenerate the PostgreSQL Certificate with IP in SAN (Preferred)

This is the clean fix. It requires a brief PostgreSQL restart but no VCD downtime if done before the upgrade.

Step 1 - Generate a New Certificate with IP SAN

Generate new PostgreSQL cert with IP in SAN

# Method 1: For VCD appliance using built-in certificate generation # This regenerates the appliance's internal PostgreSQL certificate # Run on the primary cell via VAMI (https://{cell-ip}:5480): # Certificates > Generate New Certificate # The appliance-generated cert automatically includes the correct SANs # Method 2: If using a custom CA, generate a new CSR with IP SAN DB_IP="192.168.x.x" # Your PostgreSQL IP openssl req -new -newkey rsa:4096 -nodes -keyout /tmp/postgres-new.key -out /tmp/postgres-new.csr -subj "/CN=vcd-postgresql" -reqexts SAN -config <(cat /etc/ssl/openssl.cnf <(printf " [SAN] subjectAltName=IP:${DB_IP},DNS:$(hostname -f)")) # Submit CSR to your CA, get signed certificate back # The signed cert must include IP:${DB_IP} in SAN

Step 2 - Install the New Certificate

Install new PostgreSQL certificate

PGDATA=/var/vmware/vpostgres/current/pgdata # Backup existing certs cp $PGDATA/server.crt $PGDATA/server.crt.bak cp $PGDATA/server.key $PGDATA/server.key.bak # Install new certificate cp /tmp/postgres-signed.crt $PGDATA/server.crt cp /tmp/postgres-new.key $PGDATA/server.key chown vpostgres:vpostgres $PGDATA/server.crt $PGDATA/server.key chmod 600 $PGDATA/server.key chmod 644 $PGDATA/server.crt # Reload PostgreSQL (no full restart needed for cert changes) sudo -i -u postgres /opt/vmware/vpostgres/current/bin/psql -c "SELECT pg_reload_conf();" # Verify new cert is in use openssl x509 -in $PGDATA/server.crt -noout -text | grep -A 5 "Subject Alternative Name" # Confirm IP appears

Option B - Add SSL Mode to JDBC URL (No Certificate Replacement)

If you cannot replace the PostgreSQL certificate before the upgrade (for example, if your internal CA has a slow turnaround), this workaround bypasses the SAN validation by switching the connection to explicit SSL mode. It is documented in Broadcom KB 388974.

Option B - Modify global.properties on All Cells

Add sslmode=require to jdbcUrl - do this on every VCD cell before upgrading

PROPS=/opt/vmware/vcloud-director/etc/global.properties # Backup first cp $PROPS $PROPS.bak.$(date +%Y%m%d) # Find the current jdbcUrl line grep "database.jdbcUrl" $PROPS # Edit the line - append &sslmode=require&ssl=true to the URL # Before: database.jdbcUrl=jdbc:postgresql://192.168.x.x:5432/vcloud?socketTimeout=90 # After: database.jdbcUrl=jdbc:postgresql://192.168.x.x:5432/vcloud?socketTimeout=90&sslmode=require&ssl=true # Use sed to append (adjust the IP to match yours): sed -i 's|socketTimeout=90|socketTimeout=90\&sslmode=require\&ssl=true|g' $PROPS # Verify the change grep "database.jdbcUrl" $PROPS # Restart VCD to pick up the new connection string systemctl restart vmware-vcd # Confirm VCD starts cleanly with new connection string tail -f /opt/vmware/vcloud-director/logs/cell.log | grep -i "database\|Initialization"

Repeat on every cell. Then proceed with the upgrade. After the upgrade completes successfully, you can either leave this setting in place permanently or replace the PostgreSQL certificate properly and revert to the original connection string.

Roll Back to Snapshot If the Upgrade Fails Mid-Way

If you started the upgrade before applying either fix and it failed, do not try to fix the certificate mid-upgrade. Revert all cells to their pre-upgrade snapshots first, apply the fix, then retry the upgrade from a clean state. A partially upgraded VCD installation cannot have its PostgreSQL connection string changed without reverting first.

💰

28. Usage Meter Month-End Reconciliation and Billing Disputes

Month-end at a VCSP is when Usage Meter problems that went unnoticed all month become billing problems. The reconciliation workflow below covers what to do when the numbers do not match what you expect - and how to open a dispute with Broadcom when the gap is their collection failure, not yours.

The Month-End Workflow

Step 1 - Verify Usage Meter Collected All Month

Before trusting any report, confirm Usage Meter was actually collecting data throughout the month. Go to Usage Meter UI, then Reports, then Collection Status. Look for VCD in the list. Check the Last Collected timestamp - it should be within the last 4-6 hours. If Last Collected is days old, you have a gap.

Check collection log for gaps

# SSH to Usage Meter appliance # Search for collection errors or gaps in the VCD collection log grep -i "error\|fail\|vcd\|cloud.director" /var/log/um/collection.log | grep "$(date -d 'last month' +'%Y-%m')" # Check for any days with zero VCD collections grep "cloud.director" /var/log/um/collection.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20 # Each day should have multiple entries. Days with 0 or 1 entries are gaps.

Step 2 - Generate the Monthly Report

In Usage Meter, navigate to Reports, then Monthly Usage, select the billing month, and export to CSV. The VCD section of this report lists consumption by product, by organization, and by vRAM/vCPU metric.

Step 3 - Cross-Reference with VCD Allocation Data

The Usage Meter report should match the allocated resources in your VCD organizations. Pull the current VCD allocation for comparison:

Pull VCD Org VDC allocations for cross-reference

GET https://vcd.example.com/api/query?type=adminOrgVdc Authorization: Bearer {system-admin-token} Accept: application/json;version=36.3 # Key fields to compare against Usage Meter report: # cpuAllocationMhz, memoryAllocationMB, storageAllocationMB # per Org VDC, per Organization

Step 4 - Submit to Commerce Portal

Log in to the Broadcom Commerce Portal (MyVMware). Navigate to Usage > Submit Usage Report. Upload the Usage Meter CSV or enter the totals. The Commerce Portal validates the submission and generates the monthly invoice. Submission is typically due within the first week of the following month - check your VCSP contract for the exact deadline.

When the Numbers Do Not Match

Scenario: Usage Meter Shows Less Than Expected

VCD has resources allocated in active Org VDCs but Usage Meter reports a lower number. Most common causes: the VCD service account password expired and Usage Meter has been failing to connect silently, Usage Meter collected zero data for some days due to network issues, or VCD was upgraded without updating the Usage Meter connection.

Diagnose Usage Meter collection failures

# Test Usage Meter service account against VCD API directly curl -sk -u svc-usagemeter:{password} https://vcd.example.com/api/sessions -H "Accept: application/*+xml;version=36.3" # 200 OK = credentials working, 401 = password expired or wrong

Scenario: Usage Meter Shows More Than Expected

Report shows consumption higher than you thought. Check for: Org VDCs that were provisioned for testing and left allocated, tenant VDCs with resources allocated but no VMs (allocation-based billing, not VM-based), or a tenant that self-served additional VDC resources via Sub-Provider access.

Opening a Billing Dispute with Broadcom

If your Usage Meter report has a gap due to a confirmed collection failure and you believe you are being billed for a period where Usage Meter was not collecting accurately, you can open a dispute through the VCSP support channel. The process:

Gather evidence: Usage Meter collection logs showing the gap, any error messages from the period in question, and your own VCD allocation data for the gap period. Open a VCSP support case through the Broadcom support portal, explicitly tagged as a billing dispute. Provide the evidence and specify the gap period and the estimated correct consumption. Broadcom typically takes 2-4 weeks to review and respond. Disputes are evaluated on a case-by-case basis. Having clean logs and clear evidence is the difference between a quick resolution and a long back-and-forth.

Prevention Is Far Easier Than Dispute Resolution

Monitor Usage Meter collection daily. Set up an alert when Last Collected is more than 8 hours old. An alert that fires at 8 hours gives you time to fix a collection failure before it becomes a billing gap. A billing dispute filed weeks after the fact is a long process with an uncertain outcome.