Debug an Astro Private Cloud 2.0 upgrade
Use this guide when your Astro Private Cloud (APC) control plane or data plane Pods aren’t progressing to a healthy state after upgrading to 2.0.
Helm upgrade fails with patch conflict
- Cause: Environment variable ordering in the APC Deployment changed between versions, causing Helm’s strategic merge patch to fail.
- Symptoms: The upgrade fails with an error containing
cannot patch "astronomer-houston" with kind Deploymentand references to environment variable ordering. - Solution: Delete the APC Deployment with
--cascade=orphanto preserve running Pods, then retry the Helm upgrade:
APC API crashes with “Configuration property isn’t defined”
- Cause: The APC Docker image contains an older
default.yamlconfiguration file that doesn’t include keys added in 2.0. If the Helm ConfigMap doesn’t explicitly set these new keys, the APC API fails at startup when it tries to read them. - Symptoms: APC Pods enter
CrashLoopBackOffwith errors like:
- Solution: Ensure the APC Docker image matches the chart version you are deploying. If you built a custom APC Docker image, rebuild it with the 2.0 codebase that includes the updated
config/default.yaml. Then restart the APC Pods:
JetStream Pods stuck in Pending or CrashLoopBackOff
- Cause: STAN volume claims or PVCs weren’t deleted before the upgrade (applies when upgrading from 0.37.x).
- Solution: Delete existing PVCs for STAN:
Prisma migrate deploy or database migration job times out
- Cause: Database locks or concurrent transactions, often because of Prometheus or monitoring jobs.
- Solution:
-
Check for locks:
-
Terminate long-running transactions:
-
APC Worker logs show NATS: connection timeout
- Cause: The APC API tries to connect to STAN before JetStream is ready (applies when upgrading from 0.37.x).
- Solution: Restart the APC Worker Pod to reconnect to JetStream:
Airflow 3 Pods fail with ModuleNotFoundError: airflow.providers.cncf
- Cause: Cluster configuration override missing or not applied.
- Solution:
- Reapply the override function in the Airflow 3 migration guide.
- Ensure
minimumAstroRuntimeVersionis set to3.1-2or higher in the override config.
Airflow Deployments fail with Postgres connection errors after upgrade due to missing database name
- Cause: The
astronomer-bootstrapsecretconnectionstring doesn’t include a database name suffix (for example,/mainon AWS RDS or/postgreson AKS), causing connection failures. Verify the correct database name by logging in to your database. - Solution:
- Patch the
astronomer-bootstrapsecret soconnectionends with/<database_name>, then run a Helm upgrade. - The secret is in the
astronomernamespace (or the namespace where you install APC). - In control plane/data plane (CP/DP) mode, patch the secret in both the control plane and data plane clusters.
- Patch the
APC API migration fails due to duplicate workspace labels
- Cause: The upgrade includes a migration that adds a unique constraint to the
Workspace.labelcolumn. If your database contains workspaces with duplicate labels, the migration fails with Prisma errorP3009. Once the migration is marked as failed, all subsequent migrations are blocked, preventing the APC API from starting. - Symptoms: APC database migration Pods fail with
Error: P3009and the messagemigrate found failed migrations in the target database. APC API and worker Pods enterCrashLoopBackOff. - Solution:
Back up the APC database before you run DELETE on _prisma_migrations or rely on a retry of this migration.
Check for duplicate workspace labels
Connect to your APC database and check for duplicate workspace labels:
Resolve the duplicate labels
Rename one of the duplicate workspace labels, using the workspace ID as a suffix to guarantee uniqueness:
Check whether the migration partially applied
Before clearing the failed Prisma record, check whether the unique constraint was applied:
- If the unique constraint on
labelisn’t listed, the migration didn’t complete. After fixing duplicate labels, you can clear the failed migration and retry. Back up your database first, then delete only that failed migration row:
- If a unique constraint on
labelis present, the schema change may have been applied even though Prisma recorded a failure. Don’t delete the_prisma_migrationsrow or re-run the migration without DBA or Astronomer support — you can create conflicting schema or migration state.
Values migration script reports unexpected errors
- Cause: The Python migration script requires Python 3.10+ and the
ruamel.yamlpackage. - Solution:
- Verify your Python version:
python3 --version - Install or upgrade the dependency:
pip install ruamel.yaml - Run the script in dry-run mode first to preview changes:
./bin/migrate-helm-chart-values-1x-to-2x.py --dry-run my-values.yaml
- Verify your Python version:
Airflow Deployments fail with Postgres connection errors after upgrade
- Cause: The
astronomer-bootstrapsecretconnectionstring doesn’t include a database name suffix (for example,/postgres), causing connection failures. - Solution:
- Patch the
astronomer-bootstrapsecret soconnectionends with/<database_name>, then run a Helm upgrade. - In CP/DP mode, patch the secret in both the control plane and data plane clusters.
- Patch the
Partially applied database migration
- Cause: Prisma migration interrupted during the upgrade.
- Solution:
-
Run migration manually:
- Verify the
ClusterandDeploymenttables exist.
- Verify the
-
Vector Pods not running (upgrading from 0.37.x)
- Cause: The
fluentdkey wasn’t renamed tovectorin your migrated values file, or custom Fluentd resources were incompatible. - Solution: Verify that your migrated values file uses
vector(notfluentd) as the top-level key. Re-run the migration script if needed, or manually rename the key.