Debug an Astro Private Cloud 2.0 upgrade

Use this guide when your Astro Private Cloud (APC) control plane or data plane Pods aren’t progressing to a healthy state after upgrading to 2.0.

Helm upgrade fails with patch conflict

  • Cause: Environment variable ordering in the APC Deployment changed between versions, causing Helm’s strategic merge patch to fail.
  • Symptoms: The upgrade fails with an error containing cannot patch "astronomer-houston" with kind Deployment and references to environment variable ordering.
  • Solution: Delete the APC Deployment with --cascade=orphan to preserve running Pods, then retry the Helm upgrade:
$kubectl delete deployment/<release-name>-houston -n astronomer --cascade=orphan
$helm upgrade -f migrated-values.yaml -n astronomer astronomer astronomer/astronomer --version 2.0.x

APC API crashes with “Configuration property isn’t defined”

  • Cause: The APC Docker image contains an older default.yaml configuration file that doesn’t include keys added in 2.0. If the Helm ConfigMap doesn’t explicitly set these new keys, the APC API fails at startup when it tries to read them.
  • Symptoms: APC Pods enter CrashLoopBackOff with errors like:
Error: Configuration property "deployments.runtimeManagement.astroRuntimeReleasesFile" is not defined
  • Solution: Ensure the APC Docker image matches the chart version you are deploying. If you built a custom APC Docker image, rebuild it with the 2.0 codebase that includes the updated config/default.yaml. Then restart the APC Pods:
$kubectl rollout restart deploy/<release-name>-houston

JetStream Pods stuck in Pending or CrashLoopBackOff

  • Cause: STAN volume claims or PVCs weren’t deleted before the upgrade (applies when upgrading from 0.37.x).
  • Solution: Delete existing PVCs for STAN:
$kubectl delete pvc -l app.kubernetes.io/name=stan

Prisma migrate deploy or database migration job times out

  • Cause: Database locks or concurrent transactions, often because of Prometheus or monitoring jobs.
  • Solution:
    • Check for locks:

      1SELECT * FROM pg_locks pl JOIN pg_stat_activity psa ON pl.pid = psa.pid;
    • Terminate long-running transactions:

      1SELECT pg_terminate_backend(pid)
      2FROM pg_stat_activity
      3WHERE state = 'active' AND now() - query_start > interval '5 minutes';

APC Worker logs show NATS: connection timeout

  • Cause: The APC API tries to connect to STAN before JetStream is ready (applies when upgrading from 0.37.x).
  • Solution: Restart the APC Worker Pod to reconnect to JetStream:
$kubectl rollout restart deploy/<release-name>-houston-worker

Airflow 3 Pods fail with ModuleNotFoundError: airflow.providers.cncf

  • Cause: Cluster configuration override missing or not applied.
  • Solution:
    • Reapply the override function in the Airflow 3 migration guide.
    • Ensure minimumAstroRuntimeVersion is set to 3.1-2 or higher in the override config.

Airflow Deployments fail with Postgres connection errors after upgrade due to missing database name

  • Cause: The astronomer-bootstrap secret connection string doesn’t include a database name suffix (for example, /main on AWS RDS or /postgres on AKS), causing connection failures. Verify the correct database name by logging in to your database.
  • Solution:
    • Patch the astronomer-bootstrap secret so connection ends with /<database_name>, then run a Helm upgrade.
    • The secret is in the astronomer namespace (or the namespace where you install APC).
    • In control plane/data plane (CP/DP) mode, patch the secret in both the control plane and data plane clusters.
$# Namespace where APC is installed
$NAMESPACE=astronomer
$# Database name suffix; for AWS the default database is usually "main", for AKS "postgres"
$DB_NAME=main
$
$# Read current connection string, append database name, and update the secret
$CURRENT=$(kubectl -n "$NAMESPACE" get secret astronomer-bootstrap -o jsonpath='{.data.connection}' | base64 -d)
$NEW="${CURRENT%/}/$DB_NAME"
$kubectl -n "$NAMESPACE" patch secret astronomer-bootstrap --type=merge -p "{\"data\":{\"connection\":\"$(printf '%s' "$NEW" | base64 -w0)\"}}"
$
$# Apply the update with a Helm upgrade
$helm upgrade -f values.yaml -n "$NAMESPACE" astronomer astronomer/astronomer

APC API migration fails due to duplicate workspace labels

  • Cause: The upgrade includes a migration that adds a unique constraint to the Workspace.label column. If your database contains workspaces with duplicate labels, the migration fails with Prisma error P3009. Once the migration is marked as failed, all subsequent migrations are blocked, preventing the APC API from starting.
  • Symptoms: APC database migration Pods fail with Error: P3009 and the message migrate found failed migrations in the target database. APC API and worker Pods enter CrashLoopBackOff.
  • Solution:

Back up the APC database before you run DELETE on _prisma_migrations or rely on a retry of this migration.

1

Check for duplicate workspace labels

Connect to your APC database and check for duplicate workspace labels:

1SET search_path TO "houston$default";
2SELECT label, COUNT(*) FROM "Workspace" GROUP BY label HAVING COUNT(*) > 1;
2

Resolve the duplicate labels

Rename one of the duplicate workspace labels, using the workspace ID as a suffix to guarantee uniqueness:

1UPDATE "Workspace" SET label = label || '-' || id
2WHERE id = '<workspace-id-to-rename>';
3

Check whether the migration partially applied

Before clearing the failed Prisma record, check whether the unique constraint was applied:

1SELECT constraint_name FROM information_schema.table_constraints
2WHERE table_schema = 'houston$default'
3 AND table_name = 'Workspace'
4 AND constraint_type = 'UNIQUE';
  • If the unique constraint on label isn’t listed, the migration didn’t complete. After fixing duplicate labels, you can clear the failed migration and retry. Back up your database first, then delete only that failed migration row:
1DELETE FROM "houston$default"."_prisma_migrations"
2 WHERE migration_name = '20250918072802_added_unique_constraint_to_ws_label';
  • If a unique constraint on label is present, the schema change may have been applied even though Prisma recorded a failure. Don’t delete the _prisma_migrations row or re-run the migration without DBA or Astronomer support — you can create conflicting schema or migration state.
4

Re-run the Helm upgrade

$helm upgrade -f migrated-values.yaml -n astronomer astronomer astronomer/astronomer --version 2.0.x

Values migration script reports unexpected errors

  • Cause: The Python migration script requires Python 3.10+ and the ruamel.yaml package.
  • Solution:
    • Verify your Python version: python3 --version
    • Install or upgrade the dependency: pip install ruamel.yaml
    • Run the script in dry-run mode first to preview changes: ./bin/migrate-helm-chart-values-1x-to-2x.py --dry-run my-values.yaml

Airflow Deployments fail with Postgres connection errors after upgrade

  • Cause: The astronomer-bootstrap secret connection string doesn’t include a database name suffix (for example, /postgres), causing connection failures.
  • Solution:
    • Patch the astronomer-bootstrap secret so connection ends with /<database_name>, then run a Helm upgrade.
    • In CP/DP mode, patch the secret in both the control plane and data plane clusters.
$NAMESPACE=astronomer
$DB_NAME=postgres
$
$CURRENT=$(kubectl -n "$NAMESPACE" get secret astronomer-bootstrap -o jsonpath='{.data.connection}' | base64 -d)
$NEW="${CURRENT%/}/$DB_NAME"
$kubectl -n "$NAMESPACE" patch secret astronomer-bootstrap --type=merge -p "{\"data\":{\"connection\":\"$(printf '%s' "$NEW" | base64 -w0)\"}}"
$
$helm upgrade -f migrated-values.yaml -n "$NAMESPACE" astronomer astronomer/astronomer

Partially applied database migration

  • Cause: Prisma migration interrupted during the upgrade.
  • Solution:
    • Run migration manually:

      $npx prisma migrate deploy
      • Verify the Cluster and Deployment tables exist.

Vector Pods not running (upgrading from 0.37.x)

  • Cause: The fluentd key wasn’t renamed to vector in your migrated values file, or custom Fluentd resources were incompatible.
  • Solution: Verify that your migrated values file uses vector (not fluentd) as the top-level key. Re-run the migration script if needed, or manually rename the key.