Debug an Astro Private Cloud installation
Use this guide when your Astro Private Cloud (APC) control plane or data plane Pods are not progressing to a healthy state after installation.
Ensure platform components are reaching full availability
Work through the following checks to from controllers to individual containers to isolate possible causes when Pods don’t reach the READY
state.
1. Verify controllers and ReplicaSets
-
List Deployments, StatefulSets, and ReplicaSets in your namespace and confirm the latest ReplicaSet or StatefulSet shows the expected number of available replicas:
-
Identify the most recent ReplicaSet for the component that is failing, with results sorted by creation timestamp:
-
Inspect the returned ReplicaSet for status and events that may be preventing Pods from launching:
Resolve issues such as insufficient resources, pull errors, or missing secrets, then re-check the ReplicaSet until
.status.availableReplicas
matches.spec.replicas
.
2. Examine Pods and namespace events
-
List Pod status:
-
describe
a failing Pod to view events, container status, and scheduling details: -
Review recent events in the namespace for additional context:
3. Inspect container logs
If a Pod continues to restart or stuck in CrashLoopBackOff
, gather logs for each container:
If the container restarts quickly, use --previous
to view logs from the last attempt:
Use the collected errors to adjust your configuration, for example, by fixing database credentials or registry access. After remediation, re-run kubectl get pods
to confirm all Pods report READY
status. If problems persist, collect the relevant logs and events and contact Astronomer support.
Houston Pods stuck in CrashLoopBackOff
Houston (API) connects directly to the control-plane database during startup. If the Pods restart repeatedly:
-
List Pods to verify their status:
-
Test connectivity to the database from inside the cluster:
If the connection times out, investigate networking or firewall rules between Kubernetes nodes and the Postgres host.
-
Confirm the
astronomer-bootstrap
secret contains the correct connection string:Decode the
connection
value and fix any typos. After updating the secret, delete the Houston and Grafana Pods so they pick up the change.
x509 “certificate signed by unknown authority” while pulling images
If image pulls fail with a certificate error, such as when syncing registry certificates, restart the Houston Pods followed by the platform registry Pod. Ensure any custom certificate authorities are configured under global.privateCaCerts
and applied via helm upgrade
.
Houston worker showing NATS timeout errors after installation
After installing or upgrading APC, you might encounter issues where Deployments appear in the Astro CLI and database, but their Kubernetes namespaces are not created. Houston logs might show UnhandledPromiseRejectionWarning: NatsError: TIMEOUT
.
This occurs when the NATS JetStream cluster has not yet elected a metadata leader before the Houston worker Pods attempt to set up streams and consumers.
To resolve:
- Verify Houston worker Pods are showing NATS timeout errors:
- Restart the Houston worker Pods to allow them to reconnect after the NATS leader election completes:
- Confirm Deployment namespaces are created:
After the Houston worker Pods restart, they successfully create the necessary Kubernetes resources for your deployments.