Manage cluster status | Astronomer Documentation

Astro Private Cloud (APC) tracks the operational status of every data plane cluster so that workloads only run on healthy infrastructure. This page describes the cluster status values, how the APC API determines status, the GraphQL operations for querying and updating status, and how to troubleshoot unhealthy clusters.

Authenticate to the APC API

Every operation on this page requires an APC API token sent as a bearer credential. Send your token in the Authorization header on each request to the APC GraphQL endpoint:

$ curl -X POST https://houston.<your-base-domain>/v1 \
>   -H "Authorization: <your-token>" \
>   -H "Content-Type: application/json" \
>   -d '{"query": "query { self { user { username } } }"}'

For step-by-step instructions on obtaining a user token or creating a system service account token, see Authenticate to the APC API.

Required roles and permissions

Cluster operations are gated by RBAC permissions. The following table maps each operation to the permission APC API checks and the default role that grants it.

Operation	Required permission	Default role that grants access
`paginatedClusters`	Authenticated user	Any signed-in user
`cluster`	`system.clusters.get`	System Admin
`updateCluster`	`system.clusters.update`	System Admin
`reconcileClusterMetadataJob`	`system.clusters.update`	System Admin

The System Admin role inherits every system.clusters.* permission.

Cluster status values

Status	Description	Allows new deployments	Allows configuration updates
`ACTIVE`	Cluster is healthy and reachable	Yes	Yes
`INACTIVE`	Cluster is unreachable or reporting an unhealthy status	No	No

Status determination

The APC API derives cluster status from the healthStatus field in the deployment orchestrator’s /metadata response. The mapping is binary:

Deployment orchestrator `healthStatus`	APC API cluster status
`HEALTHY`	`ACTIVE`
Any other value	`INACTIVE`
Fetch error or timeout	`INACTIVE`

A CronJob in the control plane reconciles cluster metadata by calling the deployment orchestrator’s /metadata endpoint. The default schedule is 0 * * * * (every hour at minute 0), and is configurable through the houston.syncDataplaneClusters.schedule value on the Astronomer Helm chart.

You can list the reconcile CronJob and recent runs with the following command:

$ kubectl get cronjob,jobs -n astronomer | grep sync-dataplane-clusters

Query cluster status

List clusters

The paginatedClusters query returns clusters the caller has access to. Pagination uses the take argument, plus either cursor (a cluster UUID) or pageNumber. The response object contains a clusters list and a total count.

1 query {
2   paginatedClusters(
3     take: 50
4     status: ACTIVE
5   ) {
6     clusters {
7       id
8       name
9       status
10       statusReason
11       healthStatus
12       k8sVersion
13       cloudProvider
14       region
15       createdAt
16       updatedAt
17     }
18     count
19   }
20 }

Get a single cluster

1 query {
2   cluster(id: "<cluster-id>") {
3     id
4     name
5     status
6     statusReason
7     healthStatus
8     k8sVersion
9     cloudProvider
10     region
11     dpChartVersion
12     commanderVersion
13     config
14     configOverride
15   }
16 }

The healthStatus field returns a JSON object containing the full health payload the APC API received from the deployment orchestrator, not a single string. The statusReason field is also a JSON object.

Filter by cloud provider and region

1 query {
2   paginatedClusters(
3     status: INACTIVE
4     cloudProvider: "aws"
5     region: "us-east-1"
6     take: 25
7   ) {
8     clusters {
9       id
10       name
11       statusReason
12     }
13     count
14   }
15 }

Other supported filter arguments include searchPhrase, k8sVersion, id, sortBy, and sortDirection.

See Update cluster status for the shape the APC API writes to statusReason.

Update cluster status

A user with permission to update clusters can change a cluster’s status manually. The statusReason argument accepts a JSON object whose shape isn’t enforced by the schema, but the APC API itself writes the value the deployment orchestrator returns in its /metadata response when reconciling. To stay consistent, use the same shape the APC API uses or include a descriptive message field.

1 mutation {
2   updateCluster(
3     id: "<cluster-id>"
4     status: INACTIVE
5     statusReason: { message: "Maintenance window — cluster offline for upgrades" }
6   ) {
7     id
8     status
9     statusReason
10   }
11 }

For status changes, supply id (required), status, and statusReason. The updateCluster mutation also accepts name and deploymentsConfigOverride for non-status changes; see Update data plane cluster configurations for those workflows.

The APC API blocks configuration updates (deploymentsConfigOverride, name) while the cluster status is INACTIVE and returns the error This operation is not allowed as the cluster is not active. Status itself can still be updated in any state.

To manually restore a cluster to ACTIVE after confirming it’s healthy:

1 mutation {
2   updateCluster(
3     id: "<cluster-id>"
4     status: ACTIVE
5     statusReason: { message: "Manually verified healthy" }
6   ) {
7     id
8     status
9   }
10 }

Force a metadata reconciliation

Use the reconcileClusterMetadataJob query to make the APC API refetch metadata from the deployment orchestrator immediately, instead of waiting for the next CronJob run. The query accepts a list of cluster UUIDs; if you pass null or omit the argument, the APC API reconciles every cluster the caller is authorized to update.

1 query {
2   reconcileClusterMetadataJob(
3     clusterIds: ["<cluster-id-1>", "<cluster-id-2>"]
4   ) {
5     successfulClusterIds
6     failedClusterIds
7     skippedClusterIds
8   }
9 }

A cluster appears in skippedClusterIds when it lacks a data plane URL or when the caller isn’t authorized to reconcile it.

Use this query in the following situations:

After resolving a network or DNS issue between the control plane and a data plane.
After restarting the deployment orchestrator.
To verify cluster health after a maintenance window.
When debugging connectivity from the control plane.

Troubleshoot unhealthy clusters

Check the cluster’s current status

1 query {
2   cluster(id: "<cluster-id>") {
3     status
4     statusReason
5     healthStatus
6     updatedAt
7   }
8 }

Verify the deployment orchestrator connectivity

From a Pod in the control plane namespace with network access to the deployment orchestrator, call the metadata endpoint:

$ curl -s https://<commander-url>/metadata | jq .

A healthy response includes (among other fields) the following:

1 {
2   "kubernetesVersion": "<k8s-version>",
3   "baseDomain": "<cluster-base-domain>",
4   "healthStatus": "HEALTHY",
5   "cloudProvider": "<provider>",
6   "region": "<region>",
7   "dataplaneChartVersion": "<chart-version>",
8   "commander": {
9     "version": "<commander-version>",
10     "url": "<commander-grpc-url>",
11     "status": "HEALTHY",
12     "airflowChartVersion": "<airflow-chart-version>"
13   }
14 }

The full response also includes mode, dataplaneUrl, dataplaneId, releaseName, releaseNamespace, dbType, namespacePools, and registry.

Check the deployment orchestrator health and pods

$ curl -s https://<commander-url>/healthz

$ kubectl get pods -n astronomer -l app=commander

Force a metadata refresh

1 query {
2   reconcileClusterMetadataJob(clusterIds: ["<cluster-id>"]) {
3     successfulClusterIds
4     failedClusterIds
5     skippedClusterIds
6   }
7 }

Review deployment orchestrator logs

Replace <release-name> with your Helm release name, which is astronomer by default:

$ kubectl logs -n astronomer deployment/<release-name>-commander --tail=100

Common issues and resolutions

Cluster stuck in `INACTIVE`

Possible causes:

The deployment orchestrator Pod isn’t running.
Network connectivity between APC API and the deployment orchestrator is broken (firewall, DNS, service mesh).
TLS certificate problems on the metadata endpoint.
The deployment orchestrator’s /metadata endpoint returns a non-2xx response or a payload without healthStatus: "HEALTHY".

Resolution steps:

$ kubectl get pods -n astronomer -l app=commander

$ kubectl describe pod <commander-pod> -n astronomer

$ kubectl logs -n astronomer deployment/<release-name>-commander

Test connectivity from APC API (replace <release-name> with your Helm release, default astronomer):

$ kubectl exec -it deployment/<release-name>-houston -n astronomer -- \
>   curl -v https://<commander-url>/metadata

After the underlying issue is resolved, force a reconciliation through the reconcileClusterMetadataJob query.

Configuration updates rejected

APC API returns this error when a configuration update is attempted on a non-ACTIVE cluster:

This operation is not allowed as the cluster is not active.

Resolution:

Confirm the cluster is reachable and run reconcileClusterMetadataJob to refresh status.
If the cluster reports healthy but APC API hasn’t yet reconciled, wait for the next reconcile cycle or trigger one manually.
As a last resort, a System Admin can manually set the cluster status back to ACTIVE:

1 mutation {
2   updateCluster(
3     id: "<cluster-id>"
4     status: ACTIVE
5     statusReason: { message: "Manually verified healthy" }
6   ) {
7     id
8     status
9   }
10 }

Best practices

Monitor cluster status proactively. Configure alerts for clusters transitioning to INACTIVE and surface status on operations dashboards.
Always provide a meaningful statusReason when manually changing status. The reason is preserved in the cluster record and is useful when diagnosing later incidents.
Distribute Deployments across multiple clusters so that a single INACTIVE cluster doesn’t affect every workload.
Validate connectivity from the control plane Pod after firewall, DNS, or certificate changes; don’t rely on the next scheduled reconciliation to surface the problem.