Manage cluster status

Astro Private Cloud (APC) tracks the operational status of every data plane cluster so that workloads only run on healthy infrastructure. This page describes the cluster status values, how the APC API determines status, the GraphQL operations for querying and updating status, and how to troubleshoot unhealthy clusters.

Authenticate to the APC API

Every operation on this page requires an APC API token sent as a bearer credential. Send your token in the Authorization header on each request to the APC GraphQL endpoint:

$curl -X POST https://houston.<your-base-domain>/v1 \
> -H "Authorization: <your-token>" \
> -H "Content-Type: application/json" \
> -d '{"query": "query { self { user { username } } }"}'

For step-by-step instructions on obtaining a user token or creating a system service account token, see Authenticate to the APC API.

Required roles and permissions

Cluster operations are gated by RBAC permissions. The following table maps each operation to the permission APC API checks and the default role that grants it.

OperationRequired permissionDefault role that grants access
paginatedClustersAuthenticated userAny signed-in user
clustersystem.clusters.getSystem Admin
updateClustersystem.clusters.updateSystem Admin
reconcileClusterMetadataJobsystem.clusters.updateSystem Admin

The System Admin role inherits every system.clusters.* permission.

Cluster status values

StatusDescriptionAllows new deploymentsAllows configuration updates
ACTIVECluster is healthy and reachableYesYes
INACTIVECluster is unreachable or reporting an unhealthy statusNoNo

Status determination

The APC API derives cluster status from the healthStatus field in the deployment orchestrator’s /metadata response. The mapping is binary:

Deployment orchestrator healthStatusAPC API cluster status
HEALTHYACTIVE
Any other valueINACTIVE
Fetch error or timeoutINACTIVE

A CronJob in the control plane reconciles cluster metadata by calling the deployment orchestrator’s /metadata endpoint. The default schedule is 0 * * * * (every hour at minute 0), and is configurable through the houston.syncDataplaneClusters.schedule value on the Astronomer Helm chart.

You can list the reconcile CronJob and recent runs with the following command:

$kubectl get cronjob,jobs -n astronomer | grep sync-dataplane-clusters

Query cluster status

List clusters

The paginatedClusters query returns clusters the caller has access to. Pagination uses the take argument, plus either cursor (a cluster UUID) or pageNumber. The response object contains a clusters list and a total count.

1query {
2 paginatedClusters(
3 take: 50
4 status: ACTIVE
5 ) {
6 clusters {
7 id
8 name
9 status
10 statusReason
11 healthStatus
12 k8sVersion
13 cloudProvider
14 region
15 createdAt
16 updatedAt
17 }
18 count
19 }
20}

Get a single cluster

1query {
2 cluster(id: "<cluster-id>") {
3 id
4 name
5 status
6 statusReason
7 healthStatus
8 k8sVersion
9 cloudProvider
10 region
11 dpChartVersion
12 commanderVersion
13 config
14 configOverride
15 }
16}

The healthStatus field returns a JSON object containing the full health payload the APC API received from the deployment orchestrator, not a single string. The statusReason field is also a JSON object. See Update cluster status for the shape the APC API writes.

Filter by cloud provider and region

1query {
2 paginatedClusters(
3 status: INACTIVE
4 cloudProvider: "aws"
5 region: "us-east-1"
6 take: 25
7 ) {
8 clusters {
9 id
10 name
11 statusReason
12 }
13 count
14 }
15}

Other supported filter arguments include searchPhrase, k8sVersion, id, sortBy, and sortDirection.

Update cluster status

A user with permission to update clusters can change a cluster’s status manually. The statusReason argument accepts a JSON object whose shape isn’t enforced by the schema, but APC API itself writes the value the deployment orchestrator returns in its /metadata response when reconciling. To stay consistent, use the same shape APC API uses or include a descriptive message field.

1mutation {
2 updateCluster(
3 id: "<cluster-id>"
4 status: INACTIVE
5 statusReason: { message: "Maintenance window — cluster offline for upgrades" }
6 ) {
7 id
8 status
9 statusReason
10 }
11}

For status changes, supply id (required), status, and statusReason. The updateCluster mutation also accepts name and deploymentsConfigOverride for non-status changes; see Update data plane cluster configurations for those workflows.

APC API blocks configuration updates (deploymentsConfigOverride, name) while the cluster status is INACTIVE and returns the error This operation is not allowed as the cluster is not active. Status itself can still be updated in any state.

Force a metadata reconciliation

Use the reconcileClusterMetadataJob query to make APC API refetch metadata from the deployment orchestrator immediately, instead of waiting for the next CronJob run. The query accepts a list of cluster UUIDs; if you pass null or omit the argument, APC API reconciles every cluster the caller is authorized to update.

1query {
2 reconcileClusterMetadataJob(
3 clusterIds: ["<cluster-id-1>", "<cluster-id-2>"]
4 ) {
5 successfulClusterIds
6 failedClusterIds
7 skippedClusterIds
8 }
9}

A cluster appears in skippedClusterIds when it lacks a data plane URL or when the caller isn’t authorized to reconcile it.

Use this query in the following situations:

  • After resolving a network or DNS issue between the control plane and a data plane.
  • After restarting the deployment orchestrator.
  • To verify cluster health after a maintenance window.
  • When debugging connectivity from the control plane.

Troubleshoot unhealthy clusters

1

Check the cluster’s current status

1query {
2 cluster(id: "<cluster-id>") {
3 status
4 statusReason
5 healthStatus
6 updatedAt
7 }
8}
2

Verify the deployment orchestrator connectivity

From a Pod in the control plane namespace with network access to the deployment orchestrator, call the metadata endpoint:

$curl -s https://<commander-url>/metadata | jq .

A healthy response includes (among other fields) the following:

1{
2 "kubernetesVersion": "<k8s-version>",
3 "baseDomain": "<cluster-base-domain>",
4 "healthStatus": "HEALTHY",
5 "cloudProvider": "<provider>",
6 "region": "<region>",
7 "dataplaneChartVersion": "<chart-version>",
8 "commander": {
9 "version": "<commander-version>",
10 "url": "<commander-grpc-url>",
11 "status": "HEALTHY",
12 "airflowChartVersion": "<airflow-chart-version>"
13 }
14}

The full response also includes mode, dataplaneUrl, dataplaneId, releaseName, releaseNamespace, dbType, namespacePools, and registry.

3

Check the deployment orchestrator health and pods

$curl -s https://<commander-url>/healthz
$kubectl get pods -n astronomer -l app=commander
4

Force a metadata refresh

1query {
2 reconcileClusterMetadataJob(clusterIds: ["<cluster-id>"]) {
3 successfulClusterIds
4 failedClusterIds
5 skippedClusterIds
6 }
7}
5

Review deployment orchestrator logs

Replace <release-name> with your Helm release name, which is astronomer by default:

$kubectl logs -n astronomer deployment/<release-name>-commander --tail=100

Common issues and resolutions

Cluster stuck in INACTIVE

Possible causes:

  1. The deployment orchestrator Pod isn’t running.
  2. Network connectivity between APC API and the deployment orchestrator is broken (firewall, DNS, service mesh).
  3. TLS certificate problems on the metadata endpoint.
  4. The deployment orchestrator’s /metadata endpoint returns a non-2xx response or a payload without healthStatus: "HEALTHY".

Resolution steps:

$kubectl get pods -n astronomer -l app=commander
$kubectl describe pod <commander-pod> -n astronomer
$kubectl logs -n astronomer deployment/<release-name>-commander

Test connectivity from APC API (replace <release-name> with your Helm release, default astronomer):

$kubectl exec -it deployment/<release-name>-houston -n astronomer -- \
> curl -v https://<commander-url>/metadata

After the underlying issue is resolved, force a reconciliation through the reconcileClusterMetadataJob query.

Configuration updates rejected

APC API returns this error when a configuration update is attempted on a non-ACTIVE cluster:

This operation is not allowed as the cluster is not active.

Resolution:

  1. Confirm the cluster is reachable and run reconcileClusterMetadataJob to refresh status.
  2. If the cluster reports healthy but APC API hasn’t yet reconciled, wait for the next reconcile cycle or trigger one manually.
  3. As a last resort, a System Admin can manually set the cluster status back to ACTIVE:
1mutation {
2 updateCluster(
3 id: "<cluster-id>"
4 status: ACTIVE
5 statusReason: { message: "Manually verified healthy" }
6 ) {
7 id
8 status
9 }
10}

Best practices

  • Monitor cluster status proactively. Configure alerts for clusters transitioning to INACTIVE and surface status on operations dashboards.
  • Always provide a meaningful statusReason when manually changing status. The reason is preserved in the cluster record and is useful when diagnosing later incidents.
  • Distribute Deployments across multiple clusters so that a single INACTIVE cluster doesn’t affect every workload.
  • Validate connectivity from the control plane Pod after firewall, DNS, or certificate changes; don’t rely on the next scheduled reconciliation to surface the problem.