Manage cluster status

Astro Private Cloud (APC) tracks the operational status of every data plane cluster so that workloads only run on healthy infrastructure. This page describes the cluster status values, how Houston determines status, the GraphQL operations for querying and updating status, and how to troubleshoot unhealthy clusters.

Authenticate to Houston

Every operation on this page requires a Houston API token sent as a bearer credential. Send your token in the Authorization header on each request to the Houston GraphQL endpoint:

$curl -X POST https://houston.<your-base-domain>/v1 \
> -H "Authorization: <your-token>" \
> -H "Content-Type: application/json" \
> -d '{"query": "query { self { user { username } } }"}'

For step-by-step instructions on obtaining a user token or creating a system service account token, see Authenticate to the Houston API.

Required roles and permissions

Cluster operations are gated by RBAC permissions. The following table maps each operation to the permission Houston checks and the default role that grants it.

OperationRequired permissionDefault role that grants access
paginatedClustersAuthenticated userAny signed-in user
clustersystem.clusters.getSystem Admin
updateClustersystem.clusters.updateSystem Admin
reconcileClusterMetadataJobsystem.clusters.updateSystem Admin

The System Admin role inherits every system.clusters.* permission.

Cluster status values

StatusDescriptionAllows new deploymentsAllows configuration updates
ACTIVECluster is healthy and reachableYesYes
INACTIVECluster is unreachable or reporting an unhealthy statusNoNo

Status determination

Houston derives cluster status from the healthStatus field in Commander’s /metadata response. The mapping is binary:

Commander healthStatusHouston cluster status
HEALTHYACTIVE
Any other valueINACTIVE
Fetch error or timeoutINACTIVE

A CronJob in the control plane reconciles cluster metadata by calling Commander’s /metadata endpoint. The default schedule is 0 * * * * (every hour at minute 0), and is configurable through the houston.syncDataplaneClusters.schedule value on the Astronomer Helm chart.

You can list the reconcile CronJob and recent runs with the following command:

$kubectl get cronjob,jobs -n astronomer | grep sync-dataplane-clusters

Query cluster status

List clusters

The paginatedClusters query returns clusters the caller has access to. Pagination uses the take argument, plus either cursor (a cluster UUID) or pageNumber. The response object contains a clusters list and a total count.

1query {
2 paginatedClusters(
3 take: 50
4 status: ACTIVE
5 ) {
6 clusters {
7 id
8 name
9 status
10 statusReason
11 healthStatus
12 k8sVersion
13 cloudProvider
14 region
15 createdAt
16 updatedAt
17 }
18 count
19 }
20}

Get a single cluster

1query {
2 cluster(id: "<cluster-id>") {
3 id
4 name
5 status
6 statusReason
7 healthStatus
8 k8sVersion
9 cloudProvider
10 region
11 dpChartVersion
12 commanderVersion
13 config
14 configOverride
15 }
16}

The healthStatus field returns a JSON object containing the full health payload Houston received from Commander, not a single string. The statusReason field is also a JSON object. See Update cluster status for the shape Houston writes.

Filter by cloud provider and region

1query {
2 paginatedClusters(
3 status: INACTIVE
4 cloudProvider: "aws"
5 region: "us-east-1"
6 take: 25
7 ) {
8 clusters {
9 id
10 name
11 statusReason
12 }
13 count
14 }
15}

Other supported filter arguments include searchPhrase, k8sVersion, id, sortBy, and sortDirection.

Update cluster status

A user with permission to update clusters can change a cluster’s status manually. The statusReason argument accepts a JSON object whose shape isn’t enforced by the schema, but Houston itself writes the value Commander returns in its /metadata response when reconciling. To stay consistent, use the same shape Houston uses or include a descriptive message field.

1mutation {
2 updateCluster(
3 id: "<cluster-id>"
4 status: INACTIVE
5 statusReason: { message: "Maintenance window — cluster offline for upgrades" }
6 ) {
7 id
8 status
9 statusReason
10 }
11}

For status changes, supply id (required), status, and statusReason. The updateCluster mutation also accepts name and deploymentsConfigOverride for non-status changes; see Update data plane cluster configurations for those workflows.

Houston blocks configuration updates (deploymentsConfigOverride, name) while the cluster status is INACTIVE and returns the error This operation is not allowed as the cluster is not active. Status itself can still be updated in any state.

Force a metadata reconciliation

Use the reconcileClusterMetadataJob query to make Houston refetch metadata from Commander immediately, instead of waiting for the next CronJob run. The query accepts a list of cluster UUIDs; if you pass null or omit the argument, Houston reconciles every cluster the caller is authorized to update.

1query {
2 reconcileClusterMetadataJob(
3 clusterIds: ["<cluster-id-1>", "<cluster-id-2>"]
4 ) {
5 successfulClusterIds
6 failedClusterIds
7 skippedClusterIds
8 }
9}

A cluster appears in skippedClusterIds when it lacks a dataplane URL or when the caller isn’t authorized to reconcile it.

Use this query in the following situations:

  • After resolving a network or DNS issue between the control plane and a data plane.
  • After restarting Commander.
  • To verify cluster health after a maintenance window.
  • When debugging connectivity from the control plane.

Troubleshoot unhealthy clusters

1

Check the cluster’s current status

1query {
2 cluster(id: "<cluster-id>") {
3 status
4 statusReason
5 healthStatus
6 updatedAt
7 }
8}
2

Verify Commander connectivity

From a Pod in the control plane namespace with network access to Commander, call the metadata endpoint:

$curl -s https://<commander-url>/metadata | jq .

A healthy response includes (among other fields) the following:

1{
2 "kubernetesVersion": "<k8s-version>",
3 "baseDomain": "<cluster-base-domain>",
4 "healthStatus": "HEALTHY",
5 "cloudProvider": "<provider>",
6 "region": "<region>",
7 "dataplaneChartVersion": "<chart-version>",
8 "commander": {
9 "version": "<commander-version>",
10 "url": "<commander-grpc-url>",
11 "status": "HEALTHY",
12 "airflowChartVersion": "<airflow-chart-version>"
13 }
14}

The full response also includes mode, dataplaneUrl, dataplaneId, releaseName, releaseNamespace, dbType, namespacePools, and registry.

3

Check Commander health and pods

$curl -s https://<commander-url>/healthz
$kubectl get pods -n astronomer -l app=commander
4

Force a metadata refresh

1query {
2 reconcileClusterMetadataJob(clusterIds: ["<cluster-id>"]) {
3 successfulClusterIds
4 failedClusterIds
5 skippedClusterIds
6 }
7}
5

Review Commander logs

Replace <release-name> with your Helm release name, which is astronomer by default:

$kubectl logs -n astronomer deployment/<release-name>-commander --tail=100

Common issues and resolutions

Cluster stuck in INACTIVE

Possible causes:

  1. The Commander Pod isn’t running.
  2. Network connectivity between Houston and Commander is broken (firewall, DNS, service mesh).
  3. TLS certificate problems on the metadata endpoint.
  4. Commander’s /metadata endpoint returns a non-2xx response or a payload without healthStatus: "HEALTHY".

Resolution steps:

$kubectl get pods -n astronomer -l app=commander
$kubectl describe pod <commander-pod> -n astronomer
$kubectl logs -n astronomer deployment/<release-name>-commander

Test connectivity from Houston (replace <release-name> with your Helm release, default astronomer):

$kubectl exec -it deployment/<release-name>-houston -n astronomer -- \
> curl -v https://<commander-url>/metadata

After the underlying issue is resolved, force a reconciliation through the reconcileClusterMetadataJob query.

Configuration updates rejected

Houston returns this error when a configuration update is attempted on a non-ACTIVE cluster:

This operation is not allowed as the cluster is not active.

Resolution:

  1. Confirm the cluster is reachable and run reconcileClusterMetadataJob to refresh status.
  2. If the cluster reports healthy but Houston hasn’t yet reconciled, wait for the next reconcile cycle or trigger one manually.
  3. As a last resort, a System Admin can manually set the cluster status back to ACTIVE:
1mutation {
2 updateCluster(
3 id: "<cluster-id>"
4 status: ACTIVE
5 statusReason: { message: "Manually verified healthy" }
6 ) {
7 id
8 status
9 }
10}

Best practices

  • Monitor cluster status proactively. Configure alerts for clusters transitioning to INACTIVE and surface status on operations dashboards.
  • Always provide a meaningful statusReason when manually changing status. The reason is preserved in the cluster record and is useful when diagnosing later incidents.
  • Distribute Deployments across multiple clusters so that a single INACTIVE cluster doesn’t affect every workload.
  • Validate connectivity from the control plane Pod after firewall, DNS, or certificate changes; don’t rely on the next scheduled reconciliation to surface the problem.