Installation
Common issues

Troubleshoot common deployment issues

Learn how you can troubleshoot MOSTLY AI deployment issues that might occur in any Kubernetes environment.

Queued jobs make no progress when scheduled on a worker node requiring the operator Exists

Problem

After you deploy and start a generator or synthetic dataset, you might see that the job is stuck in the Queued status without making any progress.

Cause

One reason may be that your Kubernetes cluster requires the definition of the mostly_coordinator.deployment.core_job.tolerationOperator to Exists in the values.yaml file.

Solution

Remove any Toleration values defined for your computes.

  1. As a Super admin, go to the Profile menu in the upper right and select Computes.
  2. For the Default compute, select it from the list and remove the Toleration value, and click Save. MOSTLY AI - Troubleshoot Queued jobs - Remove Toleration from computes
  3. Repeat for any additional computes added to your MOSTLY AI deployment.

Pods stay in pending after a cluster restart or "hot swap"

Problem

If your policies require to start the cluster on-demand, to move the workloads through nodes, or start as required, you might see that the pods remain in pending status.

You can then obtain more details about one of the pods with the kubectl describe command.

kubectl -n mostly-ai describe pod POD_NAME

You might see the following:

Warning  FailedScheduling    0/8 nodes are available: 5 Insufficient memory, 
3 node(s) didn't match node selector.

Solution

Most cloud providers provide different nodes after restart. The same happens for large on-prem deployments with procedures like "hot swapping" or maintenance restarts. MOSTLY AI uses nodeAffinity by default to schedule workloads to nodes, and it may be the case that your new nodes do not include the labels that the application is requiring to schedule the pods.

To solve this issue, apply the node labels required by MOSTLY AI.

  1. Apply the mostly_app=yes label to your application nodes.
    kubectl label node APP_NODE_NAME mostly_app=yes
  2. Apply the mostly_worker=yes label to your worker nodes.
    kubectl label node WORKER_NODE_NAME mostly_worker=yes

Keep in mind

  1. If you use Terraform, CloudFormation, Karpenter, or similar tools to deploy and scale your infrastructure, it is best you apply the labels on your nodes before you deploy MOSTLY AI.
  2. If you provision new nodes in your cluster, make sure they have enough capacity (RAM and CPU) to meet the workloads requirements of MOSTLY AI. For more information, see compute and memory requirements (opens in a new tab).