Troubleshoot common deployment issues
Learn how you can troubleshoot MOSTLY AI deployment issues that might occur in any Kubernetes environment.
Queued
jobs make no progress when scheduled on a worker node requiring the operator Exists
Problem
After you deploy and start a generator or synthetic dataset, you might see that the job is stuck in the Queued
status without making any progress.
Cause
One reason may be that your Kubernetes cluster requires the definition of the mostly_coordinator.deployment.core_job.tolerationOperator
to Exists
in the values.yaml
file.
Solution
Remove any Toleration values defined for your computes.
- As a Super admin, go to the Profile menu in the upper right and select Computes.
- For the Default compute, select it from the list and remove the Toleration value, and click Save.
- Repeat for any additional computes added to your MOSTLY AI deployment.
Pods stay in pending
after a cluster restart or "hot swap"
Problem
If your policies require to start the cluster on-demand, to move the workloads through nodes, or start as required, you might see that the pods remain in pending
status.
You can then obtain more details about one of the pods with the kubectl describe
command.
kubectl -n mostly-ai describe pod POD_NAME
You might see the following:
Warning FailedScheduling 0/8 nodes are available: 5 Insufficient memory,
3 node(s) didn't match node selector.
Solution
Most cloud providers provide different nodes after restart. The same happens for large on-prem deployments with procedures like "hot swapping" or maintenance restarts. MOSTLY AI uses nodeAffinity
by default to schedule workloads to nodes, and it may be the case that your new nodes do not include the labels that the application is requiring to schedule the pods.
To solve this issue, apply the node labels required by MOSTLY AI.
- Apply the
mostly_app=yes
label to your application nodes.kubectl label node APP_NODE_NAME mostly_app=yes
- Apply the
mostly_worker=yes
label to your worker nodes.kubectl label node WORKER_NODE_NAME mostly_worker=yes
Keep in mind
- If you use Terraform, CloudFormation, Karpenter, or similar tools to deploy and scale your infrastructure, it is best you apply the labels on your nodes before you deploy MOSTLY AI.
- If you provision new nodes in your cluster, make sure they have enough capacity (RAM and CPU) to meet the workloads requirements of MOSTLY AI. For more information, see compute and memory requirements (opens in a new tab).