Installation
Common issues

Troubleshoot common deployment issues

Learn how you can troubleshoot MOSTLY AI deployment issues that might occur in any Kubernetes environment.

Generator is stuck in Queued status

After you deploy MOSTLY AI, your first sanity test can be to train a new generator. However, the generator training might make no progress and remain in the Queued status indefinitely.

Problem

After starting the training of a new generator, its status is Queued and it remains as such indefinitely without making progress.

Cause

The most likely cause for this is that in your values.yaml file, all CPU and memory resources available to the worker nodes have been allocated to the Default compute.

📑

To learn more about computes and how to manage them, see Compute.

Solution

Reallocate the Default compute resources to be lower than the total resources allocated for your worker nodes. For example, if your worker nodes have 14 CPUs and 24 GB of memory each, allocate 10 CPUs and 20 GB of memory to the Default compute.

  1. In the values.yaml file, edit the defaultComputePool section for the mostly-app service.
    values.yaml
    mostly-app:
      deployment: 
        ...
        mostly:
          defaultComputePool:
            name: Default
            type: KUBERNETES
            toleration: engine-jobs
            resources:
              cpu: 10
              memory: 20
              gpu: 0
  2. Save the values.yaml file.
  3. Remove your current deployment by deleting the mostly-ai namespace.
    kubectl delete namespace mostly-ai
  4. Re-deploy MOSTLY AI.
    helm upgrade --install mostly-ai ./mostly-combined --values values.yaml --namespace mostly-ai --create-namespace

Queued jobs make no progress when scheduled on a worker node requiring the operator Exists

Problem

After you deploy and start a generator or synthetic dataset, you might see that the job is stuck in the Queued status without making any progress.

Cause

One reason may be that your Kubernetes cluster requires the definition of the mostly_coordinator.deployment.core_job.tolerationOperator to Exists in the values.yaml file.

Solution

Remove any Toleration values defined for your computes.

  1. As a Super admin, go to the Profile menu in the upper right and select Computes.
  2. For the Default compute, select it from the list and remove the Toleration value, and click Save. MOSTLY AI - Troubleshoot Queued jobs - Remove Toleration from computes
  3. Repeat for any additional computes added to your MOSTLY AI deployment.