Quick start for fine-tuning LLMs

MOSTLY AI integrates with well-known LLMs and gives the flexibility to use any LLM to generate privacy-safe synthetic text. When included in a tabular dataset, synthetic text data is intelligently aligned with the rest of the data, ensuring high correlation while safeguarding the privacy of real-world individuals.

Step 1: Train a generator to fine-tune an LLM

Prerequisites

Hardware. The minimum recommended requirements to fine-tune an LLM with MOSTLY AI are:
- GPU: Nvidia A10g or better (for example, A100, H100, RTX 4090)
- GPU VRAM: at least 24GB
Dataset with unstructured text in a supported format. The dataset you will use for fine-tuning must be CSV, Parquet, TSV, where one or more of the columns contains unstructured text.
Size of the training dataset. Depending on the size of the dataset and the compute resources you use for training, it is possible that the generator training can fail with out-of-memory errors. For details, see Troubleshooting.

Steps

Create a new generator and upload a table file or add a table from a data source that contains unstructured text.
On the Data configuration page, for the columns containing unstructured text, make sure that Language/Text is selected as Encoding type.
Click Configure models in the upper right.
On the Model configuration page, expand the language model and configure it.
1. For Model, select one of the available language models for training.
  
  📑
  The list of available models includes HuggingFace text generation models and the not-pre-trained MOSTLY AI LSTM model.
  If you need to use another LLM, contact MOSTLY AI Support.
2. For Compute, select an available compute. We recommend the use of GPUs for language models.
  📑
  From Compute, you select from a list of compute resources configured for MOSTLY AI. The compute is based on the resources available in the compute cluster where MOSTLY AI is running.
  - CPU-based computes offer a specific number of CPU cores and memory. They perform better for tabular data computations.
  - GPU-based computes include a specific number of GPU cores and GPU memory. For language model fine-tuning and language generation, it is typically faster to use GPUs.
Click Start training in the upper right.

Step 2: Generate text

Open the trained generator and click Generate data in the upper right.
Configure the generation.
1. For Tabular compute, select the compute to use for tabular data. It is usually best to use a CPU-based compute.
2. For Language compute, select the compute to use for language data. GPUs tend to be faster for language generation.
3. (Optional) If needed, adjust the rest of the generation options.
Click Start generation in the upper right.

Troubleshooting

Depending on your original text dataset, it is possible that the model samples of your trained generator include _INVALID_ text values, or that for particularly long texts, the generator training can fail. Learn how you can troubleshoot such issues.

`_INVALID_` values

If you encounter _INVALID_ values in your model samples or generated synthetic text data, this is likely due to the use of the less efficient CPU compute for fine-tuning, insufficient data, or insufficient training time.

Use GPUs for language model fine-tuning

GPUs are in a league of their own when it comes to LLM fine-tuning. To avoid _INVALID_ values, you can use the best practices listed below.

Use GPU computes as explained in Step 1: Train a generator to fine-tune an LLM, step 4.2.
Use CPU computes only for tabular model training.

Increase Max training time

If you already use GPUs and still see _INVALID_ values, the next step is to train a new generator with an increased Max training time.

Start by training a new generator with Max training time increased to 20 min. The default is 10 min.
If you still see _INVALID_ values, increase Max training time to 30 min.

For details, see Increase max training time.

Use a training dataset with shorter texts

Using a dataset with shorter texts or trimming its original lengths can also help to avoid _INVALID_ values.

Generator training failures

Depending on the LLM you use, the size of your text data and its dataset, and the compute resources available for fine-tuning, generator training might fail with out-of-memory errors. To troubleshoot, try the suggestions below in the order listed.

Set Batch size to 2 or 4. For details on how to set batch size, see Increase batch size.
Use a training dataset with shorter texts.