TL;DR The MOSTLY AI Platform now supports the creation of synthetic text leveraging pretrained language models. You can choose from various LLMs on Hugging Face and conveniently fine-tune them with your proprietary text data. Directly on the MOSTLY AI Platform and in the context of structured data. Realizing privacy-preserving GenAI use cases has never been easier!
The Challenge with GenAI
Today, AI training is hitting a plateau as models exhaust public data sources and yield diminishing returns. Enterprises thus want to turn to high-quality, proprietary data, which offers far greater value and potential than the residual public data currently being used. But unfortunately it’s not as simple as that. There are three major challenges for organizations today:
- Real text data often contains sensitive information, such as personally identifiable information (PII), posing a risk of unintended exposure when used in LLMs.
- The available text data may not be optimal for LLM training as it often lacks diversity, and manually creating this specialized data is labor-intensive and can yield low-quality results.
- Text data is never standalone; it comes intertwined with other structured data about their customer base.
Introducing MOSTLY AI Synthetic Text
By 2026, Gartner predicts that 75% of companies will use generative AI to create synthetic customer data, up from less than 5% in 2023. We are enabling this mass adoption by expanding the MOSTLY AI Platform to include synthetic text powered by pretrained language models.
By uniquely integrating structured and unstructured data, we enable enterprises to safely create a complete and statistically accurate picture of their proprietary data assets to fine-tune and deliver high-quality, bespoke generative AI solutions in a safe and compliant way.
This is how it works:
- The original text data is loaded onto the MOSTLY AI Platform.
- The user selects a language model that they want to use to create the synthetic data. A variety of OpenSource LLMs from Hugging Face including Mistral-7B, Viking-7B and others can be used.
- The selected LLM is then fine-tuned with the original text data on the MOSTLY AI Platform. This will take place in the context of additional structured data that is provided (e.g. specific customer information) to increase the quality of the created synthetic text.
- With the fine-tuned LLM in place, the MOSTLY AI Platform will create the synthetic text which can be downloaded or stored in a database for further processing.
This new functionality is an integral part of the MOSTLY AI Platform. One important aspect of our Platform is its ability to run in isolation within a secure enterprise environment, a capability that extends even to synthetic text. We allow users to select and combine various Generative AI models -including LLM models from HuggingFace and proprietary MOSTLY AI models- to produce synthetic data of the highest quality and with the highest privacy guarantees.
Use Cases
With this new functionality, enterprises can unlock the vast amount of proprietary text collected, such as customer support transcripts and chatbot conversations, without compromising privacy, to train and fine-tune large language models (LLMs) for faster innovation and better decision-making.
The first use case we are targeting is prompt-response data. This type of data often comes in the form of question-answer pairs. In customer service, for example, this could be a customer’s question and the corresponding response. For instance:
Q: “I have forgotten my password, what can I do?”
A: “You can request a password-reset email from our website.”
This specific QA pair could have some additional structured information associated with it. For example:
Customer Age | Purchased Product | Question | Answer |
25 | Product A | I have forgotten my password, what can I do? | You can request a password-reset email from our website. |
37 | Product B | I can’t find the setting to change the volume on my device. | You can change the volume by pressing the up or down buttons which are located on the back of your device. |
… |
This unique combination of structured and unstructured data can now conveniently be synthesized with a few clicks or API calls. All privacy preserving and maintaining the correlations and statistical information between the structured and text data.
Synthetic data is set to become the driving force behind LLMs. Leveraging advanced tools to unveil deep insights hidden in proprietary data is paramount for strategic, informed decision-making across operations. MOSTLY AI provides companies with a synthetic representation that reflects both the text and the structured insights they hold.
What's next?
If you’re as excited about this new functionality as we are you should check out these additional blog posts:
- Creating privacy-preserving synthetic text in Databricks to safely fine-tune your custom LLM
- Benchmarking Synthetic Text Generation: MOSTLY AI vs. GPT-4o-mini in Wine Review Prediction
If you prefer to explore synthetic text on your own, sign up for an account on our FREE version of the MOSTLY AI Platform here: https://app.mostly.ai
And lastly, if you just want to sit back and relax for a couple of minutes, check out this 6min 25s video that demonstrates the MOSTLY AI Platform in action.