Introduction

In today’s data-driven world, the ability to leverage high-quality, privacy-safe data is paramount. The MOSTLY AI Platform provides a suite of advanced data capabilities, including synthetic data generation and generating AI-driven insights. To streamline the integration of these capabilities within Databricks, we’ve developed a comprehensive Solution Accelerator. This blog post outlines the goals and processes of our Solution Accelerator, enabling you to harness the full potential of MOSTLY AI's Platform within Databricks.

TL;DR

Our Solution Accelerator simplifies the implementation of MOSTLY AI’s synthetic data generation capabilities within the Databricks Intelligence Platform. With just a few guided steps, you can set up and utilize synthetic data generators, and with the model registered in Unity Catalog (UC), users can generate synthetic data without ever touching production data. This ensures privacy and accelerates data-driven projects efficiently and safely.

Goal of the Solution Accelerator

The primary goal of our Solution Accelerator is to simplify the implementation of MOSTLY AI’s synthetic data capabilities within Databricks. Through a series of guided notebooks, users can quickly set up and utilize our Platform’s features. By abstracting the complexities, this Accelerator allows users to focus on deriving valuable insights from their data.

The Process

The Solution Accelerator comprises four key notebooks, each building upon the previous one to enable seamless data innovation. While there is significant complexity within the notebook code itself, users only need to input variables into widgets and run the notebooks. In almost all cases, this will work seamlessly without requiring code modifications. Here’s an overview of the process:

Step 0: Create Generator with Databricks Connector

The initial setup step involves installing the MOSTLY AI Python package in your Databricks environment and initializing the client using your API key and base URL. By inputting a configuration variable, you can train a new data generator. In MOSTLY AI, a generator is the core engine that learns the statistical patterns and relationships within an original dataset and produces highly realistic, privacy-preserving synthetic data that mirrors the structure and insights of the real data without containing any actual identifiable information. The generator ID created in this step is crucial for subsequent notebooks. This step is essential for those creating a new generator.

Step 1: Save Generator Path, API Key Path & URL to Unity Catalog

In this step, we save critical configuration information to Databricks Unity Catalog (UC). This includes the generator ID path, API key path, and the MOSTLY AI URL. These paths are essential for model configuration in later steps. All necessary variables are provided as widgets within the notebook, minimizing the need for direct code modifications.

Step 2: Create, Load & Register the Generator as a Model to Unity Catalog

This notebook focuses on creating a model object that generates synthetic data using the configuration saved in the previous step. By registering this model in Unity Catalog, users can easily access and use it for data generation without interacting with the production data. The notebook requires a sample configuration for the model input and an output schema for the model output, which are necessary for the model signature in Unity Catalog. Users input the 3-level namespace (catalog.schema.table) where they want the model saved, and the model is trained and saved to the specified location.

Step 3: Generate Synthetic Data from UC Model

The final and most crucial step involves loading the registered model from Unity Catalog, providing the necessary configuration, and generating synthetic data. This data is then written to a specified location in Unity Catalog for downstream consumption. This notebook abstracts the complexity from the users, allowing them to run the notebook with the appropriate model and configuration to produce synthetic data without needing to understand the underlying mechanisms. With the registered generator in Unity Catalog, consumers can generate synthetic data without ever having to touch production data, ensuring privacy and security.

Usage Tips

  • Widget-Based Input: All necessary variables are provided as widgets within the notebooks, ensuring that users can simply input their variables and run the notebooks without modifying the code. This design makes the process user-friendly and efficient.
  • Running Only When Necessary: Steps 0, 1, and 2 only need to be run when there is new synthetic data to be generated. If you are using an existing synthetic data generator, you can skip these steps. Databricks users can be provided with the last notebook alone and run it for a specific Unity Catalog model to generate and access synthetic data directly.

Conclusion

Our Solution Accelerator for synthetic data generation with MOSTLY AI and Databricks simplifies the process, enabling users to quickly implement and utilize synthetic data generators. By following the structured steps in the provided notebooks, users can seamlessly integrate MOSTLY AI’s platform into their Databricks workflows, enhancing data privacy, scalability, and innovation. Synthetic data enables safer and faster access to data for everyone, making it a powerful tool for data-driven projects.

We hope this solution accelerator empowers you to harness the full potential of MOSTLY AI’s capabilities and accelerates your data-driven projects. 

To get started:

  1. Navigate to the Databricks Marketplace and search “Mostly”
  2. Select the “Synthetic Data Generator by MOSTLY AI for Databricks” tile.
  3. Click “Get Access” in the top right corner. 
  4. Navigate to Unity Catalog Databricks and click new Catalog available in the “Shared” section.
  5. In the Catalog, click “Other Assets” and you will see the 4 notebooks.
  6. Clone each of them to your Databricks workspace and you are ready to go!