Key features

  • Stores technical metadata, such as the names and number of tables and columns in a data source, data types, and relationships across tables.

  • Can read tables in Parquet and CSV file formats.

  • Doesn’t retain any static datasets in memory.

  • Determines suitable column encoding types by performing a full analysis on the data source’s contents.

  • Allows admins to manage a data source’s synthetization settings.

  • Enables automation of synthetic data delivery pipelines.

  • Simplifies manual synthetization tasks as users don’t need to set up jobs from scratch

Data catalogs connect to your data sources, extract information about the data inside, and set them up for repeatable synthetic data generation. They simplify the management and reuse of synthetization settings for CSV and Parquet files.

What is a data catalog?

In simple terms, a data catalog is a document that describes what`s available in a data source, together with the requirements to create a privacy-secure version of it’s contents. It gives you the certainty of knowing what you’re going to get, without needing a detailed insight into the actual data.

Better yet, data catalogs allow for the contents to change, providing you with the opportunity to share the latest, up-to-date synthetic copies of that data source.

Let’s expand on this with the analogy of a library. In their catalog, you’ll find records about the books they have on offer. These records tell whether a book is available, where you can find it, and provide a short description of the book’s contents.

A data catalog in MOSTLY AI is very similar to these records. But to understand to capabilities of this data catalog, let’s envision having records for newspapers rather than books in the library. These records couldn’t describe the contents of a newspaper as they change all the time. Instead, these records would say something about the themes they cover and how the newspaper is structured — front page articles, feature articles, international news, editorials, etc.., allowing you to easily find a particular point of view on the latest events.

Similarly, a data catalog describes the structure of the contents of a data source — how many tables are there, how are they related, which columns do they have, and how should they be synthesized?

In this way, a data catalog allows the original data to change as long as its structure stays the same, enabling you to share the latest, up-to-date synthetic copy across your company and partnerships.