lightbulb

Learn about the hardware, software, and user and data access requirements when installing and deploying MOSTLY AI onto your company’s infrastructure.

MOSTLY AI’s architecture

Below you can see MOSTLY AI’s architecture diagram. It depicts how its different components interact with each other, with services in your company’s server environment, and with clients.

MOSTLY AI’s architecture

Hardware requirements

Running MOSTLY AI requires a cluster of at least two virtual machines. One of them will function as the application server and the others will function as AI servers.

The application server is responsible for the web-based user interface and the distribution of synthetic data generation tasks across the AI servers. The minimum requirements for the application server are as follows:

  • two CPUs

  • 32 Gb of memory

  • should always be up and running

The AI servers are responsible for processing these synthetic data generation tasks. The hardware requirements for the AI servers are classified into tiers depending on the dataset size:

  • Tier 1 refers to a dataset up to 1 million subjects and 100 columns

  • Tier 2 refers to a dataset up to 10 million subjects and 250 columns.

  • Tier 3 refers to a dataset larger than 10 million subjects and 250 columns.

For on-premises deployment, these tiers refer to the following machine types:

Tier 1 Tier 2 Tier 3

Data size

up to 100k subjects and 100 columns

up to 10m subjects and 250 columns

more than 10m subjects and 250 columns

CPU

32 Cores

64 Cores

64 Cores

Memory

128 GB

256 GB

512 GB

Disk Storage (SSD)

500 GB

1 TB

1 TB

For cloud deployment, some examples of VMs are provided below; however, any VM with similar memory, CPU and storage capacities would be sufficient:

Tier 1 Tier 2 Tier 3

AWS

m5.8xlarge

m5.16xlarge

r5a.16xlarge

Google

n1-standard-32

n1-standard-64

Ni-highmem-64

Azure

Standard_D32s_v3

Standard_D64_v3

Standard-E64a_v4

Machine management capabilities

The application server can suspend or stop AI servers when idle so that resources are not wasted and operational costs are minimized.

The AI servers can also have different CPU, memory, and disk configurations. However, this release of MOSTLY AI will not be doing any active forwarding of tasks based on their processing capabilities.

Persistent storage requirements

To successfully complete synthetic data generation jobs, both the application server and the AI servers need to have access to a shared data volume.

Software requirements

Operating System

Supported operating systems incl. minimum version:

  • Red Hat versions 7 or 8

  • Ubuntu 18.04 or 20.04

Container management

MOSTLY AI requires the following components:

  • Docker Engine 20.10.16

  • Docker Compose 2.5.0

Data source and destination configuration requirements

Cloud storage

  • Read and write access privileges for the MOSTLY AI user account.

Databases

MOSTLY AI works with source and destination databases. These can be on the same server.

  • The databases should be accessible by the application and AI servers.

  • The source database can be read-only.

  • The destination database should be empty and have write access as synthetic data
    will be written to it.

  • The database user should have access privileges to create tables and write to the database and schema.

System administration

MOSTLY AI requires an administrative user for installation which has one of the following attributes:

  • member of the docker group, OR

  • has sudo rights for docker, OR

  • is root OR

  • has sudo rights

User access requirements

MOSTLY AI’s web UI

Users can operate MOSTLY AI using its web UI, which they can access using a web browser. Admins can configure a specific port and a domain name for user access. This port needs to be white-listed in the firewall settings, and the domain name needs to be certified.

The web UI can also be accessed via localhost or the IP address of the MOSTLY AI server.

Identity and Access Management

MOSTLY AI uses the Keycloak Identity and Access Management service to manage users and configure their access permissions. It is part of MOSTLY AI’s installation and needs its own port to be white-listed in the firewall settings.

Administrators can manage users within MOSTLY AI’s web UI by synchronizing its user database with your company’s Active Directory.

But they can also add users via Keycloak’s web UI, which is also accessed via the browser. It uses the same domain name that’s configured for MOSTLY AI, but with the /auth path attached to it — https://mostlyai.mycompany.com/auth, for example.

Securing MOSTLY AI via Reverse Proxy

To ensure a secure connection, you can configure a reverse proxy to ensure that MOSTLY AI and Keycloak are only accessible using the ports specified during setup (8080 and 8090 by default).

Data access requirements

MOSTLY AI requires access to the data sources with which it will generate synthetic data.

For database data sources, please ensure that the cluster’s network configuration allows access to them, apart from credentials.

For CSV and Parquet files, you can grant access in the following ways:

  • Store them locally on the server where MOSTLY AI is running

  • Using a client that accesses the administrative console of MOSTLY AI

  • Store them in a cloud bucket on AWS, Google, or Azure

    • AWS S3
      Requires Access Key and Secret Access Key of the user profile in AWS IAM.

    • Google
      Requires a service account with permission to operate within the storage bucket.

    • Azure
      Requires Access Key and Secret Access Key of the storage account.

License management

Activating or renewing the license doesn’t require internet access. MOSTLY AI’s synthetic data platform operates solely within your company’s intranet and has no interaction with the internet. We also don’t have any access to your usage or data.

To activate your company’s license, the administrator tasked with installing MOSTLY AI needs to contact your account manager.