Use Apache Hive for synthetic data

MOSTLY AI can use Apache Hive as a data source for original data as well as a destination for synthetic data. To do so, you need to create Apache Hive connectors.

Create an Apache Hive connector

For each Apache Hive data source or destination, you need a separate connector.

Prerequisites

Obtain the Apache Hive connection details. To use Kerberos, see Use Kerberos for authentication.

  • host
  • port
  • credentials

If you use the web application, create a new Apache Hive connector from the Connectors page.

Steps

  1. From the Connectors page, click New connector. Click Create connector button
  2. From the Create a new connector window, select Apache Hive. Select Apache Hive connector
  3. From the New connector page, configure the connector.
    1. For Name, enter a name that you can distinguish from other connectors.
    2. For Access type, select whether you want to use the connector as a source or destination.
      💡

      If you plan to use Apache Hive as a destination for synthetic data, MOSTLY AI recommends that you load the data into Apache Hive using Parquet files. For more information, see Considerations for Apache Hive as a destination.

    3. For Host, enter the Apache Hive hostname.
    4. For Port, enter the port number.

      The default port for Apache Hive is 10000.

    5. For Username and Password, enter your Apache Hive credentials. Configure Apache Hive connector
  4. Click Save to save your new Apache Hive connector.

    MOSTLY AI tests the connection. If you see an error, check the connection details, update them, and click Save again.

    You can click Save anyway to save the connector disregarding any errors.

Use Kerberos for authentication

To create an Apache Hive connector with Kerberos authentication, contact your Kerberos system administrator to obtain the information listed below.

  • Kerberos principal. A unique identity to which Kerberos can assign tickets.
  • Kerberos krb5.conf configuration file. The contents of a krb5.conf configuration file, such as the one listed below. For more information, see krb5.conf (opens in a new tab) in the MIT Kerberos Documentation.
    [libdefaults]
    default_realm = INTERNAL
    dns_lookup_realm = false
    dns_lookup_kdc = false
    forwardable = true
    rdns = false
    
    [realms]
    INTERNAL = {
    	kdc = ip-172-***-***-222
    	kdc = ip-172-***-***-222.hive-kerberized.domain.com
    	kdc = 172.***.***.222
    	kdc = 3.***.***.111
    	kdc = hive-kerberized.domain.com
    	admin_server = hive-kerberized.domain.com
    }
    
    [domain_realm]
    .hive-kerberized.domain.com = INTERNAL
    hive-kerberized.domain.com = INTERNAL
  • Kerberos keytab. Short for "key table", keytab files are used in Kerberos authentication to store keys needed to log in to Kerberos-aware services. Keytab files allow automated processes (scripts and service authentication) to authenticate using Kerberos without requiring a human to enter a password.
  • CA certificate. If you have access to your keystore file from which you want to extract the CA certificate, you can use the following command to extract the CA certificate in PEM format.
    shell
    keytool -exportcert -alias ALIAS -keystore KEYSTORE_FILE -storepass STORE_PASSWORD -file ca-certificate.pem -rfc
    For example:
    shell
    keytool -exportcert -alias hivessl-ss -keystore /usr/lib/jvm/java-8-openjdk-amd64/jre/lib/security/hivessl-ss.p12 -storepass 123456 -file /tmp/ca-cert.pem -rfc

If you use the web application, enable the Authenticate with Kerberos checkbox in your Apache Hive connector and provide the required information.

Steps

  1. To use Kerberos for authentication in your Apache Hive connector, select the Authenticate with Kerberos checkbox.
  2. Provide the Kerberos authentication details.
    1. For Kerberos principal, type or paste the Kerberos principal.
    2. For Kerberos krb5 config, paste the contents of a krb5.conf file.
    3. For Kerberos keytab, click the Upload button and select a Kerberos keytab file from your local file system. Apache Hive connector configuration - Kerberos authentication
  3. Enable SSL and provide the CA certificate. Apache Hive connector configuration - Kerberos authentication with SSL
  4. Click Save.

What's next

Depending on whether you created a source or a destination connector, you can use the connector as:

Considerations for Apache Hive as a destination

The Apache Hive connector uses a database interface to connect to your Apache Hive instance. While the database interface provides high-performance for read operations (for example, to pull original data for synthesis), using that interface to load synthetic data in batches can be too slow. It is not a recommended solution for delivering high volumes of synthetic data.

The fastest way to load data into Apache Hive is to save synthetic data as Parquet files into an S3 storage bucket and then load the data into Apache Hive.

For example:

hive
CREATE TABLE table_name;
LOAD DATA INPATH 's3://bucket-name/path/to/parquet-file' INTO TABLE table_name;