💡 Introducing the MOSTLY AI Assistant
Read all about it here
July 27, 2023
10m 24s

Integrating Databricks and MOSTLY AI's synthetic data generator

In this video, we'll explain how to pull data from Databricks for synthetic data generation and how to push the generated synthetic data back to Databricks.


[00:00:00] In this video, I'll be showing the integration between Databricks and MOSTLY AI. The purpose of this video is to show how to pull data out of Databricks, into MOSTLY AI, run a synthetic data job and push back into Databricks for downstream use cases, leveraging that data.

[00:00:18] For this example, we're going to be using a linked table job, where we have a subject table, as well as an associated linked table. First thing we're going to do is, we're going to go into Databricks here.

[00:00:31] If you're not familiar, what we're going to see is that we're operating under the assumption that we already have some data in Databricks, so what I went ahead and did is, created this catalog in Databricks, with these various schemas, as well as tables, within each of these schemas.

[00:00:50] Two important pieces to know in Databricks is, it operates on this three-level namespace, where it goes catalog, schema and/or database, that are interchangeable, and tables within those databases.

[00:01:05] On the backend, another important note is that all of this actually exists in cloud storage. Databricks is the compute layer, but it's actually built on GCP Azure and AWS.

[00:01:19] While these tables exist in Databricks, the actual underlying file storage is in those various respective cloud storages, within the customer's accounts. Just something to note, in Databricks, that there's no proprietary data format in Databricks that exists, it's open in cloud storage format, just wanted to highlight that fact.

[00:01:42] What we're looking at here is two tables in this baseball catalog, or database, I should say. We have players, it's some information about players, so the unique identifying key, each line is going to be a separate player with some height, weight, how they bat, throw, et cetera. We have seasons. Each player might have multiple seasons, so it's going to be a many-to-one ratio.

[00:02:12] We'll see that player one might have 10 seasons, player two might have three, et cetera. The linkage there is going to be that player ID, right?

[00:02:24] Now, what we're going to do is, we're going to go over into MOSTLY AI. The first thing that we want to do is, we want to establish connectors. Now, I went ahead and created these ahead of time, just because it had some credentials that I had to input and all of that, but I'll still show how to create them, but these are the two that we'll be using.

[00:02:40] These connectors are what's actually going to be that direct connection into Databricks, into those specific schemas, for the source of the data that you're pulling in and the output of the data that you're pushing out.

[00:02:53] When I go and create a connector and I go into Databricks, you'll see that I have these paths, or these fields right here, that I have to input. In order to do this, what we do is, we go back into Databricks, we go into this SQL warehouse tab, and we see that this is running.

[00:03:11] A warehouse in Databricks, just think of it as the compute layer underneath the hood that's actually what you need to actually attach to run commands, to do any type of interactive work on Databricks, so this is the compute layer that Databricks has. We have this starter warehouse spun up already.

[00:03:29] When I click into this warehouse, it's the same thing as, basically, a cluster, and in the connection details, you'll see the various fields that you need to input. We have the hostname, which would go here. The connector name could be anything.

[00:03:46] We have the HTTP path, which you put in here. For access token, depending on the organization's needs, you can use an admin, or whatever it is, just to show you in Databricks, where that actually exists. If I go into my own user settings, I can generate a new token for access to this.

[00:04:08] Create that token, copy it, and I would input it in the respective token path here.

[00:04:17] The last thing you would do is-- Back in this data tab here, you'll remember I have this JB catalog. This is going to be the source, so I have these two tables that I'm pulling in. What I'm going to do is, I'm going to direct it to-- This would be the catalog, so JB catalog, and this schema would actually be baseball. Once I created that, I would have this direct path into this specific schema, with this specific data.

[00:04:46] I already went ahead and, like I said, created that, because I had to use some tokens and all that, so that's what this Databricks baseball DB is. Then I also created a second connector. The purpose of this connector is, when we run the job, we want to have a place that we're actually pushing the data to, so in here, we do the same exact thing, give it a different name, so this one I've used over here, you can see synth output, or synth DB. Same hostname, same path, because we're using that same compute, same access token, because we're still accessing the same way, using the same credentials, same catalog, the only difference would be now, I'm going to point it to this synth output database that I have. You'll see that there's no data in this, currently, because that's where we're actually going to be pushing the data back to, within Databricks.

[00:05:33] I would create that, and then I would have two connectors here, so the source and the output. Now what I want to do is, I want to create a catalog. A catalog is going to be a way to run repeatable jobs within MOSTLY AI, and so, what we're going to do here is create a catalog. We're going to use this Databricks baseball DB, because that's where the data is currently residing. I'm going to go ahead and give it a different name,

[00:05:59] just to make it more related to what we're actually doing, so I'll call this Baseball Synthetic Job. Now we're doing this, we'll go to tables and views, you'll see that it actually picks up the tables that currently exist in Databricks, we're going to select all of them.

[00:06:22] The subject table, as I mentioned here, is going to be this player's table. There's about 19,000 there, just for reference. We're going to hit next, and we're going to proceed. You can see the data catalog's been created, and now, what we want to do is create some additional configurations. As I mentioned, this player is going to be the subject table,

[00:06:51] so you'll recall that ID is that unique identifier per each player, so I'm going to save that as the primary key. Then I'm going to go into seasons and I'm going to actually add a foreign key. In this case, it's going to be players ID, and the reference table is going to be players.

[00:07:08] Now what you'll see is that linkage between the subject table, as well as the linked table, based on that player ID. Data settings. We're going to use just as default right now, of how they're going to generate this in the synthetic data. You can always change this, include it, not include it, whatever the team wants to do. Training settings.

[00:07:28] What I'm going to do here, just for purposes of this demo, is make it built for speed, and not accuracy, just so I can get some results and show how this data's actually pushed out. The purpose here isn't to show the best model, but more show the end-to-end process here. I'm actually just going to go-- Built on speed and the training size, I'm just going to give it 2,000 and give it a small model. All of these, basically, are just increasing

[00:07:54] The speed and sacrificing a little bit of the accuracy, just for the demo purposes, so for both tables, I'm going to do the same thing. Speed, give it a smaller model, and do that. Output settings.

[00:08:07] This is if you want to do a different number of generated subjects for this demo, we're going to leave that blank. I'm going to click Start Job.

[00:08:16] It brings you up to the same type of screen, but the only difference is, in this output settings, it now provides a data destination. The data destination, I'm now going to actually create it as that Databricks baseball synthetic data, or synthetic DB, which is basically a one-to-one to this synth output schema that I've created, that currently has no data in it.

[00:08:39] Ideally, when this job runs, what you should see is it pulling in this data, running the synthetic job. Then pushing the same data, or pushing the synthetic data out to this synthetic output database.

[00:08:53] I'm going to go ahead and launch job. Don't worry about this, and there we go.

[00:09:07] Oops, I have two running here, don't I? Once I launch the job-- I'm going to cancel this one.

[00:09:16] This one is currently running, we're going to pause the video here, and I will pick it back up once the job is actually completed.

[00:09:30] The job has been completed. You can see that the QA report was produced. The synthetic data within the platform exists here, for both the players and seasons.

[00:09:44] Now, if we go back into Databricks, I'm just going to refresh this just in case, you'll notice that we have our data tab, and you'll see now, in this synthetic output, we have players and seasons.

[00:09:59] If I go into seasons, for example, I click on some sample data, you can actually see that the synthetic data has in fact been produced, pushed back out to Databricks,

[00:10:09] and can now be leveraged for any downstream use cases that the organization might have.

Ready to start?

Sign up for free or contact our sales team to schedule a demo.