Tutorial¤

This tutorial gives an example of how to use airt services to train a model and make predictions.

We can use the airt-client library’s following classes for the task at hand:

Client for authenticating and accessing the airt service,
DataBlob for encapsulating the data from sources like CSV files, databases, Azure Blob Storage, or AWS S3 bucket, and
DataSource for managing datasources and training the models in the airt service.

We import them from airt.client module as follows:

from airt.client import Client, DataBlob, DataSource

Authentication¤

To access the airt service, you must first create a developer account. Please fill out the signup form below to get one:

https://bit.ly/3hbXQLY

After successful verification, you will receive an email with the username and password for the developer account.

Once you have the credentials, use them to get an access token by calling Client.get_token method. It is necessary to get an access token; otherwise, you won’t be able to access all of the airt service’s APIs. You can either pass the username, password, and server address as parameters to the Client.get_token method or store them in the environment variables AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL

In addition to the regular authentication with credentials, you can also enable multi-factor authentication (MFA) and single sign-on (SSO) for generating tokens.

To help protect your account, we recommend that you enable multi-factor authentication (MFA). MFA provides additional security by requiring you to provide unique verification code (OTP) in addition to your regular sign-in credentials when performing critical operations.

Your account can be configured for MFA in just two easy steps:

To begin, you need to enable MFA for your account by calling the User.enable_mfa method, which will generate a QR code. You can then scan the QR code with an authenticator app, such as Google Authenticator and follow the on-device instructions to finish the setup in your smartphone.
Finally, activate MFA for your account by calling User.activate_mfa and passing the dynamically generated six-digit verification code from your smartphone’s authenticator app.

You can also disable MFA for your account at any time by calling the method User.disable_mfa method.

Single sign-on (SSO) can be enabled for your account in three simple steps:

Enable the SSO for a provider by calling the User.enable_sso method with the SSO provider name and an email address. At the moment, we only support “google” and “github” as SSO providers. We intend to support additional SSO providers in future releases.
Before you can start generating new tokens with SSO, you must first authenticate with the SSO provider. Call the Client.get_token with the same SSO provider you have enabled in the step above to generate an SSO authorization URL. Please copy and paste it into your preferred browser and complete the authentication process with the SSO provider.
After successfully authenticating with the SSO provider, call the Client.set_sso_token method to generate a new token and use it automatically in all future interactions with the airt server.

Info

In the below example, the username, password, and server address are stored in AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL environment variables.

# Authenticate
Client.get_token()

1. Data Blob¤

DataBlob objects are used to encapsulate data access. Currently, we support:

access for local CSV files,
database access for MySql, ClickHouse, and
files stored cloud storages like AWS S3 bucket and Azure Blob Storage.

We intend to support additional databases and storage mediums in future releases.

To create a DataBlob object, use one of the DataBlob class’s from_* methods. Check out the DataBlob class documentation for more information.

In this example, the input data is a CSV file stored in an AWS S3 bucket. Before you can use the data to train a model, it must be uploaded to the airt server. To upload data from an AWS S3 bucket to the airt server, use the DataBlob class’s DataBlob.from_s3 method.

# Pull the data from an AWS S3 bucket to the airt server
data_blob = DataBlob.from_s3(uri="s3://test-airt-service/ecommerce_behavior_csv")

The above method will automatically pull the data into the airt server, and all calls to the library are asynchronous and return immediately. To manage completion, all the from_* methods of the DataBlob class will return a status object indicating the completion status. Alternatively, you can monitor the completion status interactively in a progress bar by calling the DataBlob.progress_bar method:

# Display the completion status in a progress bar
data_blob.progress_bar()

100%|██████████| 1/1 [00:35<00:00, 35.48s/it]

# Check to ensure that the upload is complete
assert data_blob.is_ready()

The next step is to preprocess and prepare the data for training. Preprocessing entails creating the index column, sort column, and so on. Currently, CSV and Parquet files can be preprocessed. Please use the to_datasource method in the DataBlob class for the same. We intend to support additional file formats in the future releases.

# Preprocess and prepare the data for training.
data_source = data_blob.to_datasource(
    file_type="csv", index_column="user_id", sort_by="event_time"
)

# Display the data preprocessing progress
data_source.progress_bar()

100%|██████████| 1/1 [00:35<00:00, 35.46s/it]

When the preprocessing is finished, you can run the following command to display the head of the data to ensure everything is fine.

# Display the first few rows of preprocessed data.
data_source.head().style

	event_time	event_type	product_id	category_id	category_code	brand	price	user_session
user_id
10300217	2019-11-06 06:51:52+00:00	view	26300219	2053013563424899840	None	sokolov	40.540000	d1fdcbf1-bb1f-434b-8f1a-4b77f29a84a0
253299396	2019-11-05 21:25:44+00:00	view	2400724	2053013563743666944	appliances.kitchen.hood	bosch	246.850000	b097b84d-cfb8-432c-9ab0-a841bb4d727f
253299396	2019-11-05 21:27:43+00:00	view	2400724	2053013563743666944	appliances.kitchen.hood	bosch	246.850000	b097b84d-cfb8-432c-9ab0-a841bb4d727f
272811580	2019-11-05 19:38:48+00:00	view	3601406	2053013563810775808	appliances.kitchen.washer	beko	195.600000	d18427ab-8f2b-44f7-860d-a26b9510a70b
272811580	2019-11-05 19:40:21+00:00	view	3601406	2053013563810775808	appliances.kitchen.washer	beko	195.600000	d18427ab-8f2b-44f7-860d-a26b9510a70b
288929779	2019-11-06 05:39:21+00:00	view	15200134	2053013553484398848	None	racer	55.860000	fc582087-72f8-428a-b65a-c2f45d74dc27
288929779	2019-11-06 05:39:34+00:00	view	15200134	2053013553484398848	None	racer	55.860000	fc582087-72f8-428a-b65a-c2f45d74dc27
310768124	2019-11-05 20:25:52+00:00	view	1005106	2053013555631882752	electronics.smartphone	apple	1422.310000	79d8406f-4aa3-412c-8605-8be1031e63d6
315309190	2019-11-05 23:13:43+00:00	view	31501222	2053013558031024640	None	dobrusskijfarforovyjzavod	115.180000	e3d5a1a4-f8fd-4ac3-acb7-af6ccd1e3fa9
339186405	2019-11-06 07:00:32+00:00	view	1005115	2053013555631882752	electronics.smartphone	apple	915.690000	15197c7e-aba0-43b4-9f3a-a815e31ade40

2. Training¤

The prediction engine is specialized for predicting which clients are most likely to have a given event in the future.

We assume the input data includes the following:

a column identifying a client client_column (person, car, business, etc.),
a colum specifying a type of event we will try to predict target_column (buy, checkout, click on form submit, etc.), and
a timestamp column specifying the time of an occured event.

Each row of data may contain additional columns of the int, category, float, or datetime types, which will be used to improve prediction accuracy. E.g. there could be a city associated with each user or type, credit card used for a transaction, smartphone model used to access a mobile app, etc.

Finally, we need to know how far ahead we want to make predictions. E.g. if we predict that a client will most likely buy a product in the next minute, there isn’t much we can do anyway. We might be more interested in clients who are likely to buy a product tomorrow so that we can send them a special offer or engage them in some other way. That lead time varies greatly depending on the application and can be as short as a few minutes for a web store or as long as several weeks for a banking product such as a loan. In any case, there is a parameter predict_after that allows you to specify the time period based on your particular needs.

To train a model, pass the configurations for your usecase to the DataSource.train method. The train method is asynchronous and may take several hours to complete depending on the size of your dataset. You can check the training status by calling the Model.is_ready method or monitor the completion progress interactively by calling the Model.progress_bar method.

In the following example, we will train a model to predict which users will perform a purchase event **(*purchase)** 3 hours before they actually do it:

# Train a model
from datetime import timedelta

model = data_source.train(
    client_column="user_id",
    target_column="event_type",
    target="*purchase",
    predict_after=timedelta(hours=3),
)

# Display model training progress
model.progress_bar()

100%|██████████| 5/5 [00:00<00:00, 90.27it/s]

# Check to ensure that the model training is complete
assert model.is_ready()

Once the model training is complete, call the Model.evaluate method to display multiple evaluation metrics to evaluate the model’s performance.

# Evaluate the model
model.evaluate().style

	eval
accuracy	0.985000
recall	0.962000
precision	0.934000

3. Predictions¤

Finally, you can use the trained model to make predictions by calling the Model.predict method. The predict method is asynchronous and may take several hours to complete depending on the size of your dataset. You can check the prediction status by calling the Prediction.is_ready method or monitor the completion progress interactively by calling the Prediction.progress_bar method.

# Run predictions
predictions = model.predict()

# Display model prediction progress
predictions.progress_bar()

100%|██████████| 3/3 [00:10<00:00,  3.40s/it]

# Check to ensure that the prediction is complete
assert predictions.is_ready()

If the dataset is small enough, you can download the prediction results as a Pandas DataFrame by calling the Prediction.to_pandas method:

# Download the prediction results as a pandas DataFrame
predictions.to_pandas().style

	Score
user_id
520088904	0.979853
530496790	0.979157
561587266	0.979055
518085591	0.978915
558856683	0.977960
520772685	0.004043
514028527	0.003890
518574284	0.001346
532364121	0.001341
532647354	0.001139

In many cases, it is much better to push the prediction results to destinations such as AWS S3, MySql databases, or even download them to the local machine.

Below is an example to push the prediction results to an AWS S3 bucket. For other available options, please check the documentation of the Prediction class.

# Push prediction results to an AWS S3 bucket
status = predictions.to_s3(uri=TARGET_S3_BUCKET)

# Display prediction push progress
status.progress_bar()

100%|██████████| 1/1 [00:10<00:00, 10.18s/it]