Tutorial¤
This tutorial gives an example of how to use airt services to train a model and make predictions.
We can use the airt-client library’s following classes for the task at hand:
-
Client
for authenticating and accessing the airt service, -
DataBlob
for encapsulating the data from sources like CSV files, databases, Azure Blob Storage, or AWS S3 bucket, and -
DataSource
for managing datasources and training the models in the airt service.
We import them from airt.client module as follows:
from airt.client import Client, DataBlob, DataSource
Authentication¤
To access the airt service, you must first create a developer account. Please fill out the signup form below to get one:
After successful verification, you will receive an email with the username and password for the developer account.
Once you have the credentials, use them to get an access token by
calling
Client.get_token
method. It is necessary to get an access token; otherwise, you won’t be
able to access all of the airt service’s APIs. You can either pass the
username, password, and server address as parameters to the
Client.get_token
method or store them in the environment variables
AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and
AIRT_SERVER_URL
In addition to the regular authentication with credentials, you can also enable multi-factor authentication (MFA) and single sign-on (SSO) for generating tokens.
To help protect your account, we recommend that you enable multi-factor authentication (MFA). MFA provides additional security by requiring you to provide unique verification code (OTP) in addition to your regular sign-in credentials when performing critical operations.
Your account can be configured for MFA in just two easy steps:
-
To begin, you need to enable MFA for your account by calling the
User.enable_mfa
method, which will generate a QR code. You can then scan the QR code with an authenticator app, such as Google Authenticator and follow the on-device instructions to finish the setup in your smartphone. -
Finally, activate MFA for your account by calling
User.activate_mfa
and passing the dynamically generated six-digit verification code from your smartphone’s authenticator app.
You can also disable MFA for your account at any time by calling the
method
User.disable_mfa
method.
Single sign-on (SSO) can be enabled for your account in three simple steps:
-
Enable the SSO for a provider by calling the
User.enable_sso
method with the SSO provider name and an email address. At the moment, we only support “google” and “github” as SSO providers. We intend to support additional SSO providers in future releases. -
Before you can start generating new tokens with SSO, you must first authenticate with the SSO provider. Call the
Client.get_token
with the same SSO provider you have enabled in the step above to generate an SSO authorization URL. Please copy and paste it into your preferred browser and complete the authentication process with the SSO provider. -
After successfully authenticating with the SSO provider, call the
Client.set_sso_token
method to generate a new token and use it automatically in all future interactions with the airt server.
Info
In the below example, the username, password, and server address are stored in AIRT_SERVICE_USERNAME, AIRT_SERVICE_PASSWORD, and AIRT_SERVER_URL environment variables.
# Authenticate
Client.get_token()
1. Data Blob¤
DataBlob
objects are used to encapsulate data access. Currently, we support:
-
access for local CSV files,
-
database access for MySql, ClickHouse, and
-
files stored cloud storages like AWS S3 bucket and Azure Blob Storage.
We intend to support additional databases and storage mediums in future releases.
To create a
DataBlob
object, use one of the DataBlob class’s from_* methods. Check
out the
DataBlob
class documentation for more information.
In this example, the input data is a CSV file stored in an AWS S3
bucket. Before you can use the data to train a model, it must be
uploaded to the airt server. To upload data from an AWS S3 bucket to the
airt server, use the DataBlob class’s
DataBlob.from_s3
method.
# Pull the data from an AWS S3 bucket to the airt server
data_blob = DataBlob.from_s3(uri="s3://test-airt-service/ecommerce_behavior_csv")
The above method will automatically pull the data into the airt server,
and all calls to the library are asynchronous and return immediately. To
manage completion, all the from_* methods of the
DataBlob
class will return a status object indicating the completion status.
Alternatively, you can monitor the completion status interactively in a
progress bar by calling the
DataBlob.progress_bar
method:
# Display the completion status in a progress bar
data_blob.progress_bar()
100%|██████████| 1/1 [00:35<00:00, 35.48s/it]
# Check to ensure that the upload is complete
assert data_blob.is_ready()
The next step is to preprocess and prepare the data for training.
Preprocessing entails creating the index column, sort column, and so on.
Currently, CSV and Parquet files can be preprocessed. Please use the
to_datasource
method in the
DataBlob
class for the same. We intend to support additional file formats in the
future releases.
# Preprocess and prepare the data for training.
data_source = data_blob.to_datasource(
file_type="csv", index_column="user_id", sort_by="event_time"
)
# Display the data preprocessing progress
data_source.progress_bar()
100%|██████████| 1/1 [00:35<00:00, 35.46s/it]
When the preprocessing is finished, you can run the following command to display the head of the data to ensure everything is fine.
# Display the first few rows of preprocessed data.
data_source.head().style
event_time | event_type | product_id | category_id | category_code | brand | price | user_session | |
---|---|---|---|---|---|---|---|---|
user_id | ||||||||
10300217 | 2019-11-06 06:51:52+00:00 | view | 26300219 | 2053013563424899840 | None | sokolov | 40.540000 | d1fdcbf1-bb1f-434b-8f1a-4b77f29a84a0 |
253299396 | 2019-11-05 21:25:44+00:00 | view | 2400724 | 2053013563743666944 | appliances.kitchen.hood | bosch | 246.850000 | b097b84d-cfb8-432c-9ab0-a841bb4d727f |
253299396 | 2019-11-05 21:27:43+00:00 | view | 2400724 | 2053013563743666944 | appliances.kitchen.hood | bosch | 246.850000 | b097b84d-cfb8-432c-9ab0-a841bb4d727f |
272811580 | 2019-11-05 19:38:48+00:00 | view | 3601406 | 2053013563810775808 | appliances.kitchen.washer | beko | 195.600000 | d18427ab-8f2b-44f7-860d-a26b9510a70b |
272811580 | 2019-11-05 19:40:21+00:00 | view | 3601406 | 2053013563810775808 | appliances.kitchen.washer | beko | 195.600000 | d18427ab-8f2b-44f7-860d-a26b9510a70b |
288929779 | 2019-11-06 05:39:21+00:00 | view | 15200134 | 2053013553484398848 | None | racer | 55.860000 | fc582087-72f8-428a-b65a-c2f45d74dc27 |
288929779 | 2019-11-06 05:39:34+00:00 | view | 15200134 | 2053013553484398848 | None | racer | 55.860000 | fc582087-72f8-428a-b65a-c2f45d74dc27 |
310768124 | 2019-11-05 20:25:52+00:00 | view | 1005106 | 2053013555631882752 | electronics.smartphone | apple | 1422.310000 | 79d8406f-4aa3-412c-8605-8be1031e63d6 |
315309190 | 2019-11-05 23:13:43+00:00 | view | 31501222 | 2053013558031024640 | None | dobrusskijfarforovyjzavod | 115.180000 | e3d5a1a4-f8fd-4ac3-acb7-af6ccd1e3fa9 |
339186405 | 2019-11-06 07:00:32+00:00 | view | 1005115 | 2053013555631882752 | electronics.smartphone | apple | 915.690000 | 15197c7e-aba0-43b4-9f3a-a815e31ade40 |
2. Training¤
The prediction engine is specialized for predicting which clients are most likely to have a given event in the future.
We assume the input data includes the following:
-
a column identifying a client client_column (person, car, business, etc.),
-
a colum specifying a type of event we will try to predict target_column (buy, checkout, click on form submit, etc.), and
-
a timestamp column specifying the time of an occured event.
Each row of data may contain additional columns of the int, category, float, or datetime types, which will be used to improve prediction accuracy. E.g. there could be a city associated with each user or type, credit card used for a transaction, smartphone model used to access a mobile app, etc.
Finally, we need to know how far ahead we want to make predictions. E.g. if we predict that a client will most likely buy a product in the next minute, there isn’t much we can do anyway. We might be more interested in clients who are likely to buy a product tomorrow so that we can send them a special offer or engage them in some other way. That lead time varies greatly depending on the application and can be as short as a few minutes for a web store or as long as several weeks for a banking product such as a loan. In any case, there is a parameter predict_after that allows you to specify the time period based on your particular needs.
To train a model, pass the configurations for your usecase to the
DataSource.train
method. The train method is asynchronous and may take several hours to
complete depending on the size of your dataset. You can check the
training status by calling the Model.is_ready
method or monitor the
completion progress interactively by calling the Model.progress_bar
method.
In the following example, we will train a model to predict which users will perform a purchase event **(*purchase)** 3 hours before they actually do it:
# Train a model
from datetime import timedelta
model = data_source.train(
client_column="user_id",
target_column="event_type",
target="*purchase",
predict_after=timedelta(hours=3),
)
# Display model training progress
model.progress_bar()
100%|██████████| 5/5 [00:00<00:00, 90.27it/s]
# Check to ensure that the model training is complete
assert model.is_ready()
Once the model training is complete, call the
Model.evaluate
method to display multiple evaluation metrics to evaluate the model’s
performance.
# Evaluate the model
model.evaluate().style
eval | |
---|---|
accuracy | 0.985000 |
recall | 0.962000 |
precision | 0.934000 |
3. Predictions¤
Finally, you can use the trained model to make predictions by calling
the
Model.predict
method. The predict method is asynchronous and may take several hours to
complete depending on the size of your dataset. You can check the
prediction status by calling the Prediction.is_ready
method or monitor
the completion progress interactively by calling the
Prediction.progress_bar
method.
# Run predictions
predictions = model.predict()
# Display model prediction progress
predictions.progress_bar()
100%|██████████| 3/3 [00:10<00:00, 3.40s/it]
# Check to ensure that the prediction is complete
assert predictions.is_ready()
If the dataset is small enough, you can download the prediction results
as a Pandas DataFrame by calling the
Prediction.to_pandas
method:
# Download the prediction results as a pandas DataFrame
predictions.to_pandas().style
Score | |
---|---|
user_id | |
520088904 | 0.979853 |
530496790 | 0.979157 |
561587266 | 0.979055 |
518085591 | 0.978915 |
558856683 | 0.977960 |
520772685 | 0.004043 |
514028527 | 0.003890 |
518574284 | 0.001346 |
532364121 | 0.001341 |
532647354 | 0.001139 |
In many cases, it is much better to push the prediction results to destinations such as AWS S3, MySql databases, or even download them to the local machine.
Below is an example to push the prediction results to an AWS S3 bucket.
For other available options, please check the documentation of the
Prediction
class.
# Push prediction results to an AWS S3 bucket
status = predictions.to_s3(uri=TARGET_S3_BUCKET)
# Display prediction push progress
status.progress_bar()
100%|██████████| 1/1 [00:10<00:00, 10.18s/it]