import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
#Import the Datasist Library
import datasist as ds


#Read in data set
train_data = pd.read_csv('training.csv')
test_data = pd.read_csv('test.csv')

Quick summary of a data set using the describe function in the structdata module

ds.structdata.describe(train_data)

First five data points


Last five data points


Shape of  data set: (95662, 16)


Size of  data set: 1530592


Data Types
Note: All Non-numerical features are identified as objects in pandas


Column(s) {'TransactionStartTime'} should be in Datetime format. Use the [to_date] function in datasist.feature_engineering to coonvert to Pandas Datetime format


Numerical Features in Data set
['CountryCode', 'Amount', 'Value', 'PricingStrategy', 'FraudResult']


Statistical Description of Columns


Categorical Features in Data set

['TransactionId',
 'BatchId',
 'AccountId',
 'SubscriptionId',
 'CustomerId',
 'CurrencyCode',
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'TransactionStartTime']


Unique class Count of Categorical features


Missing Values in Data

Remove features that contains only one unique field as these features are redundant

#Drop redundant features
ds.feature_engineering.drop_redundant(data=train_data)
ds.feature_engineering.drop_redundant(data=test_data)

Dropped ['CurrencyCode', 'CountryCode']
Dropped ['CurrencyCode', 'CountryCode']

Check for missing values in dataset with the display function

EXPLORATION OF CATEGORICAL FEATURES

cat_feats = ds.structdata.get_cat_feats(train_data)

cat_feats

['TransactionId',
 'BatchId',
 'AccountId',
 'SubscriptionId',
 'CustomerId',
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'TransactionStartTime']

ds.structdata.get_unique_counts(train_data)

From the unique display output, we notice that the TransactionId and BatchId contains too many classes and thus we can drop them

train_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)
test_data.drop(['TransactionId', 'BatchId'], axis=1, inplace=True)

VISUALIZATION FOR CATEGORICAL FEATURES

ds.visualizations.countplot(train_data)

Unique Values in AccountId is too large to plot


Unique Values in SubscriptionId is too large to plot


Unique Values in CustomerId is too large to plot


Unique Values in TransactionStartTime is too large to plot

	TransactionId	BatchId	AccountId	SubscriptionId	CustomerId	CurrencyCode	CountryCode	ProviderId	ProductId	ProductCategory	ChannelId	Amount	Value	TransactionStartTime	PricingStrategy
0	TransactionId_76871	BatchId_36123	AccountId_3957	SubscriptionId_887	CustomerId_4406	UGX	256	ProviderId_6	ProductId_10	airtime	ChannelId_3	1000.0	1000	2018-11-15T02:18:49Z	2
1	TransactionId_73770	BatchId_15642	AccountId_4841	SubscriptionId_3829	CustomerId_4406	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-20.0	20	2018-11-15T02:19:08Z	2
2	TransactionId_26203	BatchId_53941	AccountId_4229	SubscriptionId_222	CustomerId_4683	UGX	256	ProviderId_6	ProductId_1	airtime	ChannelId_3	500.0	500	2018-11-15T02:44:21Z	2
3	TransactionId_380	BatchId_102363	AccountId_648	SubscriptionId_2185	CustomerId_988	UGX	256	ProviderId_1	ProductId_21	utility_bill	ChannelId_3	20000.0	21800	2018-11-15T03:32:55Z	2
4	TransactionId_28195	BatchId_38780	AccountId_4841	SubscriptionId_3829	CustomerId_988	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-644.0	644	2018-11-15T03:34:21Z	2

	TransactionId	BatchId	AccountId	SubscriptionId	CustomerId	CurrencyCode	CountryCode	ProviderId	ProductId	ProductCategory	ChannelId	Amount	Value	TransactionStartTime	PricingStrategy
95657	TransactionId_89881	BatchId_96668	AccountId_4841	SubscriptionId_3829	CustomerId_3078	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-1000.0	1000	2019-02-13T09:54:09Z	2
95658	TransactionId_91597	BatchId_3503	AccountId_3439	SubscriptionId_2643	CustomerId_3874	UGX	256	ProviderId_6	ProductId_10	airtime	ChannelId_3	1000.0	1000	2019-02-13T09:54:25Z	2
95659	TransactionId_82501	BatchId_118602	AccountId_4841	SubscriptionId_3829	CustomerId_3874	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-20.0	20	2019-02-13T09:54:35Z	2
95660	TransactionId_136354	BatchId_70924	AccountId_1346	SubscriptionId_652	CustomerId_1709	UGX	256	ProviderId_6	ProductId_19	tv	ChannelId_3	3000.0	3000	2019-02-13T10:01:10Z	2
95661	TransactionId_35670	BatchId_29317	AccountId_4841	SubscriptionId_3829	CustomerId_1709	UGX	256	ProviderId_4	ProductId_6	financial_services	ChannelId_2	-60.0	60	2019-02-13T10:01:28Z	2

	CountryCode	Amount	Value	PricingStrategy	FraudResult
count	95662.0	9.566200e+04	9.566200e+04	95662.000000	95662.000000
mean	256.0	6.717846e+03	9.900584e+03	2.255974	0.002018
std	0.0	1.233068e+05	1.231221e+05	0.732924	0.044872
min	256.0	-1.000000e+06	2.000000e+00	0.000000	0.000000
25%	256.0	-5.000000e+01	2.750000e+02	2.000000	0.000000
50%	256.0	1.000000e+03	1.000000e+03	2.000000	0.000000
75%	256.0	2.800000e+03	5.000000e+03	2.000000	0.000000
max	256.0	9.880000e+06	9.880000e+06	4.000000	1.000000

	features	missing_counts	missing_percent
0	TransactionId	0	0.0
1	BatchId	0	0.0
2	AccountId	0	0.0
3	SubscriptionId	0	0.0
4	CustomerId	0	0.0
5	CurrencyCode	0	0.0
6	CountryCode	0	0.0
7	ProviderId	0	0.0
8	ProductId	0	0.0
9	ProductCategory	0	0.0
10	ChannelId	0	0.0
11	Amount	0	0.0
12	Value	0	0.0
13	TransactionStartTime	0	0.0
14	PricingStrategy	0	0.0
15	FraudResult	0	0.0

	Data Type
TransactionId	object
BatchId	object
AccountId	object
SubscriptionId	object
CustomerId	object
CurrencyCode	object
CountryCode	int64
ProviderId	object
ProductId	object
ProductCategory	object
ChannelId	object
Amount	float64
Value	int64
TransactionStartTime	object
PricingStrategy	int64
FraudResult	int64

	Feature	Unique Count
0	TransactionId	95662
1	BatchId	94809
2	AccountId	3633
3	SubscriptionId	3627
4	CustomerId	3742
5	CurrencyCode	1
6	ProviderId	6
7	ProductId	23
8	ProductCategory	9
9	ChannelId	4
10	TransactionStartTime	94556

Example use case of datasist in the workflow of a data scientist using a dataset from the Zindi competitive data science platform