ANZ Virtual Internship

This is a post about a virtual work experience program I completed with Forage in their collaboration with ANZ, an Australian multinational banking and financial services company.

Overview

The virtual work experiences contain a series of resources and tasks designed to simulate the real-world experience. This task is based on a synthesised transaction dataset containing 3 months’ worth of transactions for 100 hypothetical customers. It contains purchases, recurring transactions, and salary transactions.

The dataset is designed to simulate realistic transaction behaviours that are observed in ANZ’s real transaction data, so many of the insights you can gather from the tasks below will be genuine.

This task will involve exploratory data analysis, feature engineering and predicting customers’ annual salary based on their transactions.

Have a look at this interactive data dashboard too. And if you would like to deep dive into the code, here is a link to the github repository.

Exploratory Data Analysis

The dataset given has 12 043 records with 23 columns.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12043 entries, 0 to 12042
Data columns (total 23 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   status             12043 non-null  object        
 1   card_present_flag  7717 non-null   float64       
 2   bpay_biller_code   885 non-null    object        
 3   account            12043 non-null  object        
 4   currency           12043 non-null  object        
 5   long_lat           12043 non-null  object        
 6   txn_description    12043 non-null  object        
 7   merchant_id        7717 non-null   object        
 8   merchant_code      883 non-null    float64       
 9   first_name         12043 non-null  object        
 10  balance            12043 non-null  float64       
 11  date               12043 non-null  datetime64[ns]
 12  gender             12043 non-null  object        
 13  age                12043 non-null  int64         
 14  merchant_suburb    7717 non-null   object        
 15  merchant_state     7717 non-null   object        
 16  extraction         12043 non-null  object        
 17  amount             12043 non-null  float64       
 18  transaction_id     12043 non-null  object        
 19  country            12043 non-null  object        
 20  customer_id        12043 non-null  object        
 21  merchant_long_lat  7717 non-null   object        
 22  movement           12043 non-null  object        
dtypes: datetime64[ns](1), float64(4), int64(1), object(17)
memory usage: 2.1+ MB

75% of customers are between age 18 and 38, median age 28.

Difference in average salary by gender
Males: 2077.94 AUD, Females: 1679.37 AUD.

Average purchase transaction (POS transactions) amount is 39.8 AUD and every customer makes 27 of them on average every month.

Average week purchase volume and amount.

Average day purchase volume and amount.

Feature engineering

To understand our customer better, we will construct some new features that relate to their purchasing behaviours.

Create features like which state does the customer live in, how many purchases they make weekly, what is their average purchase amount, how many big purchases they make and what is their average balance.

customer_state = []
avg_weekly_purch_num = []
avg_weekly_trans_num = []
no_trans_days = []
avg_trans_amount = []
max_amount = []
num_large_trans = []
avg_trans_amount_overall = []
med_balance = []

for i in df['customer_id']:
    customer_state.append(customer_state_df[customer_state_df['id']==i]['state'].item())
    avg_weekly_purch_num.append(int(weekly_purch.loc[i]))
    avg_weekly_trans_num.append(int(weekly_trans.loc[i]))
    no_trans_days.append(pos_df[pos_df['customer_id']==i]['date'].nunique())
    avg_trans_amount.append(int(round(pos_df[pos_df['customer_id']==i]['amount'].mean(), 0)))
    max_amount.append(round(pos_df[pos_df['customer_id']==i]['amount'].max(), 1))
    num_large_trans.append(pos_df[(pos_df['customer_id']==i) & (pos_df['amount']>100)]['amount'].count())
    avg_trans_amount_overall.append(int(round(df[df['customer_id']==i]['amount'].mean(), 0)))
    med_balance.append(df[df['customer_id']==i]['balance'].median())
    
df['state'] = customer_state
df['avg_weekly_purch_num'] = avg_weekly_purch_num
df['avg_weekly_trans_num'] = avg_weekly_trans_num
df['no_trans_days'] = no_trans_days
df['avg_trans_amount'] = avg_trans_amount
df['max_amount'] = max_amount
df['num_large_trans'] = num_large_trans
df['avg_trans_amount_overall'] = avg_trans_amount_overall
df['med_balance'] = med_balance

Age bins

We will also bin the customer age to one four categories: below 20, between 20 and 40, between 40 and 60, over 60.

Annual salary for each customer

Find only salary payments.

df_salaries = df[df['txn_description']=='PAY/SALARY'].groupby('customer_id').sum()

Every customer in the dataset has a different frequency for receiving their salary, some get paid once a month, some every week. To find an average monthly salary we will sum up all their payments in the three months of data we have and divide it by three. To then get the annual salary we multiply the average monthly salary with 12.

annual_salaries = df_salaries['amount']/3*12 # annual salary for each customer

Predicting annual salary

First we will need to do some dummy encoding on categorical variables like ‘status’, ’txn_description’, ‘gender’, ‘state’ and ‘age_bin’.

Then check correlation between variabels and ‘annual_salary’.

num_feats.corr()['annual_salary'].sort_values(ascending=False)

annual_salary           1.000000
avg_trans_amount_ov     0.538656
med_balance             0.258076
balance                 0.257159
amount                  0.091111
avg_trans_amount        0.044312
num_large_trans        -0.045275
avg_weekly_trans_num   -0.079352
max_amount             -0.097739
no_trans_days          -0.172765
avg_weekly_purch_num   -0.189532
Name: annual_salary, dtype: float64

Here I realised avg_trans_amount_overall is suspiciously highly correlated. I decided to not use this as a feature for our predictive model because the calculation includes salary payments and we are trying to predict customers’ annual salary just by their purchasing behaviour, without knowing about their salary payments.

For testing, I chose to split our data to 70% train and 30% for test set.

Linear regression

Let’s build a simple linear regression model with sklearn to predict customers’ annual salary with features we defined earlier.

Our linear model valuation:

R-squared 0.5381890117545585
MAE: 13706.815968071387
MSE: 323705668.54002833
RMSE: 17991.822268464868

Plotting the predictions. Our model’s predictions are plotted in blue, perfect predictions in red.

Our model’s predictions are kind of following the trend, but still too scattered to be trusted. So our linear regression model did not perform the best and this would not be an accurate way to predict customers’ annual salaries. But let’s try using a different model.

Gradient boosting

Gradient boosting is an ensemble learner. This means it will create a final model based on a collection of individual decision tree models.

Our gradient boosting model performed noticably better than the linear regression earlier:

R-squared 0.9842118273179078
MAE: 2476.2601909654327
MSE: 11066694.217258077
RMSE: 3326.66412750943

Plotting the predictions. Our gradient boosting model’s predictions are plotted in blue, perfect predictions in red.

A gradient descent procedure is used to minimize the loss when adding trees. Our model’s loss is minimsed at around 400 trees.

Results

Our gradient boosting machine definitely outperformed the linear regression model. Should it be used to segment customers whose annual salary we do not have? Maybe not yet. We could definitely do better with more data (real data) and more parameter tuning.