Continuation of Udacity project
This time we will go through this project with Python using scikit-learn and see how the results differ from eachother from previous iteration.
We’re tasked with predicting how much money a fictional company in the mail-order catalog business can expect to earn from sending out a catalog to new customers.
Overview
This project has previously been finished using Alteryx analytics software as project work for Udacity’s Predictive Analytics for Business nanodegree.
This task will involve building a linear regression model with scikit-learn and applying the results in order to provide a recommendation to this fictional management.
We have two excel files to work with:
1. customers.xlsx - existing customers
2. mailinglist.xlsx - new customers
The Business Problem
Last year the company sent out its first print catalog, and is preparing to send out this year’s catalog in the coming months. The company has 250 new customers from their mailing list that they want to send the catalog to. Determine how much profit the company can expect from sending a catalog to these customers. Management does not want to send the catalog out to these new customers unless the expected profit contribution exceeds $10,000.
Business and Data Understanding
- We need to predict if this catalog campaign is a good idea. Should the company send out these catalogs to new customers? How much revenue can this company expect to make if they send their new catalog to these 250 customers? Is it profitable considering all the costs and possibility that customer might not make a purchase at all?
- What data is needed to inform those decisions?
- Possible revenue from each client
- Cost of each catalog
- Average past sales
- Average gross margin for each product
- Average number of products purchased
- Probability that customer will buy
- Customer segment
EDA
In the exploratory data analysis part we were able to see that our existing customer data on response to the last catalog is imbalanced towards ‘No’.
customers['Responded_to_Last_Catalog'].value_counts()
No 2204
Yes 171
Name: Responded_to_Last_Catalog, dtype: int64
Because we are working with fake data, we already have the new customers probable response to our new catalog in our data, with a probability between 0 and 1. If we did not have this we would have to calculate the probability with a logistic regression model and training the model with such an imbalance(13:1) would be problematic.
Choosing features
For our linear regrssion features we have kept Customer_Segment, Avg_Sale_Amount and Avg_Num_Products_Purchased. Because we want to use Customer_Segment in our model as well, but is a categorical variable, we need to convert it to dummy variables.
feats['Customer_Segment'].value_counts()
Store Mailing List 1108
Loyalty Club Only 579
Credit Card Only 494
Loyalty Club and Credit Card 194
Name: Customer_Segment, dtype: int64
Convert the 4 categorical values to dummy variables:
seg_dummies = pd.get_dummies(feats['Customer_Segment'], prefix='Seg_', drop_first=True)
feats = pd.concat([feats, seg_dummies], axis=1)
feats = feats.drop('Customer_Segment', axis=1)
Linear regression
We will use Avg_Sale_Amount from old customers’ data as the target variable to predict the average sale amount for new customers.
After splitting our data for testing and training the Best linear regression equation was the following:
Y = 284.48 + 70.68 * Avg_Num_Products_Purchased - 144.33 * (If Type: Loyalty Club Only) + 267.64 * (If Type: Loyalty Club and Credit Card) - 231.33 * (If Type: Store Mailing List) + 0 * (If Type: Credit Card Only)
R-squared 0.8485331100796489
MAE: 89.31308194968375
MSE: 17006.43868059899
RMSE: 130.40873697954055
feats['Avg_Sale_Amount'].describe()
count 2375.000000
mean 399.774093
std 340.115808
min 1.220000
25% 168.925000
50% 281.320000
75% 572.400000
max 2963.490000
Name: Avg_Sale_Amount, dtype: float64
Average sales for new customer
Now we are ready to deploy our linear regression model on new customers data and then multiply the predicted revenue with given customer’s probability of making the purchase:
Probable_Revenue = (Predicted_Avg_Sale_Amount) * (Answer_Yes)
Results
Does the expected profit contribution exceed $10,000?
Probable revenue is multiplied with average gross margin of all products, which is 50% and deducted the cost of each catalog which is $6.50:
Probable_Revenue * 0.5 – 6.50
prob_revenue['Rev_Minus_cost'] = (prob_revenue['Prob_Revenue']*0.5)-6.50
prob_revenue['Rev_Minus_cost'].sum()
22012.86333703793
The Final Profit from the new catalog is expected to be $22,012.86 which would be the probable profit made from sending new catalogs out to given 250 customers.
—Notes—
The results compared with previous rendering of linear regression model in Alteryx software for the same data is almost the same.
Their R-squared values were slightly different, where Alteryx model calculated 0.8369 and our model in sckit-learn 0.8485.
So Alteryx predicted the final profit to be $21,987.44, which results in a $25.42 difference.