Google Merchandise Store (also known as GStore, where Google swag is sold)

Google Analytics Customer Revenue Prediction

Predict how much GStore customers will spend

18 min readDec 27, 2021

The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

The Google provided Merchandise customer dataset and number of transactions per customer. The task is to build a predictive model using G-store data set to predict the total revenue per customer that helps in better use of marketing budget and also interpret the most impacting element on the total revenue prediction using different models.

Prerequisites
Introduction
Understanding Business Problem
Real-world/Business objectives and Constraints
About Data
About features
Performance Matric
Machine Learning Problem Formulation
Data Loading and Data Preprocessing
Exploratory Data Analysis
Feature Engineering
Machine Learning Models and Hyper Parameter Tuning
Feature Importance
Results of Machine Learning Models
Summary of Machine Learning Models
Deep Learning Models
Results of Deep Learning Models
Ensemble Model
Results of Ensemble Model
References

1. Prerequisites:

This post assumes you are familiarity with basic knowledge of Data Preprocessing, Exploratory Data Analysis, Performance matric, Machine Learning, Deep Learning techniques like CNN, python syntax, some libraries like NumPy, Pandas, sk-learn, Matplotlib, Seaborn, PrettyTable, TensorFlow, Keras, etc.

2. Introduction:

Online marketing is growing day by day and has become a billion-dollar industry. Companies spend a lot of money by targeting users who visit their website once, to encourage them to buy products from them. But a major part of the revenue comes from only 20% of the users. So instead of targeting everyone who visits the website, marketing budget can better be utilized If they target only those users who are most likely to purchase a product in the future.
Google Merchandise Store is a website and has provided the dataset of their website to help us predict customers who are most likely to make a transaction in the future, and the total revenue they will earn from that customer.
This data was collected by Google Analytics, and the same model can be used by anyone who is using Google Analytics. The goal of this project is to preprocess the data and use various machine learning algorithms to predict the estimated total revenue the website will earn from a user who visits their website.

3. Understanding Business Problem:

3.1 Description:

Source: https://www.kaggle.com/c/ga-customer-revenue-prediction
Data: Provided by google for Kaggle competition
Download ga-customer-revenue-prediction.zip from Kaggle.

3.2 Problem statement:

In every business it was proven about 80–20 rule., this rule tells us 80% of our revenue will be generated by only 20% of our potential customers.
So our goal is to predict the revenue that is going to be generated by those potential customers in the near feature.
So that marketing teams will invest appropriate money on promotional strategies to attract potential customers.
In simple words we are given with the users past data and transactions (when they logged into Gstore).
So by using this data we need to predict the future revenue will be created by those customers.
So google provided Merchandise customer dataset and number of transactions per customer.
We will build a predictive model using Gstore data set to predict the total revenue per customer that helps in better use of marketing budget and we will also interpret the most impacting element on the total revenue prediction using different models.

4. Real-world/Business objectives and Constraints:

No low-latency requirement.
Only a very small percentage of customers produce most of the revenue.
So we need to carefully analyze the revenue generated by customers.

5. About Data:

We have downloaded data from here.

We need to download train_v2.csv and test_v2.csv.
We will be predicting the target for all users in the posted test set: test_v2.csv, for their transactions in the future time period of December 1st 2018 through January 31st 2019.
Both train_v2.csv and test_v2.csv contain the columns listed under Data Fields. Each row in the dataset is one visit to the store. Because we are predicting the log of the total revenue per user, be aware that not all rows in test_v2.csv will correspond to a row in the submission, but all unique fullVisitorIds will correspond to a row in the submission.
IMPORTANT: Due to the formatting of fullVisitorId you must load the Id's as strings in order for all Id's to be properly unique!
There are multiple columns which contain JSON blobs of varying depth. In one of those JSON columns, totals, the sub-column transactionRevenue contains the revenue information we are trying to predict. This sub-column exists only for the training data.

6. About features:

fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
channelGrouping — The channel via which the user came to the Store.
date — The date on which the user visited the Store.
device — The specifications for the device used to access the Store.
geoNetwork — This section contains information about the geography of the user.
socialEngagementType — Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
totals — This section contains aggregate values across the session.
trafficSource — This section contains information about the Traffic Source from which the session originated.
visitId — An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
visitNumber — The session number for this user. If this is the first session, then this is set to 1.
visitStartTime — The timestamp (expressed as POSIX time).
hits — This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
customDimensions — This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
totals — This set of columns mostly includes high-level aggregate data.

6.1 External Data:

External data is permitted for this competition, per this forum post. This includes the Google Merchandise Store Demo Account. Although the Demo Account contains the predicted variable, final standings will not benefit from access to this external data, because it requires future-looking predictions.

7. Performance Matric:

Root Mean Squared Error (RMSE)

Submissions are scored on the root mean squared error.

RMSE is defined as:

where y hat is the natural log of the predicted revenue for a customer and y is the natural log of the actual summed revenue value plus one.

8. Machine Learning Problem Formulation:

8.1 Type of Machine Learning Problem:

So here we are going to predict the revenue generated by the customer (in dollars) when he visits the store .
So we can pose this problem as regression problem
By following some of the Kaggle discussions and winners solution., they are solving this problem in this way.
They are building a classification model that will predict whether the user will visit the store or not in the test period time and then suppose if there is chance that he is visiting store then by using regression model we will predict the revenue is going to be generated by the customer.
Solving the problem as Classification + Regression is motivated by Hurdle Model(https://seananderson.ca/2014/05/18/gamma-hurdle/).

8.2 Hurdle Model:

This model is preferred way of solving problem where target variable has more number of zeroes than a value.
It recommends to solve problem by Classification whether the value is going to non-zero or not. And then predict the amount.
The solution implemented for this challenge is based on above model.
Anyway we will discuss more about this at the time of featurization and model building.

9. Data Loading and Data Preprocessing:

Train dataset shape is : (1708337, 60)
Test dataset shape is : (401589, 59)
Here each record corresponds to one visit to store.\

10. Exploratory Data Analysis(EDA):

10.1 Percentage of Missing data:

Now we will analyze the each missing valued feature whether it is useful or not ., if it is useful then we will analyze how to impute missing values.

10.2 Target Feature Analysis:

As we already discussed about 80/20 rule., by looking at this graph it was proved . Most of the transactions had generated zero revenue but only few transactions had no-zero revenue.

10.3 Trend Analysis:

Number of visits over time: In our data each record corresponds to one visit.

In the month of december-2017 the the number of visits and revenue are raised drastically. So this one of the useful insights to the promotional team. So that they can invest more money in promotions in the month of December.

10.4 Channel grouping analysis:

number of visits per each channel
total revenue generated per each channel

The most revenue is coming from ‘organic search’, ‘Direct’, ’Referral’. But number of visits in ‘Direct’, ‘referral’ are very less. So here conclusion is the analytics team can invest less money in ‘Direct’, ’referral’ channels (since less users are visiting from this channel) and can generate most revenue.

10.5 Web-Browser Analysis:

number of visits per each browser
total revenue generated per each browser
It’s very difficult to analyze all the browsers that we had in train data( since it had some of the browsers that we had never known).
So we will analyze only top 20 browsers.(here top means depends on their occurrence in train data)

The number of visits in chrome browser are very huge compare with all browsers. The most revenue is coming from ‘chrome’, ‘Firefox’, ’safari’, ’Internet explorer’, ’edge’, ’opera’, ’Samsung internet’, ’android web view’, ’safari’, ’amazon silk’, ’YaBrowser’. So here conclusion is the analytics team can invest less money on the users visiting store through browsers(ex: safari, Firefox, opera, edge) except chrome and can generate most revenue.

10.6 Operating System analysis:

number of visits per each operating system
total revenue generated per each operating system

Most of the users are visiting through Windows, Macintosh and the most revenue is generating from windows and Macintosh. If we observe carefully very less people (less than 100K) are visiting through Linux and chrome OS. So business team can invest very less on money for promotions on this two OS platforms and can generate most revenue. Very importantly through windows phone less than 2000 people are visiting merchandise site but they are also generating good amount of revenue. So analytics team can invest less money on the windows phone OS and can generate good revenue.

10.7 Device Category Analysis:

number of visits per each device
total revenue generated per each device

Most of the users are visiting through desktop Here the very important observation is through tablet device less than 68K people(this is significantly less compare with other devices) are visiting but they are generating significantly higher revenue. So analytics team can invest less amount of money for promotions on the users visiting store through ‘tablet device’ and can generate significantly higher revenue.

10.8 Mobile vs non-mobile analysis:

number of visits with mobile and other than mobile
total revenue generated with mobile and other than mobile

So many users are coming through non-mobile devices and more revenue is generating from non-mobile devices number of visitors are relatively very less in mobile users but they are also generating significantly good revenue as compared to non-mobile users.

10.9 Continent Analysis:

number of visits per each continent
total revenue generated per each continent

Number of visits from America are significantly more compare with other continents even number of visits are less from ‘Oceania’, ’Africa’. But this continents are also generating good amount of revenue. So its’s better to invest in this two continents sources.

10.10 Traffic Source Analysis:

number of visits per each traffic source
total revenue generated per each traffic source
It’s very difficult to analyze all the traffic Source that we had in train data(we had about 345 different traffic Sources).
So we will analyze only top 20 browsers.(here top means depends on their occurrence in train data)

Most users are visiting through google, direct, YouTube even less number of are users are visiting through ‘reddit’, ’analytics.google.com’, ’yahoo’, ’Facebook’ but significantly they are also generating good amount of revenue. so it’s better to invest on this sources to gain maximum revenue.

11. Feature Engineering:

11.1 Impute missing values:

Here we will impute the zero for missing values in target feature. Since we already know that about 98% of transactions are not generating any money.

11.2 Convert Boolean Features:

11.3 Convert numerical features to float:

11.4 Label encoding for categorical features:

11.5 Time-series featurization:

The most important task for this problem is time series featurization:

Credits : “https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/82614”
Since this is regression problem and most values are zero ., so we are going to solve these kind of problems using hurdle models.
Here I will discuss the entire methodology for this idea.
Basically kaggle given :
* train data time period range is : Aug 1'st 2016 to Apr 30th 2018 => total 638 days.
* test data time period range is : may 1'st 2018 to Oct 15th 2018 => total 168 days.
* Prediction data time period range is : Dec 1'st 2018 to Jan 31'st 2019 => total 62 days.
So here we need to predict the revenue of users in the period of Dec 1st 2018 to Jan 31'st 2019 by using the train and test data given to us.
So we had data until Oct 15th 2018 and prediction data beginning date was Dec 1st 2018., so the in between period is called “cooling period” and it is 46 days.
So here idea is first we need to predict whether the user will come to store or not after the “cooling period” of 46 days(or in test period). so for this we will use classification model.
Suppose if he will come to store then we will predict the revenue of that user by using regression model with user data(features).
So the next step is we need to build the data for classification model in such way that it will replicate the real world scenario.

real world scenario ?
That means train data will consists of 168 days data and test data will consists of 62 days data and we will maintain the gap between the train data end date and test data beginning date with 46 days.
So by using this train data we need to predict whether the user will come to store or not for test data that we prepared.
ex:
train data = Aug 1'st 2016 to Jan 15th 2017 (168 days)
test data = mar 2nd 2017 to may 3'rd 2017 (62 days)
The gap between train and test data is 46 days.
So by using the data that we had we can make 4 sets of train and test frames.
data set-1:
*train data = Aug 1'st 2016 to Jan 15th 2017 (168 days)
*test data = mar 2nd 2017 to may 3'rd 2017 (62 days)
data set-2:
*train data = Jan 16'st 2017 to Jul 2nd 2017 (168 days)
*test data = Aug 17th 2017 to Oct 18th 2017 (62 days)
data set-3:
*train data = Jul 3'rd 2017 to Dec 17th 2017 (168 days)
*test data = Feb 1'st 2018 to Apr 4th 2018 (62 days)
data set-4:
*train data = Dec 18th 2017 to Jun 4th 2018 (168 days)
*test data = Jul 20th 2018 to Sep 20th 2018 (62 days)

So from above data sets for the users which are common in train and test(that means they returned after cooling period) we will create a new feature ‘is_returned’ and we will set it to 1., for the users which not returned we will set ‘is_returned’ to 0.

We will create some new features for the every user in ‘train data’ and finally we will merge all this data frames.

So now our target features are “is_returned” and “revenue”.
Where “is_returned” will indicate that the whether the user will come to store or not in test period.
Where “revenue” will indicate that the revenue generated by the user.

Note: I know for first time it is very difficult to understand. so here in fewer lines I will give the brief summary of time series featurization.

we decided to build classification model and regression model. Here the task of classification model is to predict whether the user will come to store or not. If he is not coming to store store then the revenue from that user is zero. until here we are clear.

But for building classification model we don’t have any labelled data with us. so we are trying to generate the data for classification model. so with the data that we have on our hand we are dividing it into train frame and test frame by replicating real world scenario( cooling period gap). If the user present in both train frame and test frame means he will come to store and label for that user is ‘1’., If he is not present in test frame then we will label that user with ‘0’. I hope now it’s clear for you.

12 Machine Learning Models and Hyper Parameter Tuning:

12.1 LightGBM

Here we are using two models for building final that will predict revenue:-

Classification Model to predict whether customer would return during test window.
Regression Model to predict transaction amount.

So the final value is :-

predicted revenue = classification model output(probability)* regression model output(real value)

Note: If predicted revenue is negative then we will make it as zero. since revenue is not in negative.

12.1.1 Hyper Parameters tuning for Classification Model:

After the above code snippet executed we got the below best hyper-parameters for our Light-GBM classification model.

{'subsample': 0.9, 'reg_lambda': 0, 'reg_alpha': 1, 'objective': 'regression', 'num_leaves': 8, 'n_estimators': 100, 'min_child_samples': 20, 'metric': 'rmse', 'max_leaves': 128, 'learning_rate': 0.015, 'colsample_bytree': 1, 'boosting_type': 'gbdt'}
0.07100636089146617

12.1.2 Hyper Parameters tuning for Regressor Model:

After the above code snippet executed we got the below best hyper-parameters for our Light-GBM Regressor model,

{'subsample': 0.9, 'reg_lambda': 0, 'reg_alpha': 1, 'objective': 'regression', 'num_leaves': 8, 'n_estimators': 100, 'min_child_samples': 20, 'metric': 'rmse', 'max_leaves': 128, 'learning_rate': 0.015, 'colsample_bytree': 1, 'boosting_type': 'gbdt'}
0.07100636089146617

12.1.3 Final Model:

We are building classification model and regression model on the best hyper-parameter values that we got and we run for multiple times(let say 10) and we are taking the average value of all the predictions generated in each iteration.

Now we will create final submission.csv file with ‘fullvisitorID’ and ‘PredictedLogRevenue’ as columns.

Kaggle Score for the above CSV is,

12.2 Random Forest:

12.2.1 Hyper Parameter tuning for random forest classifier:

After the above code snippet executed we got the below best hyper-parameters for our Random Forest classification model,

{'n_estimators': 800, 'min_samples_split': 7, 'min_samples_leaf': 2, 'max_depth': 7}
0.9937555332169373

12.2.2 Hyper Parameter tuning for random forest regressor:

After the above code snippet executed we got the below best hyper-parameters for our Random Forest Regressor model,

{'n_estimators': 200, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 7, 'bootstrap': True}
0.08877597380775908

12.2.3 Final Model:

Now we will create final submission.csv file with ‘fullvisitorID’ and ‘PredictedLogRevenue’ as columns.

Kaggle score for the above csv is,

Both Random Forest and LightGBM are giving similar results.

13. Feature Importance:

Here we will see which features are really useful.
So that we will use only those features so that we can reduce the dimensionality of data and computational time also.
So for that we are using ‘recursive feature elimination’

13.1 Recursive feature elimination

The recursive feature elimination idea is similar to backward feature selection.
First we need to specify the base model(Here base model has to return the feature importance).
So that the algorithm first train the model on all features of the data set.
Now it will take feature importance of all features.
Now by removing least importance features it will re-train the model on new set of features.
So this operation will iteratively run for different set of features.
Finally the feature set which is gibing best accuracy those features are selected as our final features.

https://towardsdatascience.com/feature-selection-in-python-recursive-feature-elimination-19f1c39b8d15
https://medium.com/@aneesha/recursive-feature-elimination-with-scikit-learn-3a2cbdf23fb7

14. Results of Machine Learning Models:

The above two model resulted in score of 0.88257 and 0.88293 on private leader board that would end up rank of 5 on leader board respectively.

15. Summary of Machine Learning Models:

Reading data and dealing with json columns.
Understanding business problem and metrics.
Transforming the business problem into machine learning problem.
Since we are solving this problem for advertisement team such they can spend appropriate amount in appropriate fields.
Analyzing the features and removing constant valued columns.
Exploratory data analysis o each feature and writing observations.
Data preprocessing and handling missing values.
Feature engineering and time series featurization.
Understanding about ‘hurdle model’ strategy.
Building models.
Trying out linear and non linear models and hyper-parameter tuning.
Feature importance using ‘recursive feature elimination’.
Re-building the models on only important features.

16. Deep Learning Models:

16.1 MLP Models

16.1.1 MLP Classification Model

MLP Classification Model summary and total trainable parameters are

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 256)               9472      
_________________________________________________________________
dense_2 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 129       
=================================================================
Total params: 42,497
Trainable params: 42,497
Non-trainable params: 0
_________________________________________________________________

16.1.2 MLP Regressor Model

MLP Regressor Model summary and total trainable parameters are,

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 256)               9472      
_________________________________________________________________
dense_5 (Dense)              (None, 128)               32896     
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 129       
=================================================================
Total params: 42,497
Trainable params: 42,497
Non-trainable params: 0
_________________________________________________________________

Kaggle Score of MLP Model is

16.2 CNN Model

16.2.1 CNN Classification Model:

CNN Classification Model Summary and Trainable parameters are,

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 35, 64)            192       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 17, 64)            0         
_________________________________________________________________
flatten (Flatten)            (None, 1088)              0         
_________________________________________________________________
dense (Dense)                (None, 50)                54450     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
=================================================================
Total params: 54,693
Trainable params: 54,693
Non-trainable params: 0
_________________________________________________________________

16.2.2 CNN Regressor Model:

CNN Regressor Model Summary and Trainable parameters are,

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1d (Conv1D)              (None, 35, 64)            192       
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 17, 64)            0         
_________________________________________________________________
flatten (Flatten)            (None, 1088)              0         
_________________________________________________________________
dense (Dense)                (None, 50)                54450     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 51        
=================================================================
Total params: 54,693
Trainable params: 54,693
Non-trainable params: 0
_________________________________________________________________

Kaggle Score of CNN Model

17. Results of Deep Learning Models:

Here we are using simple MLP and CNN architectures for our data.
Results are not significantly better., but deep learning models are also giving good score as tree based models.

18. Ensemble Model:

Ensemble learning helps improve machine learning results by combining several models.
Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (Bagging)
Bagging : Also known as boost strapped aggregation. Here we will take samples with replacement.
Instead of passing the whole data to each model we will pass subset of data to each model in our ensemble architecture.
Here subsets are formed by taking samples(with replacement) from whole train data. so by using this technique we can reduce the variance in models.
Here is my ensemble architecture:

Here I am not using classification model. I am using only regression model to predict the revenue of the user.

Note: Here I am using only 3 models but in real world people will use 100’s of base models to improve the results.

Kaggle Score of Ensemble Model is,

19. Results of Ensemble Model:

Using Ensemble model for regression we are getting top 5 % solution on Kaggle leader board.

20. References:

You can reach me at:-

GitHub Repository Link: https://github.com/charanhu/Google-Analytics-Customer-Revenue-Prediction

LinkedIn: https://www.linkedin.com/in/charanhu/

GitHub: https://github.com/charanhu