CUSTOMER SEGMENTATION: AVARTO PROJECT

Nermin KIBRISLIOGLU UYSAL
13 min readDec 20, 2020

1. PROJECT OVERVIEW

Arvato is an international company that develops and implements innovative data-driven solutions for business customers worldwide. These include supply chain solutions, financial services, and Avarto systems. In this project, we will be working with Avarto finances data to provide a mail-order sales company's strategy to expand its customer base.

The research question we pursue is ‘How effectively can a mail-order sales company expands its customer base?’. To answer this question, we can use two strategies. The first one is a customer segmentation analysis with an unsupervised learning model. The second one is a supervised learning model to predict the probability of individuals turning into customers. In this project, I used both methods to provide an answer to our research question.

This report consists of six sections. The project overview is followed by a methods section in which data sets and data analysis procedures were explained. In the third section, the analysis conducted for preprocessing is presented. In the fourth and fifth sections, the results for customer segmentation and supervised learning models are discussed. In the last section, a summary was provided, and limitations and possible improvements were discussed.

2. METHOD

2.1.Data sets

There are four data files associated with this project:

· Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

· Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

· Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).

· Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

Each row of the demographics files represents a single person and includes information outside of individuals, including information about their household, building, and neighborhood.

The “CUSTOMERS” file contains three extra columns (‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, and ‘PRODUCT_GROUP’), which provide broad information about the customers depicted in the file. The original “MAILOUT” file included one additional column, “RESPONSE”, which indicated whether each recipient became a customer of the company or not. This column enables us to run a supervised learning model.

Two supplementary files include additional information about features, meaning, and their coding.

· DIAS Information Levels — Attributes 2017.xlsx: A top-level list of attributes and descriptions, organized by informational category.

· DIAS Attributes — Values 2017.xlsx: A detailed mapping of data values for each feature in alphabetical order.

2.2.Data Analysis

This project consists of two main parts: customer segmentation and supervised learning.

2.2.1. Customer segmentation

It divides customers into groups based on some demographic variables to provide distinct groups enabling different market strategies. In this project, customer segmentation is conducted in three steps. First, I examined the customer and general population distributions based on some demographics. Second, I conducted a principal component analysis (PCA) for dimension reduction. Lastly, I used K-means clustering, an unsupervised learning model, to create customer clusters. Lastly, I compared the percentage of individuals in each cluster to understand which clusters are more likely to have future customers.

2.2.2. Supervised Learning

The machine learning task of learning a function that maps an input to an output based on example input-output pairs. Depending on the output, one can use different methods. In our data, we have a binary classified output: responded or not responded. The performance of the methods also varies between data sets. To select the best performing model, I tried a couple of methods.

2.2.2.1.Dealing with unbalanced responses

When the class weights are not balanced, it becomes tricky to have a proper ml model to predict the smaller class. There are two approaches to deal with this problem: balancing class distribution with sampling techniques and using cost-sensitive models. In this project, I compared sampling methods and cost-sensitive training methods.

Sampling methods:

Sampling methods can be classified into two groups: oversampling and under-sampling techniques. Oversampling methods duplicate cases in the minority class or synthesize new cases from the examples in the minority class. Some oversampling techniques are Random Oversampling, Synthetic Minority Oversampling Technique (SMOTE), Borderline-SMOTE, Borderline Oversampling with SVM, Adaptive Synthetic Sampling (ADASYN). In this project, I used random oversampling, SMOTE, and ADASYN techniques. On the other hand, under-sampling methods delete or select a subset of examples from the majority class. Some under-sampling techniques are Random Under-sampling, Condensed Nearest Neighbor Rule (CNN), Near Miss Under-sampling, and Tomek Links Under-sampling. Under-sampling techniques are better for the large data sets having large enough minority class. Hence, I did not use an under-sampling technique in this project as my minority class size is not big enough. You can find detailed information about sampling techniques here.

Prediction Methods:

The most used technique to predict binary classifications is logistic regression. I used standard logistic regression with sampled data and class balanced logistic regression for imbalanced data in this project.

Other prediction techniques work better with imbalanced data. I used a random forest classifier (RFC) and an X gradient boosting classifier (XGBoost). RFC is a tree-based algorithm that creates decision trees from random samples, predicts each tree, and selects the best performing one. RFC is a robust and accurate technique, but it is slow. Gradient boosting classifier trains many models sequentially. It is a numerical optimization algorithm where each model minimizes the loss function. XGBoost is an advanced and more efficient implementation of Gradient Boosting. In this project, I used the XGBoost method as it is more efficient than gradient boosting.

2.2.2.2.Evaluation Metrics

One needs a metric to evaluate the performance of a model. The commonly used metrics are accuracy, precision, recall, f1 score, and AUC. They were explained in table 1.

While accuracy is a handy and commonly used metric to evaluate an ML model, it may create misleading results when the class weights are not balanced. Hence it is always good to check the other metrics. In this project, we have a very unbalanced class distribution. Hence, I used AUC as the main criteria for model selection.

3. DATA PREPROCESSING

3.1.Understanding the data sets

In this part, I summarized basic data cleaning and preprocessing steps. I have used 6 different data sets, which are explained in detail in the Data set section. The main data sets were general population data and customer data. A small preview of each data set is shown in Figure 1.

(a) General population data preview
(b) Customer data preview

Figure 1. Main data previews

General population data consists of 891,221 rows and 366 columns, while customer data consists of 191,652 rows and 369 columns. Before diving into the data cleaning, we need to understand each variable and the meanings of the codes. For this purpose, I examined two supplementary data files: Information level data and attributes data. The previews of each data are shown in figure 2. Information levels include the column names in attribute columns and the columns' description, while attribute data includes column names and value meanings of each variable.

As a first step of understanding the data files, I checked whether we have information for all columns in the general population and customer data. For this purpose, I checked whether there are variables in main data files that are not included in the attribute data. I encounter 8 columns in the general population data that we have no information about attribute levels or encodings. In those 8 variables, 4 of them included in the information levels data. Those variables had been coded 1 to 10. However, we have no information about these codes' meanings. Moreover, between 76 to 99 % of the observations were in a single category (category number 10). As these variables do not provide much variance, I dropped them too. In the end, we had 272 columns for both customer and general population data.

(a) Information levels data preview
Attributes data preview

Figure 2. Supplementary data previews

3.1.1. Understanding the Variables

Variables types are important as each type requires different approaches. For this reason, to understand the variables, I checked the data types. There are two columns with data type integer, 267 columns having float, and 3 columns with object data types. Binary coded object data types were recoded as 0,1, and the remaining ones converted to the dummy variables. Float data types, on the other hand, required further investigation. The procedure for handling float data is as follows:

· Binary coded values recoded as 0 and 1.

· For the variables having more than two categories, category meaning was checked. The variables that are ordered categorical or continuous were kept as is. The ones that are in the nominal scale were converted to the dummy variables.

· Some variables converted to the other ones like the same measures having smaller/larger number of categories were dropped.

3.2. Handling Missing Data

Attributes data has missing/unknown/not possible category codes specific to each variable. Before working with missing data, first, all these values were converted to NumPy NAN values. The column-wise missing value distributions for the general population and customer data were shown in figure 3.

Figure 3. Colum-wise missing value distribution

As shown in figure 3, the median missing value percentage is 10% for the general population, while for customer data, it is 27%. We also observe that columns have an extreme amount (more than 50%) of missing data. Therefore, in data cleaning, I deleted the columns having more than 30% missing values.

Figure 4 shows row-wise missing value distributions sorted ascending for the general population and customer data. The median missing value percentage is 0.7 % for the general population, while for customer data, it is 0.06%. However, the graphs show that rows have extremely missing data. To protect data integrity, I dropped rows having more than 30% of missing value.

Figure 4. Row-wise missing value distribution

After dropping extreme missing columns and rows, I imputed median values for the remaining missing values. At the end of the data cleaning, imputation, and one hot encoding, general population data consists of 784380 rows and 413 columns, while customer data consists of 140310 rows and 404 columns.

4. CUSTOMER SEGMENTATION

4.1.Descriptive Analysis

Before clustering, I created some explanatory graphics to understand the distribution of demographics in the general population and customer population. The customer and general population distribution according to gender, age, and social status variables are shown in figure 5.

Figure 5. Distribution of demographics

As shown in figure 5, the percentage of males is higher in the customer group than the general population. Similarly, the percentage of people older than 60 years, high income, and top earners are higher in the customer group than the general population.

4.2.Principle Component Analysis

To simplify data and reduce dimensions, PCA analysis was conducted. The number of components versus the explained variance is shown in figure 6. I selected 250 components as it is the point having more than 90% variance explained.

Figure 6. Cumulative explained variance vs. the number of components

4.3.Cluster Analysis

K-means clustering was used as an unsupervised learning model. To decide an optimal number of clusters, I tried cluster sizes 2 to 70. The resulting model scores were shown in figure 7. From figure 7, I decided to use 10 clusters.

Clusters

Figure 7. Cluster size versus model score

The percentages of each cluster in the general population and customer population are shown in figure 8. As shown in the figure, the cluster percentages are higher in the first, second, third, and 10th clusters in the customer population. For the remaining clusters, the percentages of the customer population are lower than the general population.

Figure 8. Cluster distribution: general population vs. customer population

5. SUPERVISED LEARNING MODEL

For the supervised learning model, we are given two separate data sets as training and testing data. For the Kaggle competition, testing data does not have a response column. To be able to test my models and I divided training data into test and train splits. I cleaned both training and testing data, as explained in the data preprocessing section. Initially, training data consisted of 42962 rows and 367 columns. After cleaning, there were 34987 rows and 413 columns. Test data 42833 rows and 366 columns. After cleaning, there were 34980 rows and 413 columns.

As we have an imbalanced class distribution, I used two different approaches to this situation: sampling and using cost-sensitive training algorithms. For sampling strategy, I used 3 sampling methods: random oversampling, SMOTE, and ADSYN, and compared their performances with imbalanced data using three prediction methods: LR, RFC, XGboost. For each sampling method, I used two different perspectives to test model integrity. First, I resampled the whole data and then train and then test the model. The results for the resampled test and train sets are shown in figure 9(a).

(a) Testing with sampled data
(b) Testing with imbalanced data

Figure 9. ROC AUC values for testing with sampled data

Figure 9 (a) shows AUC values for each sampling method under different estimators. The results of how sampling techniques are outperformed the imbalanced data. For estimation methods, XGboost seems to be performed better.

My second approach to testing sampling methods was sampling training data testing model performance on imbalanced data. I conducted this analysis as I was supposed to use imbalanced data for testing. The results of this analysis were reported in Figure 9 (b). Figure 9 (b) shows that sampling methods performed similarly with the algorithms trained with imbalanced data under testing with imbalanced data. Moreover, the AUC results were around 0.50. to increase model AUC scores, I performed parameter search and cross-validation on RFC and XGboost models.

5.1. Parameter tuning and cross-validation

For cross-validation, I used the stratified shuffle split method with 5 fold. For the RFC method, I tuned six parameters resulting in fitting 5 folds for each of the 360 candidates, a total of 1800 fits. For XGboost, I tuned five parameters resulting in fitting 5 folds for each of 240 candidates, a total of 1200 fits. The results for each model before and after parameter tuning is shown in table 2.

The cross-validation results with grid search improved RFC’s AUC score as 0.60 and XGboost’s AUC score as 0.57. I continued to work with the best RFC method as it is the one that provided a higher AUC score.

6. SUMMARY and FINAL NOTES

In this project, I tried to answer the problem ‘How effectively can a mail-order sales company expands its customer base?’. Within this problem, I used two perspectives: customer segmentation and supervised learning.

The customer segmentation results show that there are 10 clusters. The percentages of individuals in each cluster differ between customers and the general population enables us to select our targets accordingly.

For the supervised learning, seven different models were tested under three conditions: sampled test and training data, and two models were selected to further improvements. The results are summarized in Table 3.

There were two major main challenges in this project. First, one is understanding the data files, meanings of the variables, and preprocessing. As the types of variables affect how we behave them like keep as is, convert dummies, convert binary coding, etc., I needed to understand the variables' nature. To handle this problem, I divided variables according to types and number of categories, then dig deeper to understand their nature.

The second and most challenging part of this project had highly imbalanced response categories. Finding an approach that works better in this situation has its own challenges. To overcome this problem, I researched the possible ways to handle an imbalance in data and used three different methods. Although I managed to improve the predictions by grid search and validated them by grid search, there is still room for improvement in this perspective as the final AUC score was 0.60.

Note: Under the terms and conditions, I only provided small screenshots of the data instead of providing the full data set.

License, terms, and conditions:

In addition to Udacity’s Terms of Use and other policies, your downloading and use of the AZ Direct GmbH data solely for use in the Unsupervised Learning and Bertelsmann Capstone projects are governed by the following additional terms and conditions. The big takeaways:

You agree to AZ Direct GmbH’s General Terms provided below and that you only have the right to download and use the AZ Direct GmbH data solely to complete the data mining task which is part of the Unsupervised Learning and Bertelsmann Capstone projects for the Udacity Data Science Nanodegree program.

You are prohibited from using the AZ Direct GmbH data in any other context.

You are also required and hereby represent and warrant that you will delete any and all data you downloaded within 2 weeks after your completion of the Unsupervised Learning and Bertelsmann Capstone projects and the program.

If you do not agree to these additional terms, you will not be allowed to access the data for this project.

The full terms are provided in the workspace below. You will then be asked in the next workspace to agree to these terms before gaining access to the project, which you may also choose to download if you would like to read in full the terms.

These same exact terms are provided in the next workspace, where you will be asked to accept the terms prior to gaining access to the data.

The codes, detailed terms, and conditions can be found here

--

--