Customer Churn Prediction (Telecommunications Industry)
Introduction
The development of telecommunications industry is very fast. This can see from the behavior people who use internert for communicate. This behavior case many telecommunications companies to increase their internet services provider which can lead to competition between provider. Customer have right to choose appropriate provider can switch from pervious provider which is the phenomenon known as Customer Churn. This phenomenon can make to reduce revenue for telecommunications companies and is therefore important to address.
In this case, participants are provided with a training dataset containing 4250 samples. Each sample consists of 19 features and one boolean target variable “churn”, which indicates whether the customer will churn.
Workflow
The CRISP-DM (Cross Industry Standard Process for Data Mining) methodology is a robust and comprehensive data mining process model that outlines six major phases:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment
Business understanding
Description
Customer churn refers to the phenomenon where customers cease conducting business with a company or terminate their subscription to a service. It is a critical metric for businesses, particularly in industries such as telecommunications, subscription services, and financial services, where long-term customer relationships are essential. Understanding and managing customer churn is vital for maintaining revenue and achieving sustainable growth.
Purpose
Hence, this case focuses on predicting customer churn. It is important for companies to know this prediction so they can map out business strategies to retain customers. Accurate churn prediction is crucial for businesses to maintain customer satisfaction, retain customers, and minimize revenue loss.
Metric
Metric for this case is the optimal model with high accuracy to predict customer churn, defined as follows:
Data Understanding
The dataset using from Kaggle competition that is Customer Churn Prediction 2020, with file description :
- train.csv — the training set.
Contains 4250 lines with 20 columns. 3652 samples (85.93%) belong to class churn=no and 598 samples (14.07%) belong to class churn=yes - test.csv — the test set.
Contains 750 lines with 20 columns: the index of each sample and the 19 features (missing the target variable “churn”).
These are 20 columns of information, divided into two groups based on data types :
- Categorical data
- state = 2-letter code of the US state of customer residence.
- area_code = Area with 3 digit code.
- international_plan (Yes,No) = The customer has international plan.
- voice_mail_plan (Yes,No) = The customer has voice mail plan.
- churn (Yes,No) = Customer churn — target variable.
- Numerical data
- account_length = Number of months the customer has been with the current telco provider
- number_vmail_messages = Number of voice-mail messages.
- total_day_minutes = Total minutes of day calls.
- total_day_calls = Total number of day calls.
- total_day_charge = Total charge of day calls.
- total_eve_minutes = Total minutes of evening calls.
- total_eve_calls = Total number of evening calls.
- total_eve_charge = Total charge of evening calls.
- total_night_minutes = Total minutes of night calls.
- total_night_calls = Total number of night calls.
- total_night_charge = Total charge of night calls.
- total_intl_minutes = Total minutes of international calls.
- total_intl_calls = Total number of international calls.
- total_intl_charge = Total charge of international calls.
- number_customer_service_calls = Number of calls to customer service
Import Data and Packages
The libraries or packages used for data manipulation, cleaning, modeling and evaluation are as follows :
# Data manipulation and cleaning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import re
from sklearn.preprocessing import MinMaxScaler
# Evaluation
from sklearn import model_selection
import sklearn.metrics as sm
import sklearn.model_selection
# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import pickle
Exploratory Data Analysis
The following functions will help to data visualization especially in providing labels
def pie_label(pct, x):
value = int(pct / 100.*np.sum(x))
return '{:d}\n({:.0f}%)'.format(value, pct)
def stacking_bar_label(data):
for idx in data.index:
start = 0
for col in data.columns:
y = data.loc[idx, col]
value = data.loc[idx, col]
total = data.loc[idx, :].sum()
ax.text(
x=idx,
y=(start + y / 2) * 0.95,
s=f'{round(100 * value / total, 1)}%',
fontsize=15,
ha='right',
color='black',
weight='bold'
)
start += y
def save_division(numerator, denumerator):
if numerator == 0 and denumerator == 0:
return 0
elif denumerator == 0:
raise ValueError("Division by zero is not allowed")
else:
return numerator/denumerator
class extract_data():
def __init__(self, chart, interval):
self.chart = chart
self.bin_edges = self.chart.patches
self.interval = interval
self.x_data = [(bin_edge.get_x() + bin_edge.get_width()/2) for bin_edge in self.bin_edges]
self.y_data = [bin_edge.get_height() for bin_edge in self.bin_edges]
self.x_interval = np.array(self.x_data[:interval])
self.y_churn = np.array(self.y_data[:interval])
self.y_total = np.array(self.y_data[:interval]) + np.array(self.y_data[interval:])
def hist_label(self):
for c, d, e in zip(self.x_interval, self.y_churn, self.y_total):
self.chart.text(
x=c
, y=d
, s=f'{round(100* save_division(d, e))}%'
, fontsize=12
, ha='left'
, va='center'
, color='darkred'
, weight='bold'
)
def line_label(self):
for x, y in zip(self.x_interval, self.y_total):
self.chart.text(
x=x
, y=y
, s='{:.0f}'.format(y)
, fontsize=12
, ha='right'
, va='baseline'
, color='black'
, weight='bold'
)
Customer Churn
Based on the data, 86% of customers stay with the provider’s services, while only 14% of customer churns.
Area Code : Customer Churn
area_code_415
is the area with the largest number of customers (2108) and this area has a small percentage of customer churn (14%).
International Plan : Customer Churn
Customer which have international_plan
tend to churn more with percentage 42%. It's very different when compared to customer which don't have international_plan
with most of them still stay use the provider.
Voicemail Plan : Customer Churn
For customer which have voice_mail_plan
(yes), many of them will be stay (93%), while customer which don't have voice_mail_plan
(no) have large percentage churn in 16%.
Distribution of Total Day Minutes : Customer Churn
In the total_day_minutes
column (variable) the data is normal distribution with peak or most users having total call minutes in range of 175 - 200 minutes per day. The distribution of percentage customer churn looks quite high since the duration is more than 250 minutes and dominates at duration above 316 minutes, which is very closely related to the total call cost from the provider.
Distribution of Total Day Charge : Customer Churn
The total_day_charge
are normal distribution with an average per customer of $30.0, while for customer churn the distribution fluctuates and starts to increase when the cost is more than $40.0 and for all customer is churn (100%) when it is more than $54.0 which identifies the relationship between the greater the daily call costs the more customers will be churn.
Distribution of Total International Charge : Customer Churn
Meanwhile, if you look at the distribution of total_international_charge
per day, the average customer only spends $2.7 - $3.2 per day and then for the percentage of customer churn has started to increase since the total call costs were above $3.0.
But if you breakdown for more details with column (variable) international_plan
category (yes or no). Based on comparison of the two visualizations above, it shows very different results between customers who have an international_plan
(yes) and those who don't have an international_plan
(no), where having international_plan
should be able to help customers, especially in terms of reduce cost with affordable prices so that customers don't churn, however the visualization results show that there are still many customers who churn when the price above $3.0 with international_plan
(yes).
Distribution of Total International Calls : Customer Churn
In fact, if you look at the total_international_calls
, the most customer churn comes from customers who make less than 8 international calls per day, where on average customers only make international calls 4 - 5 times per day with data distribution skewed more to right. So many customers who use international calls is churn because apart from the costs/charge which quite high and another factor is come from customer's needs, there are not too many international calls per day, in contrast to customers who are still stay with that provider, where these customers do have a need for international calls with a total of more than 10 calls per day.
The distribution of total_intl_calls
for customers both those who have an international_plan
(yes) and those who don't have an international_plan
(no) is the same as the data distribution which tends to be skewed to right with most customers for total_intl_calls
being 3 per day. However, the percentage of customers who churn for those who have an international_plan
(yes) is very high, especially for customers with total_intl_calls
less than 8 per day.
Distribution of Number Vmail Messages : Customer Churn
The majority of customers do not have number_vmail_messages
. It can be seen from the distribution that is skewed to the right with the peak at 0, where 16% of customers who do not have number_vmail_messages
is churn.
And after breaking it down in more detail based on voice_mail_plan
(yes or no), it can be see that the number_vmail_message
distribution is at zero (0) is comes from customers who don't have a voice_mail_plan
(no), while for customers who have voice_mail_plan
(yes) the number_vmail_message
data is distributed normal with an average number_vmail_messages
in 30 voice messages per day and the percentage of customers who churn is also much lower than customers who do not have a voice_mail_plan
(no).
Distribution of Number Customer Service Calls : Customer Churn
For the number_customer_service_calls
, most of users make calls only once, with the data distribution skewed to the right. Meanwhile, the percentage of customers who churn appears to be increasing at same time with increase number_customer_service_calls
. Especially for the number_customer_service_calls
that is more than 8, all customers is churn, which indicates that the customer is experiencing a lot of problems either in terms of features or costs from the provider or the problem cannot be resolved so the customer is churns.
Distribution of Numerical Data
Based on the visualization result, almost all the numerical columns (variable) are normal distribution or the majority of the data around in mean, median, mode value such as :
account_length
,total_day_minutes
,total_day_calls
,total_day_charge
,total_eve_minutes
,total_eve_calls
,total_eve_charge
,total_night_minutes
,total_night_calls
,total_night_charge
,total_intl_minutes
,total_intl_charge
.
Meanwhile for other columns (variable) such as :
number_vmail_messages
,total_intl_calls
,number_customer_service_calls
.
All data is distributed to right (positively skewed distribution) where the value of mean > median > mode
Density of Numerical Data
In bivariate analysis, it can be see that all the numerical columns (variable), the density of customer churn are much lower (most of them is stay use the provider), but in several columns (variable) there are condition where the trend of density customer churn with higher, such as :
total_day_minutes
,total_day_charge
, andnumber_customer_service_calls
which have fluctuating trend.
Data Preparation
This is the stage of cleaning data and preparing it for input into the machine learning model. Data should be checked for missing values, duplicates, and outliers. Categorical data must also be converted into a format that can be utilized by machine learning models.
Features Encoding
for col in ['international_plan', 'voice_mail_plan','churn']:
df[col] = df[col].map({'yes':1, 'no':0})
Change the string data type (‘yes’ or ‘no’) to integer (1 or 0) in the columns (variables) :
international_plan
voice_mail_plan
churn
Dimensionality Reduction
df['total_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
df['total_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
df['total_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']
Dimensionality reduction is the process of decreasing the number of features/columns/variables without losing informational content in the data, such as through aggregation sum of :
total_minutes
,total_calls
,total_charge
.
Which is a combination of total day, total evening, and total night from the columns (variables) for calls, minutes, and charges. Meanwhile for international calls columns (variables) such as:
total_intl_minutes
,total_intl_calls
,total_intl_charge
.
This column (variable) is not included in the dimensionality reduction as it is related to, or has a correlation with international_plan
.
Standardization
scaler = MinMaxScaler()
X['number_vmail_messages'] = MinMaxScaler().fit_transform(X['number_vmail_messages'].values.reshape(len(X), 1))
X['total_intl_minutes'] = MinMaxScaler().fit_transform(X['total_intl_minutes'].values.reshape(len(X), 1))
X['total_intl_calls'] = MinMaxScaler().fit_transform(X['total_intl_calls'].values.reshape(len(X), 1))
X['total_intl_charge'] = MinMaxScaler().fit_transform(X['total_charge'].values.reshape(len(X), 1))
X['number_customer_service_calls'] = MinMaxScaler().fit_transform(X['number_customer_service_calls'].values.reshape(len(X), 1))
X['total_minutes'] = MinMaxScaler().fit_transform(X['total_minutes'].values.reshape(len(X), 1))
X['total_calls'] = MinMaxScaler().fit_transform(X['total_calls'].values.reshape(len(X), 1))
X['total_charge'] = MinMaxScaler().fit_transform(X['total_charge'].values.reshape(len(X), 1))
The process involves transforming values in several columns (variables) such as number_vmail_messages
, total_intl_minutes
, total_intl_calls
, total_intl_charge
, number_customer_service_calls
, total_minutes
, total_calls
, total_charge
to be scaled between 0 and 1. This ensures that each column has a normalized distribution of values.
Feature Selection
The process of selecting columns (variables) that have a correlation with the target variable (churn) and separating the data to determine:
- dependent variable (y) with target variable (churn)
- independent variables (X), specifically referring to several columns (variables) that are correlated with the target variable (churn)
Based on the correlation results of several columns (independent variable X) with churn (dependent variable y), several columns (variables) that have unidirectional relationship (making customer churn) like :
international_plan
,total_charge
,total_minutes
,number_customer_service_calls
,total_intl_minutes
,total_intl_charge
.
Other columns (variables) that have an inverse relationship (making customers stay) such as:
total_calls
,total_intl_calls
,number_vmail_message
,voice_mail_plan
.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.3, random_state = 123)
And finally, the process of dividing the data into training and testing sets will be executed, with 70% allocated for training and 30% for testing.
Modeling & Evaluation
# Model Evaluation Function
kfold = model_selection.KFold(n_splits=10, shuffle = True, random_state=123)
model_list = pd.DataFrame(columns=['model_ml', 'accuracy', 'recall', 'precision', 'f1_score', 'cross_val_score', 'std'])
# Function to filter out empty or all-NA DataFrames
def filter_empty_or_all_na(dfs):
filtered_dfs = []
for df in dfs:
if not df.empty and not df.isna().all().all():
filtered_dfs.append(df)
return filtered_dfs
# Function to evaluation models
def evaluation(y_test, y_pred, model, name_model):
global model_list
model_ml = name_model
accuracy = sm.accuracy_score(y_test,y_pred)*100.0
precision = sm.precision_score(y_test,y_pred)*100.0
recall = sm.recall_score(y_test,y_pred)*100.0
f1 = sm.f1_score(y_test, y_pred)*100.0
result = model_selection.cross_val_score(model , X , y , cv=kfold , scoring='accuracy')
new_row = [{
'model_ml':model_ml
, 'accuracy':accuracy
, 'recall':recall
, 'precision':precision
, 'f1_score':f1
, 'cross_val_score':result.mean()*100.00
, 'std': result.std()*100.00
}]
for row in new_row:
row_df = pd.DataFrame([row])
filtered_dfs = filter_empty_or_all_na([model_list, row_df])
model_list = pd.concat(filtered_dfs, ignore_index=True)
print('Model {}'.format(model_ml))
print("Accuracy Score: %.3f%%" % (accuracy))
print("Precision Score: %.3f%%" % (precision))
print("Recall Score: %.3f%%" % (recall))
print("F1 Score: %.3f%%" % (f1))
print("cross_val_score: %.3f%% (%.3f%%)" % (result.mean()*100.0, result.std()*100.0))
print(sm.classification_report(y_test, y_pred))
sm.ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=['no churn', 'churn'])
To facilitate the evaluation and validation of machine learning results, a function has been developed that simplifies this process and can be invoked repeatedly with the following parameters:
- Machine learning is the model name
- Confusion Matrix is a table that presents the performance of a classification model, including the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) parameters predicted by the model.
- Accuracy to measure the accuracy value, one must obtain the sum of positive value data that is predicted to be positive and negative value data that is predicted to be negative (TP + TN), then divide this sum by the total amount of data in the dataset (TP + TN + FP + FN).
- Precision is the ratio of cases predicted to be true positive (TP) compared to overall predicted positive outcomes (TP + FP).
- Recall is ratio of cases with true positive predictions (TP) compared to overall true positive data (TP + FN).
- F1 — Score to obtained from comparing the average precision with recall.
- Cross-validation is beneficial for validating machine learning models. By employing K-fold, the data is partitioned into validation sets and the process is repeated ten times to prevent the machine learning model from overfitting.
Logistic Regression
The accuracy value of the Logistic Regression model, with correctly predicted data for both customer churn (True Positive) and non-churn (True Negative), is quite high at 88.08%. After conducting cross-validation 10 times, the accuracy value remains at 87.18%, with a standard deviation of 1.46%.
The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) being 75.61%. Conversely, the recall value is significantly low, with the percentage of customers correctly predicted to churn (TP) compared to the total number of customers who actually churned (TP + FN) being 17.92%. Consequently, this discrepancy impacts the average percentage comparison of precision and recall (F1-Score), which is 28.97%.
KNN (K-Nearest Neighbors)
The accuracy value of the KNN (K-Nearest Neighbors) model, with correctly predicted data for both customer churn (TP) and no churn (TN), is notably high at 92.55%. After conducting cross-validation 10 times, the accuracy value remains at 92.61% with a standard deviation of 0.9%.
The precision value is notably high. Specifically, the percentage of customers who actually churn (True Positives) out of all customers predicted to churn (True Positives + False Positives) is 83.05%. Conversely, the recall value is relatively low. It indicates that the percentage of customers correctly predicted to churn (True Positives) compared to the total number of customers who actually churned (True Positives + False Negatives) is 56.64%. This disparity significantly impacts the average comparison of precision and recall, yielding an F1-Score of 67.35%.
Decision Tree
The accuracy value of the Decision Tree model, with correctly predicted data for both customer churn (TP) and no churn (TN), is quite high at 96.39%. After carrying out cross-validation 10 times, the accuracy value remains at 94.96% with a standard deviation of 0.81%. Based on these results, it can be concluded that the accuracy value of the Decision Tree model (96.39%) is outside the range of cross-validation values (94.15% — 95.77%).
The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) standing at 87.70%. Meanwhile, the recall value, representing the percentage of customers who were correctly predicted to churn (TP) compared to all customers who actually churned (TP + FN), was 86.70%. Additionally, the average comparison of precision and recall (F1-Score) was 86.70%.
Random Forest
The accuracy value of the Random Forest model, with correctly predicted data for both customer churn (TP) and non-churn (TN), is very high at 97.80%. After conducting cross-validation 10 times, the accuracy value remains at 97.36%, with a standard deviation of 0.65%.
For the precision value is very high, the percentage of customers who actually churn (TP), out of all customers who are predicted to churn (TP + FP), is 98.66%. Conversely, the recall value, which is the percentage of customers who were correctly predicted to churn (TP) compared to all customers who actually churned (TP + FN), was 84.97%, and then the average comparison of precision and recall (F1-Score) is 97.36%.
Support Vector Machines
The accuracy value of the Support Vector Machines model, with correctly predicted data for both customer churn (True Positives) and non-churn (True Negatives), is notably high at 92.31%. Additionally, after performing cross-validation 10 times, the accuracy value remains at 92.66% with a standard deviation of 1.07%.
The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) being 91.21%. However, the recall value is relatively low, where the percentage of customers correctly predicted to churn (TP) compared to the total number of customers who actually churned (TP + FN) stands at 47.98%. Consequently, this affects the average percentage comparison of precision and recall (F1-Score), which is 62.88%.
Result
From this result, Random Forest has become the best model with the highest accuracy, precision, and F1-score in predicting customer churn.
# Model interpretation
RF = RandomForestClassifier()
forest = RF.fit(X_train, y_train)
pickle.dump(forest, open('forest.pkl', 'wb'))
forest_model = pickle.load(open('./forest.pkl', 'rb'))
churn = forest_model.predict(X_data_test)
data_test['churn'] = churn
data_test['churn'] = data_test['churn'].map({1: 'yes', 0: 'no'})
data_test.head()
According to the prediction, 82% of customers will continue using the provider’s services, while only 18% of customers are expected to churn.
Conclusion
Only 14% of customers churns from overall data customer internet provider’s which is affected by several factor, that is :
- From 396 customers which have
international_plan
is churn with percentage 42%. - Customer who don’t have
voice_mail_plan
(no) have large percentage churn in 16.4% from 3138 users. - The distribution of customers with a call duration of more than 316 minutes per day has a high charge of more than 54.0 which identifies the relationship between the greater the daily call charge, the more customer churn.
international_plan
should be able to help customers in reducetotal_international_charge
with affordable prices, but in actually many customers who use international calls is churn. The percentage of customers who churn for those who have aninternational_plan
(yes) is very high, especially for customers withtotal_intl_calls
less than 8 per day. This is caused by apart from the costs/charge which quite high and another factor is come from customer's needs, there are not too many international calls per day.- 16% from 3138 customers who do not have
number_vmail_messages
is churn. while for customers who havevoice_mail_plan
(yes) thenumber_vmail_message
data is distributed normal with an averagenumber_vmail_messages
in 30 voice messages per day and the percentage of customers who churn is also much lower. - The percentage of customers who churn is related with increase
number_customer_service_calls
. Especially for customer withnumber_customer_service_calls
more than 8, which indicates that the customer is experiencing a lot of problems either in terms of features or costs from the provider or the problem cannot be resolved so the customer is churns.
Recommendation
After predicting which customers will be churn, to prevent this from happening or reducing customer churn there are several recommendations that can be made:
- Classify customers who are predicted to churn by assigning labels to prioritize.
- Providing a subscription promo for daily calling plan of 3 hours (180 minutes) and 6 hours (360 minutes) with charges of 30.0 and 50.0 (you can discuss with business team).
- Create a comprehensive plan for a voice message feature that can be utilized by all customers. Implement 7 day trial period to ensure customers are informed and encouraged to engage with the voice message functionality.
- Update the international plan, particularly for customers who do not frequently make international calls (less than 8 times per day).
- Improve the quality of customer service, both in terms of services and features/products, as well as provider connections/signals, in order to maintain customer satisfaction.
- Please distribute satisfaction surveys to customers regarding the services or products received to obtain comprehensive feedback and identify areas for improvement.
- Monitoring implementation results to observe changes in trends that occur.