Customer Churn Prediction (Telecommunications Industry)

17 min readJun 29, 2024

Introduction

The development of telecommunications industry is very fast. This can see from the behavior people who use internert for communicate. This behavior case many telecommunications companies to increase their internet services provider which can lead to competition between provider. Customer have right to choose appropriate provider can switch from pervious provider which is the phenomenon known as Customer Churn. This phenomenon can make to reduce revenue for telecommunications companies and is therefore important to address.

In this case, participants are provided with a training dataset containing 4250 samples. Each sample consists of 19 features and one boolean target variable “churn”, which indicates whether the customer will churn.

Workflow

The CRISP-DM (Cross Industry Standard Process for Data Mining) methodology is a robust and comprehensive data mining process model that outlines six major phases:

Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment

Business understanding

Description

Customer churn refers to the phenomenon where customers cease conducting business with a company or terminate their subscription to a service. It is a critical metric for businesses, particularly in industries such as telecommunications, subscription services, and financial services, where long-term customer relationships are essential. Understanding and managing customer churn is vital for maintaining revenue and achieving sustainable growth.

Purpose

Hence, this case focuses on predicting customer churn. It is important for companies to know this prediction so they can map out business strategies to retain customers. Accurate churn prediction is crucial for businesses to maintain customer satisfaction, retain customers, and minimize revenue loss.

Metric

Metric for this case is the optimal model with high accuracy to predict customer churn, defined as follows:

Data Understanding

The dataset using from Kaggle competition that is Customer Churn Prediction 2020, with file description :

train.csv — the training set.
Contains 4250 lines with 20 columns. 3652 samples (85.93%) belong to class churn=no and 598 samples (14.07%) belong to class churn=yes
test.csv — the test set.
Contains 750 lines with 20 columns: the index of each sample and the 19 features (missing the target variable “churn”).

These are 20 columns of information, divided into two groups based on data types :

Categorical data

state = 2-letter code of the US state of customer residence.
area_code = Area with 3 digit code.
international_plan (Yes,No) = The customer has international plan.
voice_mail_plan (Yes,No) = The customer has voice mail plan.
churn (Yes,No) = Customer churn — target variable.

Numerical data

account_length = Number of months the customer has been with the current telco provider
number_vmail_messages = Number of voice-mail messages.
total_day_minutes = Total minutes of day calls.
total_day_calls = Total number of day calls.
total_day_charge = Total charge of day calls.
total_eve_minutes = Total minutes of evening calls.
total_eve_calls = Total number of evening calls.
total_eve_charge = Total charge of evening calls.
total_night_minutes = Total minutes of night calls.
total_night_calls = Total number of night calls.
total_night_charge = Total charge of night calls.
total_intl_minutes = Total minutes of international calls.
total_intl_calls = Total number of international calls.
total_intl_charge = Total charge of international calls.
number_customer_service_calls = Number of calls to customer service

Import Data and Packages

The libraries or packages used for data manipulation, cleaning, modeling and evaluation are as follows :

# Data manipulation and cleaning
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import re
from sklearn.preprocessing import MinMaxScaler

# Evaluation
from sklearn import model_selection
import sklearn.metrics as sm
import sklearn.model_selection

# Modeling
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import pickle

Exploratory Data Analysis

The following functions will help to data visualization especially in providing labels


def pie_label(pct, x):
    value = int(pct / 100.*np.sum(x))
    return '{:d}\n({:.0f}%)'.format(value, pct)

def stacking_bar_label(data):
    for idx in data.index:
        start = 0
        for col in data.columns:
            y = data.loc[idx, col]
            value = data.loc[idx, col]
            total = data.loc[idx, :].sum()
            ax.text(
                x=idx,
                y=(start + y / 2) * 0.95,
                s=f'{round(100 * value / total, 1)}%',
                fontsize=15,
                ha='right',
                color='black',
                weight='bold'
                )
            start += y

def save_division(numerator, denumerator):
    if numerator == 0  and denumerator == 0:
        return 0
    elif denumerator == 0:
        raise ValueError("Division by zero is not allowed") 
    else:
        return numerator/denumerator

class extract_data():
    def __init__(self, chart, interval):
        self.chart = chart
        self.bin_edges = self.chart.patches
        self.interval = interval
        self.x_data = [(bin_edge.get_x() + bin_edge.get_width()/2) for bin_edge in self.bin_edges]
        self.y_data = [bin_edge.get_height() for bin_edge in self.bin_edges]
        self.x_interval = np.array(self.x_data[:interval])
        self.y_churn = np.array(self.y_data[:interval])
        self.y_total = np.array(self.y_data[:interval]) + np.array(self.y_data[interval:])
    
    def hist_label(self):
        for c, d, e in zip(self.x_interval, self.y_churn, self.y_total):
            self.chart.text(
                x=c
                , y=d
                , s=f'{round(100* save_division(d, e))}%'
                , fontsize=12
                , ha='left'
                , va='center'
                , color='darkred'
                , weight='bold'
            )

    def line_label(self):
        for x, y in zip(self.x_interval, self.y_total):
            self.chart.text(
                x=x
                , y=y
                , s='{:.0f}'.format(y)
                , fontsize=12
                , ha='right'
                , va='baseline'
                , color='black'
                , weight='bold'
            )

Customer Churn

Based on the data, 86% of customers stay with the provider’s services, while only 14% of customer churns.

Area Code : Customer Churn

area_code_415 is the area with the largest number of customers (2108) and this area has a small percentage of customer churn (14%).

International Plan : Customer Churn

Customer which have international_plan tend to churn more with percentage 42%. It's very different when compared to customer which don't have international_plan with most of them still stay use the provider.

Voicemail Plan : Customer Churn

For customer which have voice_mail_plan (yes), many of them will be stay (93%), while customer which don't have voice_mail_plan (no) have large percentage churn in 16%.

Distribution of Total Day Minutes : Customer Churn

In the total_day_minutes column (variable) the data is normal distribution with peak or most users having total call minutes in range of 175 - 200 minutes per day. The distribution of percentage customer churn looks quite high since the duration is more than 250 minutes and dominates at duration above 316 minutes, which is very closely related to the total call cost from the provider.

Distribution of Total Day Charge : Customer Churn

The total_day_charge are normal distribution with an average per customer of $30.0, while for customer churn the distribution fluctuates and starts to increase when the cost is more than $40.0 and for all customer is churn (100%) when it is more than $54.0 which identifies the relationship between the greater the daily call costs the more customers will be churn.

Distribution of Total International Charge : Customer Churn

Meanwhile, if you look at the distribution of total_international_charge per day, the average customer only spends $2.7 - $3.2 per day and then for the percentage of customer churn has started to increase since the total call costs were above $3.0.

But if you breakdown for more details with column (variable) international_plan category (yes or no). Based on comparison of the two visualizations above, it shows very different results between customers who have an international_plan (yes) and those who don't have an international_plan (no), where having international_plan should be able to help customers, especially in terms of reduce cost with affordable prices so that customers don't churn, however the visualization results show that there are still many customers who churn when the price above $3.0 with international_plan (yes).

Distribution of Total International Calls : Customer Churn

In fact, if you look at the total_international_calls, the most customer churn comes from customers who make less than 8 international calls per day, where on average customers only make international calls 4 - 5 times per day with data distribution skewed more to right. So many customers who use international calls is churn because apart from the costs/charge which quite high and another factor is come from customer's needs, there are not too many international calls per day, in contrast to customers who are still stay with that provider, where these customers do have a need for international calls with a total of more than 10 calls per day.

The distribution of total_intl_calls for customers both those who have an international_plan (yes) and those who don't have an international_plan (no) is the same as the data distribution which tends to be skewed to right with most customers for total_intl_calls being 3 per day. However, the percentage of customers who churn for those who have an international_plan (yes) is very high, especially for customers with total_intl_calls less than 8 per day.

Distribution of Number Vmail Messages : Customer Churn

The majority of customers do not have number_vmail_messages. It can be seen from the distribution that is skewed to the right with the peak at 0, where 16% of customers who do not have number_vmail_messages is churn.

And after breaking it down in more detail based on voice_mail_plan (yes or no), it can be see that the number_vmail_message distribution is at zero (0) is comes from customers who don't have a voice_mail_plan (no), while for customers who have voice_mail_plan (yes) the number_vmail_message data is distributed normal with an average number_vmail_messages in 30 voice messages per day and the percentage of customers who churn is also much lower than customers who do not have a voice_mail_plan (no).

Distribution of Number Customer Service Calls : Customer Churn

For the number_customer_service_calls, most of users make calls only once, with the data distribution skewed to the right. Meanwhile, the percentage of customers who churn appears to be increasing at same time with increase number_customer_service_calls. Especially for the number_customer_service_calls that is more than 8, all customers is churn, which indicates that the customer is experiencing a lot of problems either in terms of features or costs from the provider or the problem cannot be resolved so the customer is churns.

Distribution of Numerical Data

Based on the visualization result, almost all the numerical columns (variable) are normal distribution or the majority of the data around in mean, median, mode value such as :

account_length,
total_day_minutes,
total_day_calls,
total_day_charge,
total_eve_minutes,
total_eve_calls,
total_eve_charge,
total_night_minutes,
total_night_calls,
total_night_charge,
total_intl_minutes,
total_intl_charge.

Meanwhile for other columns (variable) such as :

number_vmail_messages,
total_intl_calls,
number_customer_service_calls.

All data is distributed to right (positively skewed distribution) where the value of mean > median > mode

Density of Numerical Data

In bivariate analysis, it can be see that all the numerical columns (variable), the density of customer churn are much lower (most of them is stay use the provider), but in several columns (variable) there are condition where the trend of density customer churn with higher, such as :

total_day_minutes,
total_day_charge, and
number_customer_service_calls which have fluctuating trend.

Data Preparation

This is the stage of cleaning data and preparing it for input into the machine learning model. Data should be checked for missing values, duplicates, and outliers. Categorical data must also be converted into a format that can be utilized by machine learning models.

Features Encoding

for col in ['international_plan', 'voice_mail_plan','churn']:
    df[col] = df[col].map({'yes':1, 'no':0})

Change the string data type (‘yes’ or ‘no’) to integer (1 or 0) in the columns (variables) :

international_plan
voice_mail_plan
churn

Dimensionality Reduction

df['total_minutes'] = df['total_day_minutes'] + df['total_eve_minutes'] + df['total_night_minutes']
df['total_calls'] = df['total_day_calls'] + df['total_eve_calls'] + df['total_night_calls']
df['total_charge'] = df['total_day_charge'] + df['total_eve_charge'] + df['total_night_charge']

Dimensionality reduction is the process of decreasing the number of features/columns/variables without losing informational content in the data, such as through aggregation sum of :

total_minutes,
total_calls,
total_charge.

Which is a combination of total day, total evening, and total night from the columns (variables) for calls, minutes, and charges. Meanwhile for international calls columns (variables) such as:

total_intl_minutes,
total_intl_calls,
total_intl_charge.

This column (variable) is not included in the dimensionality reduction as it is related to, or has a correlation with international_plan.

Standardization

scaler = MinMaxScaler()
X['number_vmail_messages'] = MinMaxScaler().fit_transform(X['number_vmail_messages'].values.reshape(len(X), 1))
X['total_intl_minutes'] = MinMaxScaler().fit_transform(X['total_intl_minutes'].values.reshape(len(X), 1))
X['total_intl_calls'] = MinMaxScaler().fit_transform(X['total_intl_calls'].values.reshape(len(X), 1))
X['total_intl_charge'] = MinMaxScaler().fit_transform(X['total_charge'].values.reshape(len(X), 1))
X['number_customer_service_calls'] = MinMaxScaler().fit_transform(X['number_customer_service_calls'].values.reshape(len(X), 1))
X['total_minutes'] = MinMaxScaler().fit_transform(X['total_minutes'].values.reshape(len(X), 1))
X['total_calls'] = MinMaxScaler().fit_transform(X['total_calls'].values.reshape(len(X), 1))
X['total_charge'] = MinMaxScaler().fit_transform(X['total_charge'].values.reshape(len(X), 1))

The process involves transforming values in several columns (variables) such as number_vmail_messages, total_intl_minutes, total_intl_calls, total_intl_charge, number_customer_service_calls, total_minutes, total_calls, total_charge to be scaled between 0 and 1. This ensures that each column has a normalized distribution of values.

Feature Selection

The process of selecting columns (variables) that have a correlation with the target variable (churn) and separating the data to determine:

dependent variable (y) with target variable (churn)
independent variables (X), specifically referring to several columns (variables) that are correlated with the target variable (churn)

Based on the correlation results of several columns (independent variable X) with churn (dependent variable y), several columns (variables) that have unidirectional relationship (making customer churn) like :

international_plan,
total_charge,
total_minutes,
number_customer_service_calls,
total_intl_minutes,
total_intl_charge.

Other columns (variables) that have an inverse relationship (making customers stay) such as:

total_calls,
total_intl_calls,
number_vmail_message,
voice_mail_plan.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.3, random_state = 123)

And finally, the process of dividing the data into training and testing sets will be executed, with 70% allocated for training and 30% for testing.

Modeling & Evaluation

# Model Evaluation Function
kfold = model_selection.KFold(n_splits=10, shuffle = True, random_state=123)
model_list = pd.DataFrame(columns=['model_ml', 'accuracy', 'recall', 'precision', 'f1_score', 'cross_val_score', 'std'])

# Function to filter out empty or all-NA DataFrames
def filter_empty_or_all_na(dfs):
    filtered_dfs = []
    for df in dfs:
        if not df.empty and not df.isna().all().all():
            filtered_dfs.append(df)
    return filtered_dfs

# Function to evaluation models
def evaluation(y_test, y_pred, model, name_model):
    global model_list
    model_ml = name_model
    accuracy = sm.accuracy_score(y_test,y_pred)*100.0
    precision = sm.precision_score(y_test,y_pred)*100.0
    recall = sm.recall_score(y_test,y_pred)*100.0
    f1 = sm.f1_score(y_test, y_pred)*100.0
    result = model_selection.cross_val_score(model , X , y , cv=kfold , scoring='accuracy')
    new_row = [{
          'model_ml':model_ml
        , 'accuracy':accuracy
        , 'recall':recall
        , 'precision':precision
        , 'f1_score':f1
        , 'cross_val_score':result.mean()*100.00
        , 'std': result.std()*100.00
        }]
    for row in new_row:
        row_df = pd.DataFrame([row])
        filtered_dfs = filter_empty_or_all_na([model_list, row_df])
        model_list = pd.concat(filtered_dfs, ignore_index=True)

    print('Model {}'.format(model_ml))
    print("Accuracy Score: %.3f%%" % (accuracy))
    print("Precision Score: %.3f%%" % (precision))
    print("Recall Score: %.3f%%" % (recall))
    print("F1 Score: %.3f%%" % (f1))
    print("cross_val_score: %.3f%% (%.3f%%)" % (result.mean()*100.0, result.std()*100.0))

    print(sm.classification_report(y_test, y_pred))
    sm.ConfusionMatrixDisplay.from_predictions(y_test, y_pred, display_labels=['no churn', 'churn'])

To facilitate the evaluation and validation of machine learning results, a function has been developed that simplifies this process and can be invoked repeatedly with the following parameters:

Machine learning is the model name
Confusion Matrix is a table that presents the performance of a classification model, including the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) parameters predicted by the model.
Accuracy to measure the accuracy value, one must obtain the sum of positive value data that is predicted to be positive and negative value data that is predicted to be negative (TP + TN), then divide this sum by the total amount of data in the dataset (TP + TN + FP + FN).
Precision is the ratio of cases predicted to be true positive (TP) compared to overall predicted positive outcomes (TP + FP).
Recall is ratio of cases with true positive predictions (TP) compared to overall true positive data (TP + FN).
F1 — Score to obtained from comparing the average precision with recall.
Cross-validation is beneficial for validating machine learning models. By employing K-fold, the data is partitioned into validation sets and the process is repeated ten times to prevent the machine learning model from overfitting.

Logistic Regression

The accuracy value of the Logistic Regression model, with correctly predicted data for both customer churn (True Positive) and non-churn (True Negative), is quite high at 88.08%. After conducting cross-validation 10 times, the accuracy value remains at 87.18%, with a standard deviation of 1.46%.

The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) being 75.61%. Conversely, the recall value is significantly low, with the percentage of customers correctly predicted to churn (TP) compared to the total number of customers who actually churned (TP + FN) being 17.92%. Consequently, this discrepancy impacts the average percentage comparison of precision and recall (F1-Score), which is 28.97%.

KNN (K-Nearest Neighbors)

The accuracy value of the KNN (K-Nearest Neighbors) model, with correctly predicted data for both customer churn (TP) and no churn (TN), is notably high at 92.55%. After conducting cross-validation 10 times, the accuracy value remains at 92.61% with a standard deviation of 0.9%.

The precision value is notably high. Specifically, the percentage of customers who actually churn (True Positives) out of all customers predicted to churn (True Positives + False Positives) is 83.05%. Conversely, the recall value is relatively low. It indicates that the percentage of customers correctly predicted to churn (True Positives) compared to the total number of customers who actually churned (True Positives + False Negatives) is 56.64%. This disparity significantly impacts the average comparison of precision and recall, yielding an F1-Score of 67.35%.

Decision Tree

The accuracy value of the Decision Tree model, with correctly predicted data for both customer churn (TP) and no churn (TN), is quite high at 96.39%. After carrying out cross-validation 10 times, the accuracy value remains at 94.96% with a standard deviation of 0.81%. Based on these results, it can be concluded that the accuracy value of the Decision Tree model (96.39%) is outside the range of cross-validation values (94.15% — 95.77%).

The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) standing at 87.70%. Meanwhile, the recall value, representing the percentage of customers who were correctly predicted to churn (TP) compared to all customers who actually churned (TP + FN), was 86.70%. Additionally, the average comparison of precision and recall (F1-Score) was 86.70%.

Random Forest

The accuracy value of the Random Forest model, with correctly predicted data for both customer churn (TP) and non-churn (TN), is very high at 97.80%. After conducting cross-validation 10 times, the accuracy value remains at 97.36%, with a standard deviation of 0.65%.

For the precision value is very high, the percentage of customers who actually churn (TP), out of all customers who are predicted to churn (TP + FP), is 98.66%. Conversely, the recall value, which is the percentage of customers who were correctly predicted to churn (TP) compared to all customers who actually churned (TP + FN), was 84.97%, and then the average comparison of precision and recall (F1-Score) is 97.36%.

Support Vector Machines

The accuracy value of the Support Vector Machines model, with correctly predicted data for both customer churn (True Positives) and non-churn (True Negatives), is notably high at 92.31%. Additionally, after performing cross-validation 10 times, the accuracy value remains at 92.66% with a standard deviation of 1.07%.

The precision value is notably high, with the percentage of customers who actually churn (TP) out of all customers predicted to churn (TP + FP) being 91.21%. However, the recall value is relatively low, where the percentage of customers correctly predicted to churn (TP) compared to the total number of customers who actually churned (TP + FN) stands at 47.98%. Consequently, this affects the average percentage comparison of precision and recall (F1-Score), which is 62.88%.

Result

From this result, Random Forest has become the best model with the highest accuracy, precision, and F1-score in predicting customer churn.

# Model interpretation
RF = RandomForestClassifier()
forest = RF.fit(X_train, y_train)
pickle.dump(forest, open('forest.pkl', 'wb'))

forest_model = pickle.load(open('./forest.pkl', 'rb'))
churn = forest_model.predict(X_data_test)
data_test['churn'] = churn
data_test['churn'] = data_test['churn'].map({1: 'yes', 0: 'no'})
data_test.head()

According to the prediction, 82% of customers will continue using the provider’s services, while only 18% of customers are expected to churn.

Conclusion

Only 14% of customers churns from overall data customer internet provider’s which is affected by several factor, that is :

From 396 customers which have international_plan is churn with percentage 42%.
Customer who don’t have voice_mail_plan (no) have large percentage churn in 16.4% from 3138 users.
The distribution of customers with a call duration of more than 316 minutes per day has a high charge of more than 54.0 which identifies the relationship between the greater the daily call charge, the more customer churn.
international_plan should be able to help customers in reduce total_international_charge with affordable prices, but in actually many customers who use international calls is churn. The percentage of customers who churn for those who have an international_plan (yes) is very high, especially for customers with total_intl_calls less than 8 per day. This is caused by apart from the costs/charge which quite high and another factor is come from customer's needs, there are not too many international calls per day.
16% from 3138 customers who do not have number_vmail_messages is churn. while for customers who have voice_mail_plan (yes) the number_vmail_message data is distributed normal with an average number_vmail_messages in 30 voice messages per day and the percentage of customers who churn is also much lower.
The percentage of customers who churn is related with increase number_customer_service_calls. Especially for customer with number_customer_service_calls more than 8, which indicates that the customer is experiencing a lot of problems either in terms of features or costs from the provider or the problem cannot be resolved so the customer is churns.

Recommendation

After predicting which customers will be churn, to prevent this from happening or reducing customer churn there are several recommendations that can be made:

Classify customers who are predicted to churn by assigning labels to prioritize.
Providing a subscription promo for daily calling plan of 3 hours (180 minutes) and 6 hours (360 minutes) with charges of 30.0 and 50.0 (you can discuss with business team).
Create a comprehensive plan for a voice message feature that can be utilized by all customers. Implement 7 day trial period to ensure customers are informed and encouraged to engage with the voice message functionality.
Update the international plan, particularly for customers who do not frequently make international calls (less than 8 times per day).
Improve the quality of customer service, both in terms of services and features/products, as well as provider connections/signals, in order to maintain customer satisfaction.
Please distribute satisfaction surveys to customers regarding the services or products received to obtain comprehensive feedback and identify areas for improvement.
Monitoring implementation results to observe changes in trends that occur.

Note : for detail code and source you can visit my GitHub and Kaggle