Muhammadu Ilyas
8 min readApr 21, 2022

Credit Card analysis by segmentation to define marketing strategy

Introduction

If you want to get your message heard, it needs to make it to the right people at the right place and time. As a business owner, getting your products or services to resonate with your target audience is at the core of marketing. But before you embark on marketing your business, it’s crucial to determine precisely who it is you’ll be targeting. Market segmentation is the first move you’ll want to make in order to define who your brand should address and appeal to. Segmenting your market will allow your business efforts — from creating a website to launching a service or product — to be perfectly aligned with what your audience is looking for.

Clustering is a popular exploratory data analysis tool for gaining an understanding of the data’s structure. It is the task of identifying subgroups in data so that data points within the same subgroup (cluster) are extremely similar while data points within different clusters are very dissimilar.

I utilized the K-means algorithm in this case. The K-means algorithm is an iterative technique that attempts to split a dataset into K unique non-overlapping subgroups (clusters), each of which contains only one data point.

The case requires to develop a customer segmentation to define market strategy. The sample dataset summarizes the usage behaviour of about 9000 (approx.) active credit card holders during the last 6 months.

Task

Explore and utilize the data to create a Clustering Machine learning model that stratifies the credit cards to groups that can be applied to form a business decision.

Data source and preparation

All data obtained from https://www.kaggle.com/datasets/arjunbhasin2013/ccdata

I downloaded the dataset in a .csv file named “CC GENERAL.csv” . I exported the data into a folder and used Jupyter Notebook to explore and prepare the data.

First, I loaded the required packages and modules for data exploration and manipulation; pandas, numpy, matplotlib, seaborn and

statsmodel.

Data importation

I imported the .CSV files into Python Jupyter notebook using the read_csv() function.

Using the head() function I realised the data frame consist of credit card transaction information for different customers indexed with and Identification number.

Data frame Exploration

The file is at a customer level with 18 behavioural variables.

Following is the Data Dictionary for Credit Card dataset.

CUST ID: -Identification of Credit Card holder (Categorical).

BALANCE: — Balance amount left in their account to make purchases.

BALANCE FREQUENCY: — How frequently the balance is updated, score between 0 and 1 (1 = frequently ,0 = not frequently updated).

PURCHASES: — Number of purchases made from account.

ONEOFFPURCHASES: — Maximum purchase amount done in one-go.

INSTALLMENTSPURCHASES: — Number of purchases done in installment.

CASHADVANCE: — Cash in advance given by the user.

PURCHASESFREQUENCY: — How frequently the purchases are made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased).

ONEOFFPURCHASEFREQUENCY: — How frequently purchases are happening in one-go (1 = frequently purchased,0 = not frequently purchased).

PURCHASEINSTALLMENTFREQUENCY: — How frequently purchases installments are being done (1 = frequently done,0 = not frequently done).

CASHADVANCEFREQUENCY: — How frequently the cash advance being paid.

CASHADVANCETRX: — Number of transactions made with ‘Cash in Advanced’.

PURCHASESTRX: — Number of purchases transactions made.

CREDITLIMIT: — Limit of Credit Card for user.

PAYMENTS: — Amount of Payment done by user.

MINIMUM_PAYMENTS: — Minimum number of payments made by user.

PERCFULLPAYMENTS: — Percent of full payment paid by user.

TENURE: — Tenure of credit card service or user.

Statistical Summaries

To understand the shape and structure of the dataset, I used the describe() function to return the summary below.

From the above information the, it can be deduced that;

· The dataset contains 8950 rows, each representing a particular customer credit card transaction detail, with 18 columns, each representing an amounted behavioural variable.

· Frequency is recorded for all business entries.

· All entries are considered numerical.

· There are many outliers (look at the max value), but I didn’t drop them because they may contain important information, so I treated the outliers as extreme values.

Data preparation and cleaning

Before further utilization and implementation, I had to prepare the data to ensure that it is clean and devoid of any redundancies. The data cleaning process include the following;

· Check and filling of missing values. Using the isnull() function, I discovered the MINIMUM_PAYMENT column has 313 missing values while the CREDIT LIMIT column also has a missing variable. Since the percentage of missing value is quite significant it is safe to fill the entries with a zero i.e., 0

This method of filling allows the insight and trend of the data to remain constant and unchanged.

· Scaling the numeric features. The MinMaxScaler() from sklearn.preprocessing is employed to convert all data features into a standard scale this create room for unambiguity and underling a perfect relationship between the data points.

· Dimensionality reduction. In Clustering algorithm large space coordinate must be compressed and converted to a space coordinate which can be understood and interpreted by a 3D world or 2D plot. Hence the 18-dimensional space coordinate is reduced to a 2D where all features will be summarized into the coordinate. This is achieved using the Principal Component Analysis built in sklearn function.

Explorative Data Analysis

This visualization suggest that the data points are well nucleated about a particular range of value.

There is a positive correlation between Cash Advance and Balance, Credit limit and Balance, Purchases and One-off purchases, Purchase trx and One-off purchases, Payments and One-off purchases.

All features with frequently dated values have a positive correlation relationship with their frequented data.

Determination of number of clusters

One of the major challenges in clustering algorithm is how to determine the number of clusters the dataset can be segmented. In

this case I used the within clusters sum of square (WCSS) technique to determine the most appropriate number of clusters. The steps taken are shown below.

Plotting the WCSS values on a line graph

From the above I discovered using the elbow point of the curve there 3 significant clusters for these data points, hence number of clusters is 3.

K-Means Clustering Algorithm

The model will be built based on 3 centroids as the cluster points with maximum iteration of the centroid actualization of 1000 times.

Visualization of the featured_2D on the K-Means clusters assignment suggest that the class 1 is wider than other clusters. This cluster segment is represented by a blue region as shown below.

For more information on these algorithms check here https://www.kaggle.com/code/sabanasimbutt/clustering-visualization-of-clusters-using-pca

Saving the models

I created a file directory where the model is saved in a pickle file for future deployment and insightful exploration.

Recommendation

· For cluster 0, I recommended a silver credit card because it’s the most widely owned card. In general, a new credit cardholder will receive a silver card and they can upgrade it later. Silver cards have the lowest credit limit, which is around 4 million to 7 million IDR. The cardholder must have a monthly salary of at least 3 million IDR. The advantage of this card is the limit that is not too high.

· For cluster 1, I recommended a gold credit card. The cardholder must have a regular monthly income of around 5 million to 10 million IDR. The credit limit ranges from 10 million to 40 million IDR, depending on the credit card issuing bank. The advantage of this type of card is the limit is large enough. So, it allows you to buy/own expensive items faster. You can use it to repay big-budget items such as motorbikes or smartphones. However, the higher the credit card limit, the higher the annual fee you have to pay.

· Last, for cluster 2, I recommended a platinum credit card with the highest level. Platinum credit cards are only owned by a few people because it is not easy to get the card due to strict procedures. A platinum credit card has a high limit from 40 million up to 1 billion IDR. The cardholder must have an income of at least 180 million IDR per year and have a good credit history.

· A data frame should be created where this data point will be labelled according to there derived class (between 0,1 and 2).

· A Classification machine learning model should be created and trained based on the classes to understand trends in the credit card transaction of all customers and derive a better managerial decision on its improvement.

· The model should be used for future derivation of insights on new customers based on there data collected.

Please pay a visit to my repository to view the code:https://github.com/Gbekoilias/Credit-Card-analysis

Thank you reading! Happy to receive your suggestion and recommendation!