Customer Segmentation with K-Means Clustering - R

This is a brief article on how to perform a customer segmentation with K-Means Clustering with R. The results of this analysis could be used by the Sales and Marketing Departments of any Software Service firm to target different groups of consumers using the appropriate advertising strategies.

Alier Ëë Reng https://www.rengdatascience.io/
05-27-2019

Purpose Statement

The purpose of this article is to apply the knowledge I have acquired through my current Data Science training at Business Science University (in both DS4B 101-R: Business Analysis With R & DS4B 201-R: Data Science for Business with R) on KMeans clustering method and UMAP. In this article, we will use various R packages (see Loading the Libraries section below) to create a visualization that identifies subcategories of SAAS software solution in the S&P 500 Index. The results of this analysis could be used by the Sales and Marketing Departments of any Software Service firm to target different groups of consumers using the appropriate advertisement strategies.

Loading Libraries

Below is a list of the libraries that will be used in this article.


# Loading the libraries
library(tidyverse)
library(tidyquant)
library(broom)
library(umap)
library(plotly) 

Obtaining Data

In this article, we will be using stock prices from the S&P 500 index.


# Getting all stocks in a Stock Index (SP500, in this analysis)
sp_500_index_data <- tq_index("SP500")

# Inspecting the data obtained from SP500
sp_500_index_data

# Pulling in stock prices for each stock in the Index
sp_500_prices_data <- sp_500_index_data %>%
    select(symbol) %>%
    tq_get(get = "stock.prices")

# Saving the data
# fs::dir_create("C:/Users/areng/Desktop/alierwaaireng.github.io/data_lab")
# sp_500_prices_data %>% write_rds(path = "C:/Users/areng/Desktop/alierwaaireng.github.io/data_lab/sp_500_prices_data.rds")
# sp_500_index_data %>% write_rds("C:/Users/areng/Desktop/alierwaaireng.github.io/data_lab/sp_500_index_data.rds")

In the section below, we will inspect the stock prices using the glimpse() from dplyr package. The data that we have just obtained from the SP500 Index contains 1,243,672 observations (or rows) and 8 variables (or columns). However, the most important columns for this analysis are as follows:


# Importing data
sp_500_prices_data <- read_rds("C:/Users/areng/Desktop/alierwaaireng.github.io/data_lab/sp_500_prices_data.rds")

# Inspecting the Stock prices with the glimpse()
sp_500_prices_data %>% 
  glimpse()

Observations: 1,243,672
Variables: 8
$ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", ...
$ date     <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07,...
$ open     <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71,...
$ high     <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79,...
$ low      <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30,...
$ close    <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47,...
$ volume   <dbl> 50084000, 61475200, 58083400, 72709900, 70255400...
$ adjusted <dbl> 15.74050, 15.88760, 16.07343, 15.10562, 15.57791...

The next section describes the second data frame that contains information about the stocks the most important of which are:


# Importing data
sp_500_index_data <- read_rds("C:/Users/areng/Desktop/alierwaaireng.github.io/data_lab/sp_500_index_data.rds")


# Inspecting the Index data that contains both the Sector and Company information
sp_500_index_data %>% 
  glimpse()

Observations: 505
Variables: 5
$ symbol      <chr> "MSFT", "AAPL", "AMZN", "FB", "BRK.B", "JNJ",...
$ company     <chr> "Microsoft Corporation", "Apple Inc.", "Amazo...
$ weight      <dbl> 0.042120082, 0.036083549, 0.032016292, 0.0175...
$ sector      <chr> "Information Technology", "Information Techno...
$ shares_held <dbl> 84274540, 49204490, 4532027, 26203640, 213591...

Question

Which stock prices behave similarly?

Answering this question will us understand which companies are related using the K-Means clustering method of the Unsupervised machine learning algortihms.

Unsupervised learning is a branch of machine learning that learns from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.

In this article, we will show which companies are competitors and which are in the same market sector so that we can group them together. Most importantly, this article will help the readers understand the dynamics of the market and competition, which is imperative for all types of analyses from finance to sales to marketing.

Let’s get started.

Transformatting stock prices into a standardized format (daily returns)

To begin with, the first step in this analysis is to get the data in a format that can be converted to a “user-item” style matrix. However, the challenge herein is how to connect the dots between what we have and what we need to do to format the data properly.

With that said, it is imperative to always ensure that the data are standardized or normalized in order to compare the data. Because we cannot compare values or stock prices that are of completely different magnitudes. In order to standardize, we will convert adjusted stock prices (dollar value) to daily returns (percent change from previous day). The following is the formula that is used to normalize the data.

\[ Return_{Daily} = \frac{Price_{i}-Price_{i-1}}{Price_{i-1}} \]

For the purpose of this analysis, we will use stock prices that were obtained from the SP 500 Index, which is the daily stock prices for over 500 stocks. The data set is over 1.2M observations, from 2018-01-03 to 2019-06-14.


# Inspecting the stock prices data, once again, using the glimpse()
sp_500_prices_data %>% 
  glimpse()

Observations: 1,243,672
Variables: 8
$ symbol   <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", ...
$ date     <date> 2009-01-02, 2009-01-05, 2009-01-06, 2009-01-07,...
$ open     <dbl> 19.53, 20.20, 20.75, 20.19, 19.63, 20.17, 19.71,...
$ high     <dbl> 20.40, 20.67, 21.00, 20.29, 20.19, 20.30, 19.79,...
$ low      <dbl> 19.37, 20.06, 20.61, 19.48, 19.55, 19.41, 19.30,...
$ close    <dbl> 20.33, 20.52, 20.76, 19.51, 20.12, 19.52, 19.47,...
$ volume   <dbl> 50084000, 61475200, 58083400, 72709900, 70255400...
$ adjusted <dbl> 15.74050, 15.88760, 16.07343, 15.10562, 15.57791...

Here, we will convert sp_500_prices_data to a tibble named sp_500_daily_returns_data as follows:


# Normalizing the data
sp_500_daily_returns_data <- sp_500_prices_data %>%
  
    # Selecting columns to be included in this analysis  
    select(symbol, date, adjusted) %>%
    
    # Filtering to select the desired time period 
    filter(date >= ymd("2018-01-01")) %>%
    
    # Grouping by stock symbol 
    group_by(symbol) %>%
  
    # Computing the lags with mutate()
    mutate(lag_1 = lag(adjusted)) %>%
    ungroup() %>%
    
    # Filtering to remove nas from the lags column
    filter(!is.na(lag_1)) %>%
  
    # Computing the differences and percentage returns using the mutate()  
    mutate(diff = adjusted - lag_1,
           pct_return = diff / lag_1) %>%
    
    # Re-selecting the appropriate columns to be included in the analysis
    select(symbol, date, pct_return)


# Inspecting the data with glimpse()
sp_500_daily_returns_data %>% 
  glimpse()

Observations: 181,978
Variables: 3
$ symbol     <chr> "MSFT", "MSFT", "MSFT", "MSFT", "MSFT", "MSFT"...
$ date       <date> 2018-01-03, 2018-01-04, 2018-01-05, 2018-01-0...
$ pct_return <dbl> 0.0046538528, 0.0088014267, 0.0123980965, 0.00...

Next, we will transform the Daily Returns data as follows:


# Transforming with the spread()
stock_date_matrix_data <- sp_500_daily_returns_data %>%
    spread(key = date, value = pct_return, fill = 0)


# Printing the data
stock_date_matrix_data

# A tibble: 503 x 365
   symbol `2018-01-03` `2018-01-04` `2018-01-05` `2018-01-08`
   <chr>         <dbl>        <dbl>        <dbl>        <dbl>
 1 A          0.0254       -0.00750     0.0160        0.00215
 2 AAL       -0.0123        0.00630    -0.000380     -0.00988
 3 AAP        0.00905       0.0369      0.0106       -0.00704
 4 AAPL      -0.000174      0.00464     0.0114       -0.00371
 5 ABBV       0.0156       -0.00570     0.0174       -0.0160 
 6 ABC        0.00372      -0.00222     0.0121        0.0166 
 7 ABMD       0.0173        0.0175      0.0154        0.0271 
 8 ABT        0.00221      -0.00170     0.00289      -0.00288
 9 ACN        0.00462       0.0118      0.00825       0.00799
10 ADBE       0.0188        0.0120      0.0116       -0.00162
# ... with 493 more rows, and 360 more variables: `2018-01-09` <dbl>,
#   `2018-01-10` <dbl>, `2018-01-11` <dbl>, `2018-01-12` <dbl>,
#   `2018-01-16` <dbl>, `2018-01-17` <dbl>, `2018-01-18` <dbl>,
#   `2018-01-19` <dbl>, `2018-01-22` <dbl>, `2018-01-23` <dbl>,
#   `2018-01-24` <dbl>, `2018-01-25` <dbl>, `2018-01-26` <dbl>,
#   `2018-01-29` <dbl>, `2018-01-30` <dbl>, `2018-01-31` <dbl>,
#   `2018-02-01` <dbl>, `2018-02-02` <dbl>, `2018-02-05` <dbl>,
#   `2018-02-06` <dbl>, `2018-02-07` <dbl>, `2018-02-08` <dbl>,
#   `2018-02-09` <dbl>, `2018-02-12` <dbl>, `2018-02-13` <dbl>,
#   `2018-02-14` <dbl>, `2018-02-15` <dbl>, `2018-02-16` <dbl>,
#   `2018-02-20` <dbl>, `2018-02-21` <dbl>, `2018-02-22` <dbl>,
#   `2018-02-23` <dbl>, `2018-02-26` <dbl>, `2018-02-27` <dbl>,
#   `2018-02-28` <dbl>, `2018-03-01` <dbl>, `2018-03-02` <dbl>,
#   `2018-03-05` <dbl>, `2018-03-06` <dbl>, `2018-03-07` <dbl>,
#   `2018-03-08` <dbl>, `2018-03-09` <dbl>, `2018-03-12` <dbl>,
#   `2018-03-13` <dbl>, `2018-03-14` <dbl>, `2018-03-15` <dbl>,
#   `2018-03-16` <dbl>, `2018-03-19` <dbl>, `2018-03-20` <dbl>,
#   `2018-03-21` <dbl>, `2018-03-22` <dbl>, `2018-03-23` <dbl>,
#   `2018-03-26` <dbl>, `2018-03-27` <dbl>, `2018-03-28` <dbl>,
#   `2018-03-29` <dbl>, `2018-04-02` <dbl>, `2018-04-03` <dbl>,
#   `2018-04-04` <dbl>, `2018-04-05` <dbl>, `2018-04-06` <dbl>,
#   `2018-04-09` <dbl>, `2018-04-10` <dbl>, `2018-04-11` <dbl>,
#   `2018-04-12` <dbl>, `2018-04-13` <dbl>, `2018-04-16` <dbl>,
#   `2018-04-17` <dbl>, `2018-04-18` <dbl>, `2018-04-19` <dbl>,
#   `2018-04-20` <dbl>, `2018-04-23` <dbl>, `2018-04-24` <dbl>,
#   `2018-04-25` <dbl>, `2018-04-26` <dbl>, `2018-04-27` <dbl>,
#   `2018-04-30` <dbl>, `2018-05-01` <dbl>, `2018-05-02` <dbl>,
#   `2018-05-03` <dbl>, `2018-05-04` <dbl>, `2018-05-07` <dbl>,
#   `2018-05-08` <dbl>, `2018-05-09` <dbl>, `2018-05-10` <dbl>,
#   `2018-05-11` <dbl>, `2018-05-14` <dbl>, `2018-05-15` <dbl>,
#   `2018-05-16` <dbl>, `2018-05-17` <dbl>, `2018-05-18` <dbl>,
#   `2018-05-21` <dbl>, `2018-05-22` <dbl>, `2018-05-23` <dbl>,
#   `2018-05-24` <dbl>, `2018-05-25` <dbl>, `2018-05-29` <dbl>,
#   `2018-05-30` <dbl>, `2018-05-31` <dbl>, `2018-06-01` <dbl>, ...

Performing K-Means Clustering

Next, we will perform the K-Means clustering as illustrated below.


# Initiating the KMeans Clustering Analysis
kmeans_data <- stock_date_matrix_data %>%
  
    # Excluding the symbol column
    select(-symbol) %>%
  
    # Computing the clusters
    kmeans(centers = 4, nstart = 25)

# Using glance() to obtain "tot.withinss" the kmeans data 
kmeans_data %>% 
  glance()

# A tibble: 1 x 4
  totss tot.withinss betweenss  iter
  <dbl>        <dbl>     <dbl> <int>
1  41.5         36.1      5.36     4

Finding the optimal value of K

Next, we will use the purrr package to iterate over many values of “k” using the centers argument.

I will accomplish this using the below custom function called kmeans_mapper() from (DS4B 101-R: Business Analysis With R) course:


kmeans_mapper <- function(center = 3) {
    stock_date_matrix_data %>%
        select(-symbol) %>%
        kmeans(centers = center, nstart = 25)
}

Next, we use the purrr package to iteratively apply the kmeans_mapper() and glance() functions as follows:


# Creating a tibble with tibble()
k_means_mapped_data <- tibble(centers = 1:25) %>%
    mutate(k_means = centers %>% map(kmeans_mapper)) %>%
    mutate(glance  = k_means %>% map(glance))

Next, we will visualize the “tot.withinss” from the glance output with a Scree Plot as demonstrated below.


# Initiating a Scree Plot
sp <- k_means_mapped_data %>%
    
    # Unnesting the glance column
    unnest(glance) %>%
    
    # Initializing a ggplot
    ggplot(aes(centers, tot.withinss)) +
    geom_point(color = "#2c3e50") +
    geom_line(color = "#2c3e50") +
    theme_tq() +
    labs(title = "Scree Plot") +
    theme(plot.title = element_text(hjust = 0.5))

# Priniting the Scree Plot
sp

From the above graph, we can infer that the Scree Plot becomes linear (constant rate of change) between 5 and 10 centers for K.

Applying UMAP

Next, we will plot the UMAP 2D visualization to better understand the cluster assignments.

To begin with, we will apply the umap() function to our stock_date_matrix_data; which contains our user-item matrix in tibble format as follows.


# Initiating 
umap_results <- stock_date_matrix_data %>%
  
    # Excluding the symbol column
    select(-symbol) %>%
    
    # Applying the umap()
    umap()

Next, we will combine the layout from the umap_results with the symbol column from the stock_date_matrix_data.


# Combining the layout column with the symbol column
umap_results_data <- umap_results$layout %>%
  
    # Converting the data into tibble
    as_tibble() %>%
  
    # Combining the data with bind_cols()
    bind_cols(stock_date_matrix_data %>% select(symbol)) 

# Printing the results
umap_results_data

# A tibble: 503 x 3
       V1      V2 symbol
    <dbl>   <dbl> <chr> 
 1 -2.46  -0.411  A     
 2  0.906 -1.73   AAL   
 3 -0.143 -0.0916 AAP   
 4 -2.72  -1.82   AAPL  
 5 -1.40   0.679  ABBV  
 6 -1.000  0.958  ABC   
 7 -2.75  -1.04   ABMD  
 8 -2.24  -0.257  ABT   
 9 -1.56  -0.927  ACN   
10 -3.02  -1.36   ADBE  
# ... with 493 more rows

Lastly, we will perform a quick visualization of the umap_results_data as follows:


# Plotting the umap results data
umap_results_data %>%
  
    # Initiating the ggplot
    ggplot(aes(V1, V2)) +
    geom_point(alpha = 0.5, color = "#2c3e50") +
    theme_tq() +
    labs(title = "UMAP Projection") +
    theme(plot.title = element_text(hjust = 0.5))

Here, it is vividly clear that some clusters do exist. However, we will still need to combine the K-Means clusters and the UMAP 2D representation.

Combining K-Means and UMAP

To combine the KMeans clusters and the UMAP 2D representation, we will first pull out the KMeans for 6 Centers; we will use 6 clusters since the Scree Plot begins to flatten between 5 and 10.


# Combining the kmeans with the umap results data
kmeans_data <- k_means_mapped_data %>%
  
    # Filtering to select only 5 centers
    filter(centers == 6) %>%
  
    # Extracting the kmeans 
    pull(k_means) %>%
  
    # Extracting only the first element
    pluck(1)

Next, we will combine the clusters from the k_means_data with the umap_results_data using the following steps:


# Saving data as umap_kmeans_results_data
umap_kmeans_results_data <- kmeans_data %>% 
  
    # Augmenting the data with augment()  
    augment(stock_date_matrix_data) %>%
  
    # Selecting desired columns
    select(symbol, .cluster) %>%
  
    # Performing a left join twice in a row 
    left_join(umap_results_data, by = "symbol") %>%
    
    left_join(sp_500_index_data %>% select(symbol, company, sector),
              by = "symbol")

Plotting the K-Means and UMAP results


# Initiating ggplot
g <- umap_kmeans_results_data %>%
    ggplot(aes(V1, V2, color = .cluster)) +
    geom_point(alpha = 0.5) +
    theme_tq() +
    scale_color_tq() +
    labs(
      title = "Customer Segments Using the KMeans Clustering in R"
    ) +
    theme(legend.position = "none",
          plot.title = element_text(hjust = 0.5)) 

ggplotly(g)    

Conclusion

In this article, we have demonstrated how to obtain stock prices data from the SP 500 Index and how to perform a K-Means Clustering method using UMAP, broom, plotly, tidyverse and tidyquant packages.

Acknowledgement

The author thanks Matt Dancho of Business Science University for transforming his approach to data science.

Citation

For attribution, please cite this work as

Reng (2019, May 27). Reng Data Science Institute: Customer Segmentation with K-Means Clustering - R. Retrieved from https://www.rengdatascience.io/posts/2019-05-27-customer-segmentation-with-kmeans-clustering-r/

BibTeX citation

@misc{reng2019customer,
  author = {Reng, Alier Ëë},
  title = {Reng Data Science Institute: Customer Segmentation with K-Means Clustering - R},
  url = {https://www.rengdatascience.io/posts/2019-05-27-customer-segmentation-with-kmeans-clustering-r/},
  year = {2019}
}