Frequency Distribution: Alternative Method

Constructing Frequency and Relative Frequency Distribution Tables with gt and dplyr Packages.

Alier Ëë Reng https://www.alierwaaireng.com/
03-03-2019

The purpose of this article is to demonstrate how to construct both frequency and relative frequency distribution tables with gt and dplyr packages. In the previous article, I estimated the number of data classes using the Sturge’s formula, however, I will use an alternative method to create data classes in this article. Further, I will denomstrate how to plot a histogram, a boxplot and a density curve.

Definition

A frequency distribution is a table that shows classes or intervals of data entries with counts of the number of entries in each class. The frequency f of a class is the number of data entries in the class.

A relative frequency of a class is the portion or percentage of the data that falls in that class. To find the relative frequency of a class, divide the frequency f by the sample size n.

Rel. Frequency = \(\frac{Class frequency}{Sample size}\) = \(\frac{f}{n}\)

The cumulative frequency of a class is the sum of the frequency for that class and all previous classes. The cumulative frequency of the last class is equal to the sample size n.

Load the packages

From these exploratory graphs, it is vividly clear that the median values of the owner-occupied homes in Boston Suburbs ($1000s) are not normally distributed - this dataset is right-skewed. Outliers are identified and displayed on a boxplot in the following section, after the histograms.


library(MASS)
library(gt)
library(dplyr)
library(ggplot2)

data(Boston) # Download Boston Housing data from MASS package.


# Construct the five number summary.

five_num <- Boston %>% # Subset the dataset.
  select(medv) %>% # Choose only this column.
  summary() %>% # Caluclate the five numbers and the mean.
  tibble() %>% # Convert dataset into a table with tibble () function.
  rename(Summary = ".") %>% # Rename the column.
  gt() %>% # Initialize gt table.
   tab_header( # Add title and subtitle with tab_header() function.
    title = "The Five Number Summary with the Mean",
  ) %>% 
  tab_options(heading.background.color = "forestgreen",
              column_labels.background.color = "grey", # Change title and columns background colors.
              table.width = "50%") %>%
  cols_align(align = "center", # Center column values.
             columns = TRUE) %>%
  tab_source_note(
    source_note = "From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019"
    ) # Add citation information.

five_num
The Five Number Summary with the Mean
Summary
Min. : 5.00
1st Qu.:17.02
Median :21.20
Mean :22.53
3rd Qu.:25.00
Max. :50.00
From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019

# Basic histogram of median values of owner-occupied homes.

hist_data <- Boston %>% # Subset the dataset.
  select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
  ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
  geom_histogram(binwidth = .5, bins = 30, position = "identity",
                 fill = "lightblue", color = "black") +
  ggtitle("Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
  theme(plot.title = element_text(hjust = 0.5, color = "black")) +
  xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
  ylab("Number of the Owner-Occupied Homes")

hist_data


# Histogram of median values of owner-occupied homes with a fill.

hist_data_2 <- Boston %>% # Subset the dataset.
  select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
  ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
  geom_histogram(binwidth = .5, bins = 30,
                 color = "black", fill = "white") +
  ggtitle("Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
  theme(plot.title = element_text(hjust = 0.5, color = "black")) +
  xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
  ylab("Number of the Owner-Occupied Homes")

hist_data_2


# Histogram of median values of owner-occupied homes with a fill.

hist_data_3 <- Boston %>% # Subset the dataset.
  select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
  ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
  geom_density(color = "red") +
  ggtitle("Density Curve of Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
  theme(plot.title = element_text(hjust = 0.5, color = "red")) +
  xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
  ylab("Number of the Owner-Occupied Homes") # Change Y-Axis Label.


hist_data_3


# Histogram overlaid with a kernel density curve

hist_data_4 <- Boston %>% # Subset the dataset.
  select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
  ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
  geom_histogram(aes(y = ..density..), 
                 binwidth = .5, 
                 color = "black",
                 fill = "white") +
  geom_density(alpha = .5, fill = "blue") +
  ggtitle("Histogram and Density Curve of Median Value of the\n Owner-Occupied Homes in Boston Suburbs (in $1000s)") +
  theme(plot.title = element_text(hjust = 0.5, color = "black")) +
  geom_vline(aes(xintercept=mean(medv)),   # Ignore NA values for mean
               color="black", linetype="dashed", size=1) +
  xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
  ylab("Number of the Owner-Occupied Homes") # Change Y-Axis Label.

hist_data_4


# Confirming outliers with the Boxplot.

boxp_data <- Boston %>% # Subset the dataset.
  ggplot(aes(y = medv)) + # Plot a boxplot with ggplot() function.
  geom_boxplot(outlier.color = "red") + 
  guides(fill = FALSE) +
  coord_flip() + # Rotate the axes.
  ggtitle("Boxplot of Median Values of the Owner-occupied Homes\n in Boston Suburbs") + # Add title.
  theme(plot.title = element_text(hjust = 0.5, color = "black")) + # Center the title, make it black & bold.
  ylab("Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s)")  # Change Y-Axis Label.
  

boxp_data

Frequency and Relative Frequency Distribution Table


# Plot a boxplot of the median values of the owner-occupied homes in Boston suburbs.
gt_data <- cut(Boston$medv, 8) %>% # Choose 8 classes and then apply the cut() function.
  table() %>% # Convert dataclasses and frequencies into a table with the table() function.
  as.data.frame() %>% # Convert dataset into a data frame.
  rename(Classes = ".", Frequency = "Freq") %>% # Rename these columns.
  mutate(`Rel. Frequency` = round(Frequency / sum(Frequency), digits = 3),
         Percentages = `Rel. Frequency` * 100) %>% # Caculate frequencies and perentages with mutate() function.
  add_row(Classes = "Total", Frequency = sum(.$Frequency), # Add the total row with add_row() function.
          `Rel. Frequency` = round(sum(.$`Rel. Frequency`), digits = 2),
          Percentages = round(sum(.$Percentages))) %>% 
  gt() %>% # Initiate gt table.
  tab_header( # Add title and subtitle with tab_header() function.
    title = "Frequency & Relative Frequency Distribution Table",
    subtitle = "Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s)"
  ) %>% 
  tab_options(heading.background.color = "forestgreen",
              column_labels.background.color = "grey", # Change title and columns background colors.
              table.width = "100%") %>%
  tab_style(style = cells_styles(bkgd_color = "black", text_color = "white"),
            cells_data(columns = vars(Classes, Frequency, `Rel. Frequency`,
                                   Percentages), # Tell gt package where to apply these modifications.
                       rows = Classes == "Total" &  Frequency == 506 &
                       `Rel. Frequency` == 1.000 & 
                       Percentages == 100.0)) %>% 
  cols_align(align = "center", # Center column values.
             columns = TRUE) %>%
  tab_source_note(
    source_note = "From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019"
    ) # Add citation information.
  

gt_data 
Frequency & Relative Frequency Distribution Table
Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s)
Classes Frequency Rel. Frequency Percentages
(4.96,10.6] 31 0.061 6.1
(10.6,16.2] 85 0.168 16.8
(16.2,21.9] 158 0.312 31.2
(21.9,27.5] 126 0.249 24.9
(27.5,33.1] 47 0.093 9.3
(33.1,38.8] 27 0.053 5.3
(38.8,44.4] 9 0.018 1.8
(44.4,50] 23 0.045 4.5
Total 506 1.000 100.0
From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019

Citation

For attribution, please cite this work as

Reng (2019, March 3). Reng Data Science Institute: Frequency Distribution: Alternative Method. Retrieved from https://www.rengdatascience.io/posts/2019-03-03-frequency-distribution-alternative-method/

BibTeX citation

@misc{reng2019frequency,
  author = {Reng, Alier Ëë},
  title = {Reng Data Science Institute: Frequency Distribution: Alternative Method},
  url = {https://www.rengdatascience.io/posts/2019-03-03-frequency-distribution-alternative-method/},
  year = {2019}
}