Constructing Frequency and Relative Frequency Distribution Tables with gt and dplyr Packages.

The purpose of this article is to demonstrate how to construct both frequency and relative frequency distribution tables with gt and dplyr packages. In the previous article, I estimated the number of data classes using the Sturge’s formula, however, I will use an alternative method to create data classes in this article. Further, I will denomstrate how to plot a histogram, a boxplot and a density curve.

**Definition**

A **frequency distribution** is a table that shows classes or intervals of data entries with counts of the number of entries in each class. The **frequency** ** f** of a class is the number of data entries in the class.

A **relative frequency** of a class is the portion or percentage of the data that falls in that class. To find the relative frequency of a class, divide the frequency ** f** by the sample size

** Rel. Frequency** = \(\frac{Class frequency}{Sample size}\) = \(\frac{f}{n}\)

The **cumulative frequency** of a class is the sum of the frequency for that class and all previous classes. The cumulative frequency of the last class is equal to the sample size ** n**.

From these exploratory graphs, it is vividly clear that the median values of the owner-occupied homes in Boston Suburbs ($1000s) are not normally distributed - this dataset is right-skewed. Outliers are identified and displayed on a boxplot in the following section, after the histograms.

```
library(MASS)
library(gt)
library(dplyr)
library(ggplot2)
data(Boston) # Download Boston Housing data from MASS package.
# Construct the five number summary.
five_num <- Boston %>% # Subset the dataset.
select(medv) %>% # Choose only this column.
summary() %>% # Caluclate the five numbers and the mean.
tibble() %>% # Convert dataset into a table with tibble () function.
rename(Summary = ".") %>% # Rename the column.
gt() %>% # Initialize gt table.
tab_header( # Add title and subtitle with tab_header() function.
title = "The Five Number Summary with the Mean",
) %>%
tab_options(heading.background.color = "forestgreen",
column_labels.background.color = "grey", # Change title and columns background colors.
table.width = "50%") %>%
cols_align(align = "center", # Center column values.
columns = TRUE) %>%
tab_source_note(
source_note = "From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019"
) # Add citation information.
five_num
```

The Five Number Summary with the Mean | |
---|---|

Summary | |

Min. : 5.00 | |

1st Qu.:17.02 | |

Median :21.20 | |

Mean :22.53 | |

3rd Qu.:25.00 | |

Max. :50.00 | |

From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019 |

```
# Basic histogram of median values of owner-occupied homes.
hist_data <- Boston %>% # Subset the dataset.
select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
geom_histogram(binwidth = .5, bins = 30, position = "identity",
fill = "lightblue", color = "black") +
ggtitle("Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
theme(plot.title = element_text(hjust = 0.5, color = "black")) +
xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
ylab("Number of the Owner-Occupied Homes")
hist_data
```

```
# Histogram of median values of owner-occupied homes with a fill.
hist_data_2 <- Boston %>% # Subset the dataset.
select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
geom_histogram(binwidth = .5, bins = 30,
color = "black", fill = "white") +
ggtitle("Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
theme(plot.title = element_text(hjust = 0.5, color = "black")) +
xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
ylab("Number of the Owner-Occupied Homes")
hist_data_2
```

```
# Histogram of median values of owner-occupied homes with a fill.
hist_data_3 <- Boston %>% # Subset the dataset.
select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
geom_density(color = "red") +
ggtitle("Density Curve of Median Value of the Owner-Occupied\n Homes in Boston Suburbs (in $1000s)") +
theme(plot.title = element_text(hjust = 0.5, color = "red")) +
xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
ylab("Number of the Owner-Occupied Homes") # Change Y-Axis Label.
hist_data_3
```

```
# Histogram overlaid with a kernel density curve
hist_data_4 <- Boston %>% # Subset the dataset.
select(medv) %>% # Choose the column: medv from Boston Housing dataset and call it, hist_data.
ggplot(aes(medv)) + # Initiate a ggplot2 histogram.
geom_histogram(aes(y = ..density..),
binwidth = .5,
color = "black",
fill = "white") +
geom_density(alpha = .5, fill = "blue") +
ggtitle("Histogram and Density Curve of Median Value of the\n Owner-Occupied Homes in Boston Suburbs (in $1000s)") +
theme(plot.title = element_text(hjust = 0.5, color = "black")) +
geom_vline(aes(xintercept=mean(medv)), # Ignore NA values for mean
color="black", linetype="dashed", size=1) +
xlab("Median Value of the Owner-occupied Homes in Boston Suburbs (in $1000s)") + # Change X-Axis Label.
ylab("Number of the Owner-Occupied Homes") # Change Y-Axis Label.
hist_data_4
```

```
# Confirming outliers with the Boxplot.
boxp_data <- Boston %>% # Subset the dataset.
ggplot(aes(y = medv)) + # Plot a boxplot with ggplot() function.
geom_boxplot(outlier.color = "red") +
guides(fill = FALSE) +
coord_flip() + # Rotate the axes.
ggtitle("Boxplot of Median Values of the Owner-occupied Homes\n in Boston Suburbs") + # Add title.
theme(plot.title = element_text(hjust = 0.5, color = "black")) + # Center the title, make it black & bold.
ylab("Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s)") # Change Y-Axis Label.
boxp_data
```

```
# Plot a boxplot of the median values of the owner-occupied homes in Boston suburbs.
gt_data <- cut(Boston$medv, 8) %>% # Choose 8 classes and then apply the cut() function.
table() %>% # Convert dataclasses and frequencies into a table with the table() function.
as.data.frame() %>% # Convert dataset into a data frame.
rename(Classes = ".", Frequency = "Freq") %>% # Rename these columns.
mutate(`Rel. Frequency` = round(Frequency / sum(Frequency), digits = 3),
Percentages = `Rel. Frequency` * 100) %>% # Caculate frequencies and perentages with mutate() function.
add_row(Classes = "Total", Frequency = sum(.$Frequency), # Add the total row with add_row() function.
`Rel. Frequency` = round(sum(.$`Rel. Frequency`), digits = 2),
Percentages = round(sum(.$Percentages))) %>%
gt() %>% # Initiate gt table.
tab_header( # Add title and subtitle with tab_header() function.
title = "Frequency & Relative Frequency Distribution Table",
subtitle = "Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s)"
) %>%
tab_options(heading.background.color = "forestgreen",
column_labels.background.color = "grey", # Change title and columns background colors.
table.width = "100%") %>%
tab_style(style = cells_styles(bkgd_color = "black", text_color = "white"),
cells_data(columns = vars(Classes, Frequency, `Rel. Frequency`,
Percentages), # Tell gt package where to apply these modifications.
rows = Classes == "Total" & Frequency == 506 &
`Rel. Frequency` == 1.000 &
Percentages == 100.0)) %>%
cols_align(align = "center", # Center column values.
columns = TRUE) %>%
tab_source_note(
source_note = "From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019"
) # Add citation information.
gt_data
```

Frequency & Relative Frequency Distribution Table | ||||
---|---|---|---|---|

Median Values of the Owner-occupied Homes in Boston Suburbs (in $1000s) | ||||

Classes | Frequency | Rel. Frequency | Percentages | |

(4.96,10.6] | 31 | 0.061 | 6.1 | |

(10.6,16.2] | 85 | 0.168 | 16.8 | |

(16.2,21.9] | 158 | 0.312 | 31.2 | |

(21.9,27.5] | 126 | 0.249 | 24.9 | |

(27.5,33.1] | 47 | 0.093 | 9.3 | |

(33.1,38.8] | 27 | 0.053 | 5.3 | |

(38.8,44.4] | 9 | 0.018 | 1.8 | |

(44.4,50] | 23 | 0.045 | 4.5 | |

Total | 506 | 1.000 | 100.0 | |

From https://www.rengdatascience.io; by Alier Ëë Reng, 03/03/2019 |

For attribution, please cite this work as

Reng (2019, March 3). Reng Data Science Institute: Frequency Distribution: Alternative Method. Retrieved from https://www.rengdatascience.io/posts/2019-03-03-frequency-distribution-alternative-method/

BibTeX citation

@misc{reng2019frequency, author = {Reng, Alier Ëë}, title = {Reng Data Science Institute: Frequency Distribution: Alternative Method}, url = {https://www.rengdatascience.io/posts/2019-03-03-frequency-distribution-alternative-method/}, year = {2019} }