South Sudan Population in Pictures!

This is a brief analysis of South Sudan 2008 Population Census data. Our objective is to demonstrate how to perform data wrangling with R.

Alier Ëë Reng https://www.rengdatascience.io/
03-24-2019

Purpose Statement

The primary objective of this article is to demonstrate how to perform Data Wrangling with R . We will use South Sudan 2008 population census data to accomplish this objective. Futher, we will investigate how South Sudan population is distributed across all the country’s ten states; and across various age groups as well as across gender groups.

Loading packages


library(tidyverse)
library(gt)
library(plotly)
library(tidyquant)

# Importing data
pop_data <- readxl::read_excel('C:/Users/areng/Desktop/alierLab/ObservationData.xlsx') 

# Inspecting the first 5 rows
head(pop_data) 

# A tibble: 6 x 10
  Region `Region Name` `Region - Regio~ Variable `Variable Name` Age  
  <chr>  <chr>         <chr>            <chr>    <chr>           <chr>
1 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C1
2 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C2
3 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C3
4 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C4
5 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C5
6 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To~ KN.C6
# ... with 4 more variables: `Age Name` <chr>, Scale <chr>,
#   Units <chr>, `2008` <dbl>

# Feature engineering
pop_data_clean = pop_data %>% 
  
  # Selecting desired columns with contains()
  select(contains("Name"), "2008") %>% 
  
  # Renaming columns with set_names()
  set_names("State", "Category",
          "Age Category", "Population") %>% 
  
  # Separating a column with separate()
  separate("Category", 
           into = c("Pop.", "Category", "Other"),
           sep = " ") %>% 
  
  # Selecting columns
   select(1, 3, 5, 6)  
  
  
# Viewing the top rows.
head(pop_data_clean) 

# A tibble: 6 x 4
  State      Category `Age Category` Population
  <chr>      <chr>    <chr>               <dbl>
1 Upper Nile Total    Total              964353
2 Upper Nile Total    0 to 4             150872
3 Upper Nile Total    5 to 9             151467
4 Upper Nile Total    10 to 14           126140
5 Upper Nile Total    15 to 19           103804
6 Upper Nile Total    20 to 24            82588

Handling Missing Values or NAs

A quick summary using the skim() function reveals that there are:


# Examining data for abnormalities with skim()
pop_data_clean_sk = pop_data_clean %>% 
  
  # Perform a quick summary with skim()
  skimr::skim() 


# Inspecting the skim summary
pop_data_clean_sk

Skim summary statistics
 n obs: 453 
 n variables: 4 

-- Variable type:character -------------------------------------------------------------------------------------------------------------------------------------------------------
     variable missing complete   n min max empty n_unique
 Age Category       3      450 453   3   8     0       15
     Category       3      450 453   4   6     0        3
        State       1      452 453   5  90     0       12

-- Variable type:numeric ---------------------------------------------------------------------------------------------------------------------------------------------------------
   variable missing complete   n     mean        sd   p0     p25
 Population       3      450 453 73426.58 150692.51 1909 13598.5
   p50   p75    p100     hist
 30816 62158 1358602 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>

# Viewing the last 5 rows to display rows with NAs
tail(pop_data_clean, 5)

# A tibble: 5 x 4
  State                             Category `Age Category` Population
  <chr>                             <chr>    <chr>               <dbl>
1 Eastern Equatoria                 Female   60 to 64             5274
2 Eastern Equatoria                 Female   65+                  8637
3 <NA>                              <NA>     <NA>                   NA
4 National Bureau of Statistics, S~ <NA>     <NA>                   NA
5 http://southsudan.opendataforafr~ <NA>     <NA>                   NA

# Removing NAs
trans_pop_data_clean = pop_data_clean %>% 
  
  # Filtering to delete rows with NAs with is.na()
  filter(!is.na(Category))  

trans_pop_data_clean

# A tibble: 450 x 4
   State      Category `Age Category` Population
   <chr>      <chr>    <chr>               <dbl>
 1 Upper Nile Total    Total              964353
 2 Upper Nile Total    0 to 4             150872
 3 Upper Nile Total    5 to 9             151467
 4 Upper Nile Total    10 to 14           126140
 5 Upper Nile Total    15 to 19           103804
 6 Upper Nile Total    20 to 24            82588
 7 Upper Nile Total    25 to 29            76754
 8 Upper Nile Total    30 to 34            63134
 9 Upper Nile Total    35 to 39            56806
10 Upper Nile Total    40 to 44            42139
# ... with 440 more rows

# Modifying data to clean it up
trans_pop_data_clean_final <- trans_pop_data_clean %>% 
  
  # Filtering out totals
  filter(Category != "Total",
         `Age Category` != "Total") %>% 

  # Transforming the age category column with case_when()
  mutate(`Age Category` = case_when(
     `Age Category` %in%  c("0 to 4", "5 to 9", "10 to 14", "15 to 19") ~  "0 - 19",
     `Age Category` %in%  c( "20 to 24", "25 to 29", "30 to 34" )       ~ "20 - 34",
     `Age Category` %in%  c("35 to 39", "40 to 44", "45 to 49")         ~ "35 - 49",
     `Age Category` %in%  c( "50 to 54", "55 to 59", "60 to 64")        ~ "50 - 64",
     TRUE                                                               ~   ">= 65"
  )
)

# Inspecting the new column names
trans_pop_data_clean_final

# A tibble: 280 x 4
   State      Category `Age Category` Population
   <chr>      <chr>    <chr>               <dbl>
 1 Upper Nile Male     0 - 19              82690
 2 Upper Nile Male     0 - 19              83744
 3 Upper Nile Male     0 - 19              71027
 4 Upper Nile Male     0 - 19              57387
 5 Upper Nile Male     20 - 34             42521
 6 Upper Nile Male     20 - 34             38795
 7 Upper Nile Male     20 - 34             32236
 8 Upper Nile Male     35 - 49             30228
 9 Upper Nile Male     35 - 49             22290
10 Upper Nile Male     35 - 49             18163
# ... with 270 more rows

Modifying State Names


# Transforming state names to appropriate length
ss_2008_census_data_tidied = trans_pop_data_clean_final %>% 
  
  # Modifying the state column
  # filter(`Age Category` == 'Total' & Total == 'Total') %>% # Filter out the state total population.
  mutate(State = case_when(
    State == "Northern Bahr el Ghazal" ~  "N. Bhar el G",
    State == "Western Bahr el Ghazal"  ~  "W. Bhar el G.",
    State == "Western Equatoria"       ~  "W. Equatoria",
    State == "Central Equatoria"       ~  "C. Equatoria",
    State == "Eastern Equatoria"       ~  "E. Equatoria",
    TRUE                               ~  State)) 
  
# Printing the data
ss_2008_census_data_tidied

# A tibble: 280 x 4
   State      Category `Age Category` Population
   <chr>      <chr>    <chr>               <dbl>
 1 Upper Nile Male     0 - 19              82690
 2 Upper Nile Male     0 - 19              83744
 3 Upper Nile Male     0 - 19              71027
 4 Upper Nile Male     0 - 19              57387
 5 Upper Nile Male     20 - 34             42521
 6 Upper Nile Male     20 - 34             38795
 7 Upper Nile Male     20 - 34             32236
 8 Upper Nile Male     35 - 49             30228
 9 Upper Nile Male     35 - 49             22290
10 Upper Nile Male     35 - 49             18163
# ... with 270 more rows

# Transforming the data further to display it by age category
ss_2008_census_data_tidied_cat <- ss_2008_census_data_tidied %>%
  
  # Filtering out the overall state population
  filter(`Age Category` != "Overall Population",
         Category != "Total") %>% 
  
  # Grouping and the summarizing the data
  group_by(State, Category, `Age Category`) %>% 
  summarize(Population = sum(Population)) %>% 
  ungroup() %>% 
  
  # Spreading the data with spread()
  spread(key = `Age Category`, 
         value = Population,
         convert = FALSE,
         drop = TRUE) %>% 
  
  # Adding the totals row
  add_row(State = "Category Totals", Category = " ",
          `>= 65` = sum(.$`>= 65`), `0 - 19` = sum(.$`0 - 19`),
          `20 - 34` = sum(.$`20 - 34`), `35 - 49` = sum(.$`35 - 49`),
          `50 - 64` = sum(.$`50 - 64`)) %>% 
  
  # Initiate a gt table
  gt() %>% 
  
  # Adding the title and subtitle
  tab_header(
    title = "South Sudan Population by Category", 
    subtitle = "Summary of South Sudan 2008 Population Census Results"
      ) %>%
  
  # Formating column values; removing decimal points;  applying coma as a separator
   fmt_number(
    columns = 3:7,
    decimals = 0, 
    use_seps = TRUE) %>%
  
  # Modifying background color; adjusting title and subtitle font sizes and adjusting the table width
  tab_options(heading.background.color = "black",
              column_labels.background.color = "grey",
              heading.title.font.size = 30,
              heading.subtitle.font.size = 15,
              table.width = "100%"
              ) 

# Printing the data
ss_2008_census_data_tidied_cat
South Sudan Population by Category
Summary of South Sudan 2008 Population Census Results
State Category >= 65 0 - 19 20 - 34 35 - 49 50 - 64
C. Equatoria Female 8,596 283,092 139,942 66,745 23,460
C. Equatoria Male 11,409 308,935 153,332 79,238 28,808
E. Equatoria Female 8,637 243,642 111,079 57,120 20,496
E. Equatoria Male 12,528 274,404 99,862 55,139 23,254
Jonglei Female 12,384 329,048 164,193 87,198 31,452
Jonglei Male 22,658 419,182 157,319 90,925 44,243
Lakes Female 6,396 176,918 86,832 42,932 16,772
Lakes Male 10,100 198,581 87,219 49,536 20,444
N. Bhar el G Female 12,585 200,375 89,179 48,861 21,608
N. Bhar el G Male 13,523 204,291 63,709 45,635 21,132
Unity Female 7,801 163,798 66,837 33,267 13,851
Unity Male 8,999 179,616 62,313 34,091 15,228
Upper Nile Female 10,144 237,435 108,924 60,058 22,362
Upper Nile Male 15,746 294,848 113,552 70,681 30,603
W. Bhar el G. Female 3,527 83,151 41,467 20,767 7,479
W. Bhar el G. Male 4,171 92,265 45,326 26,307 8,971
W. Equatoria Female 7,369 148,059 83,592 45,314 16,252
W. Equatoria Male 11,541 162,324 77,197 47,857 19,524
Warrap Female 10,625 273,397 127,170 66,936 24,066
Warrap Male 12,345 275,805 94,888 63,010 24,686
Category Totals 211,084 4,549,166 1,973,932 1,091,617 434,691

# Isolating state general populations
state_gen_pop <-  trans_pop_data_clean %>% 
  
  # Filtering to obtain state populations
  filter(Category == "Total" &
         `Age Category` == "Total") %>% 
  
  # Selecting state and popluation columns only
  select(1,4) %>%  
  
  # Arranging data in descending order
  arrange(desc(Population)) %>% 
  
  # Adding a total column to display South Sudan total population
  add_row(State = "Total Population", Population = sum(.$Population)) %>% 
  mutate(Percent =  (Population / 8260490) %>% scales::percent()) %>% 
  
  
  # Initiate a gt table
  gt() %>% 
  
  # Adding the title and subtitle
  tab_header(
    title = "South Sudan Population", 
    subtitle = "Summary of South Sudan 2008 Population Census Results"
      ) %>%
  
  # Formating column values; removing decimal points;  applying coma as a separator
   fmt_number(
    columns = vars(Population),
    decimals = 0, 
    use_seps = TRUE) %>%
  
  # Modifying background color; adjusting title and subtitle font sizes and adjusting the table width
  tab_options(heading.background.color = "black",
              column_labels.background.color = "grey",
              heading.title.font.size = 30,
              heading.subtitle.font.size = 15,
              table.width = "100%"
              ) %>% 
  
  # Applying color-coding to the total row
  tab_style(style = cells_styles(
      bkgd_color = "grey",
      text_weight = "bold", 
      text_color = "white"),
      locations = cells_data(
      columns = vars(State, Population, Percent),
      rows = State == "Total" | Population == 8260490)) %>% 
      cols_align(
      columns = 2:3,
      align   = "center"
     )
  

                  
# Displaying the gt table
state_gen_pop
South Sudan Population
Summary of South Sudan 2008 Population Census Results
State Population Percent
Jonglei 1,358,602 16.4%
Central Equatoria 1,103,557 13.4%
Warrap 972,928 11.8%
Upper Nile 964,353 11.7%
Eastern Equatoria 906,161 11.0%
Northern Bahr el Ghazal 720,898 8.7%
Lakes 695,730 8.4%
Western Equatoria 619,029 7.5%
Unity 585,801 7.1%
Western Bahr el Ghazal 333,431 4.0%
Total Population 8,260,490 100.0%

# Plotting state population by gender
gen_pop_gender =   trans_pop_data_clean %>% 
  
  # Filtering to isolation state total population by gender
  filter(`Age Category` == 'Total' &
         Category %in% c('Male', 'Female')) %>% 
  
  # Renaming a column
  rename(Gender = "Category") %>% 
  
 # Selecting desired columns
  select(-`Age Category`) %>% 
  
  # Spreading the data with spread()
  spread(key     = Gender, 
         value   = Population,
         convert = FALSE,
         drop    = TRUE) %>% 
  
  # Arrranging data in a descending order using the Male column
  arrange(desc(Male)) %>% 
  

  # Initiate a gt table
  gt() %>% 
  
  # Adding the title and subtitle
  tab_header(
    title = "South Sudan Population by Gender", 
    subtitle = "Summary of South Sudan 2008 Population\n Census Results by Gender"
      ) %>%
  
  # Formating column values; removing decimal points;  applying coma as a separator
   fmt_number(
    columns = 2:3,
    decimals = 0, 
    use_seps = TRUE) %>%
  
  # Modifying background color; adjusting title and subtitle font sizes and adjusting the table width
  tab_options(heading.background.color = "black",
              column_labels.background.color = "grey",
              heading.title.font.size = 30,
              heading.subtitle.font.size = 15,
              table.width = "100%"
              ) 


gen_pop_gender
South Sudan Population by Gender
Summary of South Sudan 2008 Population Census Results by Gender
State Female Male
Jonglei 624,275 734,327
Central Equatoria 521,835 581,722
Upper Nile 438,923 525,430
Warrap 502,194 470,734
Eastern Equatoria 440,974 465,187
Lakes 329,850 365,880
Northern Bahr el Ghazal 372,608 348,290
Western Equatoria 300,586 318,443
Unity 285,554 300,247
Western Bahr el Ghazal 156,391 177,040

Graphical summary of South Sudan 2008 Population Census Data

Here, we display a bar graph of South Sudan 2008 population census results using the ggplot2 geom_col() function.


# Plotting the state population data
state_g <-  trans_pop_data_clean %>% 
  
  # Filtering to obtain state populations
  filter(Category == "Total" &
         `Age Category` == "Total") %>% 
  
  # Selecting state and popluation columns only
  select(1,4) %>%  
  
  # Arranging data in descending order
  mutate(State = State %>% fct_recode()) %>% 
 
   # Initializing the ggplot
  ggplot(aes(x = State, y = Population)) +
  geom_col(aes(fill = State)) + 
  
  # Formatting the graph
  theme_tq() +
  scale_fill_tq() +
  labs(title = "South Sudan 2008 Population Census Results by State",
       x = " ") + 
  
  # Formatting the population values
  scale_y_continuous(labels = scales::number) +
  expand_limits(y = c(0, 1500000)) +
  
  # Hiding the legend 
  theme(
    plot.title = element_text(hjust = 0.5),
    legend.position = "none",
    strip.text.x = element_text(margin = margin(5, 5, 5, 5, unit = "pt"))
  )

# Printing an interactive graph
ggplotly(state_g) 

# Plotting a stacked graph 
gender_g_stacked <-  trans_pop_data_clean %>% 
  
  # Filtering to isolation state total population by gender
  filter(`Age Category` == 'Total' &
         Category %in% c('Male', 'Female')) %>% 
  
  # Renaming a column
  rename(Gender = "Category") %>% 
  
  # Selecting desired columns
  select(-`Age Category`) %>% 
  
 ggplot(aes(x = State, y = Population)) +
 geom_col(aes(fill = Gender)) +

 # Formatting the graph
 theme_tq() +
 scale_fill_tq() +
 labs(title = "South Sudan 2008 Population Census Results by State and Gender",
     x = " ") +
  
 # Hiding the legend
 theme(
   plot.title = element_text(hjust = 0.5),
       legend.position = "none"
       ) +
 scale_y_continuous(labels = scales::number) +
 expand_limits(y = c(0, 1500000)) 


ggplotly(gender_g_stacked)

Re-organizing 2008 Population Census Data by Three Original Regions of South Sudan


# Grouping population into original three regions: Greater Bhar el Ghazal, Greater Equatoria and Greater Upper Nile
state_original_regions <- trans_pop_data_clean %>% 
  
  # Filtering to obtain state populations
  filter(Category == "Total" &
         `Age Category` == "Total") %>% 
  
  # Selecting state and popluation columns only
  select(1,4) %>% 
  
  # Transforming states into the three original regions
  mutate(Region = case_when(
    State %in% c("Upper Nile", "Jonglei", "Unity")                              ~ "Greater Upper Nile",
    State %in% c("Western Equatoria", "Central Equatoria", "Eastern Equatoria") ~ "Greater Equatoria",
    TRUE ~ "Greater Bhar el Ghazal"
  )) %>% 
  
  # Grouping population by regions
  group_by(Region) %>% 
  
  # Summrizing the population
  summarize(Population_New = sum(Population)) %>% 
  ungroup() %>% 
  
  # Computing proportions
  mutate(Proportion = round(as.numeric(Population_New / sum(Population_New)), digits = 2) %>% scales::percent()) %>% 
  arrange(desc(Population_New)) %>% 
  
 ggplot(aes(x = Region, y = Proportion)) +
  geom_col(aes(fill = Region)) +
  theme_tq() +
  
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(
    title = "Population by Original Three Regions of South Sudan",
    subtitle = 
                "Summary of South Sudan 2008 Population Census Data by Three Legacy Regions",
    x = " "
  )


state_original_regions

Graphing Population by Age Category


ss_pop_age_cat <-  trans_pop_data_clean %>%
  
    # Filtering out totals
  filter(Category != "Total",
         `Age Category` != "Total") %>% 

  # Transforming the age category column with case_when()
  mutate(`Age Category` = case_when(
     `Age Category` %in%  c("0 to 4", "5 to 9", "10 to 14", "15 to 19") ~  "0 - 19",
     `Age Category` %in%  c( "20 to 24", "25 to 29", "30 to 34" )       ~ "20 - 34",
     `Age Category` %in%  c("35 to 39", "40 to 44", "45 to 49")         ~ "35 - 49",
     `Age Category` %in%  c( "50 to 54", "55 to 59", "60 to 64")        ~ "50 - 64",
     TRUE                                                               ~   ">= 65"
  )
) %>% 
  
  # Grouping and the summarizing the data
  group_by(State, `Age Category`) %>% 
  summarize(Population = sum(Population)) %>% 
  ungroup() %>% 
  
  # Reordering data
   mutate(State = State %>% as_factor() %>% fct_rev()) %>% 
  
  # Initializing the ggplot
  ggplot(aes(x = State, y = Population)) +
  geom_col(aes(fill = State)) + 
  
  # Applying a facet_wrap()
  facet_wrap(~ `Age Category`, scales = "free_x") +
  
  coord_flip() +
  # Formatting the graph
  theme_tq() +
  scale_fill_tq() +
  labs(title = "South Sudan 2008 Population Census Results by State",
       x = " ") + 
  
  # Formatting the population values
  scale_y_continuous(labels = scales::number) +
  expand_limits(y = 0) +
  
  # Hiding the legend 
  theme(
    legend.position = "none",
    strip.text.x = element_text(margin = margin(5, 5, 5, 5, unit = "pt")),
    axis.text.x = element_text(angle = 30, hjust = 1)
  )

ss_pop_age_cat

Conclusion

In this article we have demonstrated how to perform data wrangling with R. Further, we performed three tasks: (i) cleaned the data to make it ready for analysis; (ii) displayed population data using a gt table; and (iii) performed graphical summaries of 2008 population census data with ggplot2 package. Overall, we used the below packages in this article: tidyverse, tidyquant, skimr, plotly, readxl, and gt.

Acknowledgement

Citation

For attribution, please cite this work as

Reng (2019, March 24). Reng Data Science Institute: South Sudan Population in Pictures!. Retrieved from https://www.rengdatascience.io/posts/2019-03-24-analysis-of-south-sudan-population/

BibTeX citation

@misc{reng2019south,
  author = {Reng, Alier Ëë},
  title = {Reng Data Science Institute: South Sudan Population in Pictures!},
  url = {https://www.rengdatascience.io/posts/2019-03-24-analysis-of-south-sudan-population/},
  year = {2019}
}