Livestock Pattern Analysis

Overview

We are going to analyze the Livestock data downloaded from Food and Agriculture Organization of the United Nations website. Below are the questions we are going to answer by using our data.

  1. What Items or Livestock has the highest density for the most recent year by:
    • Overall (World wide)
    • Region
  2. What is the trend of total Livestock density for:
    • Overall
    • Region
  3. Identify if there is a significant difference in Livestock Density for each Region.
  4. Determine the Top 10 country with the Highest Livestock Density.
  5. Create a Time Series Model for:
    • The top 3 Livestock Product of the Top 1 Country by Livestock Density for each Region

Data Import and Pre-Process

library(tidyverse)
library(data.table)
library(countrycode)
library(FSA)
library(forecast)

livestock_data = fread("Livestock_data.csv") %>% as_tibble()
#Check Dimension of Data
glimpse(livestock_data)
Rows: 8,180
Columns: 70
$ `Area Code`       <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
$ `Area Code (M49)` <chr> "'004", "'004", "'004", "'004", "'004", "'004", "'00…
$ Area              <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghan…
$ `Item Code`       <int> 1107, 1107, 1107, 1126, 1126, 1126, 866, 866, 866, 1…
$ `Item Code (CPC)` <chr> "'02132", "'02132", "'02132", "'02121.01", "'02121.0…
$ Item              <chr> "Asses", "Asses", "Asses", "Camels", "Camels", "Came…
$ `Element Code`    <int> 7213, 7211, 5118, 7213, 7211, 5118, 7213, 7211, 5118…
$ Element           <chr> "Livestock units per agricultural land area", "Share…
$ Unit              <chr> "LSU/ha", "%LSU", "LSU", "LSU/ha", "%LSU", "LSU", "L…
$ Y1961             <dbl> 0.02, 12.36, 650000.00, 0.00, 3.57, 187500.00, 0.05,…
$ Y1962             <dbl> 0.01, 8.10, 425925.00, 0.01, 3.99, 209932.50, 0.06, …
$ Y1963             <dbl> 0.01, 9.15, 500556.00, 0.01, 4.79, 262053.75, 0.06, …
$ Y1964             <dbl> 0.02, 10.27, 575000.00, 0.01, 4.35, 243750.00, 0.06,…
$ Y1965             <dbl> 2.000e-02, 1.134e+01, 6.500e+05, 1.000e-02, 3.920e+0…
$ Y1966             <dbl> 2.000e-02, 1.016e+01, 6.000e+05, 1.000e-02, 3.810e+0…
$ Y1967             <dbl> 2.000e-02, 1.014e+01, 6.000e+05, 1.000e-02, 3.800e+0…
$ Y1968             <dbl> 0.02, 10.84, 664000.00, 0.01, 3.66, 224250.00, 0.07,…
$ Y1969             <dbl> 2.000e-02, 1.033e+01, 6.250e+05, 1.000e-02, 3.720e+0…
$ Y1970             <dbl> 2.000e-02, 1.051e+01, 6.500e+05, 1.000e-02, 3.640e+0…
$ Y1971             <dbl> 2.000e-02, 1.057e+01, 6.500e+05, 1.000e-02, 3.660e+0…
$ Y1972             <dbl> 2.000e-02, 1.309e+01, 6.500e+05, 1.000e-02, 4.530e+0…
$ Y1973             <dbl> 0.02, 12.06, 625000.00, 0.01, 4.34, 225000.00, 0.06,…
$ Y1974             <dbl> 0.02, 11.26, 625000.00, 0.01, 4.05, 225000.00, 0.06,…
$ Y1975             <dbl> 0.02, 10.48, 625000.00, 0.01, 3.77, 225000.00, 0.07,…
$ Y1976             <dbl> 2.000e-02, 1.027e+01, 6.250e+05, 1.000e-02, 3.700e+0…
$ Y1977             <dbl> 0.02, 10.84, 650000.00, 0.01, 3.75, 225000.00, 0.07,…
$ Y1978             <dbl> 0.02, 10.95, 650000.00, 0.01, 3.79, 225000.00, 0.07,…
$ Y1979             <dbl> 2.000e-02, 1.117e+01, 6.500e+05, 1.000e-02, 3.480e+0…
$ Y1980             <dbl> 0.02, 11.08, 647500.00, 0.01, 3.40, 198750.00, 0.07,…
$ Y1981             <dbl> 0.02, 11.13, 657500.00, 0.01, 3.36, 198750.00, 0.07,…
$ Y1982             <dbl> 0.02, 11.14, 657500.00, 0.01, 3.37, 198750.00, 0.07,…
$ Y1983             <dbl> 0.02, 11.93, 657500.00, 0.01, 3.60, 198750.00, 0.06,…
$ Y1984             <dbl> 0.02, 13.91, 659000.00, 0.01, 4.20, 198750.00, 0.05,…
$ Y1985             <dbl> 0.02, 15.60, 660500.00, 0.01, 4.69, 198750.00, 0.04,…
$ Y1986             <dbl> 0.02, 19.31, 662500.00, 0.01, 5.79, 198750.00, 0.03,…
$ Y1987             <dbl> 2.000e-02, 1.777e+01, 6.500e+05, 1.000e-02, 6.150e+0…
$ Y1988             <dbl> 0.01, 13.37, 500000.00, 0.01, 5.32, 198750.00, 0.03,…
$ Y1989             <dbl> 0.01, 10.98, 400000.00, 0.00, 4.84, 176250.00, 0.03,…
$ Y1990             <dbl> 0.01, 8.43, 300000.00, 0.00, 4.53, 161250.00, 0.03, …
$ Y1991             <dbl> 1.000e-02, 7.080e+00, 2.500e+05, 0.000e+00, 4.250e+0…
$ Y1992             <dbl> 1.000e-02, 7.060e+00, 2.500e+05, 0.000e+00, 4.230e+0…
$ Y1993             <dbl> 1.000e-02, 7.030e+00, 2.500e+05, 0.000e+00, 4.220e+0…
$ Y1994             <dbl> 1.000e-02, 8.170e+00, 3.000e+05, 0.000e+00, 4.090e+0…
$ Y1995             <dbl> 0.01, 9.06, 352000.00, 0.00, 3.88, 150750.00, 0.04, …
$ Y1996             <dbl> 0.01, 8.59, 376500.00, 0.00, 3.78, 165750.00, 0.05, …
$ Y1997             <dbl> 0.01, 8.37, 402500.00, 0.00, 3.77, 181500.00, 0.05, …
$ Y1998             <dbl> 0.01, 8.34, 430000.00, 0.01, 3.86, 198750.00, 0.06, …
$ Y1999             <dbl> 0.01, 8.00, 459970.00, 0.01, 3.79, 217788.00, 0.06, …
$ Y2000             <dbl> 1.000e-02, 6.930e+00, 3.410e+05, 0.000e+00, 3.410e+0…
$ Y2001             <dbl> 0.01, 8.26, 341000.00, 0.00, 4.07, 168000.00, 0.04, …
$ Y2002             <dbl> 0.02, 14.91, 794000.00, 0.00, 2.47, 131250.00, 0.07,…
$ Y2003             <dbl> 0.02, 14.63, 799000.00, 0.00, 2.49, 135750.00, 0.07,…
$ Y2004             <dbl> 0.02, 14.99, 807000.00, 0.00, 2.65, 142500.00, 0.06,…
$ Y2005             <dbl> 0.02, 12.79, 695500.00, 0.00, 2.59, 141000.00, 0.07,…
$ Y2006             <dbl> 2.000e-02, 1.126e+01, 6.075e+05, 0.000e+00, 2.420e+0…
$ Y2007             <dbl> 0.02, 13.54, 736000.00, 0.00, 2.57, 139500.00, 0.08,…
$ Y2008             <dbl> 0.02, 10.14, 604500.00, 0.00, 2.30, 137250.00, 0.09,…
$ Y2009             <dbl> 0.02, 10.83, 661000.00, 0.00, 2.33, 142500.00, 0.09,…
$ Y2010             <dbl> 0.02, 9.97, 702500.00, 0.00, 2.03, 143250.00, 0.10, …
$ Y2011             <dbl> 0.02, 10.27, 733000.00, 0.00, 1.81, 129000.00, 0.10,…
$ Y2012             <dbl> 0.02, 10.40, 711500.00, 0.00, 1.91, 130500.00, 0.10,…
$ Y2013             <dbl> 0.02, 10.77, 725500.00, 0.00, 1.89, 127500.00, 0.10,…
$ Y2014             <dbl> 0.02, 10.53, 720500.00, 0.00, 1.87, 128250.00, 0.10,…
$ Y2015             <dbl> 0.02, 10.81, 740500.00, 0.00, 1.86, 127500.00, 0.10,…
$ Y2016             <dbl> 0.02, 10.82, 736050.00, 0.00, 1.88, 127875.00, 0.10,…
$ Y2017             <dbl> 0.02, 9.92, 658500.00, 0.00, 1.94, 129000.00, 0.09, …
$ Y2018             <dbl> 0.02, 10.01, 682038.97, 0.00, 1.89, 129068.99, 0.09,…
$ Y2019             <dbl> 0.02, 11.44, 781841.65, 0.00, 1.87, 127661.06, 0.09,…
$ Y2020             <dbl> 0.02, 11.37, 771299.07, 0.00, 1.88, 127474.37, 0.09,…
$ Y2021             <dbl> 0.02, 11.47, 780882.17, 0.00, 1.87, 126964.23, 0.09,…
#8180 rows and 70 Columns
Note

Data Dictionary

  1. Area Code = Code for the Country
  2. Area Code (M49) = Standard Country or Area Code for Statistical Use
  3. Area = Country Name
  4. Item Code = Code for Livestock Product/Item
  5. Item Code (CPC)= Customs Procedure Code to determine if product is for Import or Export
  6. Item = Name of livestock product
  7. Element = Unit of measurement for Livestock
  8. Element Code = Numeric code for Livestock Unit of Measurement
  9. Unit = Shorthand notation or symbol of Element Column
  10. Y<9999> = Year

We need to add a region column for us to answer some of the questions on our list above.

livestock_data = livestock_data %>% 
  mutate(
    Region = countrycode(
      sourcevar = Area,
      origin = "country.name",
      destination = "continent"
    )
  ) %>% 
  filter(
    !is.na(Region)
  )
Warning: There were 737 warnings in `mutate()`.
The first warning was:
ℹ In argument: `Region = countrycode(sourcevar = Area, origin = "country.name",
  destination = "continent")`.
Caused by warning in `grepl()`:
! input string 1 is invalid UTF-8
ℹ Run `dplyr::last_dplyr_warnings()` to see the 736 remaining warnings.
head(livestock_data)
# A tibble: 6 × 71
  `Area Code` `Area Code (M49)` Area        `Item Code` `Item Code (CPC)` Item  
        <int> <chr>             <chr>             <int> <chr>             <chr> 
1           2 '004              Afghanistan        1107 '02132            Asses 
2           2 '004              Afghanistan        1107 '02132            Asses 
3           2 '004              Afghanistan        1107 '02132            Asses 
4           2 '004              Afghanistan        1126 '02121.01         Camels
5           2 '004              Afghanistan        1126 '02121.01         Camels
6           2 '004              Afghanistan        1126 '02121.01         Camels
# ℹ 65 more variables: `Element Code` <int>, Element <chr>, Unit <chr>,
#   Y1961 <dbl>, Y1962 <dbl>, Y1963 <dbl>, Y1964 <dbl>, Y1965 <dbl>,
#   Y1966 <dbl>, Y1967 <dbl>, Y1968 <dbl>, Y1969 <dbl>, Y1970 <dbl>,
#   Y1971 <dbl>, Y1972 <dbl>, Y1973 <dbl>, Y1974 <dbl>, Y1975 <dbl>,
#   Y1976 <dbl>, Y1977 <dbl>, Y1978 <dbl>, Y1979 <dbl>, Y1980 <dbl>,
#   Y1981 <dbl>, Y1982 <dbl>, Y1983 <dbl>, Y1984 <dbl>, Y1985 <dbl>,
#   Y1986 <dbl>, Y1987 <dbl>, Y1988 <dbl>, Y1989 <dbl>, Y1990 <dbl>, …

Now that we have our Region/Continent column we can start answering our questions from the list above

Questions

Highest Livestock Density

  1. Overall
livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  group_by(Item) %>% 
  reframe(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  arrange(desc(Livestock_Density)) %>% 
  top_n(10) %>% 
  mutate(
    Item = reorder(Item, Livestock_Density)
  ) %>% 
  ggplot(aes(x = Item, y = Livestock_Density)) +
  geom_col()+
  coord_flip()

  1. Region
livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  group_by(Region, Item) %>% 
  reframe(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  arrange(desc(Livestock_Density)) %>% 
  #top_n(10) %>% 
  mutate(
    Region = reorder(Region, Livestock_Density)
  ) %>% 
  ggplot(aes(x = Region, y = Livestock_Density, fill = Item)) +
  geom_col()+
  coord_flip()

Livestock Density Trend

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  group_by(Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = 1)) +
  geom_point()+
  geom_line()

Interesting. There is a sharp dip of Total Livestock Density in Worldwide Level. Let’s investigate this further

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  group_by(Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  slice(which.min(Livestock_Density))
# A tibble: 1 × 2
  Year       Livestock_Density
  <date>                 <dbl>
1 1961-01-01              248.

The year where we saw the sharp dip is 1961. Let’s see what happen on this year by splicing the Livestock computation by region. Maybe we can see a region dropping its Livestock, or maybe we can see that all region will have a sharp dip.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  group_by(Region,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Region, color = Region)) +
  geom_point()+
  geom_line()

Asia seems to be the driver of that sharp dip we saw from the world level. Lets try to remove Asia from the graph. Looking at the visual, there seems to be a drop on other continent as well. The scale of Asia’s Livestock density is making the trend on other continents seem small.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region != "Asia") %>% 
  group_by(Region,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Region, color = Region)) +
  geom_point()+
  geom_line()

Looks like Asia is really the one causing the dip. Let’s try to look at this one level lower.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region == "Asia") %>% 
  group_by(Area,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Area, color = Area)) +
  geom_point(show.legend = FALSE)+
  geom_line(show.legend = FALSE)

It is hard to identify what country is displaying the sharp dip for 1991. Lets do it using dplyr

dip_country = livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region == "Asia") %>% 
  group_by(Area,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) 

dip_country %>% 
  mutate(
    LagDiff = c(0,diff(dip_country$Livestock_Density))
  ) %>% 
  filter(LagDiff < 0) %>% 
  filter(LagDiff < -15)
# A tibble: 5 × 4
  Area                      Year       Livestock_Density LagDiff
  <chr>                     <date>                 <dbl>   <dbl>
1 Cambodia                  1961-01-01              1.23   -31.5
2 China, Taiwan Province of 1961-01-01              2.98   -34.2
3 Singapore                 1988-01-01             69.2    -29.2
4 Singapore                 1991-01-01             34.1    -62.3
5 Sri Lanka                 1961-01-01              2.14  -117. 

We created a new column called LagDiff which is the result of calculating the difference of the current Livestock Density and its previous value. That way, we can see what year and country have the highest negative LagDiff that will indicate a sharp dip.

Looking at the result, we see that its probably Singapore. Since Cambodia, China and Sri Lanka’s value are all from 1961 which is obviously not the year of Interest for us.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region == "Asia") %>% 
  filter(Area == "Singapore") %>% 
  group_by(Area,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Area, color = Area)) +
  geom_point(show.legend = FALSE)+
  geom_line(show.legend = FALSE)

Singapore is the country influencing the dip at year 1991. Doing a bit of research about this sharp drop, it turns out that the cause of this drop is due to Mt. Pinatubo’s volcanic eruption in the Philippines. Lets add the Philippines on our chart

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region == "Asia") %>% 
  filter(Area %in% c("Singapore","Philippines") ) %>% 
  group_by(Area,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Area, color = Area)) +
  geom_point()+
  geom_line()

Singapore’s Livestock Density value is too high for us to see the variance in Ph’s data.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  pivot_longer(-c(1:9,71),names_to = "Year", values_to = "Livestock_Density") %>% 
  filter(Region == "Asia") %>% 
  filter(Area %in% c("Philippines") ) %>% 
  group_by(Area,Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  ggplot(aes(x = Year, Livestock_Density,group = Area, color = Area)) +
  geom_point()+
  geom_line()

Looks like even before that year, the livestock density for Ph is already low, that’s why the decrease in livestock when Mt. Pinatubo erupted does not really cause much of change in Ph’s Livestock Density.

Region Significant Difference

livestock_by_region = livestock_data %>% 
  group_by(Region) %>% 
  reframe(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  )
livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  ggplot(aes(x = Region, y = Y2021)) +
  geom_boxplot()

Before we can use ANOVA check if there is a significant difference on Livestock Density for each Region, we need to determine first if the data on each group is normal.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(Region == "Asia") %>% 
  filter(!is.na(Y2021)) %>% 
  pull(Y2021) %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.11448, p-value < 2.2e-16

Asia’s value for Livestock Density is not normal since we have a p.value of less than 0.05

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(Region == "Africa") %>% 
  filter(!is.na(Y2021)) %>% 
  pull(Y2021) %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.502, p-value < 2.2e-16

Same for Africa.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(Region == "Europe") %>% 
  filter(!is.na(Y2021)) %>% 
  pull(Y2021) %>% 
  shapiro.test()

    Shapiro-Wilk normality test

data:  .
W = 0.50173, p-value < 2.2e-16

Europe is also not normal. At this point, we dont need to check the other 2 Region, since as pre ANOVA’s requirement, all group should come from a normal distribution. We can still determine if there is a significant difference among the groups by using a Non-Parametric approach. This approach will not require our data to come from a normal distribution.

livestock_region_data = livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(!is.na(Y2021)) %>% 
  select(Region, Y2021)

livestock_kruskal = kruskal.test(Y2021 ~ Region, data = livestock_region_data )
livestock_kruskal

    Kruskal-Wallis rank sum test

data:  Y2021 by Region
Kruskal-Wallis chi-squared = 35.535, df = 4, p-value = 3.606e-07

We have a p.value of less than 0.05 indicating that we have a significant difference in Livestock Density for each Region. But let us take the test further and identify which among these groups have a significant difference among each other. We will use Dunn’s test for this task.

dunn_result = dunnTest(Y2021 ~ Region, data = livestock_region_data)

#Let's filter out those comparison test with an adjusted p.value of greater than 0.05


dunn_result$res %>% 
  filter(P.adj < 0.05)
         Comparison         Z      P.unadj        P.adj
1 Africa - Americas -4.484740 7.300273e-06 6.570246e-05
2     Africa - Asia -4.106799 4.011807e-05 3.209446e-04
3   Africa - Europe -3.235075 1.216108e-03 8.512756e-03
4  Africa - Oceania -4.631136 3.636652e-06 3.636652e-05

Checking the result from Dunn’s test, it seems that Africa have a significant difference on Livestock Desnity with all other Region in our data set.

Top 10 Country with Highest Livestock Density

We can easily extract this information with a simple code using dplyr.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  select(Area, Y2021) %>% 
  filter(!is.na(Y2021)) %>% 
  group_by(Area) %>% 
  reframe(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  arrange(desc(Livestock_Density)) %>% 
  top_n(10)
# A tibble: 10 × 2
   Area                             Livestock_Density
   <chr>                                        <dbl>
 1 Singapore                                   119.  
 2 China, Hong Kong SAR                         37.2 
 3 Brunei Darussalam                            32.7 
 4 Trinidad and Tobago                          16.4 
 5 Barbados                                     13.1 
 6 Micronesia (Federated States of)             10.8 
 7 Republic of Korea                            10.7 
 8 Kuwait                                       10.2 
 9 Qatar                                        10.2 
10 Netherlands (Kingdom of the)                  9.95

We can also check the top 10 for each Region and visualize it for better view.

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(!is.na(Y2021)) %>% 
  group_by(Region, Area) %>% 
  summarise(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  top_n(10) %>% 
  ungroup() %>% 
  mutate(
    Area = reorder(Area, Livestock_Density)
  ) %>% 
  ggplot(aes(x = Area, y = Livestock_Density, fill = Region)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~Region, scales = "free") +
  scale_x_discrete(label = function(x) abbreviate(x, minlength = 7)) +
  coord_flip()

livestock_data %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(!is.na(Y2021)) %>% 
  group_by(Region, Area) %>% 
  summarise(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  top_n(10) %>% 
  filter(Region == "Africa") %>% 
  arrange(desc(Livestock_Density))
`summarise()` has grouped output by 'Region'. You can override using the
`.groups` argument.
Selecting by Livestock_Density
# A tibble: 10 × 3
# Groups:   Region [1]
   Region Area                     Livestock_Density
   <chr>  <chr>                                <dbl>
 1 Africa Seychelles                            4.3 
 2 Africa Mauritius                             4.13
 3 Africa Ethiopia                              3.68
 4 Africa Egypt                                 3.67
 5 Africa Burkina Faso                          2.13
 6 Africa Kenya                                 2.13
 7 Africa Guinea-Bissau                         2.1 
 8 Africa Uganda                                2.04
 9 Africa Cabo Verde                            1.98
10 Africa Central African Republic              1.97

Time Series Modelling

Let’s start by creating a time series model for Africa’s highest Livestock Density Country, Seychelles

#Let's identify first the top 3 Livestock Product for Seychelles
livestock_data %>% 
  filter(Region == "Africa") %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(Area == "Seychelles") %>% 
  group_by(Item) %>% 
  reframe(
    Livestock_Density = sum(Y2021, na.rm = TRUE)
  ) %>% 
  arrange(desc(Livestock_Density))
# A tibble: 7 × 2
  Item                  Livestock_Density
  <chr>                             <dbl>
1 Major livestock types              1.92
2 Chickens                           0.84
3 Swine / pigs                       0.62
4 Goats                              0.37
5 Sheep and Goats                    0.37
6 Cattle                             0.09
7 Cattle and Buffaloes               0.09
#Top 3 Livestock Item for Seychelles is Major Livestock Types, Chickens and Swine/Pigs
#Lets Filter the Major Livestock Types first
africa_1 = livestock_data %>% 
  filter(Region == "Africa") %>% 
  filter(Element == "Livestock units per agricultural land area") %>% 
  filter(Area == "Seychelles") %>% 
  select(Area,Item,c(10:70)) %>% 
  pivot_longer(-c(Area,Item),names_to = "Year", values_to = "Livestock_Density") %>% 
   mutate(
    Year = str_remove(Year, "Y"),
    Year = paste(Year,"01","01",sep = "-"),
    Year = as.Date(Year)
  ) %>% 
  group_by(Area,Item, Year) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density, na.rm = TRUE)
  )


africa_1 %>% 
  group_by(Item) %>% 
  reframe(
    Livestock_Density = sum(Livestock_Density,na.rm = TRUE)
  ) %>% 
  arrange(desc(Livestock_Density))
# A tibble: 7 × 2
  Item                  Livestock_Density
  <chr>                             <dbl>
1 Major livestock types             91.2 
2 Chickens                          41.4 
3 Swine / pigs                      30.5 
4 Cattle                            10.9 
5 Cattle and Buffaloes              10.9 
6 Goats                              8.52
7 Sheep and Goats                    8.52
africa_1 %>% 
  filter(Item %in% c("Major livestock types","Chickens","Swine / pigs")) %>% 
  ggplot(aes(x = Year,y = Livestock_Density, color = Item, group = Item)) +
  geom_point() +
  geom_line()

We now know what are the top 3 Livestock Item for Seychelles in Africa from 1960 to 2021. We will then create a time series model for each livestock Item and compare the ratio of these 3 throughout each time period.

#Lets create a time series object first
livestock_ts_major = africa_1 %>% 
  filter(Item == "Major livestock types") %>% 
  select(Year, Livestock_Density)

livestock_ts_obj = ts(livestock_ts_major$Livestock_Density, start = 1961, frequency = 1)

Transform our Time Series data to Log Form to remove fluctuations

log_ls_ts = log(livestock_ts_obj)
plot(log_ls_ts)

We can see that there is an upward trend for our Time Series Data.

For creating the Time Series Model, we will use the function auto.arima. This will help us to identify the best configuration for our ARIMA model.

arima_model = auto.arima(log_ls_ts)
arima_model
Series: log_ls_ts 
ARIMA(2,1,2) with drift 

Coefficients:
         ar1      ar2      ma1     ma2   drift
      0.2600  -0.7282  -0.4762  0.5582  0.0281
s.e.  0.2804   0.1492   0.3711  0.1959  0.0130

sigma^2 = 0.02032:  log likelihood = 34.13
AIC=-56.27   AICc=-54.68   BIC=-43.7

Now that we are done with the Model, let’s explore our ACF and PACF plots

acf(arima_model$residuals, main = "Correlogram")

pacf(arima_model$residuals, main = "Partial Correlogram")

Testing Significance of Residuals using Ljung-Box Test

Box.test(arima_model$residuals, lag = 20, type = "Ljung-Box")

    Box-Ljung test

data:  arima_model$residuals
X-squared = 8.9967, df = 20, p-value = 0.9829

Since our pvalue is greater than 0.05, we can conclude that there is no evidence that there is a Non-Zero Autocorrelation in our forecast error from Lags 1 to 20.

Next we will generate a Residual Plot. This will help us confirm if our model is correct or if we need to consider other modelling algorithm.

hist(arima_model$residuals,
     col = "red",
     xlab = "Error",
     main = "Histogram of Residuals",
     freq = FALSE)

As we can see, most of the values for Error are concentrated on 0.

Finally, we can now start to forecast using our ARIMA model. On our script below we will try to predict the Major Livestock Density for the next 5 years

model_forecast = forecast(arima_model, 5)
autoplot(model_forecast)

Our model captured the upward trend of our Log Transformed Time Series Data. The dark and light blue region indicates the confidence interval of our prediction or forecast. Notice that our model did not capture any fluctuation from our original data, that is because our data does not have a seasonality component. The up and down movement we are seeing may be due to noise but not enough for our model to consider it as a pattern for forecasting.

accuracy(model_forecast)
                       ME      RMSE        MAE       MPE     MAPE      MASE
Training set -0.000373242 0.1353569 0.09882457 -3.725761 33.46294 0.9112668
                   ACF1
Training set 0.02882789

Running accuracy function tells us how well our model is performing. The rmse value tells us the root mean squared error of our model. Our target for this metric is for it to be lower when compared to other model performance evaluation.