In this blogpost, I am going to take the earthquake data from April 2015 to May 2016 throughout the World. Then we will process the data in usable form and look at the data on Nepal. Later, we will also dive into the casualties from this devastating earthquake. We will look at how many people died, got injured from which district, using ggplot2, wordcloud and leaflet package.

Let’s get going.

First let’s load the libraries that we will be using.

library(tidyverse)
library(leaflet)
library(lubridate)
library(tidyr)
library(wordcloud)
library(visdat)
library(scales)
library(knitr)

We will be using the data from Asheshwor’s github repository https://github.com/asheshwor/np-quake The file we will be using is all_month_merged.csv. I have downloaded the csv file and saved into my data folder. Now let’s read the data.

# Analyzing Nepal earthquake data 
all_month <- read.csv(file = "../../data/earthquake/all_month_merged.csv", 
                      stringsAsFactors = FALSE)
str(all_month)
## 'data.frame':    81699 obs. of  15 variables:
##  $ time     : chr  "2016-05-29T08:27:33.350Z" "2016-05-29T06:45:59.910Z" "2016-05-29T06:34:13.930Z" "2016-05-29T05:58:39.120Z" ...
##  $ latitude : num  32 36.1 -20.5 10.6 36.3 ...
##  $ longitude: num  -115.21 70.81 -63.24 -40.99 3.51 ...
##  $ depth    : num  1.42 92.05 562.89 10 10 ...
##  $ mag      : num  3.61 4.5 4.3 5 4.2 4.4 2.6 3.3 4.2 2.7 ...
##  $ magType  : chr  "mlr" "mb" "mb" "mb" ...
##  $ nst      : int  22 NA NA NA NA NA NA NA NA NA ...
##  $ gap      : num  262 81 43 43 96 77 59 31 91 NA ...
##  $ dmin     : num  0.302 1.622 5.198 17.105 3.842 ...
##  $ rms      : num  0.41 0.78 0.9 0.73 1.16 1.02 0.44 0.33 0.76 0.54 ...
##  $ net      : chr  "ci" "us" "us" "us" ...
##  $ id       : chr  "ci37577944" "us20005z01" "us20005yzy" "us20005yzt" ...
##  $ updated  : chr  "2016-05-29T08:30:05.446Z" "2016-05-29T07:21:17.154Z" "2016-05-29T06:51:56.741Z" "2016-05-29T06:23:30.881Z" ...
##  $ place    : chr  "23km S of Alberto Oviedo Mota, B.C., MX" "4km NE of `Alaqahdari-ye Kiran wa Munjan, Afghanistan" "9km SSE of Boyuibe, Bolivia" "Northern Mid-Atlantic Ridge" ...
##  $ type     : chr  "earthquake" "earthquake" "earthquake" "earthquake" ...

Let’s visualize our unprocessed data using visdat package:

visdat::vis_dat(all_month, palette = "qual", sort_type = FALSE, warn_large_data = FALSE)

It is a really nice way to see our variables and observations in a graphical form. We can see that variables nst, gap, and dmin have a lot of missing data.

Let’s look at head of the dataset:

head(all_month)
##                       time latitude longitude  depth  mag magType nst gap
## 1 2016-05-29T08:27:33.350Z  32.0230 -115.2117   1.42 3.61     mlr  22 262
## 2 2016-05-29T06:45:59.910Z  36.0551   70.8097  92.05 4.50      mb  NA  81
## 3 2016-05-29T06:34:13.930Z -20.4974  -63.2433 562.89 4.30      mb  NA  43
## 4 2016-05-29T05:58:39.120Z  10.6169  -40.9928  10.00 5.00      mb  NA  43
## 5 2016-05-29T05:23:10.660Z  36.2786    3.5091  10.00 4.20      mb  NA  96
## 6 2016-05-29T05:10:48.040Z  20.4756  120.3505  10.00 4.40      mb  NA  77
##      dmin  rms net         id                  updated
## 1  0.3019 0.41  ci ci37577944 2016-05-29T08:30:05.446Z
## 2  1.6220 0.78  us us20005z01 2016-05-29T07:21:17.154Z
## 3  5.1980 0.90  us us20005yzy 2016-05-29T06:51:56.741Z
## 4 17.1050 0.73  us us20005yzt 2016-05-29T06:23:30.881Z
## 5  3.8420 1.16  us us20005yzj 2016-05-29T06:12:24.532Z
## 6  2.4270 1.02  us us20005yzh 2016-05-29T05:42:19.667Z
##                                                   place       type
## 1               23km S of Alberto Oviedo Mota, B.C., MX earthquake
## 2 4km NE of `Alaqahdari-ye Kiran wa Munjan, Afghanistan earthquake
## 3                           9km SSE of Boyuibe, Bolivia earthquake
## 4                           Northern Mid-Atlantic Ridge earthquake
## 5                         14km W of Ain Bessem, Algeria earthquake
## 6                       159km W of Sabtang, Philippines earthquake

We can see that our time column in not is data or POSIXct format. Let’s go a head and fix that date format and add couple variables year and month.

all_month <- all_month %>% 
  # Add a new column named date
  mutate(date = gsub(time, pattern = "T", replacement = " ") %>% 
           gsub(., pattern = ".[0-9]+Z", replacement = "") %>% 
           ymd_hms(.))

# Examine head with new variables
all_month %>% select(time, latitude,longitude, place, date) %>% head()
##                       time latitude longitude
## 1 2016-05-29T08:27:33.350Z  32.0230 -115.2117
## 2 2016-05-29T06:45:59.910Z  36.0551   70.8097
## 3 2016-05-29T06:34:13.930Z -20.4974  -63.2433
## 4 2016-05-29T05:58:39.120Z  10.6169  -40.9928
## 5 2016-05-29T05:23:10.660Z  36.2786    3.5091
## 6 2016-05-29T05:10:48.040Z  20.4756  120.3505
##                                                   place
## 1               23km S of Alberto Oviedo Mota, B.C., MX
## 2 4km NE of `Alaqahdari-ye Kiran wa Munjan, Afghanistan
## 3                           9km SSE of Boyuibe, Bolivia
## 4                           Northern Mid-Atlantic Ridge
## 5                         14km W of Ain Bessem, Algeria
## 6                       159km W of Sabtang, Philippines
##                  date
## 1 2016-05-29 08:27:33
## 2 2016-05-29 06:45:59
## 3 2016-05-29 06:34:13
## 4 2016-05-29 05:58:39
## 5 2016-05-29 05:23:10
## 6 2016-05-29 05:10:48

Next thing we are going to fix is the place column. In this column, the text before comma is the information about the place and text after comma is the information about country / state. Let’s us now separate this string into two columns.

all_month$country <- 
  as.character(lapply(1:nrow(all_month), 
                      FUN = function(x) 
                        tail(strsplit(all_month$place[x], 
                                      split = ",")[[1]],n=1L)))

# Remove all the white spaces from column country
all_month$country <- gsub("[[:space:]]", "", x = all_month$country)

I also remove the starting and trailing spaces on my country column.

Let’s examine our data frame again with few columns. The reason I am taking only few columns to display is to save some space.

all_month %>% select(date, latitude, longitude, country) %>% head()
##                  date latitude longitude                   country
## 1 2016-05-29 08:27:33  32.0230 -115.2117                        MX
## 2 2016-05-29 06:45:59  36.0551   70.8097               Afghanistan
## 3 2016-05-29 06:34:13 -20.4974  -63.2433                   Bolivia
## 4 2016-05-29 05:58:39  10.6169  -40.9928 NorthernMid-AtlanticRidge
## 5 2016-05-29 05:23:10  36.2786    3.5091                   Algeria
## 6 2016-05-29 05:10:48  20.4756  120.3505               Philippines
# Total number of unique countries
length(unique(all_month$country))
## [1] 283

There are data of 286 countries.

We will now add day, and month columns to all_month dataframe.

# Add month, day, and julian 
all_month <- all_month %>% 
  mutate(day = day(date), 
         month = month(date), 
         julian = julian(date, origin = ymd("2015-01-01")) %>% 
           floor() %>% as.numeric())

Create a dataframe for country Nepal by using dplyr function filter

# Now filter for the country Nepal
nepal_data <- all_month %>% dplyr::filter(country == "Nepal")
head(nepal_data)
##                       time latitude longitude depth mag magType nst gap
## 1 2016-05-28T09:39:10.980Z  28.0134   85.4281 10.00 4.7      mb  NA  74
## 2 2016-05-23T19:09:53.080Z  27.7400   85.0803 10.00 4.3      mb  NA  87
## 3 2016-05-14T22:45:55.410Z  27.8199   85.3181 22.48 4.3      mb  NA  88
## 4 2016-05-13T23:20:59.460Z  27.9580   85.5191 10.00 4.0      mb  NA 104
## 5 2016-05-11T01:19:42.350Z  27.9478   84.9271 10.54 4.5      mb  NA 104
## 6 2016-04-12T14:41:52.170Z  27.6062   86.0621 10.00 4.3      mb  NA 146
##    dmin  rms net         id                  updated
## 1 0.315 1.40  us us20005ysr 2016-05-28T10:34:15.107Z
## 2 0.210 0.91  us us10005jdi 2016-05-25T22:03:19.651Z
## 3 0.107 0.43  us us10005hgq 2016-05-14T23:42:39.975Z
## 4 0.303 0.82  us us10005ivg 2016-05-27T03:54:54.132Z
## 5 0.416 1.01  us us10005ihi 2016-05-26T00:48:30.120Z
## 6 0.670 0.93  us us20005heg 2016-05-11T14:54:32.777Z
##                         place       type                date country day
## 1 34km NNW of Nagarkot, Nepal earthquake 2016-05-28 09:39:10   Nepal  28
## 2 20km WNW of Kirtipur, Nepal earthquake 2016-05-23 19:09:53   Nepal  23
## 3  13km N of Kathmandu, Nepal earthquake 2016-05-14 22:45:55   Nepal  14
## 4   26km N of Nagarkot, Nepal earthquake 2016-05-13 23:20:59   Nepal  13
## 5  45km NW of Kirtipur, Nepal earthquake 2016-05-11 01:19:42   Nepal  11
## 6  31km N of Ramechhap, Nepal earthquake 2016-04-12 14:41:52   Nepal  12
##   month julian
## 1     5    513
## 2     5    508
## 3     5    499
## 4     5    498
## 5     5    496
## 6     4    467

Let’s quickly plot the latitude,longitude and magnitude using ggplot:

ggplot(nepal_data) + geom_point(aes(longitude, latitude, color = mag)) + 
  scale_color_continuous(low = "red", high = "blue")

The devastating earthquake occurred on 2015-04-25 in Nepal. So, let’s examine the earthquake for that particular day.

# Total number of earthquakes on April 25th 
nepal_data %>% dplyr::filter(date > as.POSIXct("2015-04-24 23:59:59", tz = "UTC"), 
                             date < as.POSIXct("2015-04-26 00:00:00", tz = "UTC")) %>% 
  select(date, mag, place) %>%
  summary() %>% knitr::kable()
date mag place
Min. :2015-04-25 06:11:26 Min. :4.200 Length:33
1st Qu.:2015-04-25 07:47:01 1st Qu.:4.500 Class :character
Median :2015-04-25 09:30:29 Median :4.800 Mode :character
Mean :2015-04-25 11:50:21 Mean :4.897 NA
3rd Qu.:2015-04-25 13:36:14 3rd Qu.:5.000 NA
Max. :2015-04-25 23:41:52 Max. :7.800 NA

Earthquake actually occurred 33 times and the magnitude of earthquake varied from 4.2 to 7.8 richter. The energy released from earthquake has a exponential relationship with the magnitude. Let’s look at the image below from the Geological Society of America, of earthquake magnitudes and the equivalent energy release.

Earthquake

Earthquake

Boxplot from April 25th to April 30th, 2015

Let’s plot the boxplot for days from April 25th to April 30th.

nepal_data %>% 
  dplyr::filter(date > as.POSIXct("2015-04-1 00:00:00", tz = "UTC"), 
                date < as.POSIXct("2015-04-30 23:59:00", tz = "UTC")) %>% 
  ggplot() + 
  geom_boxplot(aes(date, mag, group = day)) 

Visualizing the locations on the map using leaflet package

We will now look into the leaflet package and plot the exact locations using latitude, longitude and magnitude of the earthquake.

nepal_data2 <- nepal_data %>% 
  dplyr::filter(date > as.POSIXct("2015-04-24 23:59:59", tz = "UTC"), 
                date < as.POSIXct("2015-04-26 00:00:00", tz = "UTC"))

# Following two lines of code were borrowed from Asheshwor's np-quake repository
nepal_data2$size <- cut(nepal_data2$mag, 
                        c(2, 3.9, 4.9, 5.9, 6.9, 7.9), 
                        labels = c("3.3 to 3.9", ">3.9 to 4.9", ">4.9 to 5.9", ">5.9 to 6.9", ">6.9 to 7.9" ))

pallet <- colorFactor(c("gray32", "dodgerblue4",  "slateblue4", "purple", "firebrick1"),
                      domain = c("3.3 to 3.9", ">3.9 to 4.9", ">4.9 to 5.9", ">5.9 to 6.9", ">6.9 to 7.9"))

library(leaflet)
#pal <- colorQuantile(palette = "YlOrRd", domain = nepal_data2$mag)

m <- leaflet(options = leafletOptions(miniZoom =8, maxZoom = 8), 
             #width = "100%", 
             data = nepal_data2) %>% 
  addTiles() %>% # Add default OpenStreetMap map tiles
  leaflet::addCircleMarkers(lng = ~longitude, 
                            lat = ~latitude, 
                            stroke = FALSE, 
                            fillOpacity = 0.8,
                            fillColor = ~pallet(size),
                            label = paste("mag=", nepal_data2$mag,
                                          "\n", "location = ",
                                          nepal_data2$place)
                            ) %>% 
        addLegend(
        "bottomleft", pal = pallet,
        values = sort(nepal_data2$size),
        title = "Magnitude"
        # labFormat = labelFormat()
      )
m

Monthwise box plot of earthquake magnitude

give.n <- function(x){
  return(c(y = mean(x), label = length(x)))
}

# Nepal monthwise box plot
nepal_data %>% 
  dplyr::filter(date > ymd_hms("2015-01-01 00:00:00"), 
                  date < ymd_hms("2016-01-01 00:00:00")) %>% 
  ggplot(aes(cut(date, breaks="month"), mag)) + geom_boxplot() +
  stat_summary(fun.data = give.n, geom = "text", hjust = -0.5, color = "blue") + 
  xlab("Date") + ylab("Magnitude")

Note that I have used give.n function to print the total number of earthquake occurrences each month. Note that the dataset has the data only from April 25. So the acutal number of total occurrences for the month of April 2015 would be higher. This 7.8 richter earthquake caused the death of around 9000 people.

Casualies by Earthquake

Nepal government has made a spreadsheet available that has recorded the casualties by district. The data can be accessed on this page

The name of the spreadsheet is Key Government Figures - Nepal Earthquake - 2015.xlsx and we will be using 05-June-2015 sheet.

In order to bring this google sheet package into R, we will use the fantastic googlesheets package by Jenny Bryan. Let’s bring the data into the R environment. I have saved the googlesheet to my google drive and the name of the sheet on my drive is JD_Key Government Figures - Nepal Earthquake - 2015.xlsx

# Use library googlesheets to read Nepal earthquake data
library(googlesheets)

# Authenticate googlesheets
googlesheets::gs_auth()

fatal <- gs_title(x = "JD_Key Government Figures - Nepal Earthquake - 2015.xlsx")
fatalities <- googlesheets::gs_read(ss = fatal, ws = "05-June-2015")

We have to use the function gs_auth() first in order to aunthenticate the access to our google drive account. Then we can find the googlesheet with the function gs_title. After you have brought the googlesheet in R environment, use the function gs_read to read that sheet in a dataframe.

Let’s examine the data structure again using the visdat package.

# Use visdat to visualize the type of data in a dataframe
visdat::vis_dat(fatalities, palette = "qual", sort_type = FALSE)

visdat::vis_miss(fatalities)

Total number of fatalities by District

In Nepal, there are 75 districts. All the data provided by the goverment is by district. We can visualize how many people died, how many got injured, how many public and government buildings were damaged by earthquake.

Total number of deaths by district

fatalities %>% dplyr::filter(!is.na(DIST_ID)) %>% 
  dplyr::filter(Tot_Deaths > 1) %>% 
  ggplot() + geom_col(aes(x = reorder(DISTRICT, -Tot_Deaths), 
                          Tot_Deaths, fill = "red")) + 
  xlab("District") + ylab("Total number of deaths") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        axis.text = element_text(size = 12),
        legend.position = "")

It is clear that the maximum number of people died in Sindhupalchok.

Total number of injured by District

fatalities %>% dplyr::filter(!is.na(DIST_ID)) %>% 
  dplyr::filter(Total_Injured > 50) %>% 
  ggplot() + geom_col(aes(x = reorder(DISTRICT, -Total_Injured), Total_Injured)) + 
  geom_text(aes(x = DISTRICT, Total_Injured, label = Total_Injured), 
            vjust = -0.5, color = "blue") + 
  xlab("District") + ylab("Total number of Injured People") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14))

While the total number of people that died in Sindhupalchowk is highest, the highest total number of injured people ocurred in Kathmandu. Part of the reason that a lot of people got injured in Kathmandu is because the population in Kathmandu is very high compared to other districts. Kathmandu is the capital of Nepal and according to this dataset we are looking at the total population was 1744240.

Total number of government / Public buildings destroyed

fatalities %>% dplyr::filter(!is.na(DIST_ID)) %>%
  dplyr::filter(GovtBuild_Damage > 1, PublicBuild_Damage > 1) %>% 
  dplyr::select(DISTRICT, GovtBuild_Damage, PublicBuild_Damage) %>% 
  tidyr::gather(key = "variable", value = "value", -DISTRICT) %>% 
  ggplot() + geom_bar(aes(x = reorder(DISTRICT, -value), y = value, 
                          fill = variable), stat = "identity") + 
  facet_wrap( ~ variable, ncol = 1, scale = "free_y") + 
  xlab("DISTRICT") + ylab("Number of damaged buildings") + labs(fill = "") + 
  theme(axis.text.x = element_text(angle = 90, hjust =1), 
        legend.position = "")

The earthquake did a lot of damage on government and public buildings throughout the country. From this graph, we can see that a lot of public buildings and private buildings were damaged in Sindhupalchowk. However, the government buildings damage is not proportional to the public building damage in other districts. While lot of public buildings got damaged in Nuwakot and Kabrepalanchok the number of government buildings got damaged are pretty low. Most of the public houses that got damaged in these districts were old houses.

Using wordcloud package to visualize the casualties

Now we are going to look into wordcloud package and plot the districts with higher frequency of death, injuries and building damages.

Let’s look at the number of people who died in the earthquake.

Before looking into the wordcloud, I am going to create a function which we can use for all the data columns. The function that I have written takes the words and frequencies and then apply color settings to our words when they are displayed. In order to be seen on the word cloud I have specified the minimum frequency to be 5.

# Word cloud
wordcloud_jd <- function(words, freq){
  wordcloud(words = words,max.words = Inf, freq = freq,  
            random.order = FALSE, random.color = FALSE, scale = c(5.,.8 ),
            colors= c("indianred1","indianred2","indianred3","indianred"), 
            res = 300, width = 12, height = 10)
}

We will now use the function wordcloud_jd to plot total number of people by district.

Word cloud of total number of people died by district

wordcloud_jd(words = fatalities$DISTRICT, freq = fatalities$Tot_Deaths)

Looking at the wordcloud, we can see that the maximum number of people died in Sindhupalchowk. The second is Kathmandu, and then Nuwakot and so on. That is the total number of people died in a district. Now let’s examine if the word cloud is consistent for male and female deaths.

# Word cloud for deaths by male
wordcloud_jd(words = fatalities$DISTRICT, freq = fatalities$Death_Male)

# Word cloud for deaths by female
wordcloud_jd(words = fatalities$DISTRICT, freq = fatalities$Death_Female)

From these wordclouds, we can see that maximum number of male and female that died is in Sindhupalchok. However, the second highest death for male occurred in Kathmandu while the second higest death for female occurred in Nuwakot district.

Word cloud of total number of injured by district

wordcloud_jd(words = fatalities$DISTRICT, freq = fatalities$Total_Injured)

While the total number of people that died was higher in Sindhupalchowk, the injured number was very high in Kathmandu. Sindhupalchowk is the 4th in terms of total number of injured people.

We will look into more details and correlations between variable in future post.