The goal of this project is to build and train a model that can predict COVID-19 test results.
COVID-19, short for Coronavirus Disease 2019, is a highly contagious respiratory illness caused by the novel coronavirus SARS-CoV-2. It emerged in late 2019 and quickly spread globally, leading to a pandemic declared by the World Health Organization in March 2020. COVID-19 is primarily transmitted through respiratory droplets when an infected person coughs, sneezes, or talks, and can also spread by touching contaminated surfaces. The virus can cause a range of symptoms, from mild respiratory issues to severe pneumonia and organ failure, with older adults and those with underlying health conditions being particularly vulnerable. The pandemic profoundly affected the world on multiple fronts. It led to widespread illness and death, overwhelmed healthcare systems in many countries, caused economic disruptions with job losses and business closures, disrupted education with school closures and remote learning, and resulted in social isolation and mental health challenges due to lockdowns and restrictions on gatherings. Efforts to control the spread of the virus included measures such as mask-wearing, social distancing, testing, contact tracing, and vaccination campaigns, which became central to global public health strategies in combating the pandemic.
Predicting COVID-test results can be very beneficial because it can help us better prepare for future pandemics. The widespread shutdowns witnessed during the COVID-19 pandemic underscored the critical need for proactive measures to mitigate transmission risks. Many nations faced unprecedented challenges in managing the influx of symptomatic patients, revealing systemic vulnerabilities and gaps in preparedness efforts. The main problems that healthcare providers have faced are the shortage of medical resources and a lack of a proper plan to efficiently distribute them. There is now a pressing demand to increase hospital readiness for future outbreaks. Consequently, this project aims to construct a predictive model capable of predicting COVID-19 test outcomes by using current medical statuses and pre-existing health conditions of patients. By doing so, it seeks to furnish healthcare systems with invaluable tools to preemptively strategize and allocate resources more effectively in the face of future pandemics akin to the coronavirus crisis.
library(naniar) #replace_with_na
library(finalfit) #missing_plot()
library(corrplot) #correlation plot
## corrplot 0.92 loaded
library(forcats) #lumping
library(discrim) #qda and lda
## Loading required package: parsnip
library(kknn)
library(xgboost)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ ggplot2 3.5.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::slice() masks xgboost::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom 1.0.5 ✔ rsample 1.2.0
## ✔ dials 1.2.1 ✔ tune 1.2.0
## ✔ infer 1.0.6 ✔ workflows 1.1.4
## ✔ modeldata 1.3.0 ✔ workflowsets 1.1.0
## ✔ recipes 1.0.10 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::slice() masks xgboost::slice()
## ✖ dials::smoothness() masks discrim::smoothness()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(ggplot2)
library(dplyr)
library(janitor) #clean names
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(ranger) #random forest
library(glmnet) #glmnet
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-8
library(vip) #vip plot
##
## Attaching package: 'vip'
##
## The following object is masked from 'package:utils':
##
## vi
set.seed(888)
covid_data <- read.csv("Covid Data.csv") %>%
clean_names()
head(covid_data, n=20)
## usmer medical_unit sex patient_type date_died intubed pneumonia age
## 1 2 1 1 1 03/05/2020 97 1 65
## 2 2 1 2 1 03/06/2020 97 1 72
## 3 2 1 2 2 09/06/2020 1 2 55
## 4 2 1 1 1 12/06/2020 97 2 53
## 5 2 1 2 1 21/06/2020 97 2 68
## 6 2 1 1 2 9999-99-99 2 1 40
## 7 2 1 1 1 9999-99-99 97 2 64
## 8 2 1 1 1 9999-99-99 97 1 64
## 9 2 1 1 2 9999-99-99 2 2 37
## 10 2 1 1 2 9999-99-99 2 2 25
## 11 2 1 1 1 9999-99-99 97 2 38
## 12 2 1 2 2 9999-99-99 2 2 24
## 13 2 1 2 2 9999-99-99 2 2 30
## 14 2 1 2 1 9999-99-99 97 2 55
## 15 2 1 1 1 9999-99-99 97 2 48
## 16 2 1 1 1 9999-99-99 97 2 23
## 17 2 1 1 2 9999-99-99 2 1 80
## 18 2 1 2 1 9999-99-99 97 2 61
## 19 2 1 2 1 9999-99-99 97 2 54
## 20 2 1 1 1 9999-99-99 97 2 64
## pregnant diabetes copd asthma inmsupr hipertension other_disease
## 1 2 2 2 2 2 1 2
## 2 97 2 2 2 2 1 2
## 3 97 1 2 2 2 2 2
## 4 2 2 2 2 2 2 2
## 5 97 1 2 2 2 1 2
## 6 2 2 2 2 2 2 2
## 7 2 2 2 2 2 2 2
## 8 2 1 2 2 1 1 2
## 9 2 1 2 2 2 1 2
## 10 2 2 2 2 2 2 2
## 11 2 2 2 2 2 2 2
## 12 97 2 2 2 2 2 2
## 13 97 2 2 2 2 2 2
## 14 97 2 2 2 2 2 2
## 15 2 1 2 2 2 2 2
## 16 2 2 2 2 2 2 2
## 17 2 2 2 2 2 1 2
## 18 97 2 2 2 2 2 2
## 19 97 2 2 2 2 2 2
## 20 2 2 2 2 2 2 2
## cardiovascular obesity renal_chronic tobacco clasiffication_final icu
## 1 2 2 2 2 3 97
## 2 2 1 1 2 5 97
## 3 2 2 2 2 3 2
## 4 2 2 2 2 7 97
## 5 2 2 2 2 3 97
## 6 2 2 2 2 3 2
## 7 2 2 2 2 3 97
## 8 2 2 1 2 3 97
## 9 2 1 2 2 3 2
## 10 2 2 2 2 3 2
## 11 2 2 2 2 3 97
## 12 2 2 2 2 3 2
## 13 2 2 2 2 3 2
## 14 2 2 2 2 3 97
## 15 2 2 2 2 3 97
## 16 2 2 2 2 3 97
## 17 2 2 2 2 3 1
## 18 2 2 2 2 3 97
## 19 2 2 2 2 3 97
## 20 2 2 2 2 3 97
We use the clean_names() function to make all the variables lowercase for easier interoperability.
The COVID-19 data set used in this project was found from kaggle Data Set Link. This data set contains 1,048,576 unique observations and 21 variables. In the Boolean features, 1 means “yes” and 2 means “no”. Values as 97, 98 and 99 are missing data. This COVID-19 data set was provided by the Mexcian government
In the Boolean features, 1 means “yes” and 2 means “no”. Otherwise the values are specified.
The variables that are listed come from the cleaned data set.
usmer
indicates whether the patient treated medical
units of the first, second or third level.
medical_unit
Type of institution of the National Health
System that provided the care.
sex
1 - Female. 2 - Male
patient_type
Type of care the patient received in the
unit. 1 for returned home and 2 for hospitalization.
intubed
(boolean) Whether the patient was connected to
the ventilator.
pneumonia
(boolean) Whether the patient already have air
sacs inflammation or not.
age
(int) Age of the patient.
pregnant
(boolean) Whether the patient is pregnant or
not.
diabetes
(boolean) Whether the patient has diabetes or
not.
copd
(boolean) Whether the patient has Chronic
obstructive pulmonary disease or not.
astma
(boolean) Whether the patient has asthma or
not.
immuno_suppress
(boolean) Whether the patient is
immunosuppressed or not.
hypertension
(boolean) Whether the patient has
hypertension or not.
other_disease
(boolean) Whether the patient has other
disease or not.
cardiovascular
(boolean) Whether the patient has heart
or blood vessels related disease.
obesity
(boolean) Whether the patient is obese or
not.
renal_chronic
(boolean) Whether the patient has chronic
renal disease or not.
tobacco
(boolean) Whether the patient is a tobacco
user.
classification_final
Covid test results. Values 1-3 mean
that the patient was diagnosed with covid in different degrees. 4 or
higher means that the patient is not a carrier of covid or that the test
is inconclusive.
icu
(int): Whether the patient had been admitted to an
Intensive Care Unit.
survived
(boolean): Whether are not the patient
survived
We will rename the following variables for easier interoperability.
INMSUPR
to immuno_suppress
HIPERTENSTION
to hypertension
CLASIFFICATION_FINAL
to
classification_final
#SPELLING ERROR
colnames(covid_data)[13] <- "immuno_suppress"
colnames(covid_data)[14] <- "hypertension"
colnames(covid_data)[20] <- "classification_final"
We will create a new column called survived
with the
boolean featured stated above where 1 - the patient survived (did not
die) and 2 - the patient did not survive
#NEW COLUMN TO REPLACE DATE_DIED TO WHETHER OR NOT THEY SURVIVED.
covid_data$survived <- case_when(
covid_data$date_died == "9999-99-99" ~ 1,
.default = 2
)
covid_data <- covid_data %>%
select(-date_died)
Additionally, for variable classification_final
we will
lump the values of 4, 5, 6, and 7 into the value of 4 since all four
values indicate no symptoms/inconclusive.
#LUMPING NO SYMPTOMS/INCONCLUSIVE
covid_data$classification_final[covid_data$classification_final==5] <- 4
covid_data$classification_final[covid_data$classification_final==6] <- 4
covid_data$classification_final[covid_data$classification_final==7] <- 4
Notice that all the males have pregnant
values of 97
(missing). This would not be missing data because men can not be
pregnant. All men patients will have their pregnant
variable be equal to 2. This actually eliminate all ‘missing data’ for
the pregnant
variable.
head(covid_data, n=50) %>%
subset(sex==2) %>%
select(sex, pregnant)
## sex pregnant
## 2 2 97
## 3 2 97
## 5 2 97
## 12 2 97
## 13 2 97
## 14 2 97
## 18 2 97
## 19 2 97
## 21 2 97
## 22 2 97
## 23 2 97
## 26 2 97
## 27 2 97
## 28 2 97
## 29 2 97
## 30 2 97
## 31 2 97
## 33 2 97
## 34 2 97
## 36 2 97
## 39 2 97
## 42 2 97
## 46 2 97
## 48 2 97
covid_data$pregnant[covid_data$sex==2] <- 2
We can see that columns icu
and intubed
are
connected with patient_type
variable and we can see that
patients who have value = 1 for patient_type
have value =
97 for both icu
and intubed
variables. It
means that patients who did not receive hostpital treatment so these
patients did not need to go to the ICU and need to be intubed. These
patients received treatment for at home care. We will change the
patients who received treatment for at home care to 2 (NO) for
icu
and intubed
head(covid_data, n = 50) %>%
subset(patient_type==1) %>%
select(patient_type, intubed)
## patient_type intubed
## 1 1 97
## 2 1 97
## 4 1 97
## 5 1 97
## 7 1 97
## 8 1 97
## 11 1 97
## 14 1 97
## 15 1 97
## 16 1 97
## 18 1 97
## 19 1 97
## 20 1 97
## 22 1 97
## 23 1 97
## 24 1 97
## 25 1 97
## 26 1 97
## 27 1 97
## 28 1 97
## 29 1 97
## 30 1 97
## 32 1 97
## 33 1 97
## 34 1 97
## 35 1 97
## 36 1 97
## 37 1 97
## 38 1 97
## 39 1 97
## 40 1 97
## 41 1 97
## 43 1 97
## 44 1 97
## 45 1 97
## 46 1 97
## 47 1 97
## 48 1 97
## 49 1 97
## 50 1 97
head(covid_data, n = 50) %>%
subset(patient_type==1) %>%
select(patient_type, icu)
## patient_type icu
## 1 1 97
## 2 1 97
## 4 1 97
## 5 1 97
## 7 1 97
## 8 1 97
## 11 1 97
## 14 1 97
## 15 1 97
## 16 1 97
## 18 1 97
## 19 1 97
## 20 1 97
## 22 1 97
## 23 1 97
## 24 1 97
## 25 1 97
## 26 1 97
## 27 1 97
## 28 1 97
## 29 1 97
## 30 1 97
## 32 1 97
## 33 1 97
## 34 1 97
## 35 1 97
## 36 1 97
## 37 1 97
## 38 1 97
## 39 1 97
## 40 1 97
## 41 1 97
## 43 1 97
## 44 1 97
## 45 1 97
## 46 1 97
## 47 1 97
## 48 1 97
## 49 1 97
## 50 1 97
#TURNING MISSING DATA INTO 2 (NO)
covid_data$intubed[covid_data$patient_type==1] <- 2
covid_data$icu[covid_data$patient_type==1] <- 2
As mentioned before, the data set marked missing data as either 97, 98, and/or 99. So we will now change the missing data to actual NA values to deal with the missing data appropriately.
# CHANGING MISSING VALUES TO NA
covid_data <- replace_with_na(covid_data, replace = list(intubed = c(97,99),
pregnant = c(97,98,99), icu = c(97,99), pneumonia = 99,
pregnant = 98,diabetes = 98, copd = 98, asthma = 98,
immuno_suppress = 98, hypertension = 98, other_disease = 98,
cardiovascular = 98, obesity = 98, renal_chronic = 98,
tobacco = 98))
Let us take a look at the missing_plot() to see how much missing values we actually have.
covid_data %>%
missing_plot()
covid_data %>%
summary()
## usmer medical_unit sex patient_type
## Min. :1.000 Min. : 1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.: 4.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :12.000 Median :1.000 Median :1.000
## Mean :1.632 Mean : 8.981 Mean :1.499 Mean :1.191
## 3rd Qu.:2.000 3rd Qu.:12.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :13.000 Max. :2.000 Max. :2.000
##
## intubed pneumonia age pregnant
## Min. :1.000 Min. :1.000 Min. : 0.00 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 30.00 1st Qu.:2.000
## Median :2.000 Median :2.000 Median : 40.00 Median :2.000
## Mean :1.968 Mean :1.864 Mean : 41.79 Mean :1.992
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.: 53.00 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :121.00 Max. :2.000
## NA's :7325 NA's :16003 NA's :3754
## diabetes copd asthma immuno_suppress hypertension
## Min. :1.00 Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000
## Median :2.00 Median :2.000 Median :2.00 Median :2.000 Median :2.000
## Mean :1.88 Mean :1.986 Mean :1.97 Mean :1.986 Mean :1.844
## 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.00 Max. :2.000 Max. :2.00 Max. :2.000 Max. :2.000
## NA's :3338 NA's :3003 NA's :2979 NA's :3404 NA's :3104
## other_disease cardiovascular obesity renal_chronic tobacco
## Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.00 Median :2.000 Median :2.000 Median :2.000
## Mean :1.973 Mean :1.98 Mean :1.847 Mean :1.982 Mean :1.919
## 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.00 Max. :2.000 Max. :2.000 Max. :2.000
## NA's :5045 NA's :3076 NA's :3032 NA's :3006 NA's :3220
## classification_final icu survived
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :2.000 Median :1.000
## Mean :3.608 Mean :1.984 Mean :1.073
## 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :4.000 Max. :2.000 Max. :2.000
## NA's :7488
Since we have very little NA values compared to our 1,048,576 unique observations, we will deal with the missing data by dropping them.
#OMIT NA DATA
covid_data <- na.omit(covid_data)
We can verify that all the missing data is ommitted by checking that there are no values of 97, 98, and/or 99 for any variables.
# verify categorical variables have no missing data
max(covid_data$usmer)
## [1] 2
max(covid_data$medical_unit)
## [1] 13
max(covid_data$sex)
## [1] 2
max(covid_data$patient_type)
## [1] 2
max(covid_data$intubed)
## [1] 2
max(covid_data$pneumonia)
## [1] 2
max(covid_data$pregnant)
## [1] 2
max(covid_data$diabetes)
## [1] 2
max(covid_data$copd)
## [1] 2
max(covid_data$asthma)
## [1] 2
max(covid_data$immuno_suppress)
## [1] 2
max(covid_data$hypertension)
## [1] 2
max(covid_data$other_disease)
## [1] 2
max(covid_data$cardiovascular)
## [1] 2
max(covid_data$obesity)
## [1] 2
max(covid_data$renal_chronic)
## [1] 2
max(covid_data$tobacco)
## [1] 2
max(covid_data$classification_final)
## [1] 4
max(covid_data$icu)
## [1] 2
max(covid_data$survived)
## [1] 2
Now that our data is all cleaned, we will factor all categorical variables
# FACTOR ALL CATEGORICAL VARIABLES
covid_data <- covid_data %>%
mutate(usmer=factor(usmer)) %>%
mutate(medical_unit=factor(medical_unit)) %>%
mutate(sex=factor(sex)) %>%
mutate(patient_type=factor(patient_type)) %>%
mutate(intubed=factor(intubed)) %>%
mutate(pneumonia=factor(pneumonia)) %>%
mutate(pregnant=factor(pregnant)) %>%
mutate(diabetes=factor(diabetes)) %>%
mutate(copd=factor(copd)) %>%
mutate(asthma=factor(asthma)) %>%
mutate(immuno_suppress=factor(immuno_suppress)) %>%
mutate(hypertension=factor(hypertension)) %>%
mutate(other_disease=factor(other_disease)) %>%
mutate(cardiovascular=factor(cardiovascular)) %>%
mutate(obesity=factor(obesity)) %>%
mutate(renal_chronic=factor(renal_chronic)) %>%
mutate(tobacco=factor(tobacco)) %>%
mutate(classification_final=factor(classification_final)) %>%
mutate(icu=factor(icu)) %>%
mutate(survived=factor(survived))
Let’s explore the distribution of
classification_final
.
#Distribution of CLASSIFICATION_FINAL
covid_data %>%
ggplot(aes(classification_final)) +
geom_bar(fill = "blue") +
labs(title = "Distribution of COVID Test Classification")
From our ggplot, we can see that the COVID test results
(classification_final
) is skewed left. Majority of the
patients tested as either no symptoms or inconclusive. The second most
frequent result is a positive result in degree 3. The third most
frequent result is a positive result in degree 1. The least frequent
result is a positive result in degree 2. The high frequency in patients
testing as either no symptoms or inconclusive is a result of lumping
degrees 4, 5, 6, and 7 into 4. It could also be a result of false
negatives and or patients who frequently went to the hosptial to get
tested to follow guidelines. A positive result in degree 3 being the
second most frequent makes sense because many patients could have not
gone to the hospital until it became severe.
Lets look at a percent bar chart with age
and
classification_final
. A common indicator of how someone
would react to COVID-19 was age.
ggplot(covid_data ,aes(fill=classification_final, x=age)) +
geom_bar(position = "fill") +
labs(title = "Percent Bar Chart For Age and Classification_final")
We can see that the percentage of positive result in degree 3 increases from ages 0 to around 70 years old. This makes sense because from ages 20 to 50 there is more exposure because more people would be working. Patients aged 50 to 70 are considered more susceptible to COVID-19, as this age group exhibits heightened vulnerability to the virus. It makes sense that the percentage would continue to rise during the age range of 50 to 70. The decrease after age 70 can be a result of low observations. Mexico’s life expectancy is around 70 years old, so there can be not enough observations past the age of 70.
Lets see a percent bar chart with classification_final
and intubed
. A high demand resource during the COVID-19
pandemic was ventilators, so lets see this chart to view the
distribution of patients that were intubed across the different result
classificaitons.
ggplot(covid_data ,aes(fill=intubed, x=classification_final)) +
geom_bar(position = "fill") +
labs(title = "Percent Bar Chart For Classification_final and Intubed")
In this plot we can see that about 30 percent of patients who had a positive result in degree 2 were intubed. We can see that there was a small percentage of patients with a negative or inconclusive result that were still intubed. This can be the result of other respriory illnesses or the patient receiving a false negative in their covid test and actually requiring a ventilator for treatment. For patients who had a positive result in degree 1 or 3, around 5 percent of patients were intubed. There can be many factors at play in this bar chart like possibly not enough supply of ventilators.
Lets see the classification_final
with patients who have
air sacs inflammation (pneuomina
), another respiratory
infection.
ggplot(covid_data ,aes(fill=pneumonia, x=classification_final)) +
geom_bar(position = "fill") +
labs(title = "Percent Bar Chart For Classification_final and Pneumonia")
In this plot we can see that about 60 percent of patients who had a positive result in degree 2 had air sacs inflammation. For patients who had a positive result in degree 1 or 3, around 20 percent of patients had air sacs inflammation. We can see that there was a small percentage of about 5-8 patients that had air sacs inflammation who tested negative or inconclusive. This closely aligns to how pre-existing condtions can contribute to future illnesses. We can see the patients that tested positive in any degree have a higher percentage of having pre-existing air sac inflammation.
We will now begin setting up the models for our predictions.
We will first begin with splitting out data into training and testing data. The purpose of doing so is to build and train our predictive model on our training data and test the model on unseen data (the testing data). Data is typically split with a 70/30 training/testing split. For this project, I will be using a 10/90 split. The reason is because 10% of my observations would make up around 100,000 observations. Training my model with 100,000 observations is still enough observations to make a decent model. Any more observations will take a significant more time to tune your models (shown later on in the project).
# splitting the data with 10/90 split and stratifying on classification_final
covid_split <- initial_split(covid_data, prop = .1, strata = classification_final)
covid_train <- training(covid_split)
covid_test <- testing(covid_split)
We can confirm that the covid data set was correctly split
nrow(covid_train)/(nrow(covid_train) + nrow(covid_test))
## [1] 0.09999941
nrow(covid_test)/(nrow(covid_train) + nrow(covid_test))
## [1] 0.9000006
We will now start creating a recipe. We will create a universal
recipe that can be used by all of our models that we will build. The
recipe will include 19 predictors from our training covid data set
(covid_train). These variables are usmer
,
medical_unit
, sex
, patient_type
,
intubed
, pneumonia
, age
,
pregnant
, diabetes
, copd
,
asthma
, immuno_suppress
,
hypertension
, other_disease
,
cardiovascular
, obesity
,
renal_chronic
, tobacco
, and icu
.
We will step dummy all the categorical variables. This encompases all
the variables except for age
. We will also scale and center
our predictors using step_normalize().
#RECIPE
covid_recipe <- recipe(classification_final ~ usmer + medical_unit + sex +
patient_type + intubed + pneumonia + age + pregnant +
diabetes + copd + asthma + immuno_suppress + hypertension +
other_disease + cardiovascular + obesity +
renal_chronic + tobacco + icu, data = covid_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors())
We will employ a k-fold stratified cross-validation approach with 10 folds, where the data set is divided into 10 equal-sized folds. During each iteration, one fold is reserved as the testing set, while the remaining nine folds (k-1) collectively form the training set. This process is repeated until each fold has served as the testing set exactly once. Subsequently, the model is trained on each training set and evaluated on the corresponding testing set. The average accuracy across all folds is then computed to gauge performance. Metrics like ROC_AUC, accuracy, and standard error can be reviewed to determine performace. This methodology offers a more robust estimate of testing accuracy compared to training on the entire dataset, as it reduces variability by averaging results over multiple iterations.
covid_folds <- vfold_cv(covid_train, v = 10, strata = classification_final)
We will now start building the models with our recipe we created earlier and we will be tuning our models with the folds we mentioned in the previous step. After tuning, we will examine the ROC_AUC metric across all of our models to see which model performed the best on the training data. We are using the ROC_AUC because it provides a single scalar value that quantifies the overall performance of the model, making it easy to interpret and compare across different models or data sets. The ROC curve will also provide a visual representation of the model’s performance across all classification thresholds. We will be creating four models: K-Nearest Neighbors, Elastic Net Multi-Nominal Regression, Gradient Boosted Tree, and Random Forest.
The K-nearest neighbor model is a supervised learning model used for classification and regression tasks. When presented with a new data point for prediction, k-NN calculates the distances between that point and all other points in the training set. It then selects the k nearest neighbors based on these distances. For classification tasks, the majority class among the k neighbors is assigned to the new data point, while for regression tasks, the average value of the target variable for the k neighbors is computed. k-NN is non-parametric and lazy, meaning it makes no assumptions about the data distribution and defers computation until prediction time. The choice of the hyperparameter k is crucial, as it affects the model’s flexibility and performance.
Elastic Net Multi-Nominal Regression is a statistical technique that extends the traditional multinomial logistic regression model by incorporating both L1 (Lasso) and L2 (Ridge) regularization penalties. This hybrid regularization technique helps address issues such as multicollinearity and overfitting by penalizing the absolute size of the coefficients (L1 penalty) and their squared magnitude (L2 penalty). Elastic Net regression is particularly useful when dealing with high-dimensional datasets with many correlated predictors, as it encourages sparsity while also providing some level of stability and robustness. Elastic Net Multi-Nominal Regression is commonly used in machine learning and predictive modeling tasks where the goal is to classify observations into multiple categories based on a set of predictor variables.
Gradient Boosted Trees is an ensemble learning method that combines multiple decision trees to create a powerful predictive model. Gradient Boosted Trees sequentially builds trees to correct errors made by previous trees, ultimately producing a strong ensemble model. By updating predictions based on residuals and incorporating regularization techniques such as shrinkage and tree depth constraints, Gradient Boosted Trees mitigates overfitting and enhances model generalization.
Random Forest is a versatile ensemble learning technique that utilizes the collective predictions of multiple decision trees to produce robust and accurate models. Each decision tree in the forest is trained independently on a random subset of the training data and a random subset of the features. This randomness helps to decorrelate the trees, reducing the risk of overfitting and improving generalization. During prediction, the results of individual trees are aggregated, typically through averaging for regression tasks or voting for classification tasks, to produce the final prediction. Random Forest models are known for their robustness to noisy data, ability to handle high-dimensional datasets, and automatic feature selection capabilities, making them widely applicable across various domains.
We will now create the models for K-Nearest Neighbors, Elastic Net Multi-Nominal Regression, Gradient Boosted Tree, and Random Forest.
#KNN MODEL
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
#ELASTIC NET MULTI-NOMINAL REGRESSION MODEL
en_log_reg_mod <- multinom_reg(mixture = tune(),
penalty = tune()) %>%
set_engine("glmnet") %>%
set_mode("classification")
#RANDOM FOREST MODEL
random_forest_mod <- rand_forest(mtry=tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger") %>%
set_mode("classification")
#GRADIENT BOOSTED TREE MODEL
bt_mod <- boost_tree(mtry=tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("classification")
#LOGISTIC REGRESSION
log_mod <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
We will now create the workflows for our models
#KNN WORKFLOW
knn_wkflw <- workflow() %>%
add_model(knn_mod) %>%
add_recipe(covid_recipe)
#ELASTIC NET MULTI-NOMINAL REGRESSION WORKFLOW
en_log_reg_wkflw <- workflow() %>%
add_model(en_log_reg_mod) %>%
add_recipe(covid_recipe)
#RANDOM FOREST WORKFLOW
random_forest_wkflw <- workflow() %>%
add_model(random_forest_mod) %>%
add_recipe(covid_recipe)
#GRADIENT BOOSTED TREE WORKFLOW
bt_wkflw <- workflow() %>%
add_model(bt_mod) %>%
add_recipe(covid_recipe)
log_wkflw <- workflow() %>%
add_model(log_mod) %>%
add_recipe(covid_recipe)
We will now create the tuning grids to specify the ranges for our parameters that the model will be tuning and how many levels.
For neighbors we will be tuning neighbors
For elastic new we will be tuning penalty: how much the model is penalized for having large coefficients. and mixture: the balance between two types of penalties: L1 (Lasso) and L2 (Ridge)
For random forest we will be tuning mtry: the number of randomly selected features considered at each split when constructing each decision tree in the forest, trees: the number of decision trees to include, and min_n: the minimum number of observations required to split a node further in each decision tree.
For gradient boosted tree we will be tuning mtry: the number of randomly selected features considered at each split when constructing each decision tree in the forest, trees: the number of decision trees to include, and learn_rate: step size at each iteration when updating the model’s predictions based on the residuals.
# grids for cross-validation
neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
en_grid <- grid_regular(penalty(range = c(0, 1), trans = identity_trans()),
mixture(range = c(0, 1)), levels = 10)
rf_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
min_n(range = c(10, 20)),
levels = 5)
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
We will tune all four of our models and save them. We can load them back in anytime for faster load times and so we do not have to tune the models everytime we want to see the results.
#tuning k-nearest neighbors
tune_neighbors <- tune_grid(
object = knn_wkflw, resamples = covid_folds, grid = neighbors_grid)
#tuning elastic net
tune_en_log <- tune_grid(
object = en_log_reg_wkflw, resamples = covid_folds, grid = en_grid)
#tuning random forest
tune_rf <- tune_grid(
object = random_forest_wkflw, resamples = covid_folds, grid = rf_grid)
#tuning boosted tree
tune_bt <- tune_grid(
object = bt_wkflw, resamples = covid_folds, grid = bt_grid)
save(tune_neighbors, file = "tune_neighbors.rda")
save(tune_en_log, file = "tune_en_log.rda")
save(tune_rf, file = "tune_rf.rda")
save(tune_bt, file = "tune_bt.rda")
load("tune_neighbors.rda")
load("tune_en_log.rda")
load("tune_rf.rda")
load("tune_bt.rda")
Lets collect the ROC_AUC metric of our tuned models and arranged them in decreasing mean since we are looking for the model with the highest ROC_AUC. We will then head the data frame to get the model with the highest ROC_AUC.
# K NEAREST NEIGHBORS
knn_roc_auc <- collect_metrics(tune_neighbors) %>%
filter(.metric=='roc_auc')%>%
arrange(desc(mean))
# ELASTIC NET
elastic_roc_auc <- collect_metrics(tune_en_log) %>%
filter(.metric=='roc_auc') %>%
arrange(desc(mean))
# RANDOM FOREST
rf_roc_auc <- collect_metrics(tune_rf) %>%
filter(.metric=='roc_auc') %>%
arrange(desc(mean))
# BOOSTED TREES
boosted_roc_auc <- collect_metrics(tune_bt) %>%
filter(.metric=='roc_auc') %>%
arrange(desc(mean))
knn_mean <- head(knn_roc_auc, n=1)
en_mean <- head(elastic_roc_auc, n=1)
rf_mean <- head(rf_roc_auc, n=1)
bt_mean <- head(boosted_roc_auc, n=1)
knn_mean
## # A tibble: 1 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 10 roc_auc hand_till 0.543 10 0.00587 Preprocessor1_Model10
en_mean
## # A tibble: 1 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0 0.222 roc_auc hand_till 0.720 10 0.00263 Preprocessor1_Model021
rf_mean
## # A tibble: 1 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 500 10 roc_auc hand_till 0.730 10 0.00305 Preprocessor1_Model0…
bt_mean
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 200 0.1 roc_auc hand_till 0.730 10 0.00333 Preprocessor1_M…
We will now create a bar chart to compare the highest ROC_AUC of all the models
# Creating a tibble of all the models and their ROC AUC
final_compare_tibble <- tibble(Model = c("K Nearest Neighbors", "Elastic Net", "Random Forest", "Gradient Boosted Trees"), ROC_AUC = c(knn_mean$mean, en_mean$mean, rf_mean$mean, bt_mean$mean))
ggplot(final_compare_tibble, aes(x=Model, y=ROC_AUC)) +
geom_bar(stat = "identity", aes(fill = Model)) +
scale_fill_manual(values = c("blue", "red", "orange", "green")) +
theme(legend.position = "none") +
labs(title = "Comparing ROC_AUC by Model")
# Arranging by highest ROC AUC
final_compare_tibble <- final_compare_tibble %>%
arrange(desc(ROC_AUC))
final_compare_tibble
## # A tibble: 4 × 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Gradient Boosted Trees 0.730
## 2 Random Forest 0.730
## 3 Elastic Net 0.720
## 4 K Nearest Neighbors 0.543
When comparing the models on the cross-validated data, we can see that the gradient boosted tree performed the best with a ROC_AUC of 0.7302054. The random forest model came at a close second with a ROC_AUC of 0.7297400.
Elastic Net had a ROC_AUC of 0.7196365 and K Nearest Neighbors had ROC_AUC of 0.5432730
Lets review the autoplots of the best model, gradient boosted tree, and the second best performing model, random forest.
Lets review the autoplot of the best performing model, gradient boosted tree.
autoplot(tune_bt, metric = "roc_auc")
We can see that for the gradient boosted tree models, the models with a learning rate of 0.1 performed the best. The number of trees seem to have a very insignificant difference for the models with learning rate 0.1. The number of trees does appear the matter for the other learning rate values. The other models do not perform as well as the one with learning rate 0.1. These other models have interesting characteristics. Two of the models contain 0.5 ROC_AUC throughout. The other 2 show increasing ROC_AUC across mtry (# Randomly Selected Predictors) where for one of them, less trees produces smaller ROC_AUC for mtry values 1 to 3.
Lets review the autoplot of the second best performing model, random forest.
autoplot(tune_rf, metric = "roc_auc")
We can see from the autoplot of the tuned random forest the the the min_n (Minimal Node Size) had minimal effect on the results. Mtry (# Randomly Selected Predictors) values of 1-4 produce worse ROC_AUC results while values of 4-6 produce better results. The number of trees do not seem to effect the results much either since most follow the same trajectory as others. However there is some minor differences between mtry values 4 to 6.
Which specific model and parameters did the best Gradient Boosted Tree model have?
bt_mean
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 200 0.1 roc_auc hand_till 0.730 10 0.00333 Preprocessor1_M…
The Gradient Boosted Tree Model with mtry (# Randomly Selected Predictors) equal 4, trees equal 200, and learn_rate equal 0.1 had produced the best model. As we explained in the gradient boosted tree autoplot, the models with learn_rate equaling to 0.1 performed the best.
We will now take the best model from the tuned gradient boosted tree models and fit it to the training data. This will train that our best model one more time on the entire training data set.
best_bt <- select_best(tune_bt, metric = 'roc_auc')
bt_final_workflow <- finalize_workflow(bt_wkflw, best_bt)
bt_final_fit <- fit(bt_final_workflow, data=covid_train)
We wil finally test our trained gradient boosted tree model on our unseen testing data.
covid_predict <- augment(bt_final_fit, new_data = covid_test)
covid_predict %>%
roc_auc(truth = classification_final, .pred_1:.pred_4)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc hand_till 0.728
Our trained gradient boosted tree model resulted in a ROC_AUC value of 0.7313427 on our testing data. The gradient boosted tree actually performed better on the testing set than on the cross-validation folds which is very good. This means that our model makes better prediction on data it has never seen. I would say that the performance of the model did generally well. ROC_AUC ranges from 0 to 1, so having a ROC_AUC value of 0.7313427 is pretty decent, but not he best however.
bt_final_fit %>%
extract_fit_parsnip() %>%
vip() +
theme_minimal()
In the VIP plot, it shows us which variables are the most useful when predicting classifcation_final. It looks like patient_type, age, and pneumonia were the most useful when predicting classifcation_final. Patient_type makes sense because patients who received no treatment at the hostpital probably had not symptoms vs those who had treatment at those hospital probably had COVID. This also connects to my EDA as I had mentioned that age was used frequently to determine how susceptible someone was to COVID-19. Pneumonia makes sense because it is a respiratory illness which COVID is also.
Let us see how well our model performed on our testing data.
roc_curve(covid_predict, truth = classification_final, .pred_1:.pred_4) %>%
autoplot()
conf_mat(covid_predict, truth = classification_final,
.pred_class) %>%
autoplot(type = "heatmap")
Based on the ROC curves, it would seem the the model predicts classification_final = 2 the best since there is a right angle on the curve. The other curves are worse and closer to the middle linear line.
Based on the confusion matrix our model correctly predicted:
1 patient with classification_final = 1 out of the 7578 actual patients
0 patients with classification_final = 2 out of the 1614 actual patients
86,731 patients with classification_final = 3 out of the 338,301 actual patients
520,754 patients with classification_final = 4 out of the 570,201 actual patients
After fitting 4 models on 10 cross validated fols and conducting
analysis on each of them, the best model to predict COVID-19 test
results, classification_final
seems to be the gradient
boosted tree model with hyper parameters mtry (# Randomly Selected
Predictors) equal 4, trees equal 200, and learn_rate equal 0.1. This is
to be expected because gradient boosted tree models are well-suited for
classification tasks due to their high accuracy, robustness,
interpretability, and scalability. Even though the gradient boosted tree
model came out to be our best model, it still did not perform
exceptionally well.
Some areas of improvement would be to train with a larger training split. Since building a machine learning model on over one million observations can be technologically demanding, I chose to build my models on a smaller training proportion. Additionally, I would like to do more research on the data set to get a deeper understanding of how the data was collected and other avenues of how variables can be connected. Another limitation was that some variables were not that specific and coming from a data science background, I was pretty unfamiliar with many variables and had to do my own research on them.
If I were to continue this project and move forward with my analysis, I would like to explore how my model can work for data from other countries. This can dive deeper into how COVID-19 affected countries specifically and determine whether a model like mine would be country specific or universal.
Overall, building a predictive model that predicts COVID-19 test results provided me with a great opportunity to work with a large data set, building machine learning models, and improve my application skills on data analysis. Building such models does not come easy as a lot of research and mental determination is required to get your models working and running. But it was overall an enriching experience and I look forward to my next project!
Data set was found on kaggle which was provided by the Mexican government and was uploaded by MEIR NIZRI.