• Introduction
    • What is COVID-19?
    • Why is predicting covid-test results important?
  • Load Packages
  • Exploring the Data
    • Load the Data
    • Where did the data set come from?
    • Data Set Variables
  • Tidying and Cleaning the Data
  • Missing Data
    • Dealing with the Missing Data
  • Exploritory Data Analaysis
    • Distribution of Classification_final
    • Age and Classification_final
    • Classification_final and Intubed
    • Classification_final and Pneuomina
  • Setting Up the Models
    • Data Split
    • Recipe Creation
    • K-Fold Cross Validation
  • Building the Models
    • K-Nearest Neighbors
    • Elastic Net Multi-Nominal Regression
    • Gradient Boosted Tree
    • Random Forest
    • Creating the models
    • Creating the Workflows
    • Tune Grids
    • Tuning
      • Saving The Tuned Models
      • Loading The Saved Tuned Models
  • Model Results
    • Obtaining ROC_AUC For Each Model
    • Bar Chart For Comparisson
    • Autoplot Results
      • Gradient Boosted Tree Autoplot
      • Random Forest Autoplot
  • Results From Best Model
    • Fit Best Model To Our Training Data
    • Testing The Model On Testing Data
    • VIP Plot
    • Confusion Matrix
  • Conclusion
  • Sources

Introduction

The goal of this project is to build and train a model that can predict COVID-19 test results.

What is COVID-19?

COVID-19, short for Coronavirus Disease 2019, is a highly contagious respiratory illness caused by the novel coronavirus SARS-CoV-2. It emerged in late 2019 and quickly spread globally, leading to a pandemic declared by the World Health Organization in March 2020. COVID-19 is primarily transmitted through respiratory droplets when an infected person coughs, sneezes, or talks, and can also spread by touching contaminated surfaces. The virus can cause a range of symptoms, from mild respiratory issues to severe pneumonia and organ failure, with older adults and those with underlying health conditions being particularly vulnerable. The pandemic profoundly affected the world on multiple fronts. It led to widespread illness and death, overwhelmed healthcare systems in many countries, caused economic disruptions with job losses and business closures, disrupted education with school closures and remote learning, and resulted in social isolation and mental health challenges due to lockdowns and restrictions on gatherings. Efforts to control the spread of the virus included measures such as mask-wearing, social distancing, testing, contact tracing, and vaccination campaigns, which became central to global public health strategies in combating the pandemic.

Why is predicting covid-test results important?

Predicting COVID-test results can be very beneficial because it can help us better prepare for future pandemics. The widespread shutdowns witnessed during the COVID-19 pandemic underscored the critical need for proactive measures to mitigate transmission risks. Many nations faced unprecedented challenges in managing the influx of symptomatic patients, revealing systemic vulnerabilities and gaps in preparedness efforts. The main problems that healthcare providers have faced are the shortage of medical resources and a lack of a proper plan to efficiently distribute them. There is now a pressing demand to increase hospital readiness for future outbreaks. Consequently, this project aims to construct a predictive model capable of predicting COVID-19 test outcomes by using current medical statuses and pre-existing health conditions of patients. By doing so, it seeks to furnish healthcare systems with invaluable tools to preemptively strategize and allocate resources more effectively in the face of future pandemics akin to the coronavirus crisis.

Load Packages

library(naniar) #replace_with_na
library(finalfit) #missing_plot()
library(corrplot) #correlation plot
## corrplot 0.92 loaded
library(forcats) #lumping
library(discrim) #qda and lda
## Loading required package: parsnip
library(kknn)
library(xgboost)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ ggplot2   3.5.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::slice()  masks xgboost::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──
## ✔ broom        1.0.5      ✔ rsample      1.2.0 
## ✔ dials        1.2.1      ✔ tune         1.2.0 
## ✔ infer        1.0.6      ✔ workflows    1.1.4 
## ✔ modeldata    1.3.0      ✔ workflowsets 1.1.0 
## ✔ recipes      1.0.10     ✔ yardstick    1.3.1 
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()   masks purrr::discard()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ recipes::fixed()    masks stringr::fixed()
## ✖ dplyr::lag()        masks stats::lag()
## ✖ dplyr::slice()      masks xgboost::slice()
## ✖ dials::smoothness() masks discrim::smoothness()
## ✖ yardstick::spec()   masks readr::spec()
## ✖ recipes::step()     masks stats::step()
## • Use suppressPackageStartupMessages() to eliminate package startup messages
library(ggplot2)
library(dplyr)
library(janitor) #clean names
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(ranger) #random forest
library(glmnet) #glmnet
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-8
library(vip) #vip plot
## 
## Attaching package: 'vip'
## 
## The following object is masked from 'package:utils':
## 
##     vi
set.seed(888)

Exploring the Data

Load the Data

covid_data <- read.csv("Covid Data.csv") %>%
  clean_names() 

head(covid_data, n=20)
##    usmer medical_unit sex patient_type  date_died intubed pneumonia age
## 1      2            1   1            1 03/05/2020      97         1  65
## 2      2            1   2            1 03/06/2020      97         1  72
## 3      2            1   2            2 09/06/2020       1         2  55
## 4      2            1   1            1 12/06/2020      97         2  53
## 5      2            1   2            1 21/06/2020      97         2  68
## 6      2            1   1            2 9999-99-99       2         1  40
## 7      2            1   1            1 9999-99-99      97         2  64
## 8      2            1   1            1 9999-99-99      97         1  64
## 9      2            1   1            2 9999-99-99       2         2  37
## 10     2            1   1            2 9999-99-99       2         2  25
## 11     2            1   1            1 9999-99-99      97         2  38
## 12     2            1   2            2 9999-99-99       2         2  24
## 13     2            1   2            2 9999-99-99       2         2  30
## 14     2            1   2            1 9999-99-99      97         2  55
## 15     2            1   1            1 9999-99-99      97         2  48
## 16     2            1   1            1 9999-99-99      97         2  23
## 17     2            1   1            2 9999-99-99       2         1  80
## 18     2            1   2            1 9999-99-99      97         2  61
## 19     2            1   2            1 9999-99-99      97         2  54
## 20     2            1   1            1 9999-99-99      97         2  64
##    pregnant diabetes copd asthma inmsupr hipertension other_disease
## 1         2        2    2      2       2            1             2
## 2        97        2    2      2       2            1             2
## 3        97        1    2      2       2            2             2
## 4         2        2    2      2       2            2             2
## 5        97        1    2      2       2            1             2
## 6         2        2    2      2       2            2             2
## 7         2        2    2      2       2            2             2
## 8         2        1    2      2       1            1             2
## 9         2        1    2      2       2            1             2
## 10        2        2    2      2       2            2             2
## 11        2        2    2      2       2            2             2
## 12       97        2    2      2       2            2             2
## 13       97        2    2      2       2            2             2
## 14       97        2    2      2       2            2             2
## 15        2        1    2      2       2            2             2
## 16        2        2    2      2       2            2             2
## 17        2        2    2      2       2            1             2
## 18       97        2    2      2       2            2             2
## 19       97        2    2      2       2            2             2
## 20        2        2    2      2       2            2             2
##    cardiovascular obesity renal_chronic tobacco clasiffication_final icu
## 1               2       2             2       2                    3  97
## 2               2       1             1       2                    5  97
## 3               2       2             2       2                    3   2
## 4               2       2             2       2                    7  97
## 5               2       2             2       2                    3  97
## 6               2       2             2       2                    3   2
## 7               2       2             2       2                    3  97
## 8               2       2             1       2                    3  97
## 9               2       1             2       2                    3   2
## 10              2       2             2       2                    3   2
## 11              2       2             2       2                    3  97
## 12              2       2             2       2                    3   2
## 13              2       2             2       2                    3   2
## 14              2       2             2       2                    3  97
## 15              2       2             2       2                    3  97
## 16              2       2             2       2                    3  97
## 17              2       2             2       2                    3   1
## 18              2       2             2       2                    3  97
## 19              2       2             2       2                    3  97
## 20              2       2             2       2                    3  97

We use the clean_names() function to make all the variables lowercase for easier interoperability.

Where did the data set come from?

The COVID-19 data set used in this project was found from kaggle Data Set Link. This data set contains 1,048,576 unique observations and 21 variables. In the Boolean features, 1 means “yes” and 2 means “no”. Values as 97, 98 and 99 are missing data. This COVID-19 data set was provided by the Mexcian government

Data Set Variables

In the Boolean features, 1 means “yes” and 2 means “no”. Otherwise the values are specified.

The variables that are listed come from the cleaned data set.

usmer indicates whether the patient treated medical units of the first, second or third level.

medical_unit Type of institution of the National Health System that provided the care.

sex 1 - Female. 2 - Male

patient_type Type of care the patient received in the unit. 1 for returned home and 2 for hospitalization.

intubed (boolean) Whether the patient was connected to the ventilator.

pneumonia (boolean) Whether the patient already have air sacs inflammation or not.

age (int) Age of the patient.

pregnant (boolean) Whether the patient is pregnant or not.

diabetes (boolean) Whether the patient has diabetes or not.

copd (boolean) Whether the patient has Chronic obstructive pulmonary disease or not.

astma (boolean) Whether the patient has asthma or not.

immuno_suppress (boolean) Whether the patient is immunosuppressed or not.

hypertension (boolean) Whether the patient has hypertension or not.

other_disease (boolean) Whether the patient has other disease or not.

cardiovascular (boolean) Whether the patient has heart or blood vessels related disease.

obesity (boolean) Whether the patient is obese or not.

renal_chronic (boolean) Whether the patient has chronic renal disease or not.

tobacco (boolean) Whether the patient is a tobacco user.

classification_final Covid test results. Values 1-3 mean that the patient was diagnosed with covid in different degrees. 4 or higher means that the patient is not a carrier of covid or that the test is inconclusive.

icu (int): Whether the patient had been admitted to an Intensive Care Unit.

survived (boolean): Whether are not the patient survived

Tidying and Cleaning the Data

We will rename the following variables for easier interoperability.

INMSUPR to immuno_suppress

HIPERTENSTION to hypertension

CLASIFFICATION_FINAL to classification_final

#SPELLING ERROR
colnames(covid_data)[13] <- "immuno_suppress"
colnames(covid_data)[14] <- "hypertension"
colnames(covid_data)[20] <- "classification_final"

We will create a new column called survived with the boolean featured stated above where 1 - the patient survived (did not die) and 2 - the patient did not survive

#NEW COLUMN TO REPLACE DATE_DIED TO WHETHER OR NOT THEY SURVIVED.
covid_data$survived <- case_when(
  covid_data$date_died == "9999-99-99" ~  1,
  .default = 2
)

covid_data <- covid_data %>%
  select(-date_died)

Additionally, for variable classification_final we will lump the values of 4, 5, 6, and 7 into the value of 4 since all four values indicate no symptoms/inconclusive.

#LUMPING NO SYMPTOMS/INCONCLUSIVE
covid_data$classification_final[covid_data$classification_final==5] <- 4
covid_data$classification_final[covid_data$classification_final==6] <- 4
covid_data$classification_final[covid_data$classification_final==7] <- 4

Missing Data

Notice that all the males have pregnant values of 97 (missing). This would not be missing data because men can not be pregnant. All men patients will have their pregnant variable be equal to 2. This actually eliminate all ‘missing data’ for the pregnant variable.

head(covid_data, n=50) %>%
  subset(sex==2) %>%
  select(sex, pregnant)
##    sex pregnant
## 2    2       97
## 3    2       97
## 5    2       97
## 12   2       97
## 13   2       97
## 14   2       97
## 18   2       97
## 19   2       97
## 21   2       97
## 22   2       97
## 23   2       97
## 26   2       97
## 27   2       97
## 28   2       97
## 29   2       97
## 30   2       97
## 31   2       97
## 33   2       97
## 34   2       97
## 36   2       97
## 39   2       97
## 42   2       97
## 46   2       97
## 48   2       97
covid_data$pregnant[covid_data$sex==2] <- 2

We can see that columns icu and intubed are connected with patient_type variable and we can see that patients who have value = 1 for patient_type have value = 97 for both icu and intubed variables. It means that patients who did not receive hostpital treatment so these patients did not need to go to the ICU and need to be intubed. These patients received treatment for at home care. We will change the patients who received treatment for at home care to 2 (NO) for icu and intubed

head(covid_data, n = 50) %>%
  subset(patient_type==1) %>%
  select(patient_type, intubed)
##    patient_type intubed
## 1             1      97
## 2             1      97
## 4             1      97
## 5             1      97
## 7             1      97
## 8             1      97
## 11            1      97
## 14            1      97
## 15            1      97
## 16            1      97
## 18            1      97
## 19            1      97
## 20            1      97
## 22            1      97
## 23            1      97
## 24            1      97
## 25            1      97
## 26            1      97
## 27            1      97
## 28            1      97
## 29            1      97
## 30            1      97
## 32            1      97
## 33            1      97
## 34            1      97
## 35            1      97
## 36            1      97
## 37            1      97
## 38            1      97
## 39            1      97
## 40            1      97
## 41            1      97
## 43            1      97
## 44            1      97
## 45            1      97
## 46            1      97
## 47            1      97
## 48            1      97
## 49            1      97
## 50            1      97
head(covid_data, n = 50) %>%
  subset(patient_type==1) %>%
  select(patient_type, icu)
##    patient_type icu
## 1             1  97
## 2             1  97
## 4             1  97
## 5             1  97
## 7             1  97
## 8             1  97
## 11            1  97
## 14            1  97
## 15            1  97
## 16            1  97
## 18            1  97
## 19            1  97
## 20            1  97
## 22            1  97
## 23            1  97
## 24            1  97
## 25            1  97
## 26            1  97
## 27            1  97
## 28            1  97
## 29            1  97
## 30            1  97
## 32            1  97
## 33            1  97
## 34            1  97
## 35            1  97
## 36            1  97
## 37            1  97
## 38            1  97
## 39            1  97
## 40            1  97
## 41            1  97
## 43            1  97
## 44            1  97
## 45            1  97
## 46            1  97
## 47            1  97
## 48            1  97
## 49            1  97
## 50            1  97
#TURNING MISSING DATA INTO 2 (NO)
covid_data$intubed[covid_data$patient_type==1] <- 2
covid_data$icu[covid_data$patient_type==1] <- 2

Dealing with the Missing Data

As mentioned before, the data set marked missing data as either 97, 98, and/or 99. So we will now change the missing data to actual NA values to deal with the missing data appropriately.

# CHANGING MISSING VALUES TO NA
covid_data <- replace_with_na(covid_data, replace = list(intubed = c(97,99),  
                  pregnant = c(97,98,99), icu = c(97,99), pneumonia = 99, 
                  pregnant = 98,diabetes = 98, copd = 98, asthma = 98, 
                  immuno_suppress = 98, hypertension = 98, other_disease = 98, 
                  cardiovascular = 98, obesity = 98, renal_chronic = 98, 
                  tobacco = 98))

Let us take a look at the missing_plot() to see how much missing values we actually have.

covid_data %>%
  missing_plot()

covid_data %>%
  summary()
##      usmer        medical_unit         sex         patient_type  
##  Min.   :1.000   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.: 4.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :12.000   Median :1.000   Median :1.000  
##  Mean   :1.632   Mean   : 8.981   Mean   :1.499   Mean   :1.191  
##  3rd Qu.:2.000   3rd Qu.:12.000   3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :13.000   Max.   :2.000   Max.   :2.000  
##                                                                  
##     intubed        pneumonia          age            pregnant    
##  Min.   :1.000   Min.   :1.000   Min.   :  0.00   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.: 30.00   1st Qu.:2.000  
##  Median :2.000   Median :2.000   Median : 40.00   Median :2.000  
##  Mean   :1.968   Mean   :1.864   Mean   : 41.79   Mean   :1.992  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.: 53.00   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :2.000   Max.   :121.00   Max.   :2.000  
##  NA's   :7325    NA's   :16003                    NA's   :3754   
##     diabetes         copd           asthma     immuno_suppress  hypertension  
##  Min.   :1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.00   Median :2.000   Median :2.00   Median :2.000   Median :2.000  
##  Mean   :1.88   Mean   :1.986   Mean   :1.97   Mean   :1.986   Mean   :1.844  
##  3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :2.00   Max.   :2.000   Max.   :2.00   Max.   :2.000   Max.   :2.000  
##  NA's   :3338   NA's   :3003    NA's   :2979   NA's   :3404    NA's   :3104   
##  other_disease   cardiovascular    obesity      renal_chronic      tobacco     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:2.00   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :2.000   Median :2.00   Median :2.000   Median :2.000   Median :2.000  
##  Mean   :1.973   Mean   :1.98   Mean   :1.847   Mean   :1.982   Mean   :1.919  
##  3rd Qu.:2.000   3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :2.000   Max.   :2.00   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##  NA's   :5045    NA's   :3076   NA's   :3032    NA's   :3006    NA's   :3220   
##  classification_final      icu           survived    
##  Min.   :1.000        Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000        1st Qu.:2.000   1st Qu.:1.000  
##  Median :4.000        Median :2.000   Median :1.000  
##  Mean   :3.608        Mean   :1.984   Mean   :1.073  
##  3rd Qu.:4.000        3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :4.000        Max.   :2.000   Max.   :2.000  
##                       NA's   :7488

Since we have very little NA values compared to our 1,048,576 unique observations, we will deal with the missing data by dropping them.

#OMIT NA DATA
covid_data <- na.omit(covid_data)

We can verify that all the missing data is ommitted by checking that there are no values of 97, 98, and/or 99 for any variables.

# verify categorical variables have no missing data
max(covid_data$usmer)
## [1] 2
max(covid_data$medical_unit)
## [1] 13
max(covid_data$sex)
## [1] 2
max(covid_data$patient_type)
## [1] 2
max(covid_data$intubed)
## [1] 2
max(covid_data$pneumonia)
## [1] 2
max(covid_data$pregnant)
## [1] 2
max(covid_data$diabetes)
## [1] 2
max(covid_data$copd)
## [1] 2
max(covid_data$asthma)
## [1] 2
max(covid_data$immuno_suppress)
## [1] 2
max(covid_data$hypertension)
## [1] 2
max(covid_data$other_disease)
## [1] 2
max(covid_data$cardiovascular)
## [1] 2
max(covid_data$obesity)
## [1] 2
max(covid_data$renal_chronic)
## [1] 2
max(covid_data$tobacco)
## [1] 2
max(covid_data$classification_final)
## [1] 4
max(covid_data$icu)
## [1] 2
max(covid_data$survived)
## [1] 2

Now that our data is all cleaned, we will factor all categorical variables

# FACTOR ALL CATEGORICAL VARIABLES
covid_data <- covid_data %>%
  mutate(usmer=factor(usmer)) %>%
  mutate(medical_unit=factor(medical_unit)) %>%
  mutate(sex=factor(sex)) %>%
  mutate(patient_type=factor(patient_type)) %>%
  mutate(intubed=factor(intubed)) %>%
  mutate(pneumonia=factor(pneumonia)) %>%
  mutate(pregnant=factor(pregnant)) %>%
  mutate(diabetes=factor(diabetes)) %>%
  mutate(copd=factor(copd)) %>%
  mutate(asthma=factor(asthma)) %>%
  mutate(immuno_suppress=factor(immuno_suppress)) %>%
  mutate(hypertension=factor(hypertension)) %>%
  mutate(other_disease=factor(other_disease)) %>%
  mutate(cardiovascular=factor(cardiovascular)) %>%
  mutate(obesity=factor(obesity)) %>%
  mutate(renal_chronic=factor(renal_chronic)) %>%
  mutate(tobacco=factor(tobacco)) %>%
  mutate(classification_final=factor(classification_final)) %>%
  mutate(icu=factor(icu)) %>%
  mutate(survived=factor(survived))

Exploritory Data Analaysis

Distribution of Classification_final

Let’s explore the distribution of classification_final.

#Distribution of CLASSIFICATION_FINAL
covid_data %>% 
  ggplot(aes(classification_final)) + 
  geom_bar(fill = "blue") + 
  labs(title = "Distribution of COVID Test Classification")

From our ggplot, we can see that the COVID test results (classification_final) is skewed left. Majority of the patients tested as either no symptoms or inconclusive. The second most frequent result is a positive result in degree 3. The third most frequent result is a positive result in degree 1. The least frequent result is a positive result in degree 2. The high frequency in patients testing as either no symptoms or inconclusive is a result of lumping degrees 4, 5, 6, and 7 into 4. It could also be a result of false negatives and or patients who frequently went to the hosptial to get tested to follow guidelines. A positive result in degree 3 being the second most frequent makes sense because many patients could have not gone to the hospital until it became severe.

Age and Classification_final

Lets look at a percent bar chart with age and classification_final. A common indicator of how someone would react to COVID-19 was age.

ggplot(covid_data ,aes(fill=classification_final, x=age)) +
  geom_bar(position = "fill") +
  labs(title = "Percent Bar Chart For Age and Classification_final")

We can see that the percentage of positive result in degree 3 increases from ages 0 to around 70 years old. This makes sense because from ages 20 to 50 there is more exposure because more people would be working. Patients aged 50 to 70 are considered more susceptible to COVID-19, as this age group exhibits heightened vulnerability to the virus. It makes sense that the percentage would continue to rise during the age range of 50 to 70. The decrease after age 70 can be a result of low observations. Mexico’s life expectancy is around 70 years old, so there can be not enough observations past the age of 70.

Classification_final and Intubed

Lets see a percent bar chart with classification_final and intubed. A high demand resource during the COVID-19 pandemic was ventilators, so lets see this chart to view the distribution of patients that were intubed across the different result classificaitons.

ggplot(covid_data ,aes(fill=intubed, x=classification_final)) +
  geom_bar(position = "fill") +
  labs(title = "Percent Bar Chart For Classification_final and Intubed")

In this plot we can see that about 30 percent of patients who had a positive result in degree 2 were intubed. We can see that there was a small percentage of patients with a negative or inconclusive result that were still intubed. This can be the result of other respriory illnesses or the patient receiving a false negative in their covid test and actually requiring a ventilator for treatment. For patients who had a positive result in degree 1 or 3, around 5 percent of patients were intubed. There can be many factors at play in this bar chart like possibly not enough supply of ventilators.

Classification_final and Pneuomina

Lets see the classification_final with patients who have air sacs inflammation (pneuomina), another respiratory infection.

ggplot(covid_data ,aes(fill=pneumonia, x=classification_final)) +
  geom_bar(position = "fill") +
  labs(title = "Percent Bar Chart For Classification_final and Pneumonia")

In this plot we can see that about 60 percent of patients who had a positive result in degree 2 had air sacs inflammation. For patients who had a positive result in degree 1 or 3, around 20 percent of patients had air sacs inflammation. We can see that there was a small percentage of about 5-8 patients that had air sacs inflammation who tested negative or inconclusive. This closely aligns to how pre-existing condtions can contribute to future illnesses. We can see the patients that tested positive in any degree have a higher percentage of having pre-existing air sac inflammation.

Setting Up the Models

We will now begin setting up the models for our predictions.

Data Split

We will first begin with splitting out data into training and testing data. The purpose of doing so is to build and train our predictive model on our training data and test the model on unseen data (the testing data). Data is typically split with a 70/30 training/testing split. For this project, I will be using a 10/90 split. The reason is because 10% of my observations would make up around 100,000 observations. Training my model with 100,000 observations is still enough observations to make a decent model. Any more observations will take a significant more time to tune your models (shown later on in the project).

# splitting the data with 10/90 split and stratifying on classification_final
covid_split <- initial_split(covid_data, prop = .1, strata = classification_final)
covid_train <- training(covid_split)
covid_test <- testing(covid_split)

We can confirm that the covid data set was correctly split

nrow(covid_train)/(nrow(covid_train) + nrow(covid_test))
## [1] 0.09999941
nrow(covid_test)/(nrow(covid_train) + nrow(covid_test))
## [1] 0.9000006

Recipe Creation

We will now start creating a recipe. We will create a universal recipe that can be used by all of our models that we will build. The recipe will include 19 predictors from our training covid data set (covid_train). These variables are usmer, medical_unit, sex, patient_type, intubed, pneumonia, age, pregnant, diabetes, copd, asthma, immuno_suppress, hypertension, other_disease, cardiovascular, obesity, renal_chronic, tobacco, and icu. We will step dummy all the categorical variables. This encompases all the variables except for age. We will also scale and center our predictors using step_normalize().

#RECIPE
covid_recipe <- recipe(classification_final ~ usmer + medical_unit + sex + 
                         patient_type + intubed + pneumonia + age + pregnant +
                         diabetes + copd + asthma + immuno_suppress + hypertension + 
                         other_disease + cardiovascular + obesity + 
                         renal_chronic + tobacco + icu, data = covid_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_predictors()) 

K-Fold Cross Validation

We will employ a k-fold stratified cross-validation approach with 10 folds, where the data set is divided into 10 equal-sized folds. During each iteration, one fold is reserved as the testing set, while the remaining nine folds (k-1) collectively form the training set. This process is repeated until each fold has served as the testing set exactly once. Subsequently, the model is trained on each training set and evaluated on the corresponding testing set. The average accuracy across all folds is then computed to gauge performance. Metrics like ROC_AUC, accuracy, and standard error can be reviewed to determine performace. This methodology offers a more robust estimate of testing accuracy compared to training on the entire dataset, as it reduces variability by averaging results over multiple iterations.

covid_folds <- vfold_cv(covid_train, v = 10, strata = classification_final)

Building the Models

We will now start building the models with our recipe we created earlier and we will be tuning our models with the folds we mentioned in the previous step. After tuning, we will examine the ROC_AUC metric across all of our models to see which model performed the best on the training data. We are using the ROC_AUC because it provides a single scalar value that quantifies the overall performance of the model, making it easy to interpret and compare across different models or data sets. The ROC curve will also provide a visual representation of the model’s performance across all classification thresholds. We will be creating four models: K-Nearest Neighbors, Elastic Net Multi-Nominal Regression, Gradient Boosted Tree, and Random Forest.

K-Nearest Neighbors

The K-nearest neighbor model is a supervised learning model used for classification and regression tasks. When presented with a new data point for prediction, k-NN calculates the distances between that point and all other points in the training set. It then selects the k nearest neighbors based on these distances. For classification tasks, the majority class among the k neighbors is assigned to the new data point, while for regression tasks, the average value of the target variable for the k neighbors is computed. k-NN is non-parametric and lazy, meaning it makes no assumptions about the data distribution and defers computation until prediction time. The choice of the hyperparameter k is crucial, as it affects the model’s flexibility and performance.

Elastic Net Multi-Nominal Regression

Elastic Net Multi-Nominal Regression is a statistical technique that extends the traditional multinomial logistic regression model by incorporating both L1 (Lasso) and L2 (Ridge) regularization penalties. This hybrid regularization technique helps address issues such as multicollinearity and overfitting by penalizing the absolute size of the coefficients (L1 penalty) and their squared magnitude (L2 penalty). Elastic Net regression is particularly useful when dealing with high-dimensional datasets with many correlated predictors, as it encourages sparsity while also providing some level of stability and robustness. Elastic Net Multi-Nominal Regression is commonly used in machine learning and predictive modeling tasks where the goal is to classify observations into multiple categories based on a set of predictor variables.

Gradient Boosted Tree

Gradient Boosted Trees is an ensemble learning method that combines multiple decision trees to create a powerful predictive model. Gradient Boosted Trees sequentially builds trees to correct errors made by previous trees, ultimately producing a strong ensemble model. By updating predictions based on residuals and incorporating regularization techniques such as shrinkage and tree depth constraints, Gradient Boosted Trees mitigates overfitting and enhances model generalization.

Random Forest

Random Forest is a versatile ensemble learning technique that utilizes the collective predictions of multiple decision trees to produce robust and accurate models. Each decision tree in the forest is trained independently on a random subset of the training data and a random subset of the features. This randomness helps to decorrelate the trees, reducing the risk of overfitting and improving generalization. During prediction, the results of individual trees are aggregated, typically through averaging for regression tasks or voting for classification tasks, to produce the final prediction. Random Forest models are known for their robustness to noisy data, ability to handle high-dimensional datasets, and automatic feature selection capabilities, making them widely applicable across various domains.

Creating the models

We will now create the models for K-Nearest Neighbors, Elastic Net Multi-Nominal Regression, Gradient Boosted Tree, and Random Forest.

#KNN MODEL
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

#ELASTIC NET MULTI-NOMINAL REGRESSION MODEL
en_log_reg_mod <- multinom_reg(mixture = tune(),
                               penalty = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

#RANDOM FOREST MODEL
random_forest_mod <- rand_forest(mtry=tune(),
                                 trees = tune(),
                                 min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

#GRADIENT BOOSTED TREE MODEL
bt_mod <- boost_tree(mtry=tune(),
                     trees = tune(),
                     learn_rate = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

#LOGISTIC REGRESSION
log_mod <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

Creating the Workflows

We will now create the workflows for our models

#KNN WORKFLOW
knn_wkflw <- workflow() %>%
  add_model(knn_mod) %>%
  add_recipe(covid_recipe)

#ELASTIC NET MULTI-NOMINAL REGRESSION WORKFLOW
en_log_reg_wkflw <- workflow() %>%
  add_model(en_log_reg_mod) %>%
  add_recipe(covid_recipe)

#RANDOM FOREST WORKFLOW
random_forest_wkflw <- workflow() %>%
  add_model(random_forest_mod) %>%
  add_recipe(covid_recipe)

#GRADIENT BOOSTED TREE WORKFLOW
bt_wkflw <- workflow() %>%
  add_model(bt_mod) %>%
  add_recipe(covid_recipe)

log_wkflw <- workflow() %>% 
  add_model(log_mod) %>% 
  add_recipe(covid_recipe)

Tune Grids

We will now create the tuning grids to specify the ranges for our parameters that the model will be tuning and how many levels.

For neighbors we will be tuning neighbors

For elastic new we will be tuning penalty: how much the model is penalized for having large coefficients. and mixture: the balance between two types of penalties: L1 (Lasso) and L2 (Ridge)

For random forest we will be tuning mtry: the number of randomly selected features considered at each split when constructing each decision tree in the forest, trees: the number of decision trees to include, and min_n: the minimum number of observations required to split a node further in each decision tree.

For gradient boosted tree we will be tuning mtry: the number of randomly selected features considered at each split when constructing each decision tree in the forest, trees: the number of decision trees to include, and learn_rate: step size at each iteration when updating the model’s predictions based on the residuals.

# grids for cross-validation
neighbors_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 10)

en_grid <- grid_regular(penalty(range = c(0, 1), trans = identity_trans()),
                        mixture(range = c(0, 1)), levels = 10)

rf_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 5)

bt_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)

Tuning

We will tune all four of our models and save them. We can load them back in anytime for faster load times and so we do not have to tune the models everytime we want to see the results.

#tuning k-nearest neighbors
tune_neighbors <- tune_grid(
  object = knn_wkflw, resamples = covid_folds, grid = neighbors_grid)
#tuning elastic net
tune_en_log <- tune_grid(
  object = en_log_reg_wkflw, resamples = covid_folds, grid = en_grid)
#tuning random forest
tune_rf <- tune_grid(
  object = random_forest_wkflw, resamples = covid_folds, grid = rf_grid)
#tuning boosted tree
tune_bt <- tune_grid(
  object = bt_wkflw, resamples = covid_folds, grid = bt_grid)

Saving The Tuned Models

save(tune_neighbors, file = "tune_neighbors.rda")
save(tune_en_log, file = "tune_en_log.rda")
save(tune_rf, file = "tune_rf.rda")
save(tune_bt, file = "tune_bt.rda")

Loading The Saved Tuned Models

load("tune_neighbors.rda")
load("tune_en_log.rda")
load("tune_rf.rda")
load("tune_bt.rda")

Model Results

Obtaining ROC_AUC For Each Model

Lets collect the ROC_AUC metric of our tuned models and arranged them in decreasing mean since we are looking for the model with the highest ROC_AUC. We will then head the data frame to get the model with the highest ROC_AUC.

# K NEAREST NEIGHBORS
knn_roc_auc <- collect_metrics(tune_neighbors) %>%
  filter(.metric=='roc_auc')%>%
  arrange(desc(mean)) 

# ELASTIC NET
elastic_roc_auc <- collect_metrics(tune_en_log) %>%
  filter(.metric=='roc_auc') %>%
  arrange(desc(mean))

# RANDOM FOREST
rf_roc_auc <- collect_metrics(tune_rf) %>%
  filter(.metric=='roc_auc') %>%
  arrange(desc(mean))

# BOOSTED TREES
boosted_roc_auc <- collect_metrics(tune_bt) %>%
  filter(.metric=='roc_auc') %>%
  arrange(desc(mean))

knn_mean <- head(knn_roc_auc, n=1)
en_mean <- head(elastic_roc_auc, n=1)
rf_mean <- head(rf_roc_auc, n=1)
bt_mean <- head(boosted_roc_auc, n=1)

knn_mean
## # A tibble: 1 × 7
##   neighbors .metric .estimator  mean     n std_err .config              
##       <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1        10 roc_auc hand_till  0.543    10 0.00587 Preprocessor1_Model10
en_mean
## # A tibble: 1 × 8
##   penalty mixture .metric .estimator  mean     n std_err .config               
##     <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
## 1       0   0.222 roc_auc hand_till  0.720    10 0.00263 Preprocessor1_Model021
rf_mean
## # A tibble: 1 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   500    10 roc_auc hand_till  0.730    10 0.00305 Preprocessor1_Model0…
bt_mean
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     4   200        0.1 roc_auc hand_till  0.730    10 0.00333 Preprocessor1_M…

Bar Chart For Comparisson

We will now create a bar chart to compare the highest ROC_AUC of all the models

# Creating a tibble of all the models and their ROC AUC
final_compare_tibble <- tibble(Model = c("K Nearest Neighbors", "Elastic Net", "Random Forest", "Gradient Boosted Trees"), ROC_AUC = c(knn_mean$mean, en_mean$mean, rf_mean$mean, bt_mean$mean))

ggplot(final_compare_tibble, aes(x=Model, y=ROC_AUC)) +
  geom_bar(stat = "identity", aes(fill = Model)) +
  scale_fill_manual(values = c("blue", "red", "orange", "green")) +
  theme(legend.position = "none") +
  labs(title = "Comparing ROC_AUC by Model")

# Arranging by highest ROC AUC
final_compare_tibble <- final_compare_tibble %>% 
  arrange(desc(ROC_AUC))

final_compare_tibble
## # A tibble: 4 × 2
##   Model                  ROC_AUC
##   <chr>                    <dbl>
## 1 Gradient Boosted Trees   0.730
## 2 Random Forest            0.730
## 3 Elastic Net              0.720
## 4 K Nearest Neighbors      0.543

When comparing the models on the cross-validated data, we can see that the gradient boosted tree performed the best with a ROC_AUC of 0.7302054. The random forest model came at a close second with a ROC_AUC of 0.7297400.

Elastic Net had a ROC_AUC of 0.7196365 and K Nearest Neighbors had ROC_AUC of 0.5432730

Autoplot Results

Lets review the autoplots of the best model, gradient boosted tree, and the second best performing model, random forest.

Gradient Boosted Tree Autoplot

Lets review the autoplot of the best performing model, gradient boosted tree.

autoplot(tune_bt, metric = "roc_auc")

We can see that for the gradient boosted tree models, the models with a learning rate of 0.1 performed the best. The number of trees seem to have a very insignificant difference for the models with learning rate 0.1. The number of trees does appear the matter for the other learning rate values. The other models do not perform as well as the one with learning rate 0.1. These other models have interesting characteristics. Two of the models contain 0.5 ROC_AUC throughout. The other 2 show increasing ROC_AUC across mtry (# Randomly Selected Predictors) where for one of them, less trees produces smaller ROC_AUC for mtry values 1 to 3.

Random Forest Autoplot

Lets review the autoplot of the second best performing model, random forest.

autoplot(tune_rf, metric = "roc_auc")

We can see from the autoplot of the tuned random forest the the the min_n (Minimal Node Size) had minimal effect on the results. Mtry (# Randomly Selected Predictors) values of 1-4 produce worse ROC_AUC results while values of 4-6 produce better results. The number of trees do not seem to effect the results much either since most follow the same trajectory as others. However there is some minor differences between mtry values 4 to 6.

Results From Best Model

Which specific model and parameters did the best Gradient Boosted Tree model have?

bt_mean
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     4   200        0.1 roc_auc hand_till  0.730    10 0.00333 Preprocessor1_M…

The Gradient Boosted Tree Model with mtry (# Randomly Selected Predictors) equal 4, trees equal 200, and learn_rate equal 0.1 had produced the best model. As we explained in the gradient boosted tree autoplot, the models with learn_rate equaling to 0.1 performed the best.

Fit Best Model To Our Training Data

We will now take the best model from the tuned gradient boosted tree models and fit it to the training data. This will train that our best model one more time on the entire training data set.

best_bt <- select_best(tune_bt, metric = 'roc_auc')
bt_final_workflow <- finalize_workflow(bt_wkflw, best_bt)
bt_final_fit <- fit(bt_final_workflow, data=covid_train)

Testing The Model On Testing Data

We wil finally test our trained gradient boosted tree model on our unseen testing data.

covid_predict <- augment(bt_final_fit, new_data = covid_test)
covid_predict %>%
  roc_auc(truth = classification_final, .pred_1:.pred_4)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till      0.728

Our trained gradient boosted tree model resulted in a ROC_AUC value of 0.7313427 on our testing data. The gradient boosted tree actually performed better on the testing set than on the cross-validation folds which is very good. This means that our model makes better prediction on data it has never seen. I would say that the performance of the model did generally well. ROC_AUC ranges from 0 to 1, so having a ROC_AUC value of 0.7313427 is pretty decent, but not he best however.

VIP Plot

bt_final_fit %>% 
  extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

In the VIP plot, it shows us which variables are the most useful when predicting classifcation_final. It looks like patient_type, age, and pneumonia were the most useful when predicting classifcation_final. Patient_type makes sense because patients who received no treatment at the hostpital probably had not symptoms vs those who had treatment at those hospital probably had COVID. This also connects to my EDA as I had mentioned that age was used frequently to determine how susceptible someone was to COVID-19. Pneumonia makes sense because it is a respiratory illness which COVID is also.

Confusion Matrix

Let us see how well our model performed on our testing data.

roc_curve(covid_predict, truth = classification_final, .pred_1:.pred_4) %>% 
  autoplot()

conf_mat(covid_predict, truth = classification_final, 
         .pred_class) %>% 
  autoplot(type = "heatmap")

Based on the ROC curves, it would seem the the model predicts classification_final = 2 the best since there is a right angle on the curve. The other curves are worse and closer to the middle linear line.

Based on the confusion matrix our model correctly predicted:

1 patient with classification_final = 1 out of the 7578 actual patients

0 patients with classification_final = 2 out of the 1614 actual patients

86,731 patients with classification_final = 3 out of the 338,301 actual patients

520,754 patients with classification_final = 4 out of the 570,201 actual patients

Conclusion

After fitting 4 models on 10 cross validated fols and conducting analysis on each of them, the best model to predict COVID-19 test results, classification_final seems to be the gradient boosted tree model with hyper parameters mtry (# Randomly Selected Predictors) equal 4, trees equal 200, and learn_rate equal 0.1. This is to be expected because gradient boosted tree models are well-suited for classification tasks due to their high accuracy, robustness, interpretability, and scalability. Even though the gradient boosted tree model came out to be our best model, it still did not perform exceptionally well.

Some areas of improvement would be to train with a larger training split. Since building a machine learning model on over one million observations can be technologically demanding, I chose to build my models on a smaller training proportion. Additionally, I would like to do more research on the data set to get a deeper understanding of how the data was collected and other avenues of how variables can be connected. Another limitation was that some variables were not that specific and coming from a data science background, I was pretty unfamiliar with many variables and had to do my own research on them.

If I were to continue this project and move forward with my analysis, I would like to explore how my model can work for data from other countries. This can dive deeper into how COVID-19 affected countries specifically and determine whether a model like mine would be country specific or universal.

Overall, building a predictive model that predicts COVID-19 test results provided me with a great opportunity to work with a large data set, building machine learning models, and improve my application skills on data analysis. Building such models does not come easy as a lot of research and mental determination is required to get your models working and running. But it was overall an enriching experience and I look forward to my next project!

Sources

Data set was found on kaggle which was provided by the Mexican government and was uploaded by MEIR NIZRI.