--- title: "Imputing missing data using rMIDAS" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Imputing missing data using rMIDAS} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) LOCAL <- identical(Sys.getenv("LOCAL"), "true") ``` This vignette provides a brief demonstration of **rMIDAS**. We show how to use the package to impute missing values in the [*Adult Census*](https://github.com/MIDASverse/MIDASpy/blob/master/Examples/adult_data.csv) dataset, which is commonly used for benchmarking machine learning tasks. ## Ensure your system is correctly configured **rMIDAS** relies on Python to implement the MIDAS imputation algorithm, so a Python environment needs to be installed on your machine. Currently, Python versions from 3.6 to 3.10 are supported. When the package is first loaded, it will prompt the user on whether to set up a Python environment and its dependencies automatically. Users that choose to set up the environment and dependencies manually, or who use **rMIDAS** in headless mode can specify a Python binary using `set_python_env()`. For additional help on manually setting up a Python environment and its dependencies please refer to our other vignettes (*Using custom Python versions* and *Running rMIDAS on a server instance*) or visit the [rMIDAS GitHub](https://github.com/MIDASverse/rMIDAS/) page. ## Loading the data Once **rMIDAS** is initialized, we can load our data. For the purpose of this example, we'll use a subset of the Adult data: ```{r, eval = LOCAL} library(rMIDAS) adult <- read.csv("https://raw.githubusercontent.com/MIDASverse/MIDASpy/master/Examples/adult_data.csv", row.names = 1)[1:1000,] ``` As the dataset has a very low proportion of missingness (one of the reasons it is favored for machine learning tasks), we randomly set 10% of observed values as missing in each column using the **rMIDAS**' `add_missingness()` function: ```{r, eval = LOCAL} set.seed(89) adult <- add_missingness(adult, prop = 0.1) ``` Next, we make a list of all categorical and binary variables, before preprocessing the data for training using the `convert()` function. Setting the `minmax_scale` argument to `TRUE` ensures that continuous variables are scaled between 0 and 1, which can substantially improve convergence in the training step. All pre-processing steps can be reversed after imputation: ```{r, eval = LOCAL} adult_cat <- c('workclass','marital_status','relationship','race','education','occupation','native_country') adult_bin <- c('sex','class_labels') # Apply rMIDAS preprocessing steps adult_conv <- convert(adult, bin_cols = adult_bin, cat_cols = adult_cat, minmax_scale = TRUE) ``` The data are now ready to be fed into the MIDAS algorithm, which involves a single call of the `train()` function. At this stage, we specify the dimensions, input corruption proportion, and other hyperparameters of the MIDAS neural network as well as the number of training epochs: ```{r, eval = LOCAL} # Train the model for 20 epochs adult_train <- train(adult_conv, training_epochs = 20, layer_structure = c(128,128), input_drop = 0.75, seed = 89) ``` Once training is complete, we can generate any number of completed datasets using the `complete()` function (below we generate 10). The completed dataframes can also be saved as '.csv' files using the `file` and `file_root` arguments (not demonstrated here). By default, `complete()` unscales continuous variables and converts binary and categorical variables back to their original form. Since the MIDAS algorithm returns predicted probabilities for binary and categorical variables, imputed values of such variables can be generated using one of two options. When `fast = FALSE` (the default), `complete()` uses the predicted probabilities for each category level to take a weighted random draw from the set of all levels. When `fast = TRUE`, the function selects the level with the highest predicted probability. If completed datasets are very large or `complete()` is taking a long time to run, users may benefit from choosing the latter option: ```{r, eval = LOCAL} # Generate 10 imputed datasets adult_complete <- complete(adult_train, m = 10) # Inspect first imputed dataset: head(adult_complete[[1]]) ``` Finally, the `combine()` function allows users to estimate regression models on the completed datasets with Rubin's combination rules. This function wraps the `glm()` package, whose arguments can be used to select different families of estimation methods (gaussian/OLS, binomial etc.) and to specify other aspects of the model: ```{r, eval = LOCAL} # Estimate logit model on 10 completed datasets (using Rubin's combination rules) adult_model <- combine("class_labels ~ hours_per_week + sex", adult_complete, family = stats::binomial) adult_model ```