Tutorial on 5 Powerful R Packages used for imputing missing values

4 stars based on 80 reviews

And, until recently, no commercial software package had options for doing a sensitivity analysis. But first, some background. In virtually all implementations of these methods in commercial software, the underlying assumption is that data are missing at random MAR. Roughly speaking, this means that the probability that data are missing on a particular variable does not depend on the value of that variable, after adjusting for observed variables.

This assumption would be violated, for example, if people with high income were less likely to report their income. MAR is not a testable assumption. That raises three issues:. The basic idea is to try out a bunch missing data imputation binary options plausible MNAR models, and then see how consistent the results are across the different models. If results are reasonably consistent, then you can feel pretty confident that, even if data are not missing at random, that would not compromise your conclusions.

On the other hand, if the results are not consistent across models, you would have to worry about whether any of the results are trustworthy. Keep in mind that this is not a test.

Inconsistency of results does not tell you that your data are MNAR. It simply gives you some idea of what would happen if the data are MNAR in particular ways. The hard part is figuring out how to come up with a reasonable set of models.

The goal is to estimate a linear regression in which the dependent variable is graduation rate, the percentage of students who graduate among those who enrolled four years earlier. If so, that would probably entail a violation missing data imputation binary options the MAR assumption. It would also imply that colleges with missing data on graduation rates would tend to have lower unobserved graduation rates than those colleges that report their graduation rates, controlling for other variables.

This program produces five data sets, with missing data imputed by linear regression. For a sensitivity analysis, the essential ingredient here is the MNAR statement. To do a proper sensitivity analysis, we would redo both the imputation and the analysis for several different values of the SCALE parameter, ranging between 0 and 1. Instead of multiplying the imputed values by some constant, we could add or subtract a constant, for example. This would subtract 20 points from any imputed graduation rates.

The SHIFT option can also be used for adjusting the imputations of categorical outcomes binary, ordinal or nominalexcept that the changes are applied on the log-odds scale. This says to subtract 20 points from the imputed values of graduation missing data imputation binary options, but only for private colleges, not for public colleges.

There are also other options, which you can read about here. You first produce data sets under the MAR assumption and then you modify imputed values by adding or multiplying by the desired constants. But the SAS method is more elegant because the adjustments are made at each iteration, and missing data imputation binary options adjusted imputations are used in imputing other variables with missing data in later steps of the algorithm.

This particular way of doing a sensitivity analysis is based on something called pattern-mixture models for MNAR. The best auxiliary variables are those that are highly correlated with both the variable that has missing data and the probability that the variable is missing.

Paul, thanks for your lucid description of these new and useful options. I used the shift method in my 99 Stat Med paper, and observed problematic behavior if shifting is missing data imputation binary options to two missing data imputation binary options variables at the same time.

Would you advise restricting this option to just one variable? However, it does make some sense to me to focus on one variable at a time. Hi dear dr, I am confused about the use of selection missing data imputation binary options and pattern mixture.

Are the selection model and pattern mixture considered as multiple imputation or likelihood -based methods? What is the link between multiple imputation and likelihood-based model?

Maximum likelihood and multiple imputation can both be used for pattern-mixture models and for selection models. Bayesian methods are likelihood-based, and the goal of multiple imputation is to make random draws from the predictive posterior distribution of the missing data given the observed data. This post is very helpful.

I usually use Stata. E-mail will not be published: Subscribe to RSS Feed. Course Materials Please fill out the missing data imputation binary options below to download sample course materials. That raises three issues: For any data set, there are an infinite number of possible MNAR models. Nothing in the data will tell you which of those models is better than another.

Results may depend heavily on which model you choose. Another option allows you to restrict the adjustments to certain subsets of the data, e. Stef van Missing data imputation binary options says: October 14, at 8: October 26, at 7: November 19, at 5: June 6, at 3: Leave a Reply Click here to cancel reply.

Binary options 360 network strategy that is proven to work

  • Binary options zero risk strategy the complete money making guide option pricing by simulation bi fi

    Binary options live signals franco's pizza

  • Usdchf forex crunch

    Forex trading systems

Binary options 1 hour strategy guides

  • Best binary options system 2018 gmc in indiana

    Excel trader forex dubai jobs

  • Forex online option brokers indianapolis

    Anyoption app windows

  • Trader joe's products online

    Uk day trading stocks

Wo mit binare optionen handeln erfahrungen

31 comments Fantastic results binary options bullet free download

Binary forex strategy

This is part four of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction. This section will talk you through the details of the imputation process. Be sure you've read at least the previous section, Creating Imputation Models , so you have a sense of what issues can affect the validity of your results.

To illustrate the process, we'll use a fabricated data set. Unlike those in the examples section, this data set is designed to have some resemblance to real world data. Our goal is to regress wages on sex, race, education level, and experience. To see the "right" answers, open the do file that creates the data set and examine the gen command that defines wage. The imputation process creates a lot of output. We'll put highlights in this page, however, a complete log file including the associated graphs can be found here:.

Each section of this article will have links to the relevant section of the log. Click "back" in your browser to return to this page. The first step in using mi commands is to mi set your data. This is somewhat similar to svyset , tsset , or xtset. The mi set command tells Stata how it should store the additional imputations you'll create. We suggest using the wide format, as it is slightly faster.

On the other hand, mlong uses slightly less memory. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong add ,clear if the data have not been saved since the last change.

Most of the time you don't need to worry about how the imputations are stored: But if you need to manipulate the data in a way mi can't do for you, then you'll need to learn about the details of the structure you're using. You'll also need to be very, very careful. If you're interested in such things including the rarely used flong and flongsep formats run this do file and read the comments it contains while examining the data browser to see what the data look like in each form.

Imputed variables are variables that mi is to impute or has imputed. Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values. Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by a function of weight and height. Interaction terms are also passive variables, though if you use Stata's interaction syntax you won't have to declare them as such.

Passive variables are often problematic—the examples on transformations , non-linearity , and interactions show how using them inappropriately can lead to biased estimates.

If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables. Registering a variable tells Stata what kind of variable it is.

Imputed variables must always be registered:. However, passive variables are more often created after imputing. Do so with mi passive and they'll be registered as passive automatically. In our example data, all the variables except female need to be imputed. The appropriate mi register command is:. Before proceeding to impute we will check each of the imputation models.

Always run each of your imputation models individually, outside the mi impute chained context, to see if they converge and insofar as it is possible verify that they are specified correctly. Note that when categorical variables ordered or not appear as covariates i.

As we'll see later, the output of the mi impute chained command includes the commands for the individual models it runs. Thus a useful shortcut, especially if you have a lot of variables to impute, is to set up your mi impute chained command with the dryrun option to prevent it from doing any actual imputing, run it, and then copy the commands from the output into your do file for testing.

The first thing to note is that all of these models run successfully. Complex models like mlogit may fail to converge if you have large numbers of categorical variables, because that often leads to small cell sizes.

To pin down the cause of the problem, remove most of the variables, make sure the model works with what's left, and then add variables back one at a time or in small groups until it stops working.

With some experimentation you should be able to identify the problem variable or combination of variables. At that point you'll have to decide if you can combine categories or drop variables or make other changes in order to create a workable model. Perfect prediction is another problem to note. The imputation process cannot simply drop the perfectly predicted observations the way logit can. You could drop them before imputing, but that seems to defeat the purpose of multiple imputation.

The alternative is to add the augment or just aug option to the affected methods. This tells mi impute chained to use the "augmented regression" approach, which adds fake observations with very low weights in such a way that they have a negligible effect on the results but prevent perfect prediction. For details see the section "The issue of perfect prediction during imputation of categorical data" in the Stata MI documentation.

You should also try to evaluate whether the models are specified correctly. A full discussion of how to determine whether a regression model is specified correctly or not is well beyond the scope of this article, but use whatever tools you find appropriate. Here are some examples:. For continuous variables, residual vs.

Consider the plot for experience:. Note how a number of points are clustered along a line in the lower left, and no points are below it:. This reflects the constraint that experience cannot be less than zero, which means that the fitted values must always be greater than or equal to the residuals, or alternatively that the residuals must be greater than or equal to the negative of the fitted values.

If the graph had the same scale on both axes, the constraint line would be a 45 degree line. If all the points were below a similar line rather than above it, this would tell you that there was an upper bound on the variable rather than a lower bound. The y-intercept of the constraint line tells you the limit in either case.

You can also have both a lower bound and an upper bound, putting all the points in a band between them. The "obvious" model, regress , is inappropriate for experience because it won't apply this constraint.

It's also inappropriate for wages for the same reason. Alternatives include truncreg, ll 0 and pmm we'll use pmm. Thus one way to check for misspecification is to add interaction terms to the models and see whether they turn out to be important. For example, we'll compare the obvious model:. We'll run similar comparisons for the models of the other variables. This creates a great deal of output, so see the log file for results.

Interactions between female and other variables are significant in the models for exp , wage , edu , and urban. There are a few significant interactions between race or urban and other variables, but not nearly as many and keep in mind that with this many coefficients we'd expect some false positives using a significance level of. We'll thus impute the men and women separately. This is an especially good option for this data set because female is never missing.

If it were, we'd have to drop those observations which are missing female because they could not be placed in one group or the other. In the imputation command this means adding the by female option.

When testing models, it means starting the commands with the by female: The improved imputation models are thus:. Each method specifies the method to be used for imputing the following varlist The possibilities for method are regress , pmm , truncreg , intreg , logit , ologit , mlogit , poisson , and nbreg. N is the number of imputations to be added to the data set. R is the seed to be used for the random number generator—if you do not set this you'll get slightly different imputations each time the command is run.

The tracefile is a dataset in which mi impute chained will store information about the imputation process. We'll use this dataset to check for convergence.

Options that are relevant to a particular method go with the method, inside the parentheses but following a comma e. Options that are relevant to the imputation process as a whole like by female go at the end, after the comma. Note that this does not include a savetrace option. As of this writing, by and savetrace cannot be used at the same time, presumably because it would require one trace file for each by group.

Stata is aware of this problem and we hope this will be changed soon. For purposes of this article, we'll remove the by option when it comes time to illustrate use of the trace file. If this problem comes up in your research, talk to us about work-arounds. There is some disagreement among authorities about how many imputations are sufficient. Some say in almost all circumstances, the Stata documentation suggests at least 20, while White, Royston, and Wood argue that the number of imputations should be roughly equal to the percentage of cases with missing values.

However, we are not aware of any argument that increasing the number of imputations ever causes problems just that the marginal benefit of another imputation asymptotically approaches zero. Increasing the number of imputations in your analysis takes essentially no work on your part. Just change the number in the add option to something bigger. On the other hand, it can be a lot of work for the computer—multiple imputation has introduced many researchers into the world of jobs that take hours or days to run.

You can generally assume that the amount of time required will be proportional to the number of imputations used e. So here's our suggestion:.