Binary logistic regression assumptions
They determined the presence or absence of 79 species of birds in New Zealand that had been artificially introduced the dependent variable and 14 independent variables, including number of releases, number of individuals released, migration scored as 1 for sedentary, 2 for mixed, 3 for migratory , body length, etc. Multiple logistic regression suggested that number of releases, number of individuals released, and migration had the biggest influence on the probability of a species being successfully introduced to New Zealand, and the logistic regression equation could be used to predict the probability of success of a new introduction.
While hopefully no one will deliberately introduce more exotic bird species to new territories, this logistic regression could help understand what will determine the success of accidental introductions or the introduction of endangered species to areas of their native range where they had been eliminated. The main null hypothesis of a multiple logistic regression is that there is no relationship between the X variables and the Y variable; in other words, the Y values you predict from your multiple logistic regression equation are no closer to the actual Y values than you would expect by chance.
As you are doing a multiple logistic regression, you'll also test a null hypothesis for each X variable, that adding that X variable to the multiple logistic regression does not improve the fit of the equation any more than expected by chance.
While you will get P values for these null hypotheses, you should use them as a guide to building a multiple logistic regression equation; you should not use the P values as a test of biological null hypotheses about whether a particular X variable causes variation in Y. Multiple logistic regression finds the equation that best predicts the value of the Y variable for the values of the X variables.
The Y variable is the probability of obtaining a particular value of the nominal variable. For the bird example, the values of the nominal variable are "species present" and "species absent. This probability could take values from 0 to 1. If the probability of a successful introduction is 0. In gambling terms, this would be expressed as "3 to 1 odds against having that species in New Zealand.
You find the slopes b 1 , b 2 , etc. Maximum likelihood is a computer-intensive technique; the basic idea is that it finds the values of the parameters under which you would be most likely to get the observed results. You might want to have a measure of how well the equation fits the data, similar to the R 2 of multiple linear regression. However, statisticians do not agree on the best measure of fit for multiple logistic regression.
Some use deviance, D , for which smaller numbers represent better fit, and some use one of several pseudo- R 2 values, for which larger numbers represent better fit.
You can use nominal variables as independent variables in multiple logistic regression; for example, Veltman et al. See the discussion on the multiple linear regression page about how to do this. Whether the purpose of a multiple logistic regression is prediction or understanding functional relationships, you'll usually want to decide which variables are important and which are unimportant.
In the bird example, if your purpose was prediction it would be useful to know that your prediction would be almost as good if you measured only three variables and didn't have to measure more difficult variables such as range and weight.
If your purpose was understanding possible causes, knowing that certain variables did not explain much of the variation in introduction success could suggest that they are probably not important causes of the variation in success.
The procedures for choosing variables are basically the same as for multiple linear regression: The main difference is that instead of using the change of R 2 to measure the difference in fit between an equation with or without a particular variable, you use the change in likelihood.
Otherwise, everything about choosing variables for multiple linear regression applies to multiple logistic regression as well, including the warnings about how easy it is to get misleading results. Multiple logistic regression assumes that the observations are independent. For example, if you were studying the presence or absence of an infectious disease and had subjects who were in close contact, the observations might not be independent; if one person had the disease, people near them who might be similar in occupation, socioeconomic status, age, etc.
Careful sampling design can take care of this. Multiple logistic regression also assumes that the natural log of the odds ratio and the measurement variables have a linear relationship. It can be hard to see whether this assumption is violated, but if you have biological or statistical reasons to expect a non-linear relationship between one of the measurement variables and the log of the odds ratio, you may want to try data transformations.
Multiple logistic regression does not assume that the measurement variables are normally distributed. Some obese people get gastric bypass surgery to lose weight, and some of them die as a result of the surgery. They obtained records on 81, patients who had had Roux-en-Y surgery, of which died within 30 days. They did multiple logistic regression, with alive vs.
Manually choosing the variables to add to their logistic model, they identified six that contribute to risk of dying from Roux-en-Y surgery: Instead, they developed a simplified version one point for every decade over 40, 1 point for every 10 BMI units over 40, 1 point for male, 1 point for congestive heart failure, 1 point for liver disease, and 2 points for pulmonary hypertension. Graphs aren't very useful for showing the results of multiple logistic regression; instead, people usually just show a table of the independent variables, with their P values and perhaps the regression coefficients.
If the dependent variable is a measurement variable, you should do multiple linear regression. There are numerous other techniques you can use when you have one nominal and three or more measurement variables, but I don't know enough about them to list them, much less explain them.
There's a very nice web page for multiple logistic regression. It will not do automatic selection of variables; if you want to construct a logistic model with fewer independent variables, you'll have to pick the variables yourself.
Salvatore Mangiafico's R Companion has a sample R program for multiple logistic regression. Here is an example using the data on bird introductions to New Zealand. In the MODEL statement, the dependent variable is to the left of the equals sign, and all the independent variables are to the right.
The summary shows that "release" was added to the model first, yielding a P value less than 0. Next, "upland" was added, with a P value of 0. Next, "migr" was added, with a P value of 0. However, none of the other variables have a P value less than 0.
You need to have several times as many observations as you have independent variables, otherwise you can get "overfitting"—it could look like every independent variable is important, even if they're not. A frequently seen rule of thumb is that you should have at least 10 to 20 times as many observations as you have independent variables. I don't know how to do a more detailed power analysis for multiple logistic regression.
Risk factors associated with mortality after Roux-en-Y gastric bypass surgery. However, you can treat some ordinal variables as continuous and some as nominal; they do not all have to be treated the same. Examples of ordinal variables include Likert items e. Fortunately, you can check assumptions 3, 4, 5 and 6 using Stata. Do not be surprised if your data fails one or more of these assumptions since this is fairly typical when working with real-world data rather than textbook examples, which often only show you how to carry out a binomial logistic regression when everything goes well.
Just remember that if you do not check that you data meets these assumptions or you test for them incorrectly, the results you get when running a binomial logistic regression might not be valid. In practice, checking for assumptions 3, 4, 5 and 6 will probably take up most of your time when carrying out a binomial logistic regression.
However, it is not a difficult task, and Stata provides all the tools you need to do this. In the section, Test Procedure in Stata , we illustrate the Stata procedure required to perform a binomial logistic regression assuming that no assumptions have been violated. First, we set out the example we use to explain the binomial logistic regression procedure in Stata.
A teacher wanted to understand whether the number of hours students' spent revising predicted success in their final year exams. They also questioned whether gender would influence exam success although they didn't expect that it would.
Therefore, the teacher recruited students who were about to undertake their final year exams. The teacher had the students estimate the numbers of hours they spent revising and record their gender. He then gained their final year exam marks to discover whether they passed or failed the exam. In order to understand whether the number of hours of study had an effect on passing the exam, the teacher ran a binomial logistic regression.
Therefore, in this example, the dichotomous dependent variable is pass , which has two categories: The number of hours of study was a continuous independent variable, hours in hours , and the gender of a participant was a dichotomous independent variable, gender , with two categories: The example and data used for this guide are fictitious. We have just created them for the purposes of this guide. In Stata, we created three variables: After creating these three variables, we entered the scores for each into the three columns of the Data Editor Edit spreadsheet, as shown below:.
In this section, we show you how to analyze your data using a binomial logistic regression in Stata when the six assumptions in the previous section, Assumptions , have not been violated. You can carry out binomial logistic regression using code or Stata's graphical user interface GUI.
After you have carried out your analysis, we show you how to interpret your results. First, choose whether you want to use code or Stata's graphical user interface GUI. This code is entered into the box below:.
Using our example where the dependent variable is pass and the two independent variables are hours and gender , the required code would be:. You'll see from the code above that continuous independent variables are simply entered "as is", whilst categorical independent variables have the prefix " i " e.
Therefore, enter the code, logistic pass hours i. You will be presented with the logistic - Logistic regression, reporting odds ratios dialogue box, as shown below:. Select the dependent variable, pass , from the Dependent variable: You will be presented with the dialogue box below:. You will be presented with the Create varlist with factor or time-series variables dialogue box, as shown below:.
Leave Factor variable selected in the —Type of variable— area. Next, in the —Add factor variable— area, leave selected in the Specification: Now, select gender in the Variables dropdown box using the drop-down button.
Finally, click on the button. You will be presented with the following dialogue box where the categorical independent variable, i. You will be returned to the logistic - Logistic regression, reporting odds ratios dialogue box, but with the categorical independent variable, i. This will generate the output.
The output below is only a fraction of the options that you have in Stata to analyse your data, assuming that your data passed all the assumptions e. However, the following output will present the results needed to ascertain whether the independent variables statistically significantly predict the passing of a final year exam. The results are presented under the " Logistic Regression " header, as shown below:.