This Month
Published: 29 June 2016

Points of Significance

Logistic regression

Jake Lever¹,
Martin Krzywinski² &
Naomi Altman³

Nature Methods volume 13, pages 541–542 (2016)Cite this article

21k Accesses
70 Citations
25 Altmetric
Metrics details

Subjects

Regression can be used on categorical responses to estimate probabilities and to classify.

You have full access to this article via your institution.

Download PDF

In recent columns we showed how linear regression can be used to predict a continuous dependent variable given other independent variables^1,2. When the dependent variable is categorical, a common approach is to use logistic regression, a method that takes its name from the type of curve it uses to fit data. Categorical variables are commonly used in biomedical data to encode a set of discrete states, such as whether a drug was administered or whether a patient has survived. Categorical variables may have more than two values, which may have an implicit order, such as whether a patient never, occasionally or frequently smokes. In addition to predicting the value of a variable (e.g., a patient will survive), logistic regression can also predict the associated probability (e.g., the patient has a 75% chance of survival).

There are many reasons to assess the probability of a state of a categorical variable, and a common application is classification—predicting the class of a new data point. Many methods are available, but regression has the advantage of being relatively simple to perform and interpret. First a training set is used to develop a prediction equation, and then the predicted membership probability is thresholded to predict the class membership for new observations, with the point classified to the most probable class. If the costs of misclassification differ between the two classes, alternative thresholds may be chosen to minimize misclassification costs estimated from the training sample (Fig. 1). For example, in the diagnosis of a deadly but readily treated disease, it is less costly to falsely assign a patient to the treatment group than to the no-treatment group.

**Figure 1: Classification of data requires thresholding, which defines probability intervals for each class.**

In our example of simple linear regression¹, we saw how one continuous variable (weight) could be predicted on the basis of another continuous variable (height). To illustrate classification, here we extend that example to use height to predict the probability that an individual plays professional basketball. Let us assume that professional basketball players have a mean height of 200 cm and that those who do not play professionally have a mean height of 170 cm, with both populations being normal and having an s.d. of 15 cm. First, we create a training data set by randomly sampling the heights of 5 individuals who play professional basketball and 15 who do not (Fig. 2a). We then assign categorical classifications of 1 (plays professional basketball) and 0 (does not play professional basketball). For simplicity, our example is limited to two classes, but more are possible.

**Figure 2: Robustness of classification to outliers depends on the type of regression used to establish thresholds.**

Let us first approach this classification using linear regression, which minimizes least-squares¹, and fit a line to the data (Fig. 2a). Each data point has one of two distinct y-values (0 and 1), which correspond to the probability of playing professional basketball, and the fit represents the predicted probability as a function of height, increasing from 0 at 159 cm to 1 at 225 cm. The fit line is truncated outside the [0, 1] range because it cannot be interpreted as a probability. Using a probability threshold of 0.5 for classification, we find that 192 cm should be the decision boundary for predicting whether an individual plays professional basketball. It gives reasonable classification performance—only one point is misclassified as false positive, and one point as false negative (Fig. 2a).

Unfortunately, our linear regression fit is not robust. Consider a child of height H = 100 cm who does not play professional basketball (Fig. 2a). This height is below the threshold of 192 cm and would be classified correctly. However, if this data point is part of the training set, it will greatly influence the fit³ and increase the classification threshold to 197 cm, which would result in an additional false negative.

To improve the robustness and general performance of this classifier, we could fit the data to a curve other than a straight line. One very simple option is the step function (Fig. 2b), which is 1 when greater than a certain value and 0 otherwise. An advantage of the step function is that it defines a decision boundary (185 cm) that is not affected by the outlier (H = 100 cm), but it cannot provide class probabilities other than 0 and 1. This turns out to be sufficient for the purpose of classification—many classification algorithms do not provide probabilities. However, the step function also does not differentiate between the more extreme observations, which are far from the decision boundary and more likely to be correctly assigned, and those near the decision boundary for which membership in either group is plausible. In addition, the step function is not differentiable at the step, and regression generally requires a function that is differentiable everywhere. To mitigate this issue, smooth sigmoid curves are used. One used commonly in the natural sciences is the logistic curve (Fig. 2b), which readily relates to the odds ratio.

If p is the probability that a person plays professional basketball, then the odds ratio is p/(1 − p), which is the ratio of the probability of playing to the probability of not playing. The log odds ratio is the logarithmic transform of this quantity, ln(p/(1 − p)). Logistic regression models the log odds ratio as a linear combination of the independent variables. For our example, height (H) is the independent variable, the logistic fit parameters are β₀ (intercept) and β_H (slope), and the equation that relates them is ln(p/(1 − p)) = β₀ + β_HH. In general, there may be any number of predictor variables and associated regression parameters (or slopes). Modeling the log odds ratio allows us to estimate the probability of class membership using a linear relationship, similar to linear regression. The log odds can be transformed back to a probability as p(t) = 1/(1 + exp(−t)), where t = β₀ + β_HH. This is an S-shaped (sigmoid) curve, with steepness controlled by β_H that maps the linear function back to probabilities in [0, 1].

As in linear regression, we need to estimate the regression parameters. These estimates are denoted by b₀ and b_H to distinguish them from the true but unknown intercept β₀ and slope β_H. Unlike linear regression¹, which yields an exact analytical solution for the estimated regression coefficients, logistic regression requires numerical optimization to find the optimal estimate, such as the iterative approach shown in Figure 3a. For our example, this would correspond to finding the maximum likelihood estimates, the pair of estimates b₀ and b_H that maximize the likelihood of the observed data (or, equivalently, minimize the negative log likelihood). Once these estimates are found, we can calculate the membership probability, which is a function of these estimates as well as of our predictor H.

**Figure 3: Optimal estimates in logistic regression are found iteratively via minimization of the negative log likelihood.**

In most cases, the maximum-likelihood estimates are unique and optimal. However, when the classes are perfectly separable, this iterative approach fails because there is an infinite number of solutions with equivalent predictive power that can perfectly predict class membership for the training set. Here, we cannot estimate the regression parameters (Fig. 3b) or assign a probability of class membership.

The interpretation of logistic regression shares some similarities with that of linear regression; for instance, variables given the greatest importance may be reliable predictors but might not actually be causal. Logistic regression parameters can be used to understand the relative predictive power of different variables, assuming that the variables have already been normalized to have a mean of 0 and variance of 1. It is important to understand the effect that a change to an independent variable will have on the results of a regression. In linear regression the coefficients have an additive effect for the predicted value, which increases by β_i when the ith independent variable increases by one unit. In logistic regression the coefficients have an additive effect for the log odds ratio rather than for the predicted probability.

Similar to linear regression, correlation among multiple predictors is a challenge to fitting logistic regression. For instance, if we are fitting a logistic regression for professional basketball using height and weight, we must be aware that these variables are highly positively correlated. Either one of them already gives insight into the value of the other. If two variables are perfectly correlated, then there would be multiple solutions to the logistic regression that would give exactly the same fit. Correlated features also make interpretation of coefficients much more difficult. Discussion of the quality of the fit of the logistic model and of classification accuracy will be left to a later column.

Logistic regression is a powerful tool for predicting class probabilities and for classification using predictor variables. For example, one can model the lethality of a new drug protocol in mice by predicting the probability of survival or, with an appropriate probability threshold, by classifying on the basis of survival outcome. Multiple factors of an experiment can be included, such as dosing information, animal weight and diet data, but care must be taken in interpretation to include the possibility of correlation.

References

Altman, N. & Krzywinski, M. Nat. Methods 12, 999–1000 (2015).
Article CAS Google Scholar
Krzywinski, M. & Altman, N. Nat. Methods 12, 1103–1104 (2015).
Article CAS Google Scholar
Altman, N. & Krzywinski, M. Nat. Methods 13, 281–282 (2016).
Article Google Scholar

Download references

Author information

Authors and Affiliations

Jake Lever is a PhD candidate at Canada's Michael Smith Genome Sciences Centre.,
Jake Lever
Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,
Martin Krzywinski
Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,
Naomi Altman

Authors

Jake Lever
View author publications
You can also search for this author in PubMed Google Scholar
Martin Krzywinski
View author publications
You can also search for this author in PubMed Google Scholar
Naomi Altman
View author publications
You can also search for this author in PubMed Google Scholar

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lever, J., Krzywinski, M. & Altman, N. Logistic regression. Nat Methods 13, 541–542 (2016). https://doi.org/10.1038/nmeth.3904

Download citation

Published: 29 June 2016
Issue Date: July 2016
DOI: https://doi.org/10.1038/nmeth.3904

This article is cited by

Income determines the impact of cash transfers on HIV/AIDS: cohort study of 22.7 million Brazilians
- Andréa F. Silva
- Inês Dourado
- Davide Rasella
Nature Communications (2024)
Errors in predictor variables
- Naomi Altman
- Martin Krzywinski
Nature Methods (2024)
Comparing classifier performance with baselines
- Fadel M. Megahed
- Ying-Ju Chen
- Naomi Altman
Nature Methods (2024)
Deep learning-based multi-model approach on electron microscopy image of renal biopsy classification
- Jingyuan Zhang
- Aihua Zhang
BMC Nephrology (2023)
Association between resilience and advance care planning during the COVID-19 pandemic in Japan: a nationwide cross-sectional study
- Jun Miyashita
- Taro Takeshima
- Shunichi Fukuhara
Scientific Reports (2023)

Logistic regression

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

This article is cited by

Income determines the impact of cash transfers on HIV/AIDS: cohort study of 22.7 million Brazilians

Errors in predictor variables

Comparing classifier performance with baselines

Deep learning-based multi-model approach on electron microscopy image of renal biopsy classification

Association between resilience and advance care planning during the COVID-19 pandemic in Japan: a nationwide cross-sectional study

Search

Quick links

Subjects

References

Author information

Authors and Affiliations

Ethics declarations

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Income determines the impact of cash transfers on HIV/AIDS: cohort study of 22.7 million Brazilians

Errors in predictor variables

Comparing classifier performance with baselines

Deep learning-based multi-model approach on electron microscopy image of renal biopsy classification

Association between resilience and advance care planning during the COVID-19 pandemic in Japan: a nationwide cross-sectional study

Search

Quick links