Logistic regression

7 min readDec 8, 2022

Logistic regression is a type of statistical model that is used to predict the probability of a binary outcome.

Logistic regression is a type of statistical model that is used to predict the probability of a binary outcome. In other words, it is used to classify items into one of two classes based on some input features. For example, a logistic regression model could be used to predict whether or not a person has a certain disease based on their medical history and other factors.

The logistic regression model takes a number of input features, which are also known as independent variables, and uses them to predict a binary outcome, which is also known as the dependent variable. The model does this by using a logistic function, which is a type of s-shaped curve that maps any real-valued number to a value between 0 and 1. The logistic function is used to predict the probability that an item belongs to a certain class.

The logistic regression model is trained on a dataset that includes both the input features and the binary outcome for each item. The model uses this training data to learn the relationship between the input features and the binary outcome, and it uses this information to make predictions on new items.

To make predictions, the logistic regression model first calculates the weighted sum of the input features, which is a linear combination of the input features and their corresponding weights. The model then applies the logistic function to this weighted sum to predict the probability that the new item belongs to a certain class. The class with the higher predicted probability is then chosen as the model’s prediction.

Overall, logistic regression is a useful tool for making predictions in a binary classification problem, and it is often used in a variety of applications such as medical diagnosis, credit scoring, and market research.

Optimization equation

In logistic regression, the goal is to find the values of the model’s parameters that best fit the data. This is done by using an optimization equation, which is a mathematical formula that is used to find the optimal values of the parameters that minimize some measure of error.

The specific optimization equation used in logistic regression is called the maximum likelihood equation. This equation is based on the principle of maximum likelihood, which is a method of estimating the parameters of a statistical model by maximizing the likelihood that the model will produce the observed data.

In the case of logistic regression, the maximum likelihood equation is used to find the values of the model’s parameters that maximize the likelihood that the model will correctly predict the class of each item in the training data. This is done by first calculating the likelihood that the model will produce the observed data for each item in the training data, and then multiplying these likelihoods together to get the overall likelihood that the model will produce the observed data for the entire dataset. The parameters of the model are then adjusted to maximize this overall likelihood.

To solve the maximum likelihood equation, logistic regression typically uses an optimization algorithm, such as gradient descent or Newton’s method. These algorithms are used to iteratively improve the values of the model’s parameters until they converge to the optimal values that maximize the likelihood of the observed data.

In summary, the optimization equation in logistic regression is used to find the values of the model’s parameters that best fit the data, and this is done by using the maximum likelihood equation and an optimization algorithm to maximize the likelihood that the model will correctly predict the class of each item in the training data.

bias vs variance tradeoff

The bias-variance tradeoff is a fundamental concept in machine learning that refers to the tradeoff between the model’s ability to accurately capture the underlying relationship in the data (low bias) and its ability to generalize to new data (low variance).

In logistic regression, the bias-variance tradeoff can be understood in terms of the model’s complexity. A simpler model with fewer parameters will have lower variance, but it may also have higher bias and be less able to accurately capture the underlying relationship in the data. On the other hand, a more complex model with more parameters will have lower bias, but it may also have higher variance and be less able to generalize to new data.

The goal in logistic regression is to find a model that strikes the right balance between bias and variance, and this is typically done through regularization. Regularization is a technique that is used to reduce the complexity of the model by adding a penalty term to the optimization equation. This penalty term reduces the magnitude of the model’s parameters, which in turn reduces the model’s variance and helps to prevent overfitting.

Overall, the bias-variance tradeoff in logistic regression refers to the tradeoff between the model’s ability to accurately capture the underlying relationship in the data and its ability to generalize to new data. This tradeoff is managed through regularization, which is used to find a model that strikes the right balance between bias and variance.

Applications

Logistic regression is a powerful and widely used statistical model that is used for a variety of applications in fields such as medicine, finance, and social science. Some examples of the applications of logistic regression include:

Medical diagnosis: Logistic regression can be used to predict the probability that a patient has a certain disease based on their medical history and other factors. This can help doctors to make more accurate diagnoses and to recommend the most appropriate treatment.
Credit scoring: Logistic regression can be used to predict the likelihood that a person will default on a loan based on their credit history and other factors. This can help lenders to make more informed decisions about whether to approve a loan.
Market research: Logistic regression can be used to predict the likelihood that a person will purchase a product or service based on their demographics and other factors. This can help companies to identify potential customers and to tailor their marketing efforts.

Overall, logistic regression is a versatile and widely used tool that can be applied to many different types of problems where the goal is to predict a binary outcome. It is particularly useful for problems where the relationship between the input features and the outcome is non-linear, and it is often used as a building block for more complex machine learning models.

Python Implementation

import numpy as np

class LogisticRegression:
    def __init__(self, lr=0.01, num_iter=100000, fit_intercept=True, verbose=False):
        self.lr = lr
        self.num_iter = num_iter
        self.fit_intercept = fit_intercept
        self.verbose = verbose
    
    def __add_intercept(self, X):
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)
    
    def __sigmoid(self, z):
        return 1 / (1 + np.exp(-z))
    def __loss(self, h, y):
        return (-y * np.log(h) - (1 - y) * np.log(1 - h)).mean()
    
    def fit(self, X, y):
        if self.fit_intercept:
            X = self.__add_intercept(X)
        
        # weights initialization
        self.theta = np.zeros(X.shape[1])
        
        for i in range(self.num_iter):
            z = np.dot(X, self.theta)
            h = self.__sigmoid(z)
            gradient = np.dot(X.T, (h - y)) / y.size
            self.theta -= self.lr * gradient
            
            if(self.verbose == True and i % 10000 == 0):
                z = np.dot(X, self.theta)
                h = self.__sigmoid(z)
                print(f'loss: {self.__loss(h, y)} \t')
    
    def predict_prob(self, X):
        if self.fit_intercept:
            X = self.__add_intercept(X)
    
        return self.__sigmoid(np.dot(X, self.theta))
    
    def predict(self, X):
        return self.predict_prob(X).round()

Example:

from sklearn import datasets
from sklearn.model_selection import train_test_split

# load the iris dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# create an instance of the LogisticRegression class
model = LogisticRegression(lr=0.1, num_iter=300000, verbose=True)

# train the model on the training data
model.fit(X_train, y_train)

# make predictions on the test data
preds = model.predict(X_test)

# evaluate the model's performance
accuracy = np.mean(preds == y_test)
print(f'accuracy: {accuracy}')

Best and worst case scenarios

The performance of a logistic regression model can vary depending on a number of factors, including the quality and quantity of the training data, the choice of regularization parameter, and the complexity of the model. In general, a logistic regression model will perform well in the following scenarios:

The training data is sufficiently large and representative of the population. A large and representative dataset will provide the model with enough information to learn the relationship between the input features and the binary outcome, and this will enable the model to make accurate predictions on new data.
The regularization parameter is carefully chosen. The regularization parameter controls the complexity of the model, and choosing the right value can help to strike the right balance between bias and variance. A good value for the regularization parameter will help to prevent overfitting and improve the model’s generalization ability.
The model is not too complex. A model that is too complex may have high variance and be prone to overfitting, which will reduce its ability to make accurate predictions on new data. A simpler model with fewer parameters may be more likely to generalize well.

On the other hand, a logistic regression model may perform poorly in the following scenarios:

The training data is small or not representative of the population. A small or unrepresentative dataset will not provide the model with enough information to learn the relationship between the input features and the binary outcome, and this will limit the model’s ability to make accurate predictions.
The regularization parameter is not carefully chosen. Choosing the wrong value for the regularization parameter can lead to underfitting or overfitting, which will reduce the model’s performance.
The model is too complex. A model with too many parameters may have high variance and be prone to overfitting, which will reduce its ability to generalize to new data.

Overall, a logistic regression model is likely to perform well in scenarios where the training data is large and representative, the regularization parameter is carefully chosen, and the model is not too complex. It may perform poorly in scenarios where the training data is small or unrepresentative, the regularization parameter is not carefully chosen, or the model is too complex.