Logistic Regression

img

Logistic RegressionKey takeawaysInterview QuestionsSolutionsWhat is the difference between logistic regression and linear regression?How does logistic regression handle multi-class classification problems?What are the assumptions of logistic regression?How does logistic regression handle missing values?How is multicollinearity treated in logistic regression?What is the loss function used in logistic regression? How is it minimized?How does logistic regression handle imbalanced datasets?How do you choose the regularization parameter in logistic regression?Can logistic regression handle non-linear relationships? If so, how?What are the differences between logistic regression and support vector machines (SVM)?What are the differences between logistic regression and decision trees?What are the advantages and limitations of logistic regression?What do the weights and biases represent in logistic regression?How does logistic regression handle outliers?What methods can be used for feature selection in logistic regression?How do you evaluate the performance of a logistic regression model?How does logistic regression handle high-dimensional datasets?Is logistic regression sensitive to outliers?In which domains is logistic regression commonly applied?How do you deal with multicollinearity in logistic regression?Python ApplicationUsing SklearnFrom ScratchNotationsSigmoid or Logistic functionHypothesisLoss/Cost functionGradient DescentDecision boundaryNormalize FunctionTrain FunctionPredict FunctionTraining and Plotting Decision BoundaryCalculating AccuracyTesting on Non-linearly Separable DataImportant Insights

 

Key takeaways

  1. Logistic regression is commonly used for binary classification problems, where it predicts the probability of an input belonging to a specific class.

  2. Logistic regression uses the sigmoid function (or logistic function) to map the linear combination of features into a probability value.

  3. The model is trained using maximum likelihood estimation, where the goal is to maximize the likelihood function to fit the model parameters.

  4. Logistic regression assumes a linear relationship between the features and the output and assumes independence among the features.

  5. Model parameters can be estimated using optimization algorithms like gradient descent, which minimizes a loss function (typically the log loss).

  6. Logistic regression models can output class labels (based on a probability threshold), predicted probabilities, or decision boundaries.

  7. Feature engineering plays a crucial role in logistic regression, including feature selection, feature scaling, and feature interactions.

  8. Polynomial logistic regression can handle non-linear relationships by introducing polynomial features to increase the model's expressive power.

  9. Evaluation metrics for logistic regression include accuracy, precision, recall, F1 score, among others, depending on the specific problem and requirements.

  10. Logistic regression is a simple and efficient classification algorithm that performs well on linearly separable problems and finds applications in various domains such as healthcare, finance, natural language processing, and more.

Interview Questions

  1. What is the difference between logistic regression and linear regression?

  2. How does logistic regression handle multi-class classification problems?

  3. What are the assumptions of logistic regression?

  4. How does logistic regression handle missing values?

  5. How is multicollinearity treated in logistic regression?

  6. What is the loss function used in logistic regression? How is it minimized?

  7. How does logistic regression handle imbalanced datasets?

  8. How do you choose the regularization parameter in logistic regression?

  9. Can logistic regression handle non-linear relationships? If so, how?

  10. What are the differences between logistic regression and support vector machines (SVM)?

  11. What are the differences between logistic regression and decision trees?

  12. What are the advantages and limitations of logistic regression?

  13. What do the weights and biases represent in logistic regression?

  14. How does logistic regression handle outliers?

  15. What methods can be used for feature selection in logistic regression?

  16. How do you evaluate the performance of a logistic regression model?

  17. How does logistic regression handle high-dimensional datasets?

  18. Is logistic regression sensitive to outliers?

  19. In which domains is logistic regression commonly applied?

  20. How do you deal with multicollinearity in logistic regression?

Solutions

What is the difference between logistic regression and linear regression?

How does logistic regression handle multi-class classification problems?

What are the assumptions of logistic regression?

How does logistic regression handle missing values?

How is multicollinearity treated in logistic regression?

What is the loss function used in logistic regression? How is it minimized?

How does logistic regression handle imbalanced datasets?

How do you choose the regularization parameter in logistic regression?

Can logistic regression handle non-linear relationships? If so, how?

What are the differences between logistic regression and support vector machines (SVM)?

What are the differences between logistic regression and decision trees?

What are the advantages and limitations of logistic regression?

What do the weights and biases represent in logistic regression?

How does logistic regression handle outliers?

What methods can be used for feature selection in logistic regression?

How do you evaluate the performance of a logistic regression model?

How does logistic regression handle high-dimensional datasets?

Is logistic regression sensitive to outliers?

How do you evaluate the performance of a logistic regression model?

How does logistic regression handle high-dimensional datasets?

Is logistic regression sensitive to outliers?

In which domains is logistic regression commonly applied?

How do you deal with multicollinearity in logistic regression?

Python Application

Using Sklearn

In this example, we first import the necessary modules: load_iris to load the Iris dataset, LogisticRegression for logistic regression implementation, train_test_split for splitting the data into training and testing sets, and accuracy_score for evaluating the accuracy of the model.

We then load the Iris dataset using load_iris() and assign the feature matrix to X and the target variable to y. Next, we split the data into training and testing sets using train_test_split.

After that, we create an instance of the LogisticRegression model using LogisticRegression(). We train the model on the training data using the fit method. Then, we use the trained model to make predictions on the testing data with the predict method.

Finally, we evaluate the accuracy of the model by comparing the predicted labels (y_pred) with the actual labels (y_test) using the accuracy_score function. The accuracy of the model is printed to the console.

From Scratch

Notations

We are going to do binary classification, so the value of y (true/target) is going to be either 0 or 1.

For example, suppose we have a breast cancer dataset with X being the tumor size and y being whether the lump is malignant(cancerous) or benign(non-cancerous). Whenever a patient visits, your job is to tell him/her whether the lump is malignant(0) or benign(1) given the size of the tumor. There are only two classes in this case.

So, y is going to be either 0 or 1.

Let’s use the following randomly generated data as a motivating example to understand Logistic Regression.

img

There are 2 features, n=2. There are 2 classes, blue and green.

For a binary classification problem, we naturally want our hypothesis (y_hat) function to output values between 0 and 1 which means all Real numbers from 0 to 1.

So, we want to choose a function that squishes all its inputs between 0 and 1. One such function is the Sigmoid or Logistic function.

Sigmoid or Logistic function

The Sigmoid Function squishes all its inputs (values on the x-axis) between 0 and 1 as we can see on the y-axis in the graph below.

img

The range of inputs for this function is the set of all Real Numbers and the range of outputs is between 0 and 1.

img

We can see that as z increases towards positive infinity the output gets closer to 1, and as z decreases towards negative infinity the output gets closer to 0.

Hypothesis

For Linear Regression, we had the hypothesis y^=wx+b, whose output range was the set of all Real Numbers.

Now, for Logistic Regression our hypothesis is — y_hat = sigmoid(w.X + b) , whose output range is between 0 and 1 because by applying a sigmoid function, we always output a number between 0 and 1.

y_hat =

img

Hypothesis for Logistic Regression; source

z = w.X +b

Now, you might wonder that there are lots of continuous function that outputs values between 0 and 1. Why did we choose the Logistic Function only, why not any other? Actually, there is a broader class of algorithms called Generalized Linear Models of which this is a special case. Sigmoid function falls out very naturally from it given our set of assumptions.

Loss/Cost function

For every parametric machine learning algorithm, we need a loss function, which we want to minimize (find the global minimum of) to determine the optimal parameters(w and b) which will help us make the best predictions.

For Linear Regression, we had the mean squared error as the loss function. But that was a regression problem.

For a binary classification problem, we need to be able to output the probability of y being 1(tumor is benign for example), then we can determine the probability of y being 0(tumor is malignant) or vice versa.

So, we assume that the values that our hypothesis(y_hat) outputs between 0 and 1, is a probability of y being 1, then the probability of y being 0 will be (1-y_hat) .

Remember that y is only 0 or 1. y_hat is a number between 0 and 1.

More formally, the probability of y=1 given X , parameterized by w and b is y_hat (hypothesis). Then, logically the probability of y=0 given X , parameterized by w and b should be 1-y_hat . This can be written as —

P(y = 1 | X; w, b) =y_hat

Then, based on our assumptions, we can calculate the loglikelihood of parameters using the above two equations and consequently determine the loss function which we have to minimize. The following is the Binary Coss-Entropy Loss or the Log Loss function —

img

J(w,b) is the overall cost/loss of the training set and L is the cost for ith training example.

By looking at the Loss function, we can see that loss approaches 0 when we predict correctly, i.e, when y=0 and y_hat=0 or, y=1 and y_hat=1, and loss function approaches infinity if we predict incorrectly, i.e, when y=0 but y_hat=1 or, y=1 but y_hat=1.

Gradient Descent

Now that we know our hypothesis function and the loss function, all we need to do is use the Gradient Descent Algorithm to find the optimal values of our parameters like this(lr →learning rate) —

w := w-lr*dw

b := b-lr*db

where, dw is the partial derivative of the Loss function with respect to w and db is the partial derivative of the Loss function with respect to b .

dw = (1/m)*(y_hat — y).X

db = (1/m)*(y_hat — y)

Let’s write a function gradients to calculate dw and db .

See comments(#).

Decision boundary

Now, we want to know how our hypothesis(y_hat) is going to make predictions of whether y=1 or y=0. The way we defined hypothesis is the probability of y being 1 given X and parameterized by w and b .

So, we will say that it will make a prediction of —

y=1 when y_hat ≥ 0.5

y=0 when y_hat < 0.5

Looking at the graph of the sigmoid function, we see that for —

y_hat ≥ 0.5, z or w.X + b ≥ 0

y_hat < 0.5, z or w.X + b < 0

which means, we make a prediction for —

y=1 when w.X + b ≥ 0

y=0 when w.X + b < 0

So, **w.X + b = 0** is going to be our Decision boundary.

The following code for plotting the Decision Boundary only works when we have only two features in X.

Normalize Function

Function to normalize the inputs. See comments(#).

Train Function

The train the function includes initializing the weights and bias and the training loop with mini-batch gradient descent.

See comments(#).

Predict Function

See comments(#).

Training and Plotting Decision Boundary

img

Calculating Accuracy

We check how many examples did we get right and divide it by the total number of examples.

We get an accuracy of 100%. We can see from the above decision boundary graph that we are able to separate the green and blue classes perfectly.

Testing on Non-linearly Separable Data

Let’s test out our code for data that is not linearly separable.

img

img

Since Logistic Regression is only a linear classifier, we were able to put a decent straight line which was able to separate as many blues and greens from each other as possible.

Let’s check accuracy for this —

87 % accuracy. Not bad.

Important Insights

When I was training the data using my code, I always got the NaN values in my losses list.

Later I discovered the I was not normalizing my inputs, and that was the reason my losses were full of NaNs.

If you are getting NaN values or overflow during training —