A30: Logistic Regression (Part-2)>> Behind the Scene!
Probability & odds, e & natural log, logit link function, log-odds, decision boundary, baseline accuracy, logistic regression coefficients, hypothesis testing, accuracy paradox, power analysis, confusion matrix, true positive/negative, accuracy, specificity, precision, error /miss-classification rate…!
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ This is a learning lecture, code is subjective and for learning purpose.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
Logistic Regression >> Behind the Scene
1. Probability and Odds
2. e and the Natural Logarithm — A Quick Review
3. Understanding Logistic Regression
- 3.1: Introduction
- 3.2: The Logit Link Function
- 3.3: Getting Probabilities
- 3.4: Derivation — (optional)
- 3.5: Transformation From log-odds to the Probabilities
4. Logistic Regression Implementation
- 4.1: The Data and its overview
- 4.2: Linear Regression vs Logistic Regression — Visual Comparisons
- 4.3: Decision Boundary
- 4.4: Interpretation of the Coefficients
5. Model Evaluation
- 5.1: The Baseline Accuracy
- 5.2: The Confusion Matrix
- 5.3: The Classification Report
- 5.4: Changing the Threshold for Prediction/s
6. Final words
7. Extra Material — for your free time reading
- 7.1: Hypothesis Testing and the Confusion Matrix
- 7.2: Building Classification Report
- >>> 7.2.1: Accuracy and Misclassification Rate
- >>>>>>>>> The Accuracy Paradox
- >>> 7.2.2: Precision / Positive Predictive Value
- >>> 7.2.3: Recall / Sensitivity / True-Positive-Rate (TPR)
- >>> 7.2.4: False Positive Rate (FPR)
- >>> 7.2.5: Specificity / True-Negative-Rate (TNR)
- >>> 7.2.6: F1-score
- >>>>>>>>> Critics
- 7.3: Solving for the beta Coefficients
- 7.4: Illustration of a few functions
- >>> 7.4.1: Probability vs Odds
- >>> 7.4.2: The Logit for Odds — log-odds
- >>> 7.4.3: The Logit for Probabilities
- 7.5: Additional Resources
- 7.6: Statistical Testing, Power Analysis and Sample Size
Let’s start with the required imports.
# We are already familiar with these libraries!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline# scikit-learn imports
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’from scipy import stats# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)
1. Probability and Odds
Before we move on to work with logistic regression, we must have clear understanding for these very important statistical concepts.
🧘🏻♂️>> Probability <<🧘🏻♂️
Probability is describing the likeliness of some event to happen or occur on a numerical scale between 0 (impossible) & 1 (certain)
. The higher the probability is, the more likely the event will occur.
Tossing a fair coin or rolling a dice and expecting how often we will get a head and a certain number on a dice, its simply the outcome divided by the total options or possibilities.
- In case of a fair coin, probability of getting head or tail is same, 1/2 (0.5 or 50% chance), similarly, for a dice, chance of getting a certain number is 1/6.
🧘🏻♂️>> Odds <<🧘🏻♂️
The odds of an event represent the ratio of the:
For example:
So we can write:
It is helpful to think of the numeric odds as a ratio, for example:
1/5
means >>1 "three-side"
against5 "no-three-sides"
(we have total six sides of 1 to 6 numbers in a dice and considering three as required outcome)
This pictorial description (source) could be useful to understand how probability and odds are related.
>> Examples — Probability and the Odds: <<
Dice roll of 1 or any number:
Probability = 1/6
Odds = 1/5
Even dice roll:
Probability = 3/6
Odds = 3/3
Dice roll less than 5:
Probability = 4/6
Odds = 4/2
Odds are commonly used in gambling, for example
3/2
means, three wins against 2 losses and4/1
means four wins against one loss! In both case, there are 5 total plays.
Probabilities & odds represent the same thing in different ways.
So, the probabilities can be alternatively expressed as odds and it would be useful to understand how they are related to each other.
Let’s Create a table for probability of a certain event and its odds.
As in the table above, when we have odds in fractions:
- for
p = 0.25
:odds = 0.333..
-- it is 0.333.. as likely to happen than not to happen. - for
p = 0.5
:odds = 1
-- it is equally likely to happen as it is to not happen. - for
p = 0.75
:odds = 3
-- it is 3 times more likely to happen than not to happen.
2. e and the Natural Logarithm — A quick review
🧘🏻♂️>> 𝑒 <<🧘🏻♂️
e ~ 2.71828.....
, also known as Euler's number, is one of the most important irrational number (like 𝜋 ; both can’t be written in fraction) in mathematics.e
is the base of the natural logln
(recall school maths - a good link).
🧘🏻♂️>> Natural log — 𝑙𝑛 or 𝑙𝑜𝑔_𝑒 <<🧘🏻♂️
- Log of a number with base 𝑒 is the natural log. <<a good link and wiki>>
𝑒 is the base rate of growth shared by all continually growing processes link and the 𝑙𝑛 gives the time needed to reach a certain level of growth link.
The natural log 𝑙𝑛 is the inverse of 𝑒 ==> 𝑙𝑛( 𝑒^𝑥)=𝑥
Let’s try!
Now, it’s time to go back and add a new column in our table_po
with ln(odds)
.
The log-odds transformation has a very important property, we have the range [−∞,∞]. This is not true for the odds ratio, which can never be a negative number.
3. Understanding Logistic Regression
Well, now its time to move on to understand the logistic regression!
3.1: Introduction
Logistic regression is one of the most frequently used classification algorithm (classifier). Logistic regression estimates probabilities of class membership and this is actually done by predicting the log-odds from a kind of regression model.
Logistic regression can be generalized to multi-class classification, however, let’s start with binary outcomes, e.g. :
- Predicting the likelihood of patients to get certain disease based of the symptoms
- Predict weather a student will get residency based on their scores and the characteristics of the medical college
Well, you can think about tons of examples ………! >>> In reality, most of the times we are dealing with binary outcomes!
3.2: The logit “link function”
Instead of continuous outcomes, we predict class membership using logistic regression, but we can still formulate logistic regression in the way we formulate linear regression. We will have intercept and coefficients!
With the help of the predictors, we obtain values for the log-odds, or, to use another name, for the logit function (which is the inverse of the logistic function) of the probability of 𝑦 belonging to class 1.
Log-odds can take any positive or negative value, and the purpose of the logit link is to take a linear combination of the covariate (dependent, or response) values, between −∞ and ∞, and convert them to the scale of a probability, i.e., between 0 and 1.
3.3: Getting probabilities
How do we get probabilities out? Inverting the logit link function with the “logistic” function
The inverse function of the logit is called the logistic function.
By inverting the logit, we can solve explicitly for 𝑃(𝑦=1):
and the logistic function is a sigmoid (S-shaped) function, the final equation can be written as:
3.4: Derivation — (Optional)
To derive how we obtain the probabilities from the log-odds, let’s set
3.5: Transformation from log-odds to the probabilities
Let’s create a plot for this transformation, it would be easier if we visualize. Cope the code below in your own jupyter notebook and create a plot.
5.4: Logistic Regression Implementation
4.1: The Data and its Overview
Its time to implement what we have learned and see how the logistic regression can be used in business.
We can think about a situation where a college has to shortlist the applicants for admission based on their GRE and/or GPA. The college offers some specializations and the codes are provided in the field column. We are trying to keep it simple and considering only few columns:
gre
: GRE score of the applicantgpa
: GPA of the applicantfield
: Field of study for which student has applied toadmit
: The target column with binary 1-0 outcomes showing is the student was successful or not
You can either read data directly from the provided github link or download and save it on your machine, and work.
url='https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/admissions.csv'# Let's read the data directly from the github
adm = pd.read_csv(url)# Checking the head
adm.head(2)
adm.info()
We can see some missing data, its a small dataset and we can calculated the numbers as well.
# How much (%) data is missing
adm.isnull().sum()/len(adm)*100 # % of missing data
Well, only a small fraction of the data is missing, we can ignore it!
adm.dropna(inplace=True) # inplace = True for the permanent change
>>Let’s compute the probabilities and the odds for admission based on the filed of study<<
adm.field.value_counts() #unique()
So, we have total 4 fields available for the applicants, and most of the students were admitted in the filed with code 2. Let’s compute probabilities and the odds for admission!
4.2: Linear Regression vs Logistic Regression — Visual Comparisons
Similar to the linear regression, we need to create instance for the logistic regression and train the model on the dataset. Once the model is trained, we can grab the model coefficients, predicted probabilities and their labels as well.
Let’s standardized the features first. It is also important as Scikit-learn applies
l2
regularization while implementing the Logistic Regression by default.
Our feature (gpa) in X is standardized (0 mean 1 variance), so the
gpa=0
indicates an averagegpa
andgpa=1
indicates a value being one standard deviation larger than the mean, which is 0.
So, we got prediction from linear regression and logistic regression model in the above data frame, let’s plot them to visualize.
So, we can see, in the left plot for Linear Regression, the predictions does not make any sense at all.
The plot at the right side, resolve the problem and class predictions make sense using logistic regression.
4.3: Decision Boundary
(odds, probability, coefficients, intercept and the feature/s)
Let see, for what value of gpa, the log odds is 0:
We can manually compute from the equation below
>>>> log(odds) = b_0 + b_1 x
We want value of x for which the log(odds)=0
>>>> 0 = b_0 + b_1 x
>>>> -b_0 = b_1 x
>>>> -b_0/b_1 = x
>>>> (-1)*(b_0/b_1) = x
So x is the value for which log(odds) = 0
Well, we don’t need to do this manually, we can grab b_0 and b_1 from our trained logistic regression model logR and compute the decision boundary where log(odds) = 0
Please read the comments in the code below and see the output, you need to spend sometime and get complete understanding.
Let’s move on and visualize the boundary and data along with predictions.
We can predict probabilities for the data to get the logistic curve and plot on the above plot. Let’s get it done.
# Generating Data for the curve
x_vals = np.linspace(-10.,10.,3000)
y_pp = logR.predict_proba(x_vals[:, np.newaxis])[:,1]#plotting the probabilities (black cure)
ax.plot(x_vals, y_pp, color=’black’, alpha=0.7, lw=4)# adding blue line for probability cut-off (0.5)
ax.axhline(0.5, lw=3, color=’blue’, ls=’ — ‘, label=’probability = 0.5')
fig
4.4: Interpretation of the Coefficients
In case, you want to see b_0 and b_1 coefficient values from the trained logistic regression model!
print(“The logistic regression beta’s are:”)
print(“beta_1 = {} and beta_0 = {}.”.format(logR.coef_[0][0], logR.intercept_[0]))
>>Meaning of the betas in log odds<<
The coefficients have a linear impact on the log-odds (recall the formula).
- If 𝛽_1 is 0, then 𝛽_0 represents the log odds of admittance for a student with an average gpa.
- 𝛽_1 is the effect of a unit increase in rescaled gpa on the log odds of admittance.
Log odds are hard to interpret. Luckily though, we can apply the logistic transform to get the probability of admittance at different 𝛽 values.
From the curve in the above plot, we can see that values of gpa
within 2 to 3 standard deviations of the mean lead to a practically linear increase of the probability of admission.
The values very far to the left or the right hardly increase or decrease the probability of admission (s-shaped curve) any further as the curve becomes very flat.
Logistic regression coefficients can be exponentiated to get the odds ratio, and this is even easier to interpret the these coefficients. We will try this in the next lecture while working with titanic data set. In the mean time, these links could be useful to explore:
- interpret coefficients — odd ratios in logistic regression
- exponentiate the logistic regression coefficients
5. Model Evaluation
5.1: The baseline accuracy
Baseline accuracy is significantly important calculation and critical to know when we are evaluating the performance of our trained model.
Baseline Accuracy: The accuracy that can be achieved by a model by simply guessing the majority class for every observation.
baseline_accuracy = majority_class_N / total_N
A typical human guess is inclined to think about 50% accuracy, equivalent chance for both classes in binary class classification problem. In fact, this is only true to guess by chance if we have both classes in the same ratio or in multi-class classification problem, if we have the majority class making up ~50% of the labels.
>>A rule of thumb; baseline accuracy can never be below 50%<<
In real life binary class problem, most of the times, the datasets are not really balanced and we have the baseline accuracy higher than 50%. For example, out of 100 observations, if 70 belong to class 1 and 30 belong to class 0, the baseline accuracy would be 70%. Creating a model with accuracy lower than the baseline is not really that we want!
If 99% of your observations (extreme unbalanced data) belong to class 1, a model predicting 99% of them correctly is performing at chance. Quality data is important and a model with 99% accuracy could be a worst model!
# Well, we can easily find the baseline accuracy using value_counts()!
y.value_counts(normalize=True)
So, in our dataset, the majority class is 1 with 54% observations, which is the baseline accuracy in our working example!
baseline_acc = y.value_counts(normalize=True).values[0]*100
print(“Baseline accuracy is: “, baseline_acc)
5.2: The Confusion Matrix
Recall the confusion matrix or error matrix from the slides in the previous lecture, let’s compute out confusion matrix manually first and then we will use using scikit-learn for this purpose.
# Manually calculate confusion matrix metrics for our model, we need predictions from the model first. predicted = logR.predict(np.array(X['gpa']).reshape(-1,1))# Manually calculating confusion matrix
tn = np.sum((y == 0) & (predicted == 0))
tp = np.sum((y == 1) & (predicted == 1))
fp = np.sum((y == 0) & (predicted == 1))
fn = np.sum((y == 1) & (predicted == 0))
print("tn:", tn)
print("tp:", tp)
print("fp:", fp)
print("fn:", fn)
print("Number of classification errors (fp+fn):", fp+fn)
Let’s use scikit-learn to get the confusion matrix….!
#Verify from sklearn’s metrics.confusion_matrix
# We need this import
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, predicted))#,labels=[1,0])) # Try labels yourself
5.3: The Classification Report
Classification report helps diagnose the effectiveness of the classifier.
Scikit-learns’ metrics.classification_report
returns the report of three very useful evaluation metrics; precision, recall and f1-score
on both of the classes (or more if you have a multi-class problem), Support
refers to the total number of observations in each class.
- 0 and 1 rows for each individual class
- weighted averages row, as from its name, gives the weighted averages across both classes.
from sklearn.metrics import classification_reportprint(classification_report(y, predicted))
5.4: Changing the Threshold for Prediction/s
The prediction of the classifier defaults to guessing the class that has the highest predicted probability. This necessarily leads to the highest possible accuracy (only a guarantee for the training data!).
However, it could be the case that maximizing the accuracy is not, in fact, our ultimate goal. Consider the following scenario:
Cancer detection: Based on some medical measurements, we have developed a classification model (classifier) to detect, whether or not a person has a cancerous tumor. The classifier gets a 96% accuracy compared to a 60% baseline accuracy.
>>>Our classifier is performing well, but what might be wrong with just maximizing the accuracy in this case?
>>>Think back to the confusion matrix and the goal should be to treat cancer patients before it is too late.
6. Final Words
Logistic regression is a very popular and attractive machine learning classifier for many good reasons:
- Shares similar properties to linear regression
- Very fast and efficient
- Coefficients are interpretable (although somewhat complex): they represent the change in log-odds due to the input variables
- Can also perform well on small number of observations
Generally, the logistic regression is considered at lower end when comes in comparisons with other competitive supervised machine learning algorithms.
7. Extra Material — for your free time reading!
In this section, we will revise some important statistical concepts. We will also get some useful plots to understand the concepts of probabilities, odds, log-odds, statistical testing, power analysis and much more……!
7.1: Hypothesis testing and the confusion matrix
In the context of hypothesis testing false positives
and false negatives
are referred to as Type I
and Type II
error, respectively.
- Type I error is the incorrect rejection of the null hypothesis when in fact the null hypothesis is true. This is equivalent to the false positive rate in classification: >> the rate of a model labeling an observation as “true” when in fact it is “false”.
Type I error directly corresponds to p-values: the p-value is the probability of incorrectly rejecting the null hypothesis.
- Type II error, on the other hand, directly corresponds to false negatives. A Type II error in the context of hypothesis testing would be to accept the null hypothesis when in fact the alternative hypothesis is true.
Statistical significance and statistical power are two fundamental concepts, you can read more from any basics statistics book.
7.2: Building Classification Report
>>7.2.1: Accuracy and misclassification rate<<
accuracy = (tp + tn) / total_population
Just the proportion of correct guesses, regardless of class.
misclassification_rate = (fp + fn) / total_population
Just a difference between one and the accuracy.
from sklearn.metrics import accuracy_scoretotal_population = tp + fp + tn + fnprint(“Manually canculating score: “,
float(tp + tn) / total_population) # manual
print(“scikit-learns accuracy_score module: “,
accuracy_score(y, predicted)) # scikit-learns matrix
print(“model’s score function: “,
logR.score(np.array(X[‘gpa’]).reshape(-1,1),y))
print(“Three options returing the same results.”)
>> The Accuracy Paradox << Wiki
Accuracy is a very intuitive metric — we can think of an exam score where we get total_correct/total_attempted
. However, accuracy is often a poor metric in application. There are many reasons for this:
- Imbalanced data with 95% positives in the baseline will have 95% accuracy even with no predictive power.
- This is the paradox; pursuing accuracy often means predicting the most common class rather than doing the most useful work.
Ranking predictions in the correct order is more important than getting them correct.
In many case we need to know the exact probability of a positives and negatives, e.g.:
- To calculate an expected return.
- To triage (degrees of urgency to decide the order of treatment) observations that are borderline positive.
Some of the most useful metrics for addressing these problems are:
>> Classification accuracy/error <<
- Classification accuracy is the percentage of correct predictions (higher is better).
- Classification error is the percentage of incorrect predictions (lower is better).
- Easiest classification metric to understand.
>> Confusion matrix <<
- Gives us a better understanding of how your classifier is performing.
- Allows us to calculate
sensitivity, specificity
, and many other metrics that might match our business objective better than just the accuracy. Precision
andrecall
are good for balancing misclassification costs.
>> ROC curves and area under a curve (AUC) <<
- Good for ranking and prioritization problems.
- Allows us to visualize the performance of our classifier across all possible classification thresholds, thus helpful to choose a threshold that
appropriately balances sensitivity and specificity
. - Still useful when there is high class imbalance (unlike classification accuracy/error).
- Harder to use when there are more than two response classes (multi-classes — try one vs all!).
>> Log loss <<
- Most useful when well-calibrated predicted probabilities are important to your business objective.
- Expected value calculations
- Triage
All of these can be easily computed in Python, and it is important to know what are we looking for.
>> 7.2.2: Precision / Positive Predictive Value <<
precision = tp / (tp + fp)
The idea of the classifier being precise is subtly different than it being accurate.
Precision is a measure of correctness only for its positive class predictions, whereas accuracy is a measure of correctness for all guesses.
from sklearn.metrics import precision_scoreprint(“Precision using scikit-learn:”,
precision_score(y, predicted))
print(“Precision computed manually: “,
float(tp) / (tp + fp))
>> 7.2.3: Recall / Sensitivity / True Positive Rate (TPR) <<
The recall measures out of all the times the true label was positive, the predicted label was also positive.
recall = tp / (tp + fn)
This is alternatively known as the sensitivity or true positive rate. The three names refer to the same quantity.
from sklearn.metrics import recall_scoreprint(“Recall using scikit-learn: “, recall_score(y, predicted))
print(“Recall manual calculations: “, float(tp) / (tp + fn))
Precision
can be seen as a measure of quality
, and recall
as a measure of quantity
.
>> 7.2.4: False Positive Rate (FPR) <<
Alternatively, the false positive rate measures out of all the times the true label was negative, the predicted label was positive.
fpr = fp / (tn + fp)
##Calculate the FPR using the confusion matrix cells.
print(“FPR: “, float(fp) / (tn + fp))
# alternative way to calculate the same
print(“FPR:”, 1 — recall_score(y==0, predicted==0))
>> 7.2.5: Specificity / True Negative Rate (TNR) <<
specificity = tn / (tn + fp)
It’s a sister metric to recall, which measures the same but for positives.
Measures of all the times the true label was negative, the predicted label was also negative.
specificity = float(tn) / (tn + fp)
print(“Manually: Specificity / True Negative Rate (TNR)”, specificity)# alternative way to calculate the same
print(“Specificity / True Negative Rate (TNR) Using recall_score module: “, recall_score(y==0, predicted==0))
>> 7.2.6: F1-score <<
The F1-score is the harmonic mean of the precision and recall metrics.
The highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either the precision or the recall is zero.
from sklearn.metrics import f1_score# Manual calculation
precision_1 = float(tp)/(tp+fp)
recall_1 = float(tp)/(tp+fn)
f1_1 = 2/(1/recall_1+1/precision_1)
print(“F1-Score: “, f1_1)
# using scikit-learn
print(“F1-Score: “, f1_score(y==1, predicted==1))
Blending the two is useful. By combining the two we have a measure of the classifiers’ ability to find the positively labeled observations as well as how permissive it is of identification errors on those labels.
>> Critics <<
It is also important to consider that f1-score have been criticized by well know statisticians/scientists.
- In their paper “A note on using the F-measure for evaluating record linkage algorithms”, published in 2017, David Hand and Peter Christen criticize the use of f1-score because of the equal importance to precision and recall in the equation. Practically, different types of mis-classifications incur different costs. In their study, they show that f-measure can also be expressed as a weighted sum of precision and recall and this relative importance assigned to precision and recall should be an aspect of the problem.
Some other readings are also useful:
- By Davide Chicco and Giuseppe Jurman in 2020 — The advantages of the Matthews correlation coefficient (MCC) over f1-scoreand accuracy in binary classification evaluation.
- MCC originally published in 1975 by a biochemist Brian Matthews measure the quality of binary classification.
- By David Power in 2011 — Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation . He argues that kappa and correlation are symmetric whereas, f1-score ignores the true negative which is misleading for unbalanced class.
7.3: Solving for the beta coefficients
Logistic regression maximizes the likelihood that the predicted probabilities give the correct class.
To do so, one considers the product of the predicted probabilities for all data points (this is called the likelihood function which is a joint probability distribution for the predicted probabilities per class):
The 𝛽-coefficients are chosen in such a way that this function is maximized.
The optimal case would be that
- the predicted probabilities for all class one observations are actually one
- the predicted probabilities for all class zero observations are actually zero
There is not a closed-form solution to the beta coefficients like in linear regression, and the betas are found through optimization procedures.
If you are particularly interested in the math, these two resources are good:
A good blog post on the logistic regression beta coefficient derivation.
This paper is also a good reference.
7.4: Illustration of a few functions
>> 7.4.1: Probability vs Odds <<
# Function to calculate the odds of success.
def odds(p):return(p/(1-p))# Generating a range of probabilities.
probabilities=np.linspace(0.001,0.99,200)# Generate list of odds.
odds_list = [odds(proba) for proba in probabilities]# Create figure.
plt.figure(figsize=(18,6))# Plot blue line for odds as probability goes from 0.1% (0.001) to 99% (0.99).
plt.plot(probabilities,odds_list,lw=4,color=’DarkBlue’)# Plot red dashed line to visualize odds when probability is 50%.
plt.vlines(0.5,0.0,100,ls=”dashed”,lw=3,color=’DarkRed’)
plt.text(0.33,50.0,”odds(P=0.5) = 1",fontsize=18,color=’DarkRed’)# Plot orange dotted line to visualize odds when probability is 66.67%.
plt.vlines(0.6667,0.0,100,ls=” — “,lw=3,color=’DarkOrange’)
plt.text(0.68,50,”odds(P=2/3) = 2",fontsize=18,color=’DarkOrange’)# Annotate blue line when probability is 100%.
plt.text(0.84,100,”odds(P=1) = $\infty$”,fontsize=18,color=’DarkBlue’)# Title, labels…..1
plt.title(“If the probability of success is 50%, then the odds of success are 1.\n\
If the probability of success is 100%, then the odds of success are $\infty$.”,
ha=’left’,position=(0,1), fontsize=18)
plt.xlabel(“Probability (P)”,fontsize=20)
plt.ylabel(“Odds”,fontsize=20);
>> 7.4.2: The logit for odds — log-odds <<
# Creating some positive x-values as suitable for odds
odds = np.linspace(start=0.001, stop=5, num=500)
# if start=0 ==> RuntimeWarning: divide by zero encountered in log below
log_odds = np.log(odds)plt.figure(figsize=(18,6)); plt.axhline(y=0, linewidth=3,
color=’DarkRed’, ls=’ — ‘, alpha=0.4)
plt.plot(odds, log_odds, lw=4, color=’DarkBlue’, alpha=0.7)
plt.xlabel(‘odds’, fontsize=16); plt.ylabel(‘log(odds)’, fontsize=16)
plt.title(‘Transformation from odds to log(odds)\n’,fontsize=18);
log-odds can take any value between −∞ and ∞
>> 7–4–3) The logit for probabilities <<
# Creating some x-values between 0 and 1 as suitable for probabilities
pr = np.linspace(start=0.001, stop=0.999, num=500) # p_min=0, p_max=1
log_it = np.log(pr/(1-pr)) # odds=P/1-Pplt.figure(figsize=(18,6)); plt.axhline(y=0, linewidth=3, color=’DarkRed’,
ls=’ — ‘, alpha=0.4)
plt.plot(pr, log_it, lw=4, color=’DarkBlue’, alpha=0.7)
plt.xlabel(‘P’, fontsize=16); plt.ylabel(‘log (P/(1-P))’, fontsize=16)
plt.title(“Transformation from probability to — log (P/(1-P)) — \n”,
fontsize=18);
7.5: Additional Resources
- Logistic Regression Video Walkthrough
- Logistic Regression Walkthrough
- Logistic Regression with Statsmodels — Well Switching in Bangladesh
- 0 and 1 are not probabilities
- Null Hypothesis
- Hypothesis Testing
- Statistical Power and Power Analysis in Python
>> ROC — Just the heads-up, we will cover next: <<
- A deeper Introduction to ROC
- Interactive playing with ROC curves
- Data School’s video and transcript on ROC/AUC
- Watch Rahul Patwari’s video on ROC curves
7.6: Statistical testing, Power Analysis and Sample Size
>> Statistical Testing <<
Logistic regression is one of the few machine learning models where we can obtain comprehensive statistics. By performing hypothesis testing, we can understand whether we have sufficient data to make strong conclusions about individual coefficients and the model as a whole. statsmodels is a very useful Python library that can provide these statistics with just a few lines of code.
>> Power Analysis <<
As we may suspect, many factors affect how statistically significant the results of a logistic regression are. The art of estimating the sample size to detect an effect of a given size with a given degree of confidence is called power analysis. Read last paragraph in description on this link:
Some factors that influence the accuracy of our resulting model are:
- Desired statistical significance (p-value)
- Magnitude of the effect
- It is more difficult to distinguish a small effect from noise. So, more data would be required!
- Measurement precision
- Sampling error
- An effect is more difficult to detect in a smaller sample.
- Experimental design
So, many factors, in addition to the number of samples, contribute to the resulting statistical power. Hence, it is difficult to give an absolute number without a more comprehensive analysis.
>> Type II error and “power” <<
Type I error corresponds to the concept of statistical significance
Type II error corresponds to the concept of statistical power.
The power of a test is:
More intuitively, power measures our ability to detect an effect that is present. It indicates the probability of avoiding a type II error.
The statistical power ranges from 0 to 1, and as it increases, the probability of making a type II error (wrongly failing to reject the null hypothesis) decreases.
We can visualize the ideas of significance, power, and error types in a matrix the same as our confusion matrix from above:
- alpha is the probability of Type I error in any hypothesis test — incorrectly rejecting the null hypothesis.
- beta is the probability of Type II error in any hypothesis test — incorrectly failing to reject the null hypothesis.
>> How Many Samples Are Needed? <<
We often ask how large our data set should be to achieve a reasonable logistic regression result. Below, a few methods will be introduced for determining how accurate the resulting model will be.
Rule of Thumb
- Quick: At least 100 samples total. At least 10 samples per feature.
Both the above methods are from: >>> Regression Models for Categorical and Limited Dependent Variables by Long, J. S. (1997). Thousand Oaks, CA: SAGE Publications.
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A31: Logistic Regression >> Dead or Alive >> Step-by-step complete machine learning project!”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.