Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
---
title: "Accuracy: The Intuitive Metric"
sidebar_label: Accuracy
description: "Understanding the most common evaluation metric, its formula, and its fatal flaws in imbalanced datasets."
tags: [machine-learning, model-evaluation, metrics, classification]
---

**Accuracy** is the most basic and intuitive metric used to evaluate a classification model. In simple terms, it answers the question: *"Out of all the predictions made, how many were correct?"*

## 1. The Mathematical Formula

Accuracy is calculated by dividing the number of correct predictions by the total number of input samples.

Using the components of a [Confusion Matrix](./confusion-matrix), the formula is:

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

Where:

* **TP (True Positives):** Correctly predicted positive samples.
* **TN (True Negatives):** Correctly predicted negative samples.
* **FP (False Positives):** Incorrectly predicted as positive.
* **FN (False Negatives):** Incorrectly predicted as negative.

**Example:**

Imagine you have a dataset of 100 emails, where 80 are spam and 20 are not spam. Your model makes the following predictions:

| Actual \ Predicted | Spam | Not Spam |
| --- | --- | --- |
| **Spam** | 70 (TP) | 10 (FN) |
| **Not Spam** | 5 (FP) | 15 (TN) |

Using the formula:

$$
\text{Accuracy} = \frac{70 + 15}{70 + 15 + 5 + 10} = \frac{85}{100} = 0.85 \text{ or } 85\%
$$

This means your model correctly identified 85% of the emails.

## 2. When Accuracy Works Best

Accuracy is a reliable metric **only** when your dataset is **balanced**.

* **Example:** You are building a model to classify images as either "Cats" or "Dogs." Your dataset has 500 cats and 500 dogs.
* If your model gets an accuracy of 90%, you can be confident that it is performing well across both categories.

## 3. The "Accuracy Paradox" (Imbalanced Data)

Accuracy becomes highly misleading when one class significantly outweighs the other. This is known as the **Accuracy Paradox**.

### The Scenario:

Imagine a Rare Disease test where only **1%** of the population is actually sick.

1. If a "lazy" model is programmed to simply say **"Healthy"** for every single patient...
2. It will be **99% accurate**.

```mermaid
graph LR
POP["$$\text{Population (100\%)}$$"]

POP --> H["$$99\% \ \text{Healthy}$$"]
POP --> S["$$1\% \ \text{Sick (Rare Disease)}$$"]

%% Lazy Model
H --> PH["$$\text{Predicted: Healthy}$$"]
S --> PS["$$\text{Predicted: Healthy}$$"]

PH --> ACC1["$$\text{True Negatives (99\%)}$$"]
PS --> ERR1["$$\text{False Negatives (1\%)}$$"]

ACC1 --> MET["$$\text{Accuracy} = \frac{99}{100} = 99\%$$"]

ERR1 --> FAIL["$$\text{❌ All Sick Patients Missed}$$"]

MET -.->|"$$\text{Accuracy Paradox}$$"| FAIL

```

**The problem?** Even though the accuracy is 99%, the model failed to find the 1% of people who actually need help. In high-stakes fields like medicine or fraud detection, accuracy is often the least important metric.

## 4. Implementation with Scikit-Learn

```python
from sklearn.metrics import accuracy_score

# Actual target values
y_true = [0, 1, 1, 0, 1, 1]

# Model predictions
y_pred = [0, 1, 0, 0, 1, 1]

# Calculate Accuracy
score = accuracy_score(y_true, y_pred)

print(f"Accuracy: {score * 100:.2f}%")
# Output: Accuracy: 83.33%

```

## 5. Pros and Cons

| Advantages | Disadvantages |
| --- | --- |
| **Simple to understand:** Easy to explain to non-technical stakeholders. | **Useless for Imbalance:** Can hide poor performance on minority classes. |
| **Single Number:** Provides a quick, high-level overview of model health. | **Ignores Probability:** Doesn't tell you how confident the model was in its choice. |
| **Standardized:** Used across almost every classification project. | **Cost Blind:** Treats "False Positives" and "False Negatives" as equally bad. |

## 6. How to move beyond Accuracy?

To get a true picture of your model's performance—especially if your data is "skewed"—you should look at Accuracy alongside:

* **Precision:** How many of the predicted positives were actually positive?
* **Recall:** How many of the actual positives did we successfully find?
* **F1-Score:** The harmonic mean of Precision and Recall.

## References

* **Google Developers:** [Classification: Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
* **StatQuest:** [Accuracy, Precision, and Recall](https://www.youtube.com/watch?v=Kdsp6soqA7o)

---

**If Accuracy isn't enough to catch rare diseases or credit card fraud, what is?** Stay tuned for our next chapter on **Precision & Recall** to find out!
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
---
title: The Confusion Matrix
sidebar_label: Confusion Matrix
description: "The foundation of classification evaluation: True Positives, False Positives, True Negatives, and False Negatives."
tags: [machine-learning, model-evaluation, metrics, classification, confusion-matrix]
---

A **Confusion Matrix** is a table used to describe the performance of a classification model. While "Accuracy" tells you how often the model is correct, the Confusion Matrix tells you exactly **how** it is failing and which classes are being swapped.

## 1. The 2x2 Layout

For a binary classification (Yes/No, Spam/Ham), the matrix consists of four quadrants:

| | Predicted: **Negative** | Predicted: **Positive** |
| :--- | :--- | :--- |
| **Actual: Negative** | **True Negative (TN)** | **False Positive (FP)** |
| **Actual: Positive** | **False Negative (FN)** | **True Positive (TP)** |

### Breaking Down the Quadrants:
* **True Positive (TP):** You predicted positive, and it was true. (e.g., You predicted a patient has cancer, and they do).
* **True Negative (TN):** You predicted negative, and it was true. (e.g., You predicted a patient is healthy, and they are).
* **False Positive (FP):** You predicted positive, but it was false. (Also known as a **Type I Error** or a "False Alarm").
* **False Negative (FN):** You predicted negative, but it was positive. (Also known as a **Type II Error** or a "Miss").

## 2. Type I vs. Type II Errors

The "cost" of these errors depends entirely on your specific problem.

```mermaid
graph TB
TITLE["$$\text{Type I vs. Type II Errors}$$"]

%% Ground Truth
TITLE --> TRUTH["$$\text{Actual Condition}$$"]
TRUTH --> POS["$$\text{Positive (Condition Present)}$$"]
TRUTH --> NEG["$$\text{Negative (Condition Absent)}$$"]

%% Model Decisions
POS --> TP["$$\text{True Positive}$$"]
POS --> FN["$$\text{Type II Error}$$<br/>$$\text{False Negative}$$"]

NEG --> TN["$$\text{True Negative}$$"]
NEG --> FP["$$\text{Type I Error}$$<br/>$$\text{False Positive}$$"]

%% Costs
FP --> COST1["$$\text{Cost Depends on Context}$$"]
FN --> COST2["$$\text{Cost Depends on Context}$$"]

%% Examples
COST1 --> EX1["$$\text{Example: Spam Filter}$$<br/>$$\text{Important Email Blocked}$$"]
COST2 --> EX2["$$\text{Example: Medical Test}$$<br/>$$\text{Disease Missed}$$"]

%% Emphasis
EX1 -.->|"$$\text{Type I Cost High}$$"| FP
EX2 -.->|"$$\text{Type II Cost High}$$"| FN

```

* **In Cancer Detection:** A **Type II Error (FN)** is much worse because a sick patient goes untreated.
* **In Spam Filtering:** A **Type I Error (FP)** is worse because an important work email is hidden in the trash.

## 3. Implementation with Scikit-Learn

```python
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Actual values and Model predictions
y_true = [0, 1, 0, 1, 0, 1, 1, 0]
y_pred = [0, 1, 1, 1, 0, 0, 1, 0]

# 1. Generate the matrix
cm = confusion_matrix(y_true, y_pred)

# 2. Visualize it
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap=plt.cm.Blues)
plt.show()

```

## 4. Multi-Class Confusion Matrices

The matrix isn't just for binary problems. If you are classifying "Cat," "Dog," and "Bird," your matrix will be 3x3. The diagonal line from top-left to bottom-right represents correct predictions. Any numbers off that diagonal show you which animals the model is confusing.

```mermaid
graph TB
TITLE["$$\text{Multi-Class Confusion Matrix (3×3)}$$"]

%% Axes
TITLE --> ACT["$$\text{Actual Class}$$"]
TITLE --> PRED["$$\text{Predicted Class}$$"]

ACT --> CAT_A["$$\text{Cat}$$"]
ACT --> DOG_A["$$\text{Dog}$$"]
ACT --> BIRD_A["$$\text{Bird}$$"]

PRED --> CAT_P["$$\text{Cat}$$"]
PRED --> DOG_P["$$\text{Dog}$$"]
PRED --> BIRD_P["$$\text{Bird}$$"]

%% Diagonal (Correct Predictions)
CAT_A --> CAT_P["$$\text{Cat → Cat}$$<br/>$$\text{Correct}$$"]
DOG_A --> DOG_P["$$\text{Dog → Dog}$$<br/>$$\text{Correct}$$"]
BIRD_A --> BIRD_P["$$\text{Bird → Bird}$$<br/>$$\text{Correct}$$"]

%% Off-Diagonal (Confusions)
CAT_A --> DOG_P["$$\text{Cat → Dog}$$<br/>$$\text{Confusion}$$"]
CAT_A --> BIRD_P["$$\text{Cat → Bird}$$<br/>$$\text{Confusion}$$"]

DOG_A --> CAT_P["$$\text{Dog → Cat}$$<br/>$$\text{Confusion}$$"]
DOG_A --> BIRD_P["$$\text{Dog → Bird}$$<br/>$$\text{Confusion}$$"]

BIRD_A --> CAT_P["$$\text{Bird → Cat}$$<br/>$$\text{Confusion}$$"]
BIRD_A --> DOG_P["$$\text{Bird → Dog}$$<br/>$$\text{Confusion}$$"]

%% Emphasis
CAT_P -.->|"$$\text{Diagonal}$$"| GOOD["$$\text{Correct Predictions}$$"]
DOG_P -.->|"$$\text{Diagonal}$$"| GOOD
BIRD_P -.->|"$$\text{Diagonal}$$"| GOOD

DOG_P -.->|"$$\text{Off-Diagonal}$$"| BAD["$$\text{Model Confusion}$$"]
BIRD_P -.->|"$$\text{Off-Diagonal}$$"| BAD

```

## 5. Summary: What can we calculate from here?

The Confusion Matrix is the "mother" of all classification metrics. From these four numbers, we derive:

* **Accuracy:**
* **Precision:**
* **Recall:**
* **F1-Score:** The balance between Precision and Recall.

## References

* **StatQuest:** [Confusion Matrices Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o)
* **Scikit-Learn:** [Confusion Matrix API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

---

**Now that you can see where the model is making mistakes, let's learn how to turn those mistakes into a single score.**
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
---
title: "F1-Score: The Balanced Metric"
sidebar_label: F1-Score
description: "Mastering the harmonic mean of Precision and Recall to evaluate models on imbalanced datasets."
tags: [machine-learning, model-evaluation, metrics, f1-score, classification]
---

The **F1-Score** is a single metric that combines [Precision](./precision) and [Recall](./recall) into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives."

## 1. The Mathematical Formula

The F1-Score is the **harmonic mean** of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low.

$$
F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$


### Why use the Harmonic Mean?

If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0.

## 2. When to Use the F1-Score

F1-Score is the best choice when:

1. **Imbalanced Classes:** You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection).
2. **Equal Importance:** You care equally about minimizing False Positives (Precision) and False Negatives (Recall).

## 3. Visualizing the Balance

Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium.

```mermaid
graph TB
SCALE["$$\text{F1-Score}$$<br/>$$\text{Balance Scale}$$"]

%% Precision Side
SCALE --> P["$$\text{Precision}$$"]
P --> P1["$$\text{Few False Positives}$$"]
P1 --> P2["$$\text{Strict Threshold}$$"]
P2 --> P3["$$\text{Misses True Positives}$$"]
P3 --> P4["$$\text{Low Recall}$$"]

%% Recall Side
SCALE --> R["$$\text{Recall}$$"]
R --> R1["$$\text{Few False Negatives}$$"]
R1 --> R2["$$\text{Loose Threshold}$$"]
R2 --> R3["$$\text{Many False Positives}$$"]
R3 --> R4["$$\text{Low Precision}$$"]

%% Balance Point
P4 -.->|"$$\text{Too Strict}$$"| UNBAL["$$\text{Unbalanced Model}$$"]
R4 -.->|"$$\text{Too Loose}$$"| UNBAL

P --> BAL["$$\text{Equilibrium}$$"]
R --> BAL

BAL --> F1["$$\text{F1} = 2 \cdot \frac{P \cdot R}{P + R}$$"]
F1 --> OPT["$$\text{Maximum F1-Score}$$"]

```

## 4. Implementation with Scikit-Learn

```python
from sklearn.metrics import f1_score

# Actual target values
y_true = [0, 1, 1, 0, 1, 1, 0]

# Model predictions
y_pred = [0, 1, 0, 0, 1, 1, 1]

# Calculate F1-Score
score = f1_score(y_true, y_pred)

print(f"F1-Score: {score:.2f}")
# Output: F1-Score: 0.75

```

## 5. Summary Table: Which Metric to Trust?

| Scenario | Best Metric | Why? |
| --- | --- | --- |
| **Balanced Data** | **Accuracy** | Simple and representative. |
| **Spam Filter** | **Precision** | False Positives (real mail in spam) are very bad. |
| **Cancer Screen** | **Recall** | False Negatives (missing a sick patient) are fatal. |
| **Fraud Detection** | **F1-Score** | Need to catch thieves (Recall) without blocking everyone (Precision). |

## 6. Beyond Binary: Macro vs. Weighted F1

If you have more than two classes (Multi-class classification), you'll see these options:

* **Macro F1:** Calculates F1 for each class and takes the unweighted average. Treats all classes as equal.
* **Weighted F1:** Calculates F1 for each class but weights them by the number of samples in that class.

## References

* **Scikit-Learn:** [F1 Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
* **Towards Data Science:** [The F1 Score Paradox](https://towardsdatascience.com/the-f1-score-2236378a31).

**The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?**
Loading