codeharborhub · ajay-dhangar · Jan 6, 2026 · Jan 6, 2026
@@ -0,0 +1,128 @@
+---
+title: "Accuracy: The Intuitive Metric"
+sidebar_label: Accuracy
+description: "Understanding the most common evaluation metric, its formula, and its fatal flaws in imbalanced datasets."
+tags: [machine-learning, model-evaluation, metrics, classification]
+---
+
+**Accuracy** is the most basic and intuitive metric used to evaluate a classification model. In simple terms, it answers the question: *"Out of all the predictions made, how many were correct?"*
+
+## 1. The Mathematical Formula
+
+Accuracy is calculated by dividing the number of correct predictions by the total number of input samples.
+
+Using the components of a [Confusion Matrix](./confusion-matrix), the formula is:
+
+$$
+\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
+$$
+
+Where:
+
+* **TP (True Positives):** Correctly predicted positive samples.
+* **TN (True Negatives):** Correctly predicted negative samples.
+* **FP (False Positives):** Incorrectly predicted as positive.
+* **FN (False Negatives):** Incorrectly predicted as negative.
+
+**Example:**
+
+Imagine you have a dataset of 100 emails, where 80 are spam and 20 are not spam. Your model makes the following predictions:
+
+| Actual \ Predicted | Spam | Not Spam |
+| --- | --- | --- |
+| **Spam** | 70 (TP) | 10 (FN) |
+| **Not Spam** | 5 (FP) | 15 (TN) |
+
+Using the formula:
+
+$$
+\text{Accuracy} = \frac{70 + 15}{70 + 15 + 5 + 10} = \frac{85}{100} = 0.85 \text{ or } 85\%
+$$
+
+This means your model correctly identified 85% of the emails.
+
+## 2. When Accuracy Works Best
+
+Accuracy is a reliable metric **only** when your dataset is **balanced**. 
+
+* **Example:** You are building a model to classify images as either "Cats" or "Dogs." Your dataset has 500 cats and 500 dogs.
+* If your model gets an accuracy of 90%, you can be confident that it is performing well across both categories.
+
+## 3. The "Accuracy Paradox" (Imbalanced Data)
+
+Accuracy becomes highly misleading when one class significantly outweighs the other. This is known as the **Accuracy Paradox**.
+
+### The Scenario:
+
+Imagine a Rare Disease test where only **1%** of the population is actually sick.
+
+1.  If a "lazy" model is programmed to simply say **"Healthy"** for every single patient...
+2.  It will be **99% accurate**.
+
+```mermaid
+graph LR
+    POP["$$\text{Population (100\%)}$$"]
+
+    POP --> H["$$99\% \ \text{Healthy}$$"]
+    POP --> S["$$1\% \ \text{Sick (Rare Disease)}$$"]
+
+    %% Lazy Model
+    H --> PH["$$\text{Predicted: Healthy}$$"]
+    S --> PS["$$\text{Predicted: Healthy}$$"]
+
+    PH --> ACC1["$$\text{True Negatives (99\%)}$$"]
+    PS --> ERR1["$$\text{False Negatives (1\%)}$$"]
+
+    ACC1 --> MET["$$\text{Accuracy} = \frac{99}{100} = 99\%$$"]
+
+    ERR1 --> FAIL["$$\text{❌ All Sick Patients Missed}$$"]
+
+    MET -.->|"$$\text{Accuracy Paradox}$$"| FAIL
+
+```
+
+**The problem?** Even though the accuracy is 99%, the model failed to find the 1% of people who actually need help. In high-stakes fields like medicine or fraud detection, accuracy is often the least important metric.
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import accuracy_score
+
+# Actual target values
+y_true = [0, 1, 1, 0, 1, 1]
+
+# Model predictions
+y_pred = [0, 1, 0, 0, 1, 1]
+
+# Calculate Accuracy
+score = accuracy_score(y_true, y_pred)
+
+print(f"Accuracy: {score * 100:.2f}%")
+# Output: Accuracy: 83.33%
+
+```
+
+## 5. Pros and Cons
+
+| Advantages | Disadvantages |
+| --- | --- |
+| **Simple to understand:** Easy to explain to non-technical stakeholders. | **Useless for Imbalance:** Can hide poor performance on minority classes. |
+| **Single Number:** Provides a quick, high-level overview of model health. | **Ignores Probability:** Doesn't tell you how confident the model was in its choice. |
+| **Standardized:** Used across almost every classification project. | **Cost Blind:** Treats "False Positives" and "False Negatives" as equally bad. |
+
+## 6. How to move beyond Accuracy?
+
+To get a true picture of your model's performance—especially if your data is "skewed"—you should look at Accuracy alongside:
+
+* **Precision:** How many of the predicted positives were actually positive?
+* **Recall:** How many of the actual positives did we successfully find?
+* **F1-Score:** The harmonic mean of Precision and Recall.
+
+## References
+
+* **Google Developers:** [Classification: Accuracy](https://developers.google.com/machine-learning/crash-course/classification/accuracy)
+* **StatQuest:** [Accuracy, Precision, and Recall](https://www.youtube.com/watch?v=Kdsp6soqA7o)
+
+---
+
+**If Accuracy isn't enough to catch rare diseases or credit card fraud, what is?** Stay tuned for our next chapter on **Precision & Recall** to find out!
@@ -0,0 +1,143 @@
+---
+title: The Confusion Matrix
+sidebar_label: Confusion Matrix
+description: "The foundation of classification evaluation: True Positives, False Positives, True Negatives, and False Negatives."
+tags: [machine-learning, model-evaluation, metrics, classification, confusion-matrix]
+---
+
+A **Confusion Matrix** is a table used to describe the performance of a classification model. While "Accuracy" tells you how often the model is correct, the Confusion Matrix tells you exactly **how** it is failing and which classes are being swapped.
+
+## 1. The 2x2 Layout
+
+For a binary classification (Yes/No, Spam/Ham), the matrix consists of four quadrants:
+
+| | Predicted: **Negative** | Predicted: **Positive** |
+| :--- | :--- | :--- |
+| **Actual: Negative** | **True Negative (TN)** | **False Positive (FP)** |
+| **Actual: Positive** | **False Negative (FN)** | **True Positive (TP)** |
+
+### Breaking Down the Quadrants:
+* **True Positive (TP):** You predicted positive, and it was true. (e.g., You predicted a patient has cancer, and they do).
+* **True Negative (TN):** You predicted negative, and it was true. (e.g., You predicted a patient is healthy, and they are).
+* **False Positive (FP):** You predicted positive, but it was false. (Also known as a **Type I Error** or a "False Alarm").
+* **False Negative (FN):** You predicted negative, but it was positive. (Also known as a **Type II Error** or a "Miss").
+
+## 2. Type I vs. Type II Errors
+
+The "cost" of these errors depends entirely on your specific problem.
+
+```mermaid
+graph TB
+    TITLE["$$\text{Type I vs. Type II Errors}$$"]
+
+    %% Ground Truth
+    TITLE --> TRUTH["$$\text{Actual Condition}$$"]
+    TRUTH --> POS["$$\text{Positive (Condition Present)}$$"]
+    TRUTH --> NEG["$$\text{Negative (Condition Absent)}$$"]
+
+    %% Model Decisions
+    POS --> TP["$$\text{True Positive}$$"]
+    POS --> FN["$$\text{Type II Error}$$<br/>$$\text{False Negative}$$"]
+
+    NEG --> TN["$$\text{True Negative}$$"]
+    NEG --> FP["$$\text{Type I Error}$$<br/>$$\text{False Positive}$$"]
+
+    %% Costs
+    FP --> COST1["$$\text{Cost Depends on Context}$$"]
+    FN --> COST2["$$\text{Cost Depends on Context}$$"]
+
+    %% Examples
+    COST1 --> EX1["$$\text{Example: Spam Filter}$$<br/>$$\text{Important Email Blocked}$$"]
+    COST2 --> EX2["$$\text{Example: Medical Test}$$<br/>$$\text{Disease Missed}$$"]
+
+    %% Emphasis
+    EX1 -.->|"$$\text{Type I Cost High}$$"| FP
+    EX2 -.->|"$$\text{Type II Cost High}$$"| FN
+
+```
+
+* **In Cancer Detection:** A **Type II Error (FN)** is much worse because a sick patient goes untreated.
+* **In Spam Filtering:** A **Type I Error (FP)** is worse because an important work email is hidden in the trash.
+
+## 3. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
+import matplotlib.pyplot as plt
+
+# Actual values and Model predictions
+y_true = [0, 1, 0, 1, 0, 1, 1, 0]
+y_pred = [0, 1, 1, 1, 0, 0, 1, 0]
+
+# 1. Generate the matrix
+cm = confusion_matrix(y_true, y_pred)
+
+# 2. Visualize it
+disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
+disp.plot(cmap=plt.cm.Blues)
+plt.show()
+
+```
+
+## 4. Multi-Class Confusion Matrices
+
+The matrix isn't just for binary problems. If you are classifying "Cat," "Dog," and "Bird," your matrix will be 3x3. The diagonal line from top-left to bottom-right represents correct predictions. Any numbers off that diagonal show you which animals the model is confusing.
+
+```mermaid
+graph TB
+    TITLE["$$\text{Multi-Class Confusion Matrix (3×3)}$$"]
+
+    %% Axes
+    TITLE --> ACT["$$\text{Actual Class}$$"]
+    TITLE --> PRED["$$\text{Predicted Class}$$"]
+
+    ACT --> CAT_A["$$\text{Cat}$$"]
+    ACT --> DOG_A["$$\text{Dog}$$"]
+    ACT --> BIRD_A["$$\text{Bird}$$"]
+
+    PRED --> CAT_P["$$\text{Cat}$$"]
+    PRED --> DOG_P["$$\text{Dog}$$"]
+    PRED --> BIRD_P["$$\text{Bird}$$"]
+
+    %% Diagonal (Correct Predictions)
+    CAT_A --> CAT_P["$$\text{Cat → Cat}$$<br/>$$\text{Correct}$$"]
+    DOG_A --> DOG_P["$$\text{Dog → Dog}$$<br/>$$\text{Correct}$$"]
+    BIRD_A --> BIRD_P["$$\text{Bird → Bird}$$<br/>$$\text{Correct}$$"]
+
+    %% Off-Diagonal (Confusions)
+    CAT_A --> DOG_P["$$\text{Cat → Dog}$$<br/>$$\text{Confusion}$$"]
+    CAT_A --> BIRD_P["$$\text{Cat → Bird}$$<br/>$$\text{Confusion}$$"]
+
+    DOG_A --> CAT_P["$$\text{Dog → Cat}$$<br/>$$\text{Confusion}$$"]
+    DOG_A --> BIRD_P["$$\text{Dog → Bird}$$<br/>$$\text{Confusion}$$"]
+
+    BIRD_A --> CAT_P["$$\text{Bird → Cat}$$<br/>$$\text{Confusion}$$"]
+    BIRD_A --> DOG_P["$$\text{Bird → Dog}$$<br/>$$\text{Confusion}$$"]
+
+    %% Emphasis
+    CAT_P -.->|"$$\text{Diagonal}$$"| GOOD["$$\text{Correct Predictions}$$"]
+    DOG_P -.->|"$$\text{Diagonal}$$"| GOOD
+    BIRD_P -.->|"$$\text{Diagonal}$$"| GOOD
+
+    DOG_P -.->|"$$\text{Off-Diagonal}$$"| BAD["$$\text{Model Confusion}$$"]
+    BIRD_P -.->|"$$\text{Off-Diagonal}$$"| BAD
+
+```
+
+## 5. Summary: What can we calculate from here?
+
+The Confusion Matrix is the "mother" of all classification metrics. From these four numbers, we derive:
+
+* **Accuracy:** 
+* **Precision:** 
+* **Recall:** 
+* **F1-Score:** The balance between Precision and Recall.
+
+## References
+
+* **StatQuest:** [Confusion Matrices Explained](https://www.youtube.com/watch?v=Kdsp6soqA7o)
+* **Scikit-Learn:** [Confusion Matrix API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)
+
+---
+
+**Now that you can see where the model is making mistakes, let's learn how to turn those mistakes into a single score.**
@@ -0,0 +1,104 @@
+---
+title: "F1-Score: The Balanced Metric"
+sidebar_label: F1-Score
+description: "Mastering the harmonic mean of Precision and Recall to evaluate models on imbalanced datasets."
+tags: [machine-learning, model-evaluation, metrics, f1-score, classification]
+---
+
+The **F1-Score** is a single metric that combines [Precision](./precision) and [Recall](./recall) into a single value. It is particularly useful when you have an imbalanced dataset and you need to find an optimal balance between "False Positives" and "False Negatives."
+
+## 1. The Mathematical Formula
+
+The F1-Score is the **harmonic mean** of Precision and Recall. Unlike a simple average, the harmonic mean punishes extreme values. If either Precision or Recall is very low, the F1-Score will also be low.
+
+$$
+F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
+$$
+
+
+### Why use the Harmonic Mean?
+
+If we used a standard arithmetic average, a model with 1.0 Precision and 0.0 Recall would have a "decent" score of 0.5. However, such a model is useless. The harmonic mean ensures that if one metric is 0, the total score is 0.
+
+## 2. When to Use the F1-Score
+
+F1-Score is the best choice when:
+
+1.  **Imbalanced Classes:** You have a large number of "Negative" samples and few "Positive" ones (e.g., Fraud detection).
+2.  **Equal Importance:** You care equally about minimizing False Positives (Precision) and False Negatives (Recall).
+
+## 3. Visualizing the Balance
+
+Think of the F1-Score as a "balance scale." If you tilt too far toward catching everyone (Recall), your precision drops. If you tilt too far toward being perfectly accurate (Precision), you miss people. The F1-Score is highest when these two are in equilibrium.
+
+```mermaid
+graph TB
+    SCALE["$$\text{F1-Score}$$<br/>$$\text{Balance Scale}$$"]
+
+    %% Precision Side
+    SCALE --> P["$$\text{Precision}$$"]
+    P --> P1["$$\text{Few False Positives}$$"]
+    P1 --> P2["$$\text{Strict Threshold}$$"]
+    P2 --> P3["$$\text{Misses True Positives}$$"]
+    P3 --> P4["$$\text{Low Recall}$$"]
+
+    %% Recall Side
+    SCALE --> R["$$\text{Recall}$$"]
+    R --> R1["$$\text{Few False Negatives}$$"]
+    R1 --> R2["$$\text{Loose Threshold}$$"]
+    R2 --> R3["$$\text{Many False Positives}$$"]
+    R3 --> R4["$$\text{Low Precision}$$"]
+
+    %% Balance Point
+    P4 -.->|"$$\text{Too Strict}$$"| UNBAL["$$\text{Unbalanced Model}$$"]
+    R4 -.->|"$$\text{Too Loose}$$"| UNBAL
+
+    P --> BAL["$$\text{Equilibrium}$$"]
+    R --> BAL
+
+    BAL --> F1["$$\text{F1} = 2 \cdot \frac{P \cdot R}{P + R}$$"]
+    F1 --> OPT["$$\text{Maximum F1-Score}$$"]
+
+```
+
+## 4. Implementation with Scikit-Learn
+
+```python
+from sklearn.metrics import f1_score
+
+# Actual target values
+y_true = [0, 1, 1, 0, 1, 1, 0]
+
+# Model predictions
+y_pred = [0, 1, 0, 0, 1, 1, 1]
+
+# Calculate F1-Score
+score = f1_score(y_true, y_pred)
+
+print(f"F1-Score: {score:.2f}")
+# Output: F1-Score: 0.75
+
+```
+
+## 5. Summary Table: Which Metric to Trust?
+
+| Scenario | Best Metric | Why? |
+| --- | --- | --- |
+| **Balanced Data** | **Accuracy** | Simple and representative. |
+| **Spam Filter** | **Precision** | False Positives (real mail in spam) are very bad. |
+| **Cancer Screen** | **Recall** | False Negatives (missing a sick patient) are fatal. |
+| **Fraud Detection** | **F1-Score** | Need to catch thieves (Recall) without blocking everyone (Precision). |
+
+## 6. Beyond Binary: Macro vs. Weighted F1
+
+If you have more than two classes (Multi-class classification), you'll see these options:
+
+* **Macro F1:** Calculates F1 for each class and takes the unweighted average. Treats all classes as equal.
+* **Weighted F1:** Calculates F1 for each class but weights them by the number of samples in that class.
+
+## References
+
+* **Scikit-Learn:** [F1 Score Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
+* **Towards Data Science:** [The F1 Score Paradox](https://towardsdatascience.com/the-f1-score-2236378a31).
+
+**The F1-Score gives us a snapshot at a single threshold. But how do we evaluate a model's performance across ALL possible thresholds?**