Skip to content

A high-accuracy LLM-based classifier built on BERT for detecting HIPAA-sensitive database column names(PHI), featuring rigorous evaluation, reproducible experiments, and comparisons with classical Machine Learning Models baselines.

Notifications You must be signed in to change notification settings

barek2k2/phi_binary_classification

Repository files navigation

PHI Data field Binary Classification to prevent HIPAA compliance

This project implements a BERT-based machine learning model to automatically classify database column names as Sensitive or Non-Sensitive.
The system is designed to support secure data sharing, privacy preservation, and AI-assisted compliance validation by identifying sensitive attributes before data is exchanged.

This work is suitable for academic research, enterprise data governance, and compliance-oriented ML pipelines.


📌 Key Features

  • Fine-tunes BERT (bert-base-uncased) using Hugging Face Transformers
  • Binary classification:
    • 0 → Non-Sensitive
    • 1 → Sensitive
  • Trains on custom labeled datasets (CSV)
  • Evaluation using accuracy
  • Saves trained model and tokenizer for reuse
  • Ready for extensions (masking, compliance scoring, SOC-2 / HIPAA)

🧠 Model Details

Component Description
Base Model bert-base-uncased
Task Sequence Classification
Output Labels 2 (Sensitive / Non-Sensitive)
Framework Hugging Face Transformers + PyTorch

📊 Dataset Format

The dataset must be a CSV file named sensitive_dataset.csv with the following columns: Column Description text Database field or column name label 1 = Sensitive, 0 = Non-Sensitive

Example Dataset

text,label
birthDate,1
email,1
phone_number,1
ssn,1
country,0
department,0
jwtToken,1

Installation & Setup

Make sure you have python3 installed in your machine Running the command python3 --version should show Python 3.14.0

Install dependencies

From the project root directory run the command python3 -m pip install -r requirements.txt which will install all dependencies

Running the Project

python3 main.py

Inference Example

Input

test_texts = ["birthDate", "birth_year", "country", "DATE_BIRTH"]

Output

birthDate: Sensitive
birth_year: Sensitive
country: Non-Sensitive
DATE_BIRTH: Sensitive

📊 Overall Performance and Evaluation(BERT-base)

Among all evaluated models, the fine-tuned BERT-base model achieved the highest validation performance and was selected as the final model for HIPAA-sensitive column name detection.

BERT consistently outperformed both classical machine learning baselines (Logistic Regression, Random Forest, SVM) and other transformer-based models (RoBERTa, GPT-2), demonstrating superior contextual understanding of structured field names.

BERT-base is deployed on Huggingface Platform publicly: https://huggingface.co/barek2k2/bert_hipaa_sensitive_db_schema

Evaluation (Validation Set)

Metric Value
Validation Accuracy 99.4305%
Precision (Sensitive = 1) 0.9982
Recall (Sensitive = 1) 0.9928
F1-score (Sensitive = 1) 0.9955
Validation Samples 878
Non-Sensitive Support 323
Sensitive Support 555

Confusion Matrix

Rows = True labels, Columns = Predicted labels

Pred Non-Sensitive Pred Sensitive
True Non-Sensitive 322 1
True Sensitive 4 551

Classification Report

                precision    recall  f1-score   support

Non-Sensitive       0.99      1.00      0.99       323
Sensitive           1.00      0.99      1.00       555

accuracy                                0.99       878
macro avg          0.99      0.99      0.99       878
weighted avg       0.99      0.99      0.99       878

All Models Evaluation Summary

Category Model Validation Accuracy
LLM-based Models BERT-base 99.43%
RoBERTa-base 95.4442%
RoBERTa-large 97.7221%
GPT-2 93.7358%
Classical ML Models SVM (Linear) 95.4442%
Random Forest 94.5330%
Logistic Regression 84.0547%

Best Overall Model: BERT-base (99.0888%)
Best Classical Model: SVM (95.4442%)

License

This project is intended for academic and research use only.

About

A high-accuracy LLM-based classifier built on BERT for detecting HIPAA-sensitive database column names(PHI), featuring rigorous evaluation, reproducible experiments, and comparisons with classical Machine Learning Models baselines.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages