PHI Data field Binary Classification to prevent HIPAA compliance

This project implements a BERT-based machine learning model to automatically classify database column names as Sensitive or Non-Sensitive.
The system is designed to support secure data sharing, privacy preservation, and AI-assisted compliance validation by identifying sensitive attributes before data is exchanged.

This work is suitable for academic research, enterprise data governance, and compliance-oriented ML pipelines.

📌 Key Features

Fine-tunes BERT (bert-base-uncased) using Hugging Face Transformers
Binary classification:
- 0 → Non-Sensitive
- 1 → Sensitive
Trains on custom labeled datasets (CSV)
Evaluation using accuracy
Saves trained model and tokenizer for reuse
Ready for extensions (masking, compliance scoring, SOC-2 / HIPAA)

🧠 Model Details

Component	Description
Base Model	`bert-base-uncased`
Task	Sequence Classification
Output Labels	2 (Sensitive / Non-Sensitive)
Framework	Hugging Face Transformers + PyTorch

📊 Dataset Format

The dataset must be a CSV file named sensitive_dataset.csv with the following columns: Column Description text Database field or column name label 1 = Sensitive, 0 = Non-Sensitive

Example Dataset

text,label
birthDate,1
email,1
phone_number,1
ssn,1
country,0
department,0
jwtToken,1

Installation & Setup

Make sure you have python3 installed in your machine Running the command python3 --version should show Python 3.14.0

Install dependencies

From the project root directory run the command python3 -m pip install -r requirements.txt which will install all dependencies

Running the Project

python3 main.py

Inference Example

Input

test_texts = ["birthDate", "birth_year", "country", "DATE_BIRTH"]

Output

birthDate: Sensitive
birth_year: Sensitive
country: Non-Sensitive
DATE_BIRTH: Sensitive

📊 Overall Performance and Evaluation(BERT-base)

Among all evaluated models, the fine-tuned BERT-base model achieved the highest validation performance and was selected as the final model for HIPAA-sensitive column name detection.

BERT consistently outperformed both classical machine learning baselines (Logistic Regression, Random Forest, SVM) and other transformer-based models (RoBERTa, GPT-2), demonstrating superior contextual understanding of structured field names.

BERT-base is deployed on Huggingface Platform publicly: https://huggingface.co/barek2k2/bert_hipaa_sensitive_db_schema

Evaluation (Validation Set)

Metric	Value
Validation Accuracy	99.4305%
Precision (Sensitive = 1)	0.9982
Recall (Sensitive = 1)	0.9928
F1-score (Sensitive = 1)	0.9955
Validation Samples	878
Non-Sensitive Support	323
Sensitive Support	555

Confusion Matrix

Rows = True labels, Columns = Predicted labels

	Pred Non-Sensitive	Pred Sensitive
True Non-Sensitive	322	1
True Sensitive	4	551

Classification Report

                precision    recall  f1-score   support

Non-Sensitive       0.99      1.00      0.99       323
Sensitive           1.00      0.99      1.00       555

accuracy                                0.99       878
macro avg          0.99      0.99      0.99       878
weighted avg       0.99      0.99      0.99       878

All Models Evaluation Summary

Category	Model	Validation Accuracy
LLM-based Models	BERT-base	99.43%
	RoBERTa-base	95.4442%
	RoBERTa-large	97.7221%
	GPT-2	93.7358%
Classical ML Models	SVM (Linear)	95.4442%
	Random Forest	94.5330%
	Logistic Regression	84.0547%

Best Overall Model: BERT-base (99.0888%)
Best Classical Model: SVM (95.4442%)

License

This project is intended for academic and research use only.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
comparison-graphs-classical-ML.py		comparison-graphs-classical-ML.py
comparison_graphs_LLM.py		comparison_graphs_LLM.py
dataset.csv		dataset.csv
generate_dataset.py		generate_dataset.py
gpt2.py		gpt2.py
logistic_regression.py		logistic_regression.py
main.py		main.py
performance-metrics-LLM.py		performance-metrics-LLM.py
performance-metrics-classical-ML.py		performance-metrics-classical-ML.py
random_forest.py		random_forest.py
requirements.txt		requirements.txt
svm.py		svm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PHI Data field Binary Classification to prevent HIPAA compliance

📌 Key Features

🧠 Model Details

📊 Dataset Format

Example Dataset

Installation & Setup

Install dependencies

Running the Project

Inference Example

Output

📊 Overall Performance and Evaluation(BERT-base)

Evaluation (Validation Set)

Confusion Matrix

Classification Report

All Models Evaluation Summary

License

About

Uh oh!

Releases

Packages

Languages

barek2k2/phi_binary_classification

Folders and files

Latest commit

History

Repository files navigation

PHI Data field Binary Classification to prevent HIPAA compliance

📌 Key Features

🧠 Model Details

📊 Dataset Format

Example Dataset

Installation & Setup

Install dependencies

Running the Project

Inference Example

Output

📊 Overall Performance and Evaluation(BERT-base)

Evaluation (Validation Set)

Confusion Matrix

Classification Report

All Models Evaluation Summary

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages