This project implements a BERT-based machine learning model to automatically classify database column names as Sensitive or Non-Sensitive.
The system is designed to support secure data sharing, privacy preservation, and AI-assisted compliance validation by identifying sensitive attributes before data is exchanged.
This work is suitable for academic research, enterprise data governance, and compliance-oriented ML pipelines.
- Fine-tunes BERT (bert-base-uncased) using Hugging Face Transformers
- Binary classification:
0→ Non-Sensitive1→ Sensitive
- Trains on custom labeled datasets (CSV)
- Evaluation using accuracy
- Saves trained model and tokenizer for reuse
- Ready for extensions (masking, compliance scoring, SOC-2 / HIPAA)
| Component | Description |
|---|---|
| Base Model | bert-base-uncased |
| Task | Sequence Classification |
| Output Labels | 2 (Sensitive / Non-Sensitive) |
| Framework | Hugging Face Transformers + PyTorch |
The dataset must be a CSV file named sensitive_dataset.csv with the following columns:
Column Description
text Database field or column name
label 1 = Sensitive, 0 = Non-Sensitive
text,label
birthDate,1
email,1
phone_number,1
ssn,1
country,0
department,0
jwtToken,1
Make sure you have python3 installed in your machine
Running the command python3 --version should show Python 3.14.0
From the project root directory run the command python3 -m pip install -r requirements.txt which will install all dependencies
python3 main.py
Input
test_texts = ["birthDate", "birth_year", "country", "DATE_BIRTH"]
birthDate: Sensitive
birth_year: Sensitive
country: Non-Sensitive
DATE_BIRTH: Sensitive
Among all evaluated models, the fine-tuned BERT-base model achieved the highest validation performance and was selected as the final model for HIPAA-sensitive column name detection.
BERT consistently outperformed both classical machine learning baselines (Logistic Regression, Random Forest, SVM) and other transformer-based models (RoBERTa, GPT-2), demonstrating superior contextual understanding of structured field names.
BERT-base is deployed on Huggingface Platform publicly: https://huggingface.co/barek2k2/bert_hipaa_sensitive_db_schema
| Metric | Value |
|---|---|
| Validation Accuracy | 99.4305% |
| Precision (Sensitive = 1) | 0.9982 |
| Recall (Sensitive = 1) | 0.9928 |
| F1-score (Sensitive = 1) | 0.9955 |
| Validation Samples | 878 |
| Non-Sensitive Support | 323 |
| Sensitive Support | 555 |
Rows = True labels, Columns = Predicted labels
| Pred Non-Sensitive | Pred Sensitive | |
|---|---|---|
| True Non-Sensitive | 322 | 1 |
| True Sensitive | 4 | 551 |
precision recall f1-score support
Non-Sensitive 0.99 1.00 0.99 323
Sensitive 1.00 0.99 1.00 555
accuracy 0.99 878
macro avg 0.99 0.99 0.99 878
weighted avg 0.99 0.99 0.99 878
| Category | Model | Validation Accuracy |
|---|---|---|
| LLM-based Models | BERT-base | 99.43% |
| RoBERTa-base | 95.4442% | |
| RoBERTa-large | 97.7221% | |
| GPT-2 | 93.7358% | |
| Classical ML Models | SVM (Linear) | 95.4442% |
| Random Forest | 94.5330% | |
| Logistic Regression | 84.0547% |
Best Overall Model: BERT-base (99.0888%)
Best Classical Model: SVM (95.4442%)
This project is intended for academic and research use only.