Skip to content

A comprehensive data mining analysis of the Foodmart dataset using K-medoids clustering and Decision Tree Regression to extract meaningful insights from retail sales data

Notifications You must be signed in to change notification settings

Mariam-coder7/Data-mining-on-FoodMart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Foodmart Data Mining Project

A comprehensive data mining analysis of the Foodmart dataset using K-medoids clustering and Decision Tree Regression to extract meaningful insights from retail sales data.

Project Overview

This project applies advanced data mining techniques to analyze the Foodmart dataset, which contains retail sales transaction data. The analysis combines unsupervised learning (K-medoids clustering) with supervised learning (Decision Tree Regression) to discover customer patterns and predict sales outcomes.

Objectives

  • Customer Segmentation: Use K-medoids clustering to identify distinct customer groups based on purchasing behavior
  • Sales Prediction: Implement Decision Tree Regressor to predict sales values based on various features
  • Pattern Discovery: Extract actionable insights from retail transaction data
  • Performance Evaluation: Assess model accuracy and clustering quality

Technologies Used

  • Python 3.x
  • Pandas - Data manipulation and analysis
  • NumPy - Numerical computing
  • Scikit-learn - Machine learning algorithms
  • Matplotlib/Seaborn - Data visualization
  • Jupyter Notebook - Interactive development environment

Dataset

The Foodmart dataset is a well-known retail sales dataset containing:

  • Customer transaction records
  • Product information
  • Sales figures
  • Time-based data
  • Store and location details

Methodology

1. K-medoids Clustering

  • Purpose: Customer segmentation and pattern identification
  • Algorithm: Partitioning Around Medoids (PAM)
  • Features: Customer purchasing behavior, transaction frequency, sales amounts
  • Output:3 Customer clusters with distinct characteristics

2. Decision Tree Regressor

  • Purpose: Sales prediction and feature importance analysis
  • Target Variable: Sales amount/revenue
  • Features: Product categories, customer segments, temporal factors
  • Output: Predicted sales values and feature importance rankings

Getting Started

Prerequisites

pip install pandas numpy scikit-learn matplotlib seaborn jupyter

Installation

  1. Clone the repository:
git clone https://github.com/yourusername/foodmart-data-mining.git
cd foodmart-data-mining
  1. Install required dependencies:
pip install -r requirements.txt
  1. Download the Foodmart dataset and place it in the data/ directory

Usage

  1. Data Preprocessing:
python scripts/data_preprocessing.py
  1. K-medoids Clustering:
python scripts/kmedoids_clustering.py
  1. Decision Tree Analysis:
python scripts/decision_tree_regression.py
  1. Run Complete Analysis:
python main.py

Results

K-medoids Clustering Results

  • Optimal Clusters: X clusters identified using silhouette analysis
  • Customer Segments:
    • High-value customers
    • Frequent buyers
    • Seasonal shoppers
    • Occasional purchasers

Decision Tree Regressor Performance

Mean Squared Error: 1.86 R² Score: 0.85 Best Parameters: {'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 10}

Key Insights

  • Most important features for sales prediction
  • Customer segment characteristics
  • Seasonal patterns in sales data
  • Product category performance

Visualizations

The project generates several visualizations:

  • Cluster visualization using PCA
  • Decision tree structure
  • Feature importance plots
  • Sales prediction accuracy charts
  • Customer segment analysis

Project Structure

foodmart-data-mining/
│
├── data/
│   ├── raw/                 # Raw Foodmart dataset
│   └── processed/           # Cleaned and preprocessed data
│
├── notebooks/
│   ├── exploratory_analysis.ipynb
│   ├── clustering_analysis.ipynb
│   └── regression_analysis.ipynb
│
├── scripts/
│   ├── data_preprocessing.py
│   ├── kmedoids_clustering.py
│   ├── decision_tree_regression.py
│   └── visualization.py
│
├── results/
│   ├── models/              # Saved models
│   ├── plots/               # Generated visualizations
│   └── reports/             # Analysis reports
│
├── requirements.txt
├── main.py
└── README.md

Key Features

  • Robust Data Preprocessing: Handles missing values, outliers, and data normalization
  • Optimal Clustering: Uses silhouette analysis to determine optimal number of clusters
  • Model Evaluation: Comprehensive evaluation metrics for both clustering and regression
  • Interactive Visualizations: Clear plots and charts for result interpretation
  • Scalable Code: Modular design for easy extension and modification

📋 Requirements

pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
jupyter>=1.0.0

Acknowledgments

  • Foodmart dataset providers
  • Scikit-learn community
  • Open source data mining community

This project demonstrates the application of unsupervised and supervised learning techniques for retail data analysis and business intelligence.

About

A comprehensive data mining analysis of the Foodmart dataset using K-medoids clustering and Decision Tree Regression to extract meaningful insights from retail sales data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published