A comprehensive data mining analysis of the Foodmart dataset using K-medoids clustering and Decision Tree Regression to extract meaningful insights from retail sales data.
This project applies advanced data mining techniques to analyze the Foodmart dataset, which contains retail sales transaction data. The analysis combines unsupervised learning (K-medoids clustering) with supervised learning (Decision Tree Regression) to discover customer patterns and predict sales outcomes.
- Customer Segmentation: Use K-medoids clustering to identify distinct customer groups based on purchasing behavior
- Sales Prediction: Implement Decision Tree Regressor to predict sales values based on various features
- Pattern Discovery: Extract actionable insights from retail transaction data
- Performance Evaluation: Assess model accuracy and clustering quality
- Python 3.x
- Pandas - Data manipulation and analysis
- NumPy - Numerical computing
- Scikit-learn - Machine learning algorithms
- Matplotlib/Seaborn - Data visualization
- Jupyter Notebook - Interactive development environment
The Foodmart dataset is a well-known retail sales dataset containing:
- Customer transaction records
- Product information
- Sales figures
- Time-based data
- Store and location details
- Purpose: Customer segmentation and pattern identification
- Algorithm: Partitioning Around Medoids (PAM)
- Features: Customer purchasing behavior, transaction frequency, sales amounts
- Output:3 Customer clusters with distinct characteristics
- Purpose: Sales prediction and feature importance analysis
- Target Variable: Sales amount/revenue
- Features: Product categories, customer segments, temporal factors
- Output: Predicted sales values and feature importance rankings
pip install pandas numpy scikit-learn matplotlib seaborn jupyter- Clone the repository:
git clone https://github.com/yourusername/foodmart-data-mining.git
cd foodmart-data-mining- Install required dependencies:
pip install -r requirements.txt- Download the Foodmart dataset and place it in the
data/directory
- Data Preprocessing:
python scripts/data_preprocessing.py- K-medoids Clustering:
python scripts/kmedoids_clustering.py- Decision Tree Analysis:
python scripts/decision_tree_regression.py- Run Complete Analysis:
python main.py- Optimal Clusters: X clusters identified using silhouette analysis
- Customer Segments:
- High-value customers
- Frequent buyers
- Seasonal shoppers
- Occasional purchasers
Mean Squared Error: 1.86 R² Score: 0.85 Best Parameters: {'max_depth': 7, 'min_samples_leaf': 4, 'min_samples_split': 10}
- Most important features for sales prediction
- Customer segment characteristics
- Seasonal patterns in sales data
- Product category performance
The project generates several visualizations:
- Cluster visualization using PCA
- Decision tree structure
- Feature importance plots
- Sales prediction accuracy charts
- Customer segment analysis
foodmart-data-mining/
│
├── data/
│ ├── raw/ # Raw Foodmart dataset
│ └── processed/ # Cleaned and preprocessed data
│
├── notebooks/
│ ├── exploratory_analysis.ipynb
│ ├── clustering_analysis.ipynb
│ └── regression_analysis.ipynb
│
├── scripts/
│ ├── data_preprocessing.py
│ ├── kmedoids_clustering.py
│ ├── decision_tree_regression.py
│ └── visualization.py
│
├── results/
│ ├── models/ # Saved models
│ ├── plots/ # Generated visualizations
│ └── reports/ # Analysis reports
│
├── requirements.txt
├── main.py
└── README.md
- Robust Data Preprocessing: Handles missing values, outliers, and data normalization
- Optimal Clustering: Uses silhouette analysis to determine optimal number of clusters
- Model Evaluation: Comprehensive evaluation metrics for both clustering and regression
- Interactive Visualizations: Clear plots and charts for result interpretation
- Scalable Code: Modular design for easy extension and modification
pandas>=1.3.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.4.0
seaborn>=0.11.0
jupyter>=1.0.0
- Foodmart dataset providers
- Scikit-learn community
- Open source data mining community
This project demonstrates the application of unsupervised and supervised learning techniques for retail data analysis and business intelligence.