Skip to content
Snippets Groups Projects
Commit a9458ed8 authored by Azmi Zahrani's avatar Azmi Zahrani
Browse files

Merge branch 'feature/data-processing' into 'develop'

Feature/data processing

See merge request !2
parents 05b46d52 65bba8f2
Branches
2 merge requests!10chore: add model version control and finally can trigger the sparking flow...,!2Feature/data processing
Pipeline #66335 passed with stages
in 11 seconds
Showing
with 30075 additions and 0 deletions
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Churn
0
0
1
1
0
0
1
1
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
1
0
0
1
1
1
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
1
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
1
1
0
1
1
0
1
0
0
0
1
1
1
0
0
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
1
0
0
1
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
0
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
1
1
0
0
1
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
0
1
0
0
0
1
0
0
0
1
1
1
0
1
0
0
0
0
0
1
1
0
0
0
0
0
1
1
1
0
1
1
0
0
0
0
0
0
1
0
0
0
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
1
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
1
1
1
0
1
0
0
0
1
1
0
0
1
1
1
1
1
0
0
1
0
1
0
1
0
0
0
0
0
1
1
0
1
0
1
0
0
0
0
1
0
0
1
0
1
0
0
This diff is collapsed.
Churn
0
1
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
0
1
1
0
0
1
1
0
0
1
0
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
1
0
1
1
0
0
1
1
0
1
1
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
1
0
1
1
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
0
0
0
1
0
1
1
1
0
0
1
0
0
1
0
1
1
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
0
1
0
0
1
1
0
0
1
1
0
0
0
0
0
0
1
1
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
1
0
1
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
0
0
1
0
0
1
0
0
0
0
1
0
1
0
0
0
1
0
1
0
0
0
0
0
0
0
1
1
0
0
0
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
1
0
1
1
0
0
0
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
1
1
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
1
0
1
1
0
0
0
0
1
0
1
0
0
1
1
1
0
1
0
0
0
0
0
0
1
1
0
1
0
0
0
0
1
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
0
1
0
0
0
0
0
0
File added
File added
File added
File added
File added
File added
This diff is collapsed.
This diff is collapsed.
%% Cell type:markdown id: tags:
# Customer Churn Prediction: Machine Learning Models
This notebook will load the training, validation, and test datasets, train multiple machine learning models, evaluate their performance, and save the models for future use.
%% Cell type:code id: tags:
``` python
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
import joblib
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
# Set display options for better readability
pd.set_option('display.max_columns', None)
```
%% Cell type:markdown id: tags:
## 1. Load the Data
Load the training, validation, and test datasets that we saved in the preprocessing step.
%% Cell type:code id: tags:
``` python
# Load the datasets
X_train = pd.read_csv('../data/X_train.csv')
y_train = pd.read_csv('../data/y_train.csv')
X_val = pd.read_csv('../data/X_val.csv')
y_val = pd.read_csv('../data/y_val.csv')
X_test = pd.read_csv('../data/X_test.csv')
y_test = pd.read_csv('../data/y_test.csv')
# Display the shapes of the datasets
print(f"Training set size: {X_train.shape[0]}")
print(f"Validation set size: {X_val.shape[0]}")
print(f"Test set size: {X_test.shape[0]}")
```
%% Output
Training set size: 5977
Validation set size: 527
Test set size: 528
%% Cell type:markdown id: tags:
## 2. Scale the Data
Scale the features using StandardScaler.
%% Cell type:code id: tags:
``` python
# Initialize the scaler
scaler = StandardScaler()
# Fit the scaler on the training data and transform the training, validation, and test sets
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
```
%% Cell type:markdown id: tags:
## 3. Model Training
Train multiple machine learning models
%% Cell type:code id: tags:
``` python
# Initialize models
models = {
'Logistic Regression': LogisticRegression(max_iter=2000),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=127),
'Support Vector Machine': SVC(probability=True),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=127),
'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
'XGBoost': XGBClassifier(eval_metric='logloss')
}
# Train models and evaluate on the validation set
for model_name, model in models.items():
model.fit(X_train_scaled, y_train.values.ravel())
y_val_pred = model.predict(X_val_scaled)
accuracy = accuracy_score(y_val, y_val_pred)
print(f"{model_name} Validation Accuracy: {accuracy:.4f}")
```
%% Output
Logistic Regression Validation Accuracy: 0.8027
Random Forest Validation Accuracy: 0.7742
Support Vector Machine Validation Accuracy: 0.7989
Gradient Boosting Validation Accuracy: 0.7951
K-Nearest Neighbors Validation Accuracy: 0.7666
XGBoost Validation Accuracy: 0.7837
%% Cell type:markdown id: tags:
## 4. Save the Models
Save the trained models for future use.
%% Cell type:code id: tags:
``` python
# Save each model
for model_name, model in models.items():
joblib.dump(model, f'../models/{model_name.replace(" ", "_")}.joblib')
print(f"{model_name} saved successfully.")
```
%% Output
Logistic Regression saved successfully.
Random Forest saved successfully.
Support Vector Machine saved successfully.
Gradient Boosting saved successfully.
K-Nearest Neighbors saved successfully.
XGBoost saved successfully.
%% Cell type:markdown id: tags:
## 5. Load and Evaluate the Models
Load the saved models and evaluate them on the test set.
%% Cell type:code id: tags:
``` python
# Load and evaluate each model on the test set
for model_name in models.keys():
loaded_model = joblib.load(f'../models/{model_name.replace(" ", "_")}.joblib')
y_test_pred = loaded_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_test_pred)
print(f"{model_name} Test Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_test_pred))
```
%% Output
Logistic Regression Test Accuracy: 0.8030
precision recall f1-score support
0 0.84 0.90 0.87 388
1 0.66 0.54 0.59 140
accuracy 0.80 528
macro avg 0.75 0.72 0.73 528
weighted avg 0.79 0.80 0.80 528
Random Forest Test Accuracy: 0.7841
precision recall f1-score support
0 0.82 0.91 0.86 388
1 0.63 0.44 0.52 140
accuracy 0.78 528
macro avg 0.73 0.68 0.69 528
weighted avg 0.77 0.78 0.77 528
Support Vector Machine Test Accuracy: 0.8087
precision recall f1-score support
0 0.84 0.92 0.88 388
1 0.69 0.51 0.58 140
accuracy 0.81 528
macro avg 0.76 0.71 0.73 528
weighted avg 0.80 0.81 0.80 528
Gradient Boosting Test Accuracy: 0.8030
precision recall f1-score support
0 0.84 0.91 0.87 388
1 0.67 0.51 0.58 140
accuracy 0.80 528
macro avg 0.75 0.71 0.72 528
weighted avg 0.79 0.80 0.79 528
K-Nearest Neighbors Test Accuracy: 0.7746
precision recall f1-score support
0 0.83 0.87 0.85 388
1 0.58 0.52 0.55 140
accuracy 0.77 528
macro avg 0.71 0.69 0.70 528
weighted avg 0.77 0.77 0.77 528
XGBoost Test Accuracy: 0.7765
precision recall f1-score support
0 0.82 0.88 0.85 388
1 0.60 0.48 0.53 140
accuracy 0.78 528
macro avg 0.71 0.68 0.69 528
weighted avg 0.76 0.78 0.77 528
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment