Project Overview
I recently worked on a Credit Risk Modeling project using a dataset provided by the Codebasics.io team. The goal was to build a model that predicts the likelihood of a loan default and to create a credit scorecard system for a fictitious NBFC (Non-Banking Financial Company) similar to a CIBIL score. In practical terms, this meant categorizing loan applicants into risk buckets – Poor, Average, Good, or Excellent – based on their predicted creditworthiness. To achieve this, I needed to develop a predictive model using historical loan data (with indicators of whether customers defaulted) and then deploy it via a user-friendly web app. The scope of work included everything from data exploration and feature engineering to model training, evaluation, and building a Streamlit app for demonstration (with model serialization using joblib for reuse).
(Dataset Note: The data – courtesy of Codebasics – contained records of borrowers with demographic info, loan details, and credit bureau metrics like credit utilization, open accounts, delinquency history, etc. There were 50,000 entries, and roughly ~8% of them were labeled as defaults.)
Exploratory Data Analysis (EDA)
Before modeling, I performed an extensive EDA using Pandas, Matplotlib, and Seaborn. I merged three tables (customers, loans, and bureau data) on a customer ID to get a comprehensive dataset for analysis. Key observations:
-
Class Imbalance: Only a small fraction of borrowers had defaulted (around 1 in 12), which meant our target class was highly imbalanced. This would later influence how I handled model training and evaluation (a high overall accuracy could be misleading if the model simply predicts “No Default” for most cases).
-
Feature Distributions: I checked distributions of numeric features (income, loan amounts, ages, etc.) and found some outliers which I treated appropriately (e.g., capping extremely high debt-to-income ratios). Categorical variables (gender, loan purpose, residence type) were analyzed for any obvious trends with default rates.
-
Correlations: Using a heatmap, I identified some strongly correlated features. For example,
principal_outstandingandbank_balance_at_applicationhad a correlation of ~0.89, meaning they carried redundant information. To avoid multicollinearity, I later dropped one of each such pair (keeping the more predictive feature). I also noted which factors correlated with thedefaultflag: credit_utilization_ratio (r ≈ 0.40) and loan_to_income (r ≈ 0.32) showed moderate positive correlation with defaults(i.e., higher credit card utilization and larger loans relative to income tended to coincide with higher default risk), whereas something like age had little correlation with default. These insights guided feature selection and engineering (for instance, I created a loan_to_income ratio feature explicitly).
Modeling Approach
For model development, I used scikit-learn (and XGBoost for one advanced model). I started with a straightforward approach and then progressively incorporated strategies to address the class imbalance and improve recall for the minority class (defaulters). Here’s a rundown of the steps and iterations:
Baseline Models (No Resampling): I first trained a basic Logistic Regression and a Random Forest on the original training data (no special handling of imbalance). These models were evaluated on a hold-out test set. As expected, they achieved high overall accuracy (~96%) because the majority class (no default) dominated. However, the recall for the default class was poor – the logistic model caught only ~72% of defaulters, and the random forest about 77%. In other words, many actual defaults were being missed, which is unacceptable in a credit risk context (we’d rather err on the side of flagging potential defaults). Precision for the default class was relatively high (~85% for logistic) since when it did predict a default.The table below summarizes some performance metrics at this stage:
| Model | Sampling Strategy | Accuracy | Precision (Default) | Recall (Default) | F1-Score (Default) |
|---|---|---|---|---|---|
| Logistic Regression | None (baseline) | 96% | 85% | 72% | 78% |
| Random Forest | None (baseline) | 96% | 77% | 77% | 77% |
Addressing Imbalance – Under-Sampling: To boost the sensitivity to defaulters, I tried under-sampling the majority class. Using imblearn’s RandomUnderSampler, I down-sampled the non-default cases in the training set to match the number of defaults (drastically reducing the training data size). A logistic regression on this balanced subset achieved a default recall ~96% – a huge jump – but at the expense of precision (around 51%). Essentially, the model was now catching almost all real defaulters, but also raising a lot of false alarms (flagging many good customers as risky). The accuracy also dropped to ~92% due to those false positives.
Addressing Imbalance – Over-Sampling (SMOTE): Next, I explored over-sampling the minority class using SMOTE (Synthetic Minority Over-sampling Technique) combined with Tomek links to generate additional synthetic default cases and clean out ambiguous samples. This gave me a much larger balanced training set (roughly 34k default vs 34k non-default after SMOTE). Training logistic regression on this data yielded a recall ~95% and precision ~56% – a better balance than under-sampling. The model still caught most defaulters, and with slightly fewer false positives.
Advanced Model – XGBoost with SMOTE: Given the complex nonlinear relationships in credit data, I trained an XGBoost classifier on the SMOTE-balanced data, using Optuna for hyperparameter tuning. After trying many parameter combinations, the best XGBoost model delivered about 87% recall and 72% precision on defaults, with overall accuracy ~96%. This was a more balanced outcome: the model misses about 13% of defaulters (versus 28% missed by the baseline logistic), while keeping precision reasonably high (72% means the majority of flagged defaults are correct). The default F1-score improved to ~0.79. I found this trade-off acceptable, since in credit risk we value catching as many bad loans as possible, up to a point of manageable false positives.
To compare, here’s how the improved model stacks up against the baseline:
| Model | Accuracy | Precision (Default) | Recall (Default) | F1-Score (Default) |
|---|---|---|---|---|
| Baseline Logistic (no resampling) | 96% | 85% | 72% | 78% |
| Tuned XGBoost (SMOTE data) | 96% | 72% | 87% | 79% |
We can see the recall jump from 72% to 87%, greatly improving our capture of risky cases, though the precision dipped as a result. Deciding on this balance was a key learning – it depends on business context how much false positives are acceptable. In an NBFC scenario, catching more defaults early may outweigh the inconvenience of a higher false alarm rate.
Model Performance and Key Insights
Confusion matrix of the final XGBoost model on the test set. The model correctly identifies 934 of 1,074 defaulters and 11,061 of 11,423 non-defaulters.
The confusion matrix above illustrates the final model’s performance. Out of 1,074 actual default cases in the test set, the model caught 934 (true positives) and missed 140 (false negatives). It incorrectly flagged 362 out of 11,423 good loans as defaults (false positives). This aligns with the ~87% recall and ~72% precision discussed. Visually, the dominant blue diagonal in the matrix (especially for the “No Default” class) reflects the high overall accuracy, but the focus for us was really on that bottom row (Default actuals).
To interpret what the model learned, I also examined the feature importances. In the case of logistic regression, we can look at the coefficients of the features to understand their impact on default probability: (Look above Feature Importance in Logistic Regression graph)
Feature importance (coefficient values) from the logistic regression model. Positive coefficients increase default risk, negative coefficients decrease it.
From the above chart, we see intuitive patterns. Features like loan_type_Unsecured, credit_utilization_ratio, loan_to_income, and delinquency-related metrics had the highest positive coefficients – meaning, for example, an unsecured loan (vs a secured one), maxed-out credit cards, or a high loan-to-income ratio all push the model toward predicting default (higher risk). On the flip side, factors such as owning a residence (negative coefficient for residence_type_Owned) or taking a home loan (which often implies collateral, as seen by a negative weight for loan_purpose_Home) reduced the predicted risk. Interestingly, age had a slight negative coefficient – older applicants were a bit less likely to default according to this data. These insights make sense and provide confidence in the model: they align with common credit risk intuition.
(For XGBoost, feature importance can be examined by split gain, but it tended to highlight similar factors – e.g., credit utilization and loan-to-income were among the top predictors.)
Deployment with Streamlit
After achieving a satisfactory model, I deployed it as a web application using Streamlit. The Streamlit UI allows a loan officer (or any user) to input borrower details and immediately get a default probability, a credit score, and a rating category. For deployment, I decided to use the logistic regression model because of its interpretability and ease of scaling into a score. I saved the trained logistic model and preprocessing objects using joblib (bundling the model, feature list, and scaler into one file).
In the app (powered by a simple main.py script and a prediction_helper.py module), when the user enters fields like age, income, loan amount, tenure, number of open accounts, etc., the app first computes some derived features (like the loan-to-income ratio) and applies the same scaling used during training. Then, the logistic model’s prediction is obtained. I convert the prediction probability into an actual credit score on a 300–900 scale (similar to real credit scores) by linearly scaling the non-default probability. For instance, a very low predicted default probability might translate to a score close to 900, whereas a high-risk applicant might end up near 300. Finally, based on the score, I assign a rating: “Poor” for scores below 500, “Average” for 500–649, “Good” for 650–749, and “Excellent” for 750+. This way, the output isn’t just a raw probability – it’s in a form that stakeholders in finance are familiar with (score and grade).
