SHEBEEB S

With a strong foundation in software engineering, I discovered my passion for data-driven decision making and intelligent systems. This curiosity led me to transition into Data Science, exploring art of data and passionately solving real-world problems through Data Science, Machine Learning and storytelling.

Project Overview

I recently completed an insurance premium prediction project as part of a mentorship program with Codebasics.io. The goal was to predict a customer’s annual insurance premium using a dataset of 50,000 records (premiums.xlsx). The dataset included a variety of features such as age, gender, region, marital status, number of dependents, BMI category, smoking status, employment status, income, medical history, and the insurance plan type, along with the target variable – the annual premium amount. This project was extensive and unfolded over multiple phases (spread across six Jupyter notebooks) to progressively refine the model. In this blog post, I’ll walk through my first-person journey tackling the problem, from exploratory data analysis and initial modeling, through a critical decision to segment the data by age, to feature engineering a Genetic Risk (GR) factor that dramatically improved performance for younger customers, and finally to deploying the solution with a Streamlit web application.

Why this problem? Predicting insurance premiums accurately is important for insurance companies to price policies fairly and for customers to understand their costs. Premium amounts are influenced by various personal and health factors. I was motivated to see how machine learning regression techniques could capture these relationships and where they might fall short, especially since domain insights (like health risk indicators) can play a big role. The project not only tested my data science skills in cleaning and modeling a real-world dataset but also taught me how to iteratively improve a model and present it as an interactive tool.

Project Overview

Before building models, I performed extensive data cleaning and preprocessing on the dataset. The raw data had 13 columns (features) initially, and later an extra column was added when we introduced the genetic risk feature. Key features in the dataset included:

  • Age: Customer’s age (integer).

  • Gender: Male or Female.

  • Region: Geographical region (Northeast, Northwest, Southeast, Southwest).

  • Marital_status: Married or Unmarried.

  • Number Of Dependants: Number of dependents (children or others) the customer has.

  • BMI_Category: Categorical BMI range (Underweight, Normal, Overweight, Obesity).

  • Smoking_Status: No Smoking, Occasional, or Regular smoker.

  • Employment_Status: Salaried, Self-Employed, etc.

  • Income_Level: Income range category (e.g. <10L, 10L - 25L, etc., indicating income in Lakhs of rupees).

  • Income_Lakhs: Numeric income in lakhs (e.g. 6 means ₹6,00,000 annual income if we interpret 1 Lakh = 100,000 currency units).

  • Medical History: A categorical field describing any medical conditions (values like No Disease, Diabetes, Heart disease, or combinations like Diabetes & Heart disease).

  • Insurance_Plan: Type of insurance plan (Bronze, Silver, Gold – presumably indicating coverage level).

  • Annual_Premium_Amount: (Target) The annual premium (insurance cost) in currency units.

I ensured all column names were converted to a consistent lowercase, snake_case format for convenience. Next, I addressed data quality issues and prepared the features for modeling:

  • Missing values: I checked for nulls in each column and found a few. All missing values were dropped for simplicity (only a small number of records were affected, so dataset size remained ~50k after dropping).

  • Duplicates: I looked for duplicate rows. There were none initially, but I kept this step in the workflow as a safeguard (and would drop duplicates if any appeared).

  • Inconsistent or out-of-range values: One issue discovered was negative values in the number_of_dependants field. In fact, 72 records had number_of_dependants as -1 or -3, which is not valid. I corrected these by taking the absolute value (assuming they were data entry errors where a leading minus sign was mistakenly included). After this fix, no ages or dependents were negative.

  • Redundant features: The dataset had both Income_Level (a categorical bracket) and Income_Lakhs (a precise numeric value). These two were closely related (e.g., if income_lakhs = 6, that corresponds to the <10L bracket). Using both could cause multicollinearity. After some exploration, I decided to drop Income_Level and rely on the numeric income_lakhs for modeling, since it contains more granular information. This helped simplify the model and avoid an absolutely correlated dummy variable set from Income_Level.

  • Categorical encoding: Most features were categorical (gender, region, etc.). I prepared these for regression by one-hot encoding. For example, gender became a binary column gender_Male (with Female as the reference), region became region_Northwest, region_Northeast, etc., and similarly for the plan type and smoking status. I also encoded the BMI categories and employment status. Each category was turned into a 0/1 dummy variable. This encoding was done consistently in each modeling notebook so that our models could interpret categorical inputs as numeric features.

  • Outliers: Using describe() and box plots, I examined numeric fields for outliers. The age range was 18 to 86, which makes sense for adult insurance customers (no obvious out-of-range ages). Income_Lakhs had a wide range (from 1 to about 93 lakhs, i.e., Rs. 100k to Rs. 9.3M), which is plausible but skewed. The Annual_Premium_Amount ranged roughly from Rs. 3.5k to Rs. 43.5k, with a mean around Rs. 20k. I noticed a few extremely high premium values (above Rs. 40k) which were rare, but given the large dataset, I did not remove them; they could be legitimate (perhaps older customers with Gold plans and multiple conditions). Instead of removing outliers, I kept them and relied on robust models (like tree-based methods) if needed. However, I did note the skew in premium amounts.

Data cleaning was iterative – I would fix issues, then re-run EDA to ensure everything looked reasonable. By the end of this phase, I had a clean dataframe ready for modeling. It’s worth noting that I maintained this cleaning process separately for different segments later on as well, to ensure consistency.

The Full Story

Exploratory Data Analysis (EDA)

With clean data in hand, I performed Exploratory Data Analysis to understand relationships and inform my modeling strategy. Some key observations from the EDA:

  • Age vs Premium: Age showed one of the strongest relationships with the premium amount. Generally, older customers had higher premiums. This makes intuitive sense, as older individuals often require more comprehensive health coverage and may have higher health risks, leading to costlier insurance. In the combined dataset, the Pearson correlation between age and annual premium was about 0.77 – a very strong positive correlation. Plotting age against premium revealed an upward trend: premiums rose sharply for customers in higher age brackets.

  • Income and Plan: Income also influenced premiums. Higher income (in lakhs) tended to correlate with choosing higher-tier plans (Silver/Gold) and slightly higher premiums (correlation ~0.41 with premium). This could be because wealthier customers might opt for plans with better coverage (which cost more). Plan type itself had a noticeable impact: Gold plan holders paid more on average than Silver, who paid more than Bronze. This was expected by design of insurance plans.

  • Medical History: This feature was categorical and needed further examination. The categories included single diseases and combinations (e.g., Diabetes & Heart disease). Initially, I looked at the distribution: the majority of customers had No Disease in their history, and among those with conditions, the most common were Diabetes and High Blood Pressure (hypertension). I suspected that having certain diseases would lead to higher premiums (insurers often charge more if you have chronic conditions). Indeed, when I later separated this into risk scores, conditions like heart disease and diabetes proved to be significant factors.

  • Other categorical factors: Smoking status had an effect: regular smokers tended to have higher premiums than non-smokers, reflecting health risk. BMI category showed that those in Obesity category generally paid more than those who were Normal or Underweight, as obesity is a health risk factor. Marital status and region had smaller effects; unmarried individuals paid slightly more on average (perhaps due to being younger on average), and region had negligible impact after controlling for other factors (a slight trend was Northwest region customers paying a bit less, but nothing major).

To visualize these relationships, I generated various plots: histograms of continuous features, box plots for premiums across categories, and scatter plots. A correlation matrix of the numerical features was especially revealing:

Correlation heatmap of numeric features and target (after adding the genetic risk feature, discussed later). Brighter colors indicate higher positive correlation. We can see that age has the highest correlation with annual premium (0.77). The introduced genetical_risk feature also shows a strong positive correlation (~0.52) with premium, second only to age, confirming it as a valuable predictor. Other correlations: income and plan type have moderate positive correlations with premium, while being unmarried (which correlates with younger age) has a negative correlation with premium.

From these analyses, one insight stood out: the model might have trouble with younger customers. Why? Because age is such a dominant factor, a regression model might rely heavily on it to predict premiums. If a customer is, say, 20 years old, the model would predict a low premium (because most young people pay less). However, if that 20-year-old had serious medical conditions, their actual premium could be high – an outlier scenario the model might miss. I kept this hypothesis in mind moving forward.

Initial Modeling and Performance Issues

For the initial modeling attempt, I treated the entire dataset as one group. I split the data into training and test sets (70% train, 30% test) and tried a few regression algorithms:

  • Linear Regression: a basic multiple linear regression as a baseline.

  • Ridge and Lasso Regression: to see if regularization helps (in case of multicollinearity among features).

  • XGBoost Regressor: a powerful ensemble tree-based model, to capture non-linearities and interactions.

I evaluated models primarily with R<sup>2</sup> (coefficient of determination) for interpretability (how much variance in premium is explained) and also looked at metrics like Mean Squared Error (MSE) and Root MSE for the scale of errors.

The Linear Regression on the full dataset performed decently overall, but with some concerning signs. It achieved an R<sup>2</sup> around the mid-80s (let’s say ~85%**), meaning it explained most of the variance in premiums. The training vs test R<sup>2</sup> were close, indicating it wasn’t overfitting badly. However, when I inspected the residuals (errors), I found a pattern: a subset of predictions had large errors. In particular, many of the largest under-predictions (where the actual premium was far higher than predicted) were for younger customers.

Distribution of prediction residuals for the initial combined model (difference between actual and predicted premium, as a percentage of actual). The distribution is centered near zero error for most records, but notice the long right tail — there are numerous cases where the model under-predicted (positive error percentages). Many of these large errors corresponded to customers under 25, indicating the model struggled with that demographic. 

As shown in the above residual plot, while the bulk of predictions were fairly accurate, there was a non-negligible number with error > +20% or even +50% (the tail extending to the right up to 80%!).

Upon investigation, most of these high-error points were young individuals who had unusually high premiums (likely due to health issues or high coverage plans) that the model didn’t anticipate. The linear model, lacking a direct input for “health risk”, mostly used age and perhaps BMI or smoker status as proxies, but apparently that wasn’t enough to capture some young outliers.

I also tried XGBoost on the combined data. XGBoost did manage a slightly higher overall R<sup>2</sup> (around 87-88%) and reduced some error for complex cases, but it started to show slight overfitting (training R<sup>2</sup> near 95% vs test ~88%). More importantly, even XGBoost didn’t fully solve the issue with young customers – it still under-predicted some of them, because without a specific feature indicating those young-high-premium cases, the model can only do so much. After evaluating these results and consulting with my mentor, I decided on a strategy: segmentation by age group.

Segmentation by Age: Two Distinct Models

The analysis pointed towards the under-25 age group being a special case. The idea emerged to split the dataset into two segments:

“Young” customers – age 25 and below, and

“Older” customers – above 25.

This was motivated by the observation that the relationship between factors and premium might differ for young people versus older people. Young customers are generally healthier (lower premiums) unless they have serious issues, whereas older customers almost uniformly see higher premiums as age increases. By segmenting, each model could specialize: one model would learn the pattern for young customers, and another for the older group.

In a separate notebook, I performed the segmentation:

  • df_young = all records where Age ≤ 25 (which turned out to be 20,096 records).

  • df_rest = all records where Age > 25 (the remaining ~29,904 records).

I saved these subsets as premiums_young.xlsx and premiums_rest.xlsx for further analysis. Each subset was roughly large enough to model on its own, and indeed the difference in premium distributions was stark: the average premium in the young group was much lower (with a longer tail due to a few high premiums), whereas the older group’s premiums were higher on average but also more steadily increasing with age.

From this point on, I essentially treated it as two sub-projects: one for young customers, one for the rest. I repeated the data cleaning steps on each subset (ensuring consistent encoding, handling any missing values – which were minimal after the initial cleaning, and scaling if necessary). Interestingly, within the young subset, the age range is small (18-25), so age as a feature has much less variance. This means other features would have to explain premium differences among young people. For the older subset (26-86 years), age spans a wide range and we expected it to be a dominating feature.

Modeling for Young Customers (Age ≤ 25)

For the young demographic, I performed dedicated EDA and modeling. EDA revealed that in this subset, premium amounts were generally modest (most under ₹15k), but a handful of young people had premiums in the ₹20k-35k range. Those few usually had something notable in their profile (like a serious medical condition).

I trained multiple models on the young customer data:

  • Linear Regression

  • Ridge/Lasso (to check if maybe a simpler model with regularization helps)

  • XGBoost Regressor (to capture non-linearity)

After one-hot encoding all categorical variables in this subset, we had a feature matrix of about ~30 features (since categories like region, plan, etc., were expanded). I also dropped the income_level here for consistency, using only the numeric income.

Model results (Young subset, without GR): The linear regression on young data achieved an R<sup>2</sup> of only around 60% on the test set (approximately 0.60-0.65). This was disappointing but not entirely surprising – it means the model could only explain ~60% of the variability in premiums for under-25 customers. The train and test scores were similar (no overfitting, just an inherently lower fit), so even more complex models would likely face the same limitation unless new information was added. XGBoost did only slightly better, maybe reaching ~67% R<sup>2</sup>, but it also started overfitting if pushed further. Clearly, something was missing in the feature set for young customers.

I analyzed feature importance and coefficients for the young model. The linear model’s coefficients indicated that being a smoker, having a higher BMI category, and having certain medical conditions did increase premiums for young people (which aligns with expectations). However, none of these factors alone was enough to account for the huge premium jump some young individuals had. For instance, one 22-year-old in the data had a premium of ₹30k (very high for that age) – this person had Heart disease in their history. The model, lacking a direct way to quantify how severe heart disease is relative to other conditions, under-predicted this premium by a lot.

This analysis led me to a crucial realization: we needed a better representation of health risk for young customers. Age was not a strong differentiator within this group (since all are young), so health factors play a bigger role. The existing “Medical History” feature was categorical and not very model-friendly in its raw form (especially for combinations of diseases). I needed to extract more signal from it. This set the stage for introducing the Genetic Risk (GR) feature.

Modeling for Older Customers (Age > 25)

In parallel, I modeled the older segment (ages 26 and up). This group was larger and the premium amounts were generally higher and more correlated with age. After encoding categories similarly, I trained the models.

Model results (Older subset): The linear regression on the 25+ data performed exceedingly well – it achieved over 95% R<sup>2</sup> on test data. In fact, the scores were about 0.953 on train and 0.953 on test, essentially a perfect generalization within rounding error. This was a stark contrast to the young group’s results. It means the features available (age, income, plan, medical history in its basic form, etc.) were almost completely sufficient to explain premium variation for older customers. Intuitively, this makes sense: for older individuals, age itself is a powerful predictor (a 60-year-old vs a 30-year-old will have a big premium difference, all else equal). Additionally, many older folks have some medical history and those translate to higher premiums, but because age already implies higher risk, the model can capture it without needing an explicit extra risk feature (to some extent).

XGBoost on the older set also did well – it could even hit ~97% R<sup>2</sup>, but it was clearly overfitting (training score ~99%). Given the linear model was already so good and much simpler, I decided to keep Linear Regression as the model of choice for the older segment too. It’s easier to interpret and explain, and the coefficients made sense (e.g., each additional year of age added a certain amount to predicted premium, being a smoker added a fixed surcharge, etc.). Checking coefficients: age had the largest positive coefficient, as expected. Smoking and BMI also had positive contributions. The plan type coefficients reflected the incremental cost of Silver vs Bronze, Gold vs Silver, etc. Everything aligned with domain expectations.

At this point, we had:

  • A young-segment model that was okay but not great (≈60-65% accuracy).

  • An older-segment model that was excellent (≈95%+ accuracy).

This confirmed that segmentation was the right approach (imagine if we had one model for all ages – it would be dragged down by the young group’s unpredictability). But it also highlighted that we needed to improve the young model if we wanted a truly robust solution for all customers.

Incorporating a Genetic Risk Feature (GR) for Young Customers

To address the shortcomings for the young customer model, I introduced an additional feature: Genetic Risk (GR). The term “Genetic Risk” here refers to an aggregate health risk score derived from the customer’s medical history – essentially quantifying how severe the reported diseases are, in terms of impacting insurance cost. This idea came from domain insight (and mentorship guidance): different medical conditions contribute differently to risk. For example, heart disease is a very serious condition for a young person and would drive premiums up significantly, whereas a history of, say, thyroid issues might have a smaller impact.

Feature engineering the GR: I created a risk scoring system for the medical conditions:

  • Diabetes: 6 points

  • High blood pressure: 6 points

  • Heart disease: 8 points

  • Thyroid: 5 points

  • No disease/None: 0 points (if the person has no listed conditions)

These scores were based on an assumed guideline of risk severity (heart disease being highest). If a customer had a combination of conditions (the Medical History could list two, like “Diabetes & High blood pressure”), I would add the scores for both. For instance, Diabetes & High blood pressure would yield 6 + 6 = 12 points. If they had no conditions, they get 0.

I then normalized this total risk score to a 0-1 range for scaling. The highest total score any customer had was used as the 1.0 mark. In practice, young customers rarely had more than one condition, but a few did (and older customers sometimes had two). After normalization, each customer got a genetical_risk value between 0.0 and 1.0. A value of 0.0 means no risk (no conditions), and values closer to 1.0 mean very high risk (e.g., presence of a serious condition or multiple conditions).

I merged this Genetical_Risk column back into the datasets. Specifically for the young segment, this was new data that we didn’t have in the first modeling round. Now the young dataset had 14 columns: the original 13 plus this new GR feature.

With this augmented feature set, I re-trained the model for young customers. The impact was immediately clear in the results: the model’s performance skyrocketed.

Young segment model (with GR) results: The linear regression now achieved about 98-99% R<sup>2</sup> on the young customer test set – a massive leap from ~60%! In fact, the model could now explain virtually all the variance in premiums for under-25s. The train/test scores were both ~0.98+, indicating an excellent fit without overfitting. We had essentially uncovered the critical feature that was missing earlier. By including the genetic risk score, the model could differentiate a healthy 25-year-old (low risk, low premium) from a 25-year-old with heart disease (high risk, high premium), something it simply couldn’t do before.

To double-check, I also tried the XGBoost model again with the GR feature included. XGBoost also did very well – it got to ~99% R<sup>2</sup> – but at this point the linear model was already so good that there was little room for improvement. The XGBoost had a tiny edge in training score (it could fit 99.25% of variance) but the test was about 98.8%, essentially the same as linear. Moreover, the linear model coefficients now told a satisfying story: the coefficient for genetical_risk was very large and positive (reflecting that a jump from 0 to 1 in normalized risk can increase the premium by many thousands of rupees, roughly equivalent to decades of age increase). This aligns with the idea that if a young person has the highest risk (like a major genetic predisposition or serious condition), their premium would be charged similar to an older person in their 50s or 60s.

In summary, the addition of the GR feature solved the puzzle for the young segment:

  • Without GR: the model mostly looked at age (18 vs 25 years old doesn’t change premium much) and perhaps other small factors, missing the big picture for some customers.

  • With GR: the model now had a direct handle on health risk and could predict high premiums for young-high-risk individuals accurately.

As a result, I now had two strong models:

  • Young customers model (≤25) – ~98% accuracy (after adding GR).

  • Older customers model (>25) – ~95% accuracy (even without needing GR explicitly, though I did test it as well).

Impact of GR on Older Segment

For completeness, I also added the genetic risk feature to the older customer dataset. Since I had computed the risk scores for everyone, the older segment also got a genetical_risk column. I retrained the older model with this feature included. There was a slight uptick in performance – the R<sup>2</sup> moved from ~0.953 to about 0.96 on test data. So a minor improvement, which is still welcome but not game-changing (the model was already very good).

Why was the effect small for older people? Mainly because age was already accounting for a lot of the variance: most older customers in the dataset who had serious medical issues were also older in age (and thus already paying higher premiums even in the previous model). In other words, age and genetic risk are somewhat correlated in the older group – older individuals are more likely to have one of those listed conditions. The linear model already implicitly learned that older customers tend to have higher premiums partly due to those conditions. Still, adding GR made the model a tad more precise and explicitly anchored the effect of specific diseases. Now, if there was an outlier case (say a 40-year-old with heart disease vs a healthy 40-year-old), the model could distinguish them better via the GR feature rather than relying on plan type or BMI as proxies.

At this point, it’s useful to recap the model performance improvements in a table:

Model & Dataset Features Used Train R2 Test R2
Initial Combined Model Basic features (no GR) ~0.85 ~0.85 (est.)
Young (≤25) – no GR Basic features (no GR) 0.60 0.60-0.62
Young (≤25) – with GR Basic + Genetical_Risk 0.988 0.989
Older (>25) – no GR Basic features (no GR) 0.954 0.953
Older (>25) – with GR Basic + Genetical_Risk 0.957 0.960

(The combined model’s scores are approximate since I focused on segment models afterward. “Basic features” refers to all original 13 features after cleaning and encoding. Train/Test are on their respective segments.)

The evolution is clear: the segmentation dramatically improved the older group’s model (since it no longer had to compromise for the young outliers), and adding the GR feature dramatically improved the young group’s model.

Satisfied with these results, I finalized the models: both would be simple linear regression models (one trained on young data with GR, one on older data with GR). Simplicity was a virtue here – it made deployment easier and interpretation clearer.

Before deployment, I also double-checked the models on a few sample individuals:

  • A 23-year-old smoker with diabetes (high GR): The young model predicted a premium that was quite high (close to what a middle-aged person might pay), which matched the actual premium.

  • A 24-year-old completely healthy: Predicted a low premium (around ₹5k) which made sense.

  • A 30-year-old with heart disease (older model): Predicted premium was significantly above an average 30-year-old, reflecting the added risk.

  • A 50-year-old healthy vs a 50-year-old with multiple conditions: the latter’s predicted premium was much higher, as expected.

Building a Streamlit Web App for Deployment

I wanted the app interface to be clean, simple, and easy for anyone to use. Using Streamlit, I crafted a form-like layout where users can enter all the relevant information without feeling overwhelmed. Instead of a long single column of fields, I organized the inputs into a grid of four rows, each containing three fields. This grid layout lets users see most inputs at a glance, making the form feel manageable rather than daunting.

The form collects a variety of information about the user’s profile and policy preferences, covering all factors that the model considers for premium calculation. I grouped the inputs into logical categories for clarity:

  • Demographics: Age, Gender, Region, Marital Status, Number of Dependents

  • Health & Lifestyle: BMI Category, Smoking Status, Medical History, Genetic Risk

  • Financial & Policy: Annual Income, Employment Status, Insurance Plan Level

Each field is paired with the appropriate Streamlit widget to ensure a smooth user experience. For example, numeric entries like Age, Income, and Number of Dependents use numeric input boxes with sensible ranges (adult ages, realistic income limits, etc.), preventing invalid entries. Categorical choices such as Gender, Region, BMI Category, and Smoking Status use dropdown selectors. These dropdowns present predefined options (e.g., Male/Female for gender, various regions, categories like Normal/Overweight/Obesity for BMI, etc.), so the user can simply pick from the list. This not only makes input easy but also guarantees only valid values are submitted to the model. Similarly, Medical History is a dropdown where the user can select a condition or a combination (like “Diabetes & High blood pressure”) from a list – this simplifies what could have been a complex multi-select input into one clear choice. I also included a Genetic Risk field as a numeric slider (on a scale of 0–5) to let users quantify any known hereditary risk factors. By using these widgets and organizing them thoughtfully, the interface feels like a friendly questionnaire rather than a spreadsheet of numbers, which is important for a positive user experience.

To make the app interactive, I added a Predict button at the end of the form. The user can adjust all their inputs and then click this button to submit the information. The moment Predict is clicked, the app springs into action – behind the scenes it prepares the data for the model and computes the prediction. I designed the output display to be immediately noticeable and clear: once the prediction is ready, the app shows it using Streamlit’s success message container, which highlights the result in an attractive green box. It reads something like: “Predicted Health Insurance Cost: 15,000” (with the actual number based on the model’s output). This way, the user instantly sees the estimated premium without having to hunt for it on the page. The use of a formatted message also allows me to include a bit of friendly text alongside the number, making it more understandable (for instance, labeling it as an annual premium cost). Overall, the UI design focuses on simplicity and clarity – the user is guided through inputting their details and gets a quick, nicely formatted result with minimal effort.

Behind the Scenes: Age-Specific Prediction Models

One of the key design choices in this project was to use two different regression models depending on the user’s age. In my exploratory data analysis, it became evident that younger customers (specifically those 25 years old or below) have very different insurance cost patterns compared to older customers. For example, younger individuals often have lower base premiums and fewer health complications, so the factors influencing their premiums differ from those of older individuals. To capture these differences more accurately, I trained separate models for the two age groups – let’s call them the young-adult model and the adult model.

In the deployed app, this logic is implemented seamlessly. When a user enters their age and hits Predict, the app automatically decides which model to use based on that age input. If the age is 25 or under, the app knows to route the input data to the young-adult model. If the age is above 25, it uses the adult model for the prediction. This all happens behind the scenes without any extra steps from the user. From the user’s perspective, there is just one Predict button and one result; internally, however, the app is smart enough to select the appropriate predictive model for their demographic. This age-based model selection was a conscious design decision to improve accuracy – it ensures that each user’s premium is estimated by a model tuned to people similar to them. As the developer, I found this approach very insightful, because it demonstrated how splitting a problem by a key demographic feature (age) can yield a more personalized and precise prediction. It also kept the user experience straightforward: the user doesn’t need to know about multiple models or make any choice; the app does that work and simply provides the best estimate for their case.

Translating Medical History into a Risk Score

Incorporating medical history into the model was another interesting challenge. Users can indicate if they have certain medical conditions (like diabetes, heart disease, thyroid issues, high blood pressure, etc.), and these factors are obviously important for predicting health insurance premiums. However, feeding raw text or a list of conditions directly into a model isn’t effective – the model needs numeric features. I also wanted to avoid overly complicating the model with a dozen separate yes/no inputs for every possible condition. Instead, I designed a single composite feature called a health risk score that summarizes the user’s medical history in one number.

Here’s how it works: behind the scenes the app maps each medical condition to a risk weight – a number that represents how much that condition might drive up the premium relative to others. For example, a condition like heart disease was given a higher weight than, say, a thyroid condition, because heart disease typically has a bigger impact on health costs. If the user selects a combination of conditions (the app allows combined choices such as “Diabetes & High blood pressure”), the app will add together the weights for both. This sum produces a total risk score for the user’s medical history. To make this score more interpretable to the model (and to keep it within a consistent range regardless of how many conditions were selected), I then normalize the risk score to a 0–1 scale. Essentially, 0 means no known medical issues (lowest risk) and 1 would correspond to the highest risk scenario in our data (for instance, having two very serious conditions together). Most people’s risk score will fall somewhere between these extremes.

All of this computation happens instantly when the user hits the Predict button. The user doesn’t see the risk score directly – it’s an internal numeric representation that the model uses. From the user’s point of view, they just chose their conditions from the dropdown, but what the model receives is a nicely scaled risk value summarizing that choice. I found this approach elegant and user-friendly: instead of asking users to input a dozen binary fields or rates for each ailment, they make one simple selection and the app distills that into a meaningful feature. It’s a design that respects the user’s time and input simplicity, while still giving the model the detailed information it needs to make an accurate prediction.

In addition to medical history, the app also asks for a Genetic Risk score. This is a number the user can provide if they have an idea of their hereditary risk factors (for instance, if they’ve done a genetic test or have a family history of certain diseases). I included this as a separate input because genetic predisposition can affect insurance risk independent of one’s current medical conditions. The genetic risk is taken as-is (a number from 0 to 5 in the UI) and will be used alongside other features. It’s kept separate from the medical history risk score because it represents a different aspect of risk (one is current health status, the other is inherited risk). Internally, of course, it’s just another numeric feature for the model – but conceptually it enriches the prediction by covering another dimension of health risk.

Feature Engineering Under the Hood

Once the user submits their information, the app needs to transform these inputs into a format the machine learning models can understand. In the background, I implemented a systematic feature engineering pipeline that prepares the data for prediction. I made sure to mirror the transformations that were applied during the model training phase, so that the inputs the model sees at deployment are consistent with what it saw during training. Here are some of the key steps happening under the hood:

  • Encoding Categorical Variables: Inputs like Gender, Region, Marital Status, BMI Category, Smoking Status, Employment Type, and Insurance Plan are all categorical. The app converts these into numerical form through encoding. In some cases I used one-hot encoding – for example, the app creates binary indicator columns for regions (Northwest, Southeast, Southwest, etc.) and for BMI categories (Normal, Overweight, Obesity, Underweight). If a user selects “Southwest” as their region, the app will set the region_Southwest feature to 1 and others to 0. This way, the model gets a 0/1 signal for each possible region, indicating which one applies. Some categories were treated as having an intrinsic order; for instance, the Insurance Plan field (Bronze, Silver, Gold) is essentially an ordinal variable where Gold is a higher-tier plan than Bronze. In this case, the app encodes Bronze/Silver/Gold to numeric levels 1, 2, 3 respectively. This numeric encoding preserves the rank of the plans. All of this encoding happens behind the scenes through a mapping in the code – the user selecting “Gold” simply results in the number 3 being set for the plan feature that the model uses.

  • Scaling Numerical Features: The numeric inputs – such as Age, Number of Dependents, Income, and the risk scores – are scaled using the same normalization that was applied during model training. Scaling is important because it keeps features on comparable ranges, especially since our model was trained on scaled data for better performance. An interesting twist here is that because we have two models (young vs. adult), I maintained two separate scaling strategies. For the younger cohort, the data (like income or age itself) may have different ranges than for the older group. So, if the user is 25 or under, the app will apply the “young scaler” to their features; if over 25, it uses the “adult scaler.” Practically, this means the app has stored two sets of scaling parameters (like means and ranges for the features) and chooses the appropriate one along with the model. This ensures, for example, that a 22-year-old’s income is scaled relative to the distribution of incomes of young people that the young-adult model was trained on, whereas a 40-year-old’s income is scaled according to the distribution in the older group. By handling scaling this way, I made sure each model gets inputs in the form it expects. All of this happens instantaneously when the user presses the button – the raw inputs are fed into a preprocessing function that outputs a ready-to-predict feature vector.

What makes this setup maintainable is that I encapsulated these transformations into internal helper functions within the app’s codebase. Instead of writing all the encoding and scaling logic directly in the Streamlit interface script (which would have made it long and hard to follow), I put them into utility functions. For example, once the user input is collected into a dictionary, I call a function that preprocesses this input: it handles converting all the categories to numeric columns, calculates the medical history risk, adds the genetic risk and other numeric fields, and performs the scaling. The output of that function is a neat one-row data frame or array with all the features in the correct order and format. This modular design was a conscious choice – it keeps the main app code focused on the user interaction, while the complex data manipulation happens in the background. This way, if I ever need to update how a feature is processed (say we decide to change the risk weighting or add a new category), I can do that in the helper logic without touching the UI code. For the user, of course, this is all invisible; they experience a fast, seamless process where their inputs are magically turned into a prediction in a split second.

Instant Predictions and User-Friendly Results

With the input data prepared and the correct model selected, the final step is to generate the prediction and present it to the user. I loaded the trained regression models into the app when it starts up (one for young customers, one for the rest), so by the time a user hits Predict, the model is already in memory and ready to crunch numbers. The app feeds the processed input vector into the appropriate model, which then outputs a predicted insurance premium amount. Because these models are regression algorithms working on just a single sample at a time, the computation is extremely fast – essentially instantaneous from the user’s point of view. Streamlit handles the function call and returns the result almost immediately, so the user never feels like they are waiting. One moment they click the button, and within a blink the answer appears.

I paid special attention to how the result is displayed to ensure it’s clear and helpful. The prediction is an annual premium cost (since the model was trained on annual premium data), and I present it as a formatted message such as: “Predicted Annual Premium: ₹15,000” (just an example output). The app uses a success notification box for this result, which highlights the text on the app page. This design choice makes the output stand out visually, and the phrasing provides context (so it’s not just a naked number – it explicitly says it’s a predicted premium). The use of formatting, including a currency symbol and comma separators for the number, makes it easily readable and gives it a polished, professional feel.

Another advantage of the interactive app format is that users can experiment with different inputs in real time. I wanted to encourage this exploratory behavior because it turns the app into not just a prediction tool but also an educational experience. For instance, a user might wonder how their premium would change if they quit smoking or if they opt for a higher-tier insurance plan. They can simply toggle the Smoking Status from “Regular” to “No Smoking”, or change the Insurance Plan from Bronze to Gold, and hit Predict again. The app will instantly update the result based on the new inputs. Seeing the premium go down after removing a smoking habit, for example, can be a powerful illustration of how lifestyle choices impact insurance costs. In this way, the app delivers insight, not just numbers. As the developer and narrator of this journey, I found this feedback loop to be one of the most rewarding aspects – the model’s complex calculations are translated into an easy-to-use tool that can inform users’ decisions or at least satisfy their curiosity about “what-if” scenarios. The professional yet accessible presentation of results was crucial to me, as this is intended for a portfolio showcase; it demonstrates that I can not only build a working model but also wrap it in an interface that communicates results clearly and effectively to end users.

Deployment and Real-World Access

After thoroughly testing the app locally, I deployed it so that others could access it as well. Streamlit makes deployment quite convenient – I hosted the app on Streamlit Cloud, which allowed me to share the application via a simple web URL. This means that anyone, from potential employers to casual users interested in their insurance estimates, can visit the link and use the app instantly. There’s no need for them to install any software or have any technical know-how; all the heavy lifting (loading models, processing inputs, computing predictions) happens on the server side in the Streamlit app. Deployment was mostly a matter of pushing my code (including the saved models and the helper logic) to the cloud service and letting it handle the rest. Within moments, the insurance premium predictor was live online, effectively serving predictions to real users just as intended.

Seeing the app deployed and running gave a satisfying sense of completion to the project. It validated that the design choices – from the intuitive interface to the behind-the-scenes logic – all work smoothly in a real-world setting. Users can now interact with a machine learning model in a way that feels straightforward and engaging. For a data science portfolio project, this deployment is a highlight because it demonstrates end-to-end ability: not only building a robust predictive model but also delivering it through a user-friendly application. In professional terms, it shows how I can bridge the gap between complex data science and usable software. On a personal note, deploying the app and observing people use it (and get value from it) was incredibly rewarding. It’s a clear illustration of why I enjoy this work – taking something as abstract as a regression model and turning it into a tangible tool that anyone can benefit from.

In summary, this Streamlit app project was about more than just predicting insurance premiums; it was an exercise in thoughtful design and effective communication of ML results. By narrating the process in this first-person account, I hope I’ve conveyed the considerations and care that went into building an application that is both technically sound and user-centric. From clever model segmentation by age to the nuanced handling of health risk factors, every component was crafted to make the predictions accurate and the app experience smooth. Deploying it brought the project full circle – it’s now out in the world, serving as an interactive exhibit of data science in action, and I’m proud of how it all came together in a professional, clear, and insightful way.

In conclusion, the multi-part insurance premium prediction project showcases how combining data science techniques with domain knowledge can yield powerful results. We went from a one-size-fits-all model that struggled with certain groups, to a tailored solution with near-perfect accuracy for each segment. And by deploying it with Streamlit, the solution is accessible and interactive. This journey will certainly be a highlight in my portfolio, and I’m excited to apply these learnings to future projects where clever feature engineering and strategic modeling can make all the difference.