Project Overview
Introduction:
I recently completed a virtual internship at Atliq Technologies where I independently worked on an end-to-end machine learning project over about 3 weeks. The goal was to build a model to predict a survey respondent’s preferred price range for a new beverage product (CodeX) based on their profile and preferences. This project followed the full data science workflow – from exploring a raw survey dataset, through data cleaning and feature engineering, to model training, experiment tracking, and deploying a Streamlit app. In this blog post, I’ll walk through my process and key learnings in a first-person narrative. The work was challenging but rewarding, and all steps were verified by experienced mentors at Atliq.
Data Exploration (Understanding the Survey Data)
The dataset provided (survey_results.csv) contained 30,010 survey responses with 17 features such as age, gender, occupation, income level, weekly consumption frequency, current brand used, reasons for brand choice, etc., and a target variable: price_range (the price range the person is comfortable paying for the product). I began by checking the structure of the data and basic summary statistics. Notably, the respondents spanned a wide age range (18 to over 70) and came from different zones (Urban, Semi-Urban, Rural, Metro) with varied income levels.
Distribution of respondent age groups. A majority (almost 65%) of respondents were under 35 years old, with about 35% in the 18–25 range and 30% in 26–35. Older age brackets taper off, as shown in the pie chart.
There were some immediate insights from initial exploration. For example, nearly half of the respondents were using “Newcomer” brands (as opposed to established brands), indicating a competitive market split. The top reason for choosing a brand was “Price” (cited by roughly 47% of respondents), followed by “Availability” (~22%), with “Quality” and “Brand Reputation” trailing around 15% each. This hinted that many consumers are price-sensitive. In terms of the target variable, the price range preferences were skewed towards higher ranges: only about 12% chose the lowest range (₹50–100), whereas a significant portion (~32%) chose the highest range (₹200–250).
Price range preference distribution. Very few respondents opt for the lowest price bracket (₹50–100), while the majority are comfortable with higher price points. Over two-thirds of respondents chose above ₹150, with ₹200–250 being the most popular category.
I also explored relationships between variables. For instance, cross-tabulating age groups with price preference revealed a clear pattern: younger consumers (18–25) largely preferred lower price ranges (most of them chose ₹50–100 or ₹100–150), whereas older consumers (36+) tended to favor higher price ranges (₹150–250). This generational difference was a crucial insight that eventually informed our modeling strategy (we ended up training separate models for younger vs. older segments, more on that later). There were some surprising findings too – e.g., health-consciousness didn’t vary much with price preference, and awareness of other brands had a moderate positive correlation with willingness to pay more (those aware of many alternative brands often were in higher price brackets, perhaps indicating they were savvy and still chose the premium option).
Data Cleaning (Tidying Up the Mess)
As with any real-world data, the survey responses needed cleaning before analysis. I followed a thorough set of instructions (provided in a PDF from my mentors) to address data quality issues. Key cleaning steps included:
-
Dropping Duplicates: We had a few duplicate entries (exact repeated responses). These were very minimal (~10 duplicates in 30k rows), so I removed them to keep only unique responses.
-
Fixing Inconsistent Labels: Some categorical entries had typos or case inconsistencies. For example, “Metor” was corrected to “Metro” in the
zonecolumn, “Establishd” to “Established”, and “newcomer” to “Newcomer” incurrent_brand. Theincome_levelsfield had"None"which we standardized to"Not Reported". These small fixes ensured that each category is represented consistently. -
Handling Missing Data: A few respondents skipped certain questions (e.g.,
purchase_channelorconsume_frequency(weekly)had some blanks). Since the proportion of missing answers was extremely low, I decided to drop those records for simplicity. This eliminated only a handful of rows and didn’t impact overall patterns. -
Outlier Treatment: I identified some impossible values, especially in the
agefield. One record bizarrely had age 604, and a few others were well over 100 years – clearly data entry errors. Using a quick box plot and common sense, I filtered out ages above 70. (The survey’s design only catered up to the “56–70” group, so any age beyond 70 wasn’t in the intended population.) Additionally, I removed one logical inconsistency: a respondent marked as “Student” in occupation while being in the 56–70 age bin – likely an error, so that entry was dropped. -
Trimming Unused Columns: The
respondent_idwas just an identifier with no predictive value, so it was dropped. I also decided to derive anage_groupcategory fromageand then remove the rawagecolumn, since age groups were more relevant for analysis and modeling in this context.
After cleaning, the dataset was down to about 29,950 valid responses, and much more reliable. This cleanup might seem tedious, but it was crucial. By resolving typos and removing anomalies, I ensured that subsequent analysis wouldn’t be skewed or confused by dirty data. Seeing the data in a cleaner state also made patterns clearer – for instance, after standardizing current_brand, it became obvious that about 51.5% respondents were using established brands vs 48.5% newcomers, a near even split.
Data Transformation (Feature Engineering & Preparation)
With clean data in hand, I moved on to feature engineering and transformation to get the dataset ready for modeling. This step was particularly fun as I got to create new features that capture domain insights. Here are the major transformations and features I engineered:
-
Ordinal Encoding for Ordered Categories: Some survey responses had an inherent order, so I encoded them as numeric scores. For example,
consume_frequency(weekly)was mapped to a score (0–2 times = 1, 3–4 times = 2, 5–7 times = 3) indicating how frequently the person consumes the beverage. Similarly,awareness_of_other_brands(how many other brands the person knows) was scored as 1 for “0–1”, 2 for “2–4”, and 3 for “>4”. Higher scores meant more frequent consumption or greater brand awareness. We also had ordinal categories forincome_levels(“<10L” < “10L–15L” < … < “>35L”) andhealth_concerns(“Low” < “Medium” < “High” concern about health). I used either manual mapping or label encoding to convert these to numerical scales reflecting their order. -
Derived Age Group Feature: As noted, I binned age into categories: 18–25, 26–35, 36–45, 46–55, 56–70. This not only smoothed out the effect of outlier ages but also aligned with typical age brackets used in marketing analysis. The model can treat this as either an ordinal feature (by label encoding the bins) or nominal (via one-hot encoding). I initially label-encoded
age_group(since there is a natural order to age brackets). -
Zone Affluence Score (ZAC): I wanted a single metric to capture a person’s socioeconomic status by combining their geographic zone and income level. Intuitively, someone living in a metro city with a high income might have different purchasing power than someone with the same income in a rural area. I created a
zone_score(Rural = 1, Semi-Urban = 2, Urban = 3, Metro = 4) and anincome_score(from 0 for Not Reported up to 5 for >35L annual income). Multiplying these, I gotzac_score = zone_score * income_score. A higher ZAC means a wealthier person in a more urbanized area. For example, a Metro dweller with the highest income gets 45 = 20, whereas a Rural individual with low income might be 11 = 1. This feature aimed to encapsulate purchasing power and lifestyle differences. -
Consumption vs. Awareness Ratio (CF-AB Score): I hypothesized that brand loyalty or engagement might be reflected in how much someone consumes relative to how many alternatives they know. So I computed
cf_ab_scoreas the ratio:consume_frequency_score / (consume_frequency_score + awareness_score), rounded to 2 decimals. This yields a number between 0 and 1. A value near 1 means the person consumes the product frequently but is aware of few other brands – possibly a loyal customer. A lower value means either they don’t consume much or they know many competitors (which could indicate they are exploring or not brand-loyal). This was an experimental feature to capture the interplay between usage and market awareness. -
Brand Sensitivity Index (BSI): From the data, “Price” and “Quality” stood out as key reasons for choosing brands. I derived a binary flag called
bsito identify respondents who might be especially price/quality sensitive and not currently with the market leader. In code:bsi = 1 if current_brand is Newcomer AND (reason is Price or Quality) else 0. The logic being: if someone isn’t using the established brand and their motivation is price or quality, they likely switched or avoided the main brand for those reasons – a signal of potential price sensitivity or value-seeking behavior. This feature was inspired by thinking about brand-switchers vs. loyalists. -
One-Hot Encoding for Categorical Variables: The remaining categorical features (gender, occupation, current_brand, preferable_consumption_size, flavor_preference, purchase_channel, typical_consumption_situations, and the reasons for choosing brand) were nominal with no ordinal relation. I applied one-hot encoding to convert these into binary indicator columns. For example,
genderbecame a columngender_M(1 for Male, 0 for Female),current_brandbecamecurrent_brand_Newcomer(with Established as the reference category), occupations were split into dummy variables likeoccupation_Student,occupation_Retired, etc., and so on. We dropped one dummy from each set to avoid redundancy (e.g., for flavor preference, we drop “Exotic” and only keep a dummy for “Traditional” flavor). In total, after encoding and dropping unused original columns, I ended up with ~22 feature columns ready for modeling.
Before moving to modeling, I checked for multicollinearity among these features. A correlation heatmap and variance inflation factor (VIF) analysis showed some high inter-correlations. For instance, zac_score was unsurprisingly correlated with income_levels_encoded (since income is part of both) and with the one-hot zone variables. To mitigate this, I made a decision to exclude the original income_levels and zone dummies from the model, trusting zac_score to capture those effects. After dropping a few such collinear features, the VIF values for remaining features were all comfortably low (most below 5). This step gave me confidence that the model wouldn’t suffer from redundant features or unstable coefficients.
Model Building & Evaluation
With a polished feature set, I split the data into training and test sets (I used a standard 80/20 split – about 23,900 training and 7,500 testing examples). The task is a multiclass classification (4 classes for the price range). Given that the classes were somewhat imbalanced (e.g., the lowest price range had ~12% of respondents vs ~32% in the highest), I made sure to stratify the split by price_range to maintain the class distribution in both train and test.
I experimented with several algorithms to find the best predictive model:
-
Logistic Regression: I started with Logistic Regression as a baseline. With all features scaled (I applied standardization to the numeric features since logistic regression benefits from scaled inputs), it achieved about 80% accuracy on the test set. The model’s confusion matrix (and classification report) showed it was especially good at identifying the highest price class (which made sense, as that class was largest and perhaps easiest to separate), with precision and recall around 0.90 for that class. Lower price classes had slightly lower recall (~75–77%), but overall performance was quite decent for a first try.
-
Support Vector Machine (SVM): I tried an SVM with an RBF kernel. It also gave roughly 80% accuracy, but training was much slower and it was harder to interpret. Without extensive hyperparameter tuning (which would be computationally heavy on this dataset), SVM didn’t significantly outperform logistic regression.
-
Random Forest: Next, I trained a Random Forest classifier. It performed similarly (around 80% accuracy as well), and had the advantage of providing feature importance scores. The random forest’s feature importances actually highlighted that
zac_scoreandincome_levels_encodedwere among the top contributors, confirming our earlier intuition that socio-economic status drives willingness to pay. It also indicatedbsiandage_groupas important features. However, the forest model didn’t dramatically beat logistic regression either, likely because our features allowed linear separation reasonably well. -
Naive Bayes: For completeness, I tested a Gaussian Naive Bayes model. Its overall accuracy was slightly lower (~78%) and it tended to be less calibrated (it was over-predicting the majority classes). Not too surprising given NB’s conditional independence assumption is probably violated by some of our correlated features.
In the end, Logistic Regression emerged as the model of choice. It was simple, fast, and performed on par with more complex models. Moreover, it gave the most straightforward interpretation – useful for explaining the results to stakeholders at Atliq. For instance, we could extract the logistic model’s coefficients to see the direction of influence: it confirmed that older age groups, higher income, and higher consumption frequency positively drove the prediction towards higher price ranges, whereas having a high BSI (price-sensitive new brand user) pushed predictions toward lower ranges.
Segmented Models for Age Groups: One twist we incorporated was training separate models for different age segments. Given the earlier observation that younger consumers behave very differently, I split the data into “young” (Age ≤ 25) and “rest” (Age > 25) subsets. This was a suggestion from my mentor to potentially improve accuracy by capturing segment-specific patterns. I trained a logistic regression on the young subset and another on the rest. Indeed, this slightly improved the predictions: the young demographic’s model was better at distinguishing those who would only pay ₹50–100 from those who’d pay more (since almost no young person was in ₹200–250 range, their model focused on the lower classes), while the model for older respondents could fine-tune on separating the higher price tiers. In deployment, I simply check the user’s age and use the appropriate model for prediction. This two-model approach added complexity, but it reflects a real-world insight that a one-size-fits-all model might not capture segment nuances.
After selecting the final models, I evaluated them thoroughly on the test set. The overall accuracy was ~80%, with a macro-averaged F1-score around 0.79. In practical terms, this means we can predict a consumer’s preferred price range correctly 4 out of 5 times – not perfect, but a solid starting point for business use. The errors made were mostly off by one category (e.g., predicting ₹150–200 instead of ₹200–250), which is understandable. Extremely few people who liked ₹50–100 were mispredicted as ₹200–250 or vice versa. This gave us confidence that the model’s mistakes aren’t wildly off the mark.
Experiment Tracking with MLflow & DagsHub
One aspect of this project that I’m particularly happy about is the use of MLflow for experiment tracking. I integrated MLflow into my workflow to log each model run’s parameters and performance metrics, and I set up a remote tracking server on DagsHub. This means every experiment (for example, “LogisticReg_v1 with all features” or “RandomForest_depth10”) was recorded, and I could easily compare metrics across runs on a dashboard.
I logged metrics such as accuracy, precision, recall for each class, and even saved the trained model artifacts. Using DagsHub as the backend gave me a convenient web UI to visualize these runs. It was incredibly useful when I was trying out different feature combinations or algorithms – instead of keeping results in spreadsheets or notes, MLflow neatly recorded everything. My mentors could also see the progress remotely via the DagsHub link I shared, which made discussions about model choices much more efficient. (DagsHub repo link).
This was my first time using MLflow with DagsHub, and it felt like a professional-grade setup. For anyone doing a lot of modeling experiments, I highly recommend using such a tracking system – it keeps you organized and makes your work reproducible. For instance, when I found that two models had similar performance, I could quickly check the logs to recall which hyperparameters I had used or what data preprocessing was applied. No more confusion like “Did I normalize that input or not?” – the logs have got you covered.
