SHEBEEB S

With a strong foundation in software engineering, I discovered my passion for data-driven decision making and intelligent systems. This curiosity led me to transition into Data Science, exploring art of data and passionately solving real-world problems through Data Science, Machine Learning and storytelling.

Project Overview

Introduction:
I recently completed a virtual internship at Atliq Technologies where I independently worked on an end-to-end machine learning project over about 3 weeks. The goal was to build a model to predict a survey respondent’s preferred price range for a new beverage product (CodeX) based on their profile and preferences. This project followed the full data science workflow – from exploring a raw survey dataset, through data cleaning and feature engineering, to model training, experiment tracking, and deploying a Streamlit app. In this blog post, I’ll walk through my process and key learnings in a first-person narrative. The work was challenging but rewarding, and all steps were verified by experienced mentors at Atliq.

Data Exploration (Understanding the Survey Data)

The dataset provided (survey_results.csv) contained 30,010 survey responses with 17 features such as age, gender, occupation, income level, weekly consumption frequency, current brand used, reasons for brand choice, etc., and a target variable: price_range (the price range the person is comfortable paying for the product). I began by checking the structure of the data and basic summary statistics. Notably, the respondents spanned a wide age range (18 to over 70) and came from different zones (Urban, Semi-Urban, Rural, Metro) with varied income levels.

Distribution of respondent age groups. A majority (almost 65%) of respondents were under 35 years old, with about 35% in the 18–25 range and 30% in 26–35. Older age brackets taper off, as shown in the pie chart.

There were some immediate insights from initial exploration. For example, nearly half of the respondents were using “Newcomer” brands (as opposed to established brands), indicating a competitive market split. The top reason for choosing a brand was “Price” (cited by roughly 47% of respondents), followed by “Availability” (~22%), with “Quality” and “Brand Reputation” trailing around 15% each. This hinted that many consumers are price-sensitive. In terms of the target variable, the price range preferences were skewed towards higher ranges: only about 12% chose the lowest range (₹50–100), whereas a significant portion (~32%) chose the highest range (₹200–250).

Price range preference distribution. Very few respondents opt for the lowest price bracket (₹50–100), while the majority are comfortable with higher price points. Over two-thirds of respondents chose above ₹150, with ₹200–250 being the most popular category.

I also explored relationships between variables. For instance, cross-tabulating age groups with price preference revealed a clear pattern: younger consumers (18–25) largely preferred lower price ranges (most of them chose ₹50–100 or ₹100–150), whereas older consumers (36+) tended to favor higher price ranges (₹150–250). This generational difference was a crucial insight that eventually informed our modeling strategy (we ended up training separate models for younger vs. older segments, more on that later). There were some surprising findings too – e.g., health-consciousness didn’t vary much with price preference, and awareness of other brands had a moderate positive correlation with willingness to pay more (those aware of many alternative brands often were in higher price brackets, perhaps indicating they were savvy and still chose the premium option).

Data Cleaning (Tidying Up the Mess)

As with any real-world data, the survey responses needed cleaning before analysis. I followed a thorough set of instructions (provided in a PDF from my mentors) to address data quality issues. Key cleaning steps included:

  • Dropping Duplicates: We had a few duplicate entries (exact repeated responses). These were very minimal (~10 duplicates in 30k rows), so I removed them to keep only unique responses.

  • Fixing Inconsistent Labels: Some categorical entries had typos or case inconsistencies. For example, “Metor” was corrected to “Metro” in the zone column, “Establishd” to “Established”, and “newcomer” to “Newcomer” in current_brand. The income_levels field had "None" which we standardized to "Not Reported". These small fixes ensured that each category is represented consistently.

  • Handling Missing Data: A few respondents skipped certain questions (e.g., purchase_channel or consume_frequency(weekly) had some blanks). Since the proportion of missing answers was extremely low, I decided to drop those records for simplicity. This eliminated only a handful of rows and didn’t impact overall patterns.

  • Outlier Treatment: I identified some impossible values, especially in the age field. One record bizarrely had age 604, and a few others were well over 100 years – clearly data entry errors. Using a quick box plot and common sense, I filtered out ages above 70. (The survey’s design only catered up to the “56–70” group, so any age beyond 70 wasn’t in the intended population.) Additionally, I removed one logical inconsistency: a respondent marked as “Student” in occupation while being in the 56–70 age bin – likely an error, so that entry was dropped.

  • Trimming Unused Columns: The respondent_id was just an identifier with no predictive value, so it was dropped. I also decided to derive an age_group category from age and then remove the raw age column, since age groups were more relevant for analysis and modeling in this context.

After cleaning, the dataset was down to about 29,950 valid responses, and much more reliable. This cleanup might seem tedious, but it was crucial. By resolving typos and removing anomalies, I ensured that subsequent analysis wouldn’t be skewed or confused by dirty data. Seeing the data in a cleaner state also made patterns clearer – for instance, after standardizing current_brand, it became obvious that about 51.5% respondents were using established brands vs 48.5% newcomers, a near even split.

Data Transformation (Feature Engineering & Preparation)

With clean data in hand, I moved on to feature engineering and transformation to get the dataset ready for modeling. This step was particularly fun as I got to create new features that capture domain insights. Here are the major transformations and features I engineered:

  • Ordinal Encoding for Ordered Categories: Some survey responses had an inherent order, so I encoded them as numeric scores. For example, consume_frequency(weekly) was mapped to a score (0–2 times = 1, 3–4 times = 2, 5–7 times = 3) indicating how frequently the person consumes the beverage. Similarly, awareness_of_other_brands (how many other brands the person knows) was scored as 1 for “0–1”, 2 for “2–4”, and 3 for “>4”. Higher scores meant more frequent consumption or greater brand awareness. We also had ordinal categories for income_levels (“<10L” < “10L–15L” < … < “>35L”) and health_concerns (“Low” < “Medium” < “High” concern about health). I used either manual mapping or label encoding to convert these to numerical scales reflecting their order.

  • Derived Age Group Feature: As noted, I binned age into categories: 18–25, 26–35, 36–45, 46–55, 56–70. This not only smoothed out the effect of outlier ages but also aligned with typical age brackets used in marketing analysis. The model can treat this as either an ordinal feature (by label encoding the bins) or nominal (via one-hot encoding). I initially label-encoded age_group (since there is a natural order to age brackets).

  • Zone Affluence Score (ZAC): I wanted a single metric to capture a person’s socioeconomic status by combining their geographic zone and income level. Intuitively, someone living in a metro city with a high income might have different purchasing power than someone with the same income in a rural area. I created a zone_score (Rural = 1, Semi-Urban = 2, Urban = 3, Metro = 4) and an income_score (from 0 for Not Reported up to 5 for >35L annual income). Multiplying these, I got zac_score = zone_score * income_score. A higher ZAC means a wealthier person in a more urbanized area. For example, a Metro dweller with the highest income gets 45 = 20, whereas a Rural individual with low income might be 11 = 1. This feature aimed to encapsulate purchasing power and lifestyle differences.

  • Consumption vs. Awareness Ratio (CF-AB Score): I hypothesized that brand loyalty or engagement might be reflected in how much someone consumes relative to how many alternatives they know. So I computed cf_ab_score as the ratio: consume_frequency_score / (consume_frequency_score + awareness_score), rounded to 2 decimals. This yields a number between 0 and 1. A value near 1 means the person consumes the product frequently but is aware of few other brands – possibly a loyal customer. A lower value means either they don’t consume much or they know many competitors (which could indicate they are exploring or not brand-loyal). This was an experimental feature to capture the interplay between usage and market awareness.

  • Brand Sensitivity Index (BSI): From the data, “Price” and “Quality” stood out as key reasons for choosing brands. I derived a binary flag called bsi to identify respondents who might be especially price/quality sensitive and not currently with the market leader. In code: bsi = 1 if current_brand is Newcomer AND (reason is Price or Quality) else 0. The logic being: if someone isn’t using the established brand and their motivation is price or quality, they likely switched or avoided the main brand for those reasons – a signal of potential price sensitivity or value-seeking behavior. This feature was inspired by thinking about brand-switchers vs. loyalists.

  • One-Hot Encoding for Categorical Variables: The remaining categorical features (gender, occupation, current_brand, preferable_consumption_size, flavor_preference, purchase_channel, typical_consumption_situations, and the reasons for choosing brand) were nominal with no ordinal relation. I applied one-hot encoding to convert these into binary indicator columns. For example, gender became a column gender_M (1 for Male, 0 for Female), current_brand became current_brand_Newcomer (with Established as the reference category), occupations were split into dummy variables like occupation_Student, occupation_Retired, etc., and so on. We dropped one dummy from each set to avoid redundancy (e.g., for flavor preference, we drop “Exotic” and only keep a dummy for “Traditional” flavor). In total, after encoding and dropping unused original columns, I ended up with ~22 feature columns ready for modeling.

Before moving to modeling, I checked for multicollinearity among these features. A correlation heatmap and variance inflation factor (VIF) analysis showed some high inter-correlations. For instance, zac_score was unsurprisingly correlated with income_levels_encoded (since income is part of both) and with the one-hot zone variables. To mitigate this, I made a decision to exclude the original income_levels and zone dummies from the model, trusting zac_score to capture those effects. After dropping a few such collinear features, the VIF values for remaining features were all comfortably low (most below 5). This step gave me confidence that the model wouldn’t suffer from redundant features or unstable coefficients.

Model Building & Evaluation

With a polished feature set, I split the data into training and test sets (I used a standard 80/20 split – about 23,900 training and 7,500 testing examples). The task is a multiclass classification (4 classes for the price range). Given that the classes were somewhat imbalanced (e.g., the lowest price range had ~12% of respondents vs ~32% in the highest), I made sure to stratify the split by price_range to maintain the class distribution in both train and test.

I experimented with several algorithms to find the best predictive model:

  • Logistic Regression: I started with Logistic Regression as a baseline. With all features scaled (I applied standardization to the numeric features since logistic regression benefits from scaled inputs), it achieved about 80% accuracy on the test set. The model’s confusion matrix (and classification report) showed it was especially good at identifying the highest price class (which made sense, as that class was largest and perhaps easiest to separate), with precision and recall around 0.90 for that class. Lower price classes had slightly lower recall (~75–77%), but overall performance was quite decent for a first try.

  • Support Vector Machine (SVM): I tried an SVM with an RBF kernel. It also gave roughly 80% accuracy, but training was much slower and it was harder to interpret. Without extensive hyperparameter tuning (which would be computationally heavy on this dataset), SVM didn’t significantly outperform logistic regression.

  • Random Forest: Next, I trained a Random Forest classifier. It performed similarly (around 80% accuracy as well), and had the advantage of providing feature importance scores. The random forest’s feature importances actually highlighted that zac_score and income_levels_encoded were among the top contributors, confirming our earlier intuition that socio-economic status drives willingness to pay. It also indicated bsi and age_group as important features. However, the forest model didn’t dramatically beat logistic regression either, likely because our features allowed linear separation reasonably well.

  • Naive Bayes: For completeness, I tested a Gaussian Naive Bayes model. Its overall accuracy was slightly lower (~78%) and it tended to be less calibrated (it was over-predicting the majority classes). Not too surprising given NB’s conditional independence assumption is probably violated by some of our correlated features.

In the end, Logistic Regression emerged as the model of choice. It was simple, fast, and performed on par with more complex models. Moreover, it gave the most straightforward interpretation – useful for explaining the results to stakeholders at Atliq. For instance, we could extract the logistic model’s coefficients to see the direction of influence: it confirmed that older age groups, higher income, and higher consumption frequency positively drove the prediction towards higher price ranges, whereas having a high BSI (price-sensitive new brand user) pushed predictions toward lower ranges.

Segmented Models for Age Groups: One twist we incorporated was training separate models for different age segments. Given the earlier observation that younger consumers behave very differently, I split the data into “young” (Age ≤ 25) and “rest” (Age > 25) subsets. This was a suggestion from my mentor to potentially improve accuracy by capturing segment-specific patterns. I trained a logistic regression on the young subset and another on the rest. Indeed, this slightly improved the predictions: the young demographic’s model was better at distinguishing those who would only pay ₹50–100 from those who’d pay more (since almost no young person was in ₹200–250 range, their model focused on the lower classes), while the model for older respondents could fine-tune on separating the higher price tiers. In deployment, I simply check the user’s age and use the appropriate model for prediction. This two-model approach added complexity, but it reflects a real-world insight that a one-size-fits-all model might not capture segment nuances.

After selecting the final models, I evaluated them thoroughly on the test set. The overall accuracy was ~80%, with a macro-averaged F1-score around 0.79. In practical terms, this means we can predict a consumer’s preferred price range correctly 4 out of 5 times – not perfect, but a solid starting point for business use. The errors made were mostly off by one category (e.g., predicting ₹150–200 instead of ₹200–250), which is understandable. Extremely few people who liked ₹50–100 were mispredicted as ₹200–250 or vice versa. This gave us confidence that the model’s mistakes aren’t wildly off the mark.

Experiment Tracking with MLflow & DagsHub

One aspect of this project that I’m particularly happy about is the use of MLflow for experiment tracking. I integrated MLflow into my workflow to log each model run’s parameters and performance metrics, and I set up a remote tracking server on DagsHub. This means every experiment (for example, “LogisticReg_v1 with all features” or “RandomForest_depth10”) was recorded, and I could easily compare metrics across runs on a dashboard.

I logged metrics such as accuracy, precision, recall for each class, and even saved the trained model artifacts. Using DagsHub as the backend gave me a convenient web UI to visualize these runs. It was incredibly useful when I was trying out different feature combinations or algorithms – instead of keeping results in spreadsheets or notes, MLflow neatly recorded everything. My mentors could also see the progress remotely via the DagsHub link I shared, which made discussions about model choices much more efficient. (DagsHub repo link).

This was my first time using MLflow with DagsHub, and it felt like a professional-grade setup. For anyone doing a lot of modeling experiments, I highly recommend using such a tracking system – it keeps you organized and makes your work reproducible. For instance, when I found that two models had similar performance, I could quickly check the logs to recall which hyperparameters I had used or what data preprocessing was applied. No more confusion like “Did I normalize that input or not?” – the logs have got you covered.

Deployment (Interactive Prediction App)

The final step of the project was deployment – turning our trained model into a tool that others can use. I developed a simple Streamlit web application that allows users (or stakeholders at Atliq) to input a new person’s details and get a predicted price range. The app consists of a friendly form with all the required inputs corresponding to our model features: age, gender, occupation, zone, income, consumption frequency, brand awareness, current brand, reason for brand choice, flavor preference, health concern level, typical consumption situation, packaging preference, and purchase channel.

Under the hood, the Streamlit app uses the same preprocessing logic as our data pipeline to transform the inputs. I wrote a helper (prediction_helper.py) that mirrors all the encoding steps: for example, it will calculate the person’s zac_score, cf_ab_score, BSI flag, and create the dummy variables in the exact format the model expects. We had saved the two logistic regression models (for young and rest) as joblib files during training. The app loads these artifacts at startup. When you hit the “Calculate Price Range” button, it determines which model to use based on the age input, feeds the processed features into it, and outputs the predicted price range category (e.g., “₹150–200”).

I paid attention to making the app intuitive – dropdowns have human-readable options (like “High (Very health-conscious)” for health concerns or “Metro” for zone). The prediction result is shown with a success message. Because it’s all local and lightweight, the prediction happens almost instantly. It was really satisfying to see the whole pipeline come together in this app: I could input a hypothetical consumer profile and see the model’s guess. For example, if I select a 22-year-old student with low income who highly values price, the model might predict “₹50–100” as their range. Changing the profile to a 40-year-old professional with a high income and concern for quality yields a higher range prediction. These behaviors matched our expectations and the insights from the data exploration, which was reassuring.

From a deployment perspective, using Streamlit was a great choice for a quick demo. It allowed me to focus on the functionality without worrying about front-end intricacies. If this were a production scenario, we would likely deploy the model as an API service, but for portfolio purposes and internal demonstration, a Streamlit app running the models is perfect.

Conclusion & Reflections

Working on this project was an incredible learning experience. Over the course of a few weeks, I got to own the entire workflow – starting from a raw dataset to delivering a usable web application. Along the way, I honed important skills: data cleaning (those pesky typos and outliers!), feature engineering (where domain knowledge meets creativity), and model development. I also learned the importance of experiment tracking and collaboration – tools like MLflow/DagsHub and guidance from mentors ensured that I didn’t get lost in the iteration loop and could justify my choices with evidence.

In terms of results, our model providing ~80% accuracy in predicting price range preferences is a strong outcome. It means Atliq can potentially target customers better – for instance, identifying which segment is likely to prefer premium pricing versus who needs more budget-friendly options. The company could use these insights for personalized marketing or product positioning. Moreover, the insight that age segments behave differently and the creation of separate models for them could inform how Atliq approaches different demographics in strategy.

On a personal note, this project was done remotely as a virtual internship, which taught me a lot about self-discipline and communication. I would update my mentors regularly, often sharing my DagsHub experiment dashboard and discussing next steps on video calls. Their feedback was invaluable, but they also gave me the independence to solve problems on my own – which really boosted my confidence. By the end of the project, I felt much more comfortable with the end-to-end process of building and deploying ML solutions.

Next Steps: If I had more time, I would look into tuning the models further (perhaps using grid search or a tool like Hyperopt for the logistic regression’s regularization strength or trying an XGBoost model). I’d also consider gathering more data or features – e.g., including any behavioral survey questions that might improve the model. And of course, one can always work on improving the UI/UX of the app, or deploying it on a cloud platform for broader access.

In summary, this internship project at Atliq Technologies not only resulted in a cool predictive app for survey responses but also marked a significant milestone in my growth as a data scientist. It’s now a highlight project in my portfolio, demonstrating my ability to take a project from raw data all the way to a deployed application. I hope you enjoyed this walkthrough, and feel free to reach out if you have any questions about the process or the project! Here’s to many more learning adventures in the world of data science.