This project is a complete end-to-end House Price Prediction Web Application built using Machine Learning and a real-world dataset inspired by the Gurgaon housing market. It began as a classic regression problem but quickly evolved into a multi-functional product that combines:
The goal was to go beyond the typical ML regression project and create a solution that mirrors a real estate tech product โ helpful for both buyers and analysts.
To give this project a realistic business edge, the focus was narrowed down to Gurgaon (Gurugram) โ one of Indiaโs rapidly developing cities with a highly organized real estate sector. The cityโs sector-based structure made it ideal for analyzing price trends and regional comparisons.
To gather meaningful insights and define the data schema, I referred to property listings from 99acres.com. This helped in:
This grounding in real-world data makes the app much more than a data science demo โ itโs a prototype for a practical solution.
๐ Price Predictor
A trained regression model that estimates property prices based on location, size, furnishing, and other features.
๐ง Recommendation System
Recommends similar houses based on user preferences (area, price range, BHK, etc.).
๐ Analytics & Visual Insights
Interactive plots and heatmaps to show pricing trends across sectors, popular property types, and more.
๐ฅ๏ธ Web Application
Developed using Flask, HTML/CSS/JS, and integrated with ML models โ offering a sleek, responsive user interface.
The project followed a structured pipeline to ensure accuracy, explainability, and usability. Below is a breakdown of each key stage:
Collected structured property data focused on Gurgaon. The dataset includes features like:
Data inspiration and schema were designed by referencing real listings from 99acres.com.
Performed thorough preprocessing to handle:
Generated:
Tried multiple regression models:
Best model selected based on Rยฒ, RMSE, and cross-validation.
Tuned using GridSearchCV and RandomizedSearchCV.
Built a powerful analytics dashboard for exploring:
Interactive plots were made using Plotly, Seaborn, and Matplotlib.
A lightweight content-based system suggesting similar houses based on:
Built using cosine similarity and custom filtering logic.
Deployed the app using:
The first step of the project was to gather real-world data that reflects the actual property market. Since the objective was to predict house prices and recommend similar listings, I chose Gurugram (Gurgaon) โ a city where properties are well-structured across sectors, making it suitable for spatial and categorical analysis.
I collected data from 99acres.com, a popular Indian real estate listing platform.
I scraped flats, houses, and apartment listings, ensuring the dataset covered all types of residential properties. For each listing, I collected details such as:
property_data = {
'property_name': ...,
'link': ...,
'society': ...,
'price': ...,
'area': ...,
'areaWithType': ...,
'bedRoom': ...,
'bathroom': ...,
'balcony': ...,
'additionalRoom': ...,
'address': ...,
'floorNum': ...,
'facing': ...,
'agePossession': ...,
'nearbyLocations': ...,
'description': ...,
'furnishDetails': ...,
'features': ...,
'rating': ...,
'property_id': ...
}
Before building any machine learning model, the foundation lies in cleaning and understanding the data. Raw real estate data is often inconsistent, messy, and incomplete โ so I took this stage seriously and applied both domain knowledge and logic to build a high-quality dataset.
โ Initial Manual Checks in Excel
โ
Scripted Cleaning in Jupyter
After initial Excel-based cleanup, both datasets were loaded into Python for a deeper, programmable cleaning process.
Hereโs what was handled:
Here are some specific actions I took using both research and common sense:
Sq. Ft. โ numeric).After thorough filtering, renaming, parsing, and formatting, the final cleaned dataset was saved as:
๐ gurgaon_properties.csv
It contains the following clean, usable columns:
['property_name', 'property_type', 'society', 'price', 'price_per_sqft',
'area', 'areaWithType', 'bedRoom', 'bathroom', 'balcony',
'additionalRoom', 'address', 'floorNum', 'facing', 'agePossession',
'nearbyLocations', 'description', 'furnishDetails', 'features',
'rating', 'noOfFloor']
gurgaon_properties.csv โ this acts as the master dataset for all further stages like EDA, ML modeling, recommendations, and analytics.After data cleaning, the dataset was enhanced through careful feature engineering โ a crucial step to convert raw, unstructured values into meaningful and predictive features. This process combined domain knowledge with analytical techniques to enrich the data and boost model performance.
areaWithType column using regular expressions.price_per_sqft by dividing total price by usable area.additionalRoom column to identify and encode:
has_servant_roomhas_store_roomfurnishDetails column to extract structured furnishing types.furnishing_type feature:
agePossession into 5 interpretable buckets:
To capture the lifestyle and luxury quotient of each property, a scoring and clustering system was developed:
luxury_score by summing the weights of features present in each listing.These engineered features allowed the model to distinguish properties not just by size or price, but by lifestyle value โ making the dataset ready for sophisticated modeling and pricing recommendations.
๐ The final dataset from this stage was saved as:
gurgaon_properties_featured.csv
After comprehensive feature engineering, a detailed Exploratory Data Analysis (EDA) was conducted to understand the underlying patterns, distributions, and relationships in the Gurgaon real estate dataset.
This EDA is categorized into three key notebooks:
univar.ipynb โ Univariate AnalysismultivariateEda.ipynb โ Multivariate Analysispandas_profiling.ipynb โ Automated Profile Report| Mean Price: โน2.53 Cr | Median Price: โน1.52 Cr |
furnishing status or age reveals deeper feature interactions.np.log1p() on skewed variables like price.np.expm1() when needed.Outliers can distort data insights and negatively impact the performance of machine learning models. This phase focuses on detecting and handling outliers in key real estate features such as price, price per sqft, and area-to-room ratio to ensure data consistency, accuracy, and interpretability.
Outliers can arise from:
price per sqft.price Column โ
(More Treatment Done Here)IQR = Q3 - Q1outliers.csv for manual review.Manual Review & Action:
๐ Result: Cleaner and more realistic price distribution.
price_persqftSteps:
price_persqft with no justification.Preserved:
๐ Result: Distribution normalized to reflect realistic per sqft prices.
Used architectural logic to check:
Steps:
property_type: House vs Flat.floor_count: Used to assess multi-story houses.๐๏ธ Result: Architecturally consistent dataset.
Kept:
Removed:
| Category | Action Taken |
|---|---|
price |
IQR method + manual correction |
price_persqft |
IQR method + logical filtering |
| Area-to-Room Ratio | Domain rule-based validation |
| Total Rows Dropped | ~400+ (after thorough analysis) |
| Final Dataset | Clean, consistent, and ML-ready |
outliers.csv โ Contains the removed/flagged outlier rows for transparency.outlier_analysis.ipynb โ Notebook with code, plots, and logic for detection.cleaned_data.csv โ Final dataset post outlier removal.In this stage, missing values werenโt just cleaned โ they were understood. I leveraged real-world relationships, logical imputation strategies to impute values with confidence and maintain the integrity of the Gurgaon Real Estate dataset.
| Feature | Missing Values |
|---|---|
balcony |
0 |
floorNum |
17 |
facing |
1,011 |
super_built_up_area |
1,680 |
built_up_area |
1,968 |
carpet_area |
1,715 |
agePossession |
0 (but 291 โUndefinedโ) |
built_up_area โ Based on 530 Valid SamplesFrom 530 rows where carpet_area, built_up_area, and super_built_up_area were all present, I derived realistic ratios:
| Ratio | Value | Meaning |
|---|---|---|
| carpet_area / built_up_area | โ 0.90 | Carpet is ~90% of built-up area |
| super_built_up_area / built_up_area | โ 1.105 | Super built-up is ~110.5% of built-up |
| Available Columns | Estimation Formula |
|---|---|
| Carpet + Super | Average of (carpet_area / 0.9) and (super_built_up_area / 1.105) |
| Only Carpet | built_up_area = carpet_area / 0.9 |
| Only Super | built_up_area = super_built_up_area / 1.105 |
๐ฏ Result: All 1,968 missing values were filled logically and confidently.
floorNum โ Grouped by ContextThe 17 missing values in floorNum were mainly from properties labeled as โHouseโ.
โ Strategy:
sector, property_type, and room count.๐ฏ Result: Imputed floor levels were contextually valid and realistic.
agePossession โ Replacing โUndefinedโAlthough not NaN, 291 rows had "Undefined" in agePossession.
โ Strategy:
sector + property_type โ Imputed with mode.sector only.๐ฏ Result: All 291 โUndefinedโ entries were replaced with meaningful labels.
| Feature | Missing Before | Missing After | Imputation Strategy |
|---|---|---|---|
built_up_area |
1,968 | 0 | Ratio-based estimation using 530 rows |
floorNum |
17 | 0 | Median from location-type grouping |
agePossession |
291 (โUndefinedโ) | 0 | Mode-based contextual replacement |
Feature selection is a vital part of the machine learning pipeline, helping models focus on the most informative inputs. In this stage, we took a statistically and mathematically driven approach to find and finalize the most important features influencing property prices in Gurgaon.
We dropped the following features before starting selection:
society: Not useful for generalization or prediction.price_per_sqft: This could leak target information (data leakage), as it is derived from price.To enhance interpretability and usability, we transformed numerical features into categorical representations:
luxury_categoryA score was generated based on various property attributes and categorized as:
def categorize_luxury(score):
if 0 <= score < 50:
return "Low"
elif 50 <= score < 150:
return "Medium"
elif 150 <= score <= 175:
return "High"
else:
return None
floor_categoryThe floor number was converted to floor type to enhance user understanding:
def categorize_floor(floor):
if 0 <= floor <= 2:
return "Low Floor"
elif 3 <= floor <= 10:
return "Mid Floor"
elif 11 <= floor <= 51:
return "High Floor"
else:
return None
Instead of relying on a single method, we used 8 feature selection techniques, treating each as an expert. The final selection was based on the average importance across all techniques.
| Feature | Corr | RF | GBoost | Perm | Lasso | RFE | LinearReg | SHAP |
|---|---|---|---|---|---|---|---|---|
| sector | -0.21 | 0.102 | 0.103 | 0.246 | -0.07 | 0.104 | -0.079 | 0.384 |
| bedRoom | 0.59 | 0.024 | 0.038 | 0.041 | 0.014 | 0.028 | 0.017 | 0.050 |
| bathroom | 0.61 | 0.026 | 0.036 | 0.035 | 0.275 | 0.024 | 0.282 | 0.113 |
| balcony | 0.27 | 0.013 | 0.002 | 0.013 | -0.044 | 0.012 | -0.066 | 0.040 |
| agePossession | -0.13 | 0.015 | 0.004 | 0.013 | 0.000 | 0.014 | -0.002 | 0.027 |
| built_up_area | 0.75 | 0.651 | 0.678 | 0.899 | 1.510 | 0.653 | 1.513 | 1.256 |
| study room | 0.24 | 0.008 | 0.003 | 0.004 | 0.172 | 0.008 | 0.180 | 0.020 |
| servant room | 0.39 | 0.019 | 0.023 | 0.040 | 0.161 | 0.018 | 0.170 | 0.096 |
| store room | 0.31 | 0.008 | 0.010 | 0.004 | 0.200 | 0.008 | 0.204 | 0.017 |
| pooja room | 0.32 | 0.006 | 0.000 | 0.003 | 0.074 | 0.005 | 0.077 | 0.012 |
| others | -0.01 | 0.003 | 0.000 | 0.002 | -0.017 | 0.003 | -0.025 | 0.007 |
| furnishing_type | 0.23 | 0.011 | 0.003 | 0.010 | 0.164 | 0.010 | 0.173 | 0.027 |
| luxury_category | 0.01 | 0.008 | 0.001 | 0.008 | 0.055 | 0.006 | 0.066 | 0.016 |
| floor_category | 0.04 | 0.007 | 0.000 | 0.007 | -0.003 | 0.006 | -0.013 | 0.025 |
After averaging and evaluating all 8 techniques, the following features were selected for the final dataset:
property_typesectorbedRoombathroombalconyagePossessionbuilt_up_areaservant roomstore roomfurnishing_typeluxury_categoryfloor_categoryprice (Target)The final result was saved as:
gurgaon_properties_post_feature_selection.csv
property_type,sector,bedRoom,bathroom,balcony,agePossession,built_up_area,servant room,store room,furnishing_type,luxury_category,floor_category,price
0.0,36.0,3.0,2.0,2.0,1.0,850.0,0.0,0.0,0.0,1.0,1.0,0.82
This stage marks the beginning of our modeling journey. The goal was to establish a baseline modelโa foundational benchmark to compare future, more complex models against.
Rather than aiming for perfection here, the focus was on:
And we did exactly that. Letโs walk through it.
To ensure fairness and consistency in model evaluation, I applied careful preprocessing:
Categorical features like:
sectorfurnishing_typeluxury_categoryfloor_categoryโฆwere converted into numerical format using One-Hot Encoding, allowing the linear model to understand them without introducing bias from arbitrary numerical mapping.
Since Linear Regression is sensitive to feature magnitudes, Standard Scaling was applied to all relevant numerical features such as:
This ensured all features contribute equally during model training.
The target column, price, was right-skewed, meaning most properties had lower prices with a few very high outliers.
To normalize the distribution, a log transformation was applied. This helps stabilize variance and improves the linear modelโs ability to generalize.
All preprocessing steps were integrated into a single pipeline along with the Linear Regression model. This made the workflow efficient, clean, and reproducibleโensuring that transformations were consistently applied across training and validation splits.
To validate the model fairly, I used K-Fold Cross-Validation:
| Metric | Value |
|---|---|
| Rยฒ Score (Mean) | 0.8845 โ Excellent |
| Rยฒ Score (Std Dev) | 0.0147 ๐ Very Stable |
| Mean Absolute Error | 0.5324 ๐ Reasonable Error |
This performance is impressive for a baseline model โ proving that the selected features and preprocessing strategy are strong even before introducing any model tuning or complexity.
All work for this stage is saved in:
๐
baseline.ipynb
It contains:
โYou canโt improve what you donโt measure.โ
This baseline model gave us clear measurement and direction. With a score of 0.88+ out-of-the-box, it validated that our feature selection, data cleaning, and transformation logic from earlier stages were strong.
This stage focuses on experimenting with different preprocessing techniques and regression models to identify the most accurate and robust model for price prediction. The entire process was designed to work within a unified pipeline architecture to maintain consistency and reproducibility across experiments.
To understand the impact of feature encoding on model performance, I tested three types of encoding on categorical features:
Ordinal Encoding
Mapped categories to integers based on frequency and importance. Naturally suited for tree-based models.
One-Hot Encoding
Applied selectively to columns with fewer unique values (sector, agePossession, furnishing_type). Although it increased dimensionality, it allowed testing model performance in a sparse representation scenario.
Target Encoding
Encoded categories based on their relationship with the target variable, introducing a supervised context to the transformation.
Each encoding method was integrated into a full preprocessing + regression pipeline, evaluated across multiple models:
After comparing models, Random Forest Regressor was selected for exhaustive tuning due to its strong baseline performance and general robustness.
n_estimatorsmax_depthmax_samplesmax_featurespipeline.pkl: Contains the final trained pipeline including preprocessing + Random Forest modeldf.pkl: Reference DataFrame to maintain column structure and categories โ useful for inference and UI dropdown generation| Aspect | Value/Model |
|---|---|
| Best Encoding | Target Encoding |
| Best Models | Extra Trees / RF / XGBoost |
| Final Rยฒ Score | ~0.90 |
| Final MAE | ~0.45 |
| Total Model Trials | 1280 (via GridSearchCV) |
| CV Strategy | 10-fold |
This stage reflects a significant effort in model tuning and experimentation, ensuring that the final predictive system is robust, reliable, and production-ready.
After a long and enriching 8-stage journey, Iโve successfully built a robust and accurate property price predictor for Gurugram apartments! This model combines the best of feature engineering, model tuning, and smart encoding strategies to provide realistic price estimations.
โก๏ธ Try it out now on the deployed website:
๐ https://houseing.onrender.com/
๐ก Enter basic apartment features and get predicted prices instantly!
Letโs move into something even more powerful and interactive:
Yes, I built not one, but three different recommendation engines โ all combined to suggest similar apartments with your preferences in control.
I used 247 Gurugram apartment listings, each packed with rich features like:
{
"AIPL Business Club Sector 62": "2.7 Km",
"Indira Gandhi International Airport": "21.1 Km",
"Golf Course Ext Rd": "99 Meter",
...
}
["Swimming Pool", "Salon", "Restaurant", "Spa", "Cafeteria", "24x7 Security", "Club House", "Gated Community"]
From the complete dataset (PropertyName, PropertySubName, NearbyLocations, LocationAdvantages, Link, PriceDetails, TopFacilities), I selected 3 core features to power recommendations:
To generate accurate suggestions, I designed three separate recommendation models, each focusing on one feature of similarity:
๐ฆ These are visualized as a flow of 3 boxes:
Each model gives its own similarity score, and together they create a hybrid recommendation score.
+---------------------+ +---------------------+ +---------------------+
| Location Advantage | | Price Details | | Top Facilities |
+---------------------+ +---------------------+ +---------------------+
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
\ | /
+--------------------------------------+
| Final Recommendation System |
+--------------------------------------+ --- ---
Hereโs the full recommendation workflow:
The modular setup allows customization of recommendation priorities:
This flexibility puts user preference in control and enhances recommendation quality.
To clean and use unstructured location data:
The backend is powered by:
cosine_sim_location.pklcosine_sim_price.pklcosine_sim_facilities.pkllocation_distance.pkl (used to filter apartments by radius)๐ The current weighted formula:
cosine_sim_matrix = 0.5 * cosine_sim1 + 0.8 * cosine_sim2 + 1 * cosine_sim3
โ In the future, this formula can be easily adjusted to prioritize any feature, giving users full control over recommendation behavior.
You can now:
๐ Live it in action:
๐ก Try Recommendation + Prediction Tool
This is not just a toolโitโs your smartest assistant for exploring homes in Gurugram! ๐ก๐๏ธ
One of the most insightful and interactive modules of this project is Data Analytics & Visualization, which brings the Gurugram real estate dataset to life with clean, meaningful, and crazy-good plots! These visualizations uncover deep insights about locality trends, pricing patterns, and buyer preferences โ all made easily understandable through visuals.
Below is a breakdown of the key visualizations and what they reveal:
This plot visualizes the geographical distribution of property listings across Gurugram using Folium, an interactive mapping library in Python.
This scatter plot illustrates the relationship between built-up area (in sq ft) and property price (in โน).
This pie chart presents the distribution of properties based on BHK configuration.
This box plot compares price ranges across different BHK categories.
This plot distinguishes between Flats and Independent Houses.
A word cloud showing the most frequently mentioned property features.
property_features column.This heatmap illustrates price per sq ft intensity across Gurugram sectors.
This Sankey diagram shows the flow of property types into different price bands.
This box plot compares price per sq ft across different sectors.
A bar chart highlighting the average price of listings across sectors.
This bar chart shows the importance of different features in the modelโs price prediction.
This bar chart shows the impact of furnishing type on average property price.
furnishing column.These visualizations offer:
Whether youโre a homebuyer, investor, or a real estate analyst, this module is a powerful lens into Gurugramโs real estate dynamics.