top of page

What is Feature Engineering?

  • Writer: Ishan Deshpande
    Ishan Deshpande
  • May 20
  • 4 min read

Imagine trying to build a house using broken bricks, missing measurements, and poor-quality materials.

Even with the best architect and workers, the final structure may still be weak.


Machine Learning works in a very similar way.


No matter how powerful your ML algorithm is, if the data is messy, incomplete, or poorly prepared, the model will struggle to learn properly.


This process of preparing and improving data before training a model is called Feature Engineering

And in real-world ML projects, this step is often more important than the algorithm itself.


What is a Feature?


A feature is simply an input variable or column used by the ML model to make predictions.


Example

Suppose we want to predict house prices.

Area

Bedrooms

Location

Price

1200 sq ft

2

Pune

₹50L

Here:

  • Area

  • Bedrooms

  • Location

are features.

And:

  • Price

is the target/output.


So, What is Feature Engineering?


Real-world data is rarely clean and perfect.

Sometimes:

  • Values are missing

  • Text cannot be understood by machines

  • Numbers are on completely different scales

  • Some columns are useless


Feature Engineering is the process of converting this messy raw data into meaningful and machine-friendly data.


Let’s discuss some import steps which are part of Feature Engineering


1. Handling Missing Values


Let’s say you collected customer data for an ML model.

Name

Age

Salary

Rahul

25

₹50K

Priya

NULL

₹70K

Now imagine asking the ML model “Learn patterns from this.”


The model immediately gets confused because one value is missing.

Most ML algorithms do not understand blank values the way humans do.


How do we fix it?


Option 1 — Remove the missing rows

Works well if only a few records are incomplete. For example, only 2 out of 10,000 rows are missing.


Option 2 — Fill the missing values

If significant values are missing, removing rows could destroy useful information.

Instead, we can replace them using:

  • Average (Mean)

  • Median

  • Most common value (Mode)

 

But when do we use each one?

  • Mean (Average) → Used when data is fairly balanced without extreme outliers.

    Example: Average exam marks.


  • Median → Preferred when data contains very high or very low values because it is less affected by outliers.

    Example: Salary data where a few people earn extremely high salaries.


  • Mode → Used mostly for categorical/text data where we replace missing values with the most frequent category.

    Example: Replacing missing “City” values with the most commonly occurring city.



2. Encoding


Humans easily understand this:

City

Pune

Mumbai

Delhi

But for ML models, text is meaningless. Machines only understand numbers.


So we convert categories into numerical form. This process is called Encoding


Simple Encoding Example


Label Encoding

Pune = 1 Mumbai = 2 Delhi = 3


One-Hot Encoding

City

Pune

Mumbai

Delhi

Pune

1

0

0

Mumbai

0

1

0

Delhi

0

0

1

But here’s the interesting part...

If we use:

Pune = 1 Mumbai = 2 Delhi = 3


Some ML models may incorrectly assume: Delhi > Mumbai > Pune

even though cities have no ranking.


That’s why One-Hot Encoding is often preferred.


When encoding becomes tricky


Imagine encoding 2500 cities, suddenly your dataset becomes huge.

This is where smarter encoding techniques are used.



3. Feature Scaling


Imagine a race between:

  • A person carrying 5kg

  • A person carrying 500kg


Not exactly fair, right? Something similar happens in ML.


Consider this dataset:

Age

Salary

25

500000

Salary values are much larger than Age values.

Some algorithms may give Salary more importance simply because the numbers are bigger.

Not because it’s actually more important.

 

Feature Scaling brings values into a similar range so models can learn more fairly.



Important thing beginners should know


Not every algorithm needs scaling.

Tree-based algorithms like:

  • Decision Trees

  • Random Forest

  • XGBoost

usually work perfectly fine without it.


But algorithms like:

  • KNN

  • SVM

  • Neural Networks

often improve significantly after scaling.



4. Feature Creation


Sometimes raw data hides useful insights. A smart ML engineer creates new features from existing ones, which help ML algorithm to learn better.



Why this matters

Many times better features improve accuracy more than changing algorithms.



5. Log Transformation


Real-world data is often unbalanced.

For example:

  • Most salaries may be between ₹20K–₹80K

  • But a few people may earn ₹50 Lakhs+


These very large values can dominate the learning process.

Log Transformation reduces the impact of extremely large values by compressing the gap between small and large numbers.




6. Feature Selection


Not every column helps the model. Some features add noise, Increase confusion & Reduce performance


Example

For predicting house prices:

Useful:

  • Area

  • Location

  • Number of bedrooms

Probably useless:

  • Owner’s age


Removing irrelevant features helps models focus on what truly matters.



Final Thoughts


Feature Engineering is where Machine Learning truly begins.


Before a model can learn patterns:

  • Data needs to be cleaned

  • Organized

  • Balanced

  • Simplified

  • Transformed into something meaningful

And this is exactly what Feature Engineering does. The better the features, the smarter the model becomes.


In the upcoming blogs, we’ll go through few more important concepts and then jump right into the Algorithms.


Until next time — stay curious, keep learning!

bottom of page