Feature Engineering That Actually Improves Models
Better features beat better algorithms. These techniques consistently improve model performance across domains.
Key Insights
- Target encoding outperforms one-hot for high-cardinality categoricals
- Time-based features (day of week, hour, recency) add predictive power to temporal data
- Feature interactions capture relationships that linear models miss
Target Encoding
from sklearn.model_selection import KFold
def target_encode(df, col, target, n_splits=5):
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
encoded = pd.Series(index=df.index, dtype=float)
for train_idx, val_idx in kf.split(df):
means = df.iloc[train_idx].groupby(col)[target].mean()
encoded.iloc[val_idx] = df.iloc[val_idx][col].map(means)
return encoded.fillna(df[target].mean())
Time Features
df["hour"] = df["timestamp"].dt.hour
df["day_of_week"] = df["timestamp"].dt.dayofweek
df["days_since_last"] = df.groupby("user_id")["timestamp"].diff().dt.days
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)
Interaction Features
# Ratio features often outperform raw values
df["price_per_sqft"] = df["price"] / df["sqft"]
df["income_to_debt"] = df["income"] / (df["debt"] + 1)