Why is DecisionTree using same feature and same condition twice

Michael · May 2, 2025, 5:46am

When trying to fit scikit-learn DecisionTreeClassifier on my data, I am observing some weird behavior.

x[54] (a boolan feature) is used to break the 19 samples into 2 and 17 on top left node. Then again, the same feature with exact same condition appears again in its True branch.

This time it again has True and False branches leading to leaf nodes.

I am using gini for deciding the split.

My question is, since we are in True branch, how can same boolean feature generate non-zero entropy or impurity at all? After all the new set can only have 0s for that feature. So there should not be any posibility of split.

What am I missing.
When trying to fit scikit-learn DecisionTreeClassifier on my data, I am observing some weird behavior.

x[54] (a boolan feature) is used to break the 19 samples into 2 and 17 on top left node. Then again, the same feature with exact same condition appears again in its True branch.

This time it again has True and False branches leading to leaf nodes.

I am using gini for deciding the split.

My question is, since we are in True branch, how can same boolean feature generate non-zero entropy or impurity at all? After all the new set can only have 0s for that feature. So there should not be any posibility of split.

What am I missing.

CodeSavager · May 2, 2025, 5:58am

You’re right to be puzzled by this — it looks like the decision tree is splitting twice on the same boolean feature x[54] <= 0.5, which seems redundant or even impossible if the feature is truly binary. But this behavior can be explained, and it’s actually expected in certain scenarios.

Let me clarify what’s likely going on.

What’s Actually Happening?

1. `x[54]` might not be strictly boolean in type

Even if you believe x[54] is boolean, scikit-learn stores numeric features as floats. A value of False becomes 0.0, and True becomes 1.0. So the condition x[54] <= 0.5 splits the data into:

x[54] = 0.0 → True branch
x[54] = 1.0 → False branch

That’s fine.

2. The same condition can appear again

Once you follow the path where x[54] <= 0.5 is True, you’re only working with samples where x[54] == 0.0. But scikit-learn doesn’t remove x[54] from consideration in further splits. So even though all values are 0.0, it may still evaluate x[54] <= 0.5 again — and due to floating point rounding or internal heuristics, it may choose to “split” them again at the same threshold.

3. So why does the second split work?

Because even though the condition is the same, the data subset is different. The impurity (gini) in that subset might still be > 0, and a split based on any feature could reduce it. If all samples are still x[54] = 0.0, then x[54] <= 0.5 will just result in one branch having all the data, and the other being empty or very small.

That’s the key — splitting on the same feature multiple times is not forbidden, and it often happens when:

The feature is numeric (even if boolean in concept).
The split threshold is re-evaluated based on subset-specific distributions.

Confirm This in Your Case

You can verify this by printing the unique values of x[54] in each node during training using a custom function or examining your training data manually, like:

import numpy as np

print(np.unique(X[:, 54]))  # Should be [0.0, 1.0] ideally

Or use:

import pandas as pd
print(pd.Series(X[:, 54]).value_counts())

Also, try checking what happens when you cast your boolean features explicitly:

X[:, 54] = X[:, 54].astype(bool).astype(int)

TL;DR

Even though x[54] is boolean in concept, it’s treated as float by scikit-learn.
The same feature can appear multiple times in the tree at different depths.
Splits are still evaluated numerically — so x[54] <= 0.5 can behave differently in different subsets.
It’s not “re-splitting” the same boolean values — it’s just re-evaluating the threshold in a subset of data.