-
Notifications
You must be signed in to change notification settings - Fork 183
FIX: Fix KNN predictions on X=None
#2821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FIX: Fix KNN predictions on X=None
#2821
Conversation
|
/intelci: run |
|
/azp run Nightly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 2 files with indirect coverage changes 🚀 New features to boost your workflow:
|
sklearnex/neighbors/common.py
Outdated
| data[0] is None | ||
| and method_name in ["predict", "predict_proba", "score"] | ||
| ), | ||
| "Predictions on 'None' data are handled by internal sklearn methods.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this instead look for the _fit_X attribute if X is None? I'm not entirely sure I follow the solution of this PR (I do see the problem however).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that wouldn't give the same results. The sklearn docs mention:
In this case, the query point is not considered its own neighbor
https://github.com/scikit-learn/scikit-learn/blob/4a10d0ed8d85e6ed24a647bd28a65c0c64b101ef/sklearn/neighbors/_base.py#L767C13-L767C77
And one can see that forwarding the same fit data doesn't generate the same results:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
rng = np.random.default_rng(seed=123)
X = rng.standard_normal(size=(5,3))
y = rng.integers(2, size=X.shape[0])
knn = KNeighborsClassifier(n_neighbors=2).fit(X, y)
knn.predict_proba(None)array([[0.5, 0.5],
[1. , 0. ],
[0. , 1. ],
[0.5, 0.5],
[0.5, 0.5]])
knn.predict_proba(X)array([[1. , 0. ],
[0.5, 0.5],
[0.5, 0.5],
[0.5, 0.5],
[0. , 1. ]])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is why we have the following logic: https://github.com/uxlfoundation/scikit-learn-intelex/blob/main/onedal/neighbors/neighbors.py#L297-L306 Just to full block it is a loss of functionality seems a bit heavy handed. At a minimum the explanation text for debugging needs to be made clearer, it needs to be discussed with someone with KNN knowledge on oneDAL side about this case @ethanglaser ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding the logic correctly, that would either way produce incorrect results in cases where there are observations with the same 'X' features but different 'y'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no, turns out that sklearn suffers from the same problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, even if this ends up offloading the operation to sklearn, since the class overrides the method kneighbors which sklearn ends up calling, the code from onedal.neighbors ends up being used instead of sklearn's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now modified it to use the sklearnex methods for predict_proba so that the log entry would be reflective of what happens, but it looks like there's no implementations of predict and score for X=None on either the onedal or sklearnex side, so we'll still need to fall back on those, unless they were to be reimplemented on the python side by calling kneighbors internally like sklearn does for theirs. Note that it still ends up calling oneDAL since it overrides the methods from the sklearn class that are called internally.
|
/intelci: run |
Description
Sklearn allows passing
X=Noneto all prediction-related methods of KNN estimators, but sklearnex takes that as an all-nan data, which is not the same. This PR fixes it.Checklist:
Completeness and readability
Testing