-
-
Notifications
You must be signed in to change notification settings - Fork 26.5k
Description
Describe the issue linked to the documentation
Documentations of several supervised-learning models omit details about their tolerance-based early stopping. Some examples include LogisticRegression and RidgeClassifier:
tol : float, default=1e-4
Tolerance for stopping criteria.
tol : float, default=1e-3
Precision of the solution.
My concern is that provided with no detailed explanation of the meaning of tol, many ML practitioners tend to optimize solver as a hyperparameter along with it. But different solvers may have different conditions and hence different optimal bounds for tol. For example, I assume that liblinear, if uses coordinate descent, checks the duality gap, while saga compares the best loss value or gradient/coefficient norm with current. If so, it is at least theoretically inconsistent to search over a grid in the form of
{'solver': ['s_1', 's_2', ..., 's_N'],
'tol': [t_1, t_2, ..., t_M]}and maybe as redundant as, say, optimizing degree of the polynomial kernel in RBF-SVC. And generally, optimization methods behind the scenes remain black boxes---e.g., (quasi-)Newton and conjugate-gradient methods.
Suggest a potential alternative/fix
If it makes sense, I would love to contribute, but as far as I can tell, solvers other than SGD aren't mathematically formulated to be certain about their specific implementation details. Ideally, we could shed some light on mathematical details of other optimization methods in User Guide and explicitly state what value the (hyper)parametertol corresponds to in a particular solver/model.