-
-
Notifications
You must be signed in to change notification settings - Fork 212
Make task the default concept to work with #352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #352 +/- ##
==========================================
- Coverage 89.69% 89.59% -0.1%
==========================================
Files 32 32
Lines 2522 2788 +266
==========================================
+ Hits 2262 2498 +236
- Misses 260 290 +30
Continue to review full report at Codecov.
|
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good but it's not clear to me why we have to do any casting at all. Also, shouldn't the arff contain the info on what the data type is? (actually the numpy recarray stil contained it)
openml/datasets/dataset.py
Outdated
| else: | ||
| if isinstance(target, six.string_types): | ||
| target = [target] | ||
| legal_target_types = (int, float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is float float64? Why not float32? And why do we require to cast?
doc/usage.rst
Outdated
| >>> print(datasets[0].name) | ||
| mfeat-factors | ||
| OpenML contains several key concepts which it needs to make machine learning | ||
| research shareable. A machine learning experiment consists of several runs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say an ML experiment could also be a single run.
doc/usage.rst
Outdated
| OpenML contains several key concepts which it needs to make machine learning | ||
| research shareable. A machine learning experiment consists of several runs, | ||
| which describe the performance of an algorithm (called a flow in OpenML) on a | ||
| task. Task is the combination of a dataset, a split and an evaluation metric. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A task"
doc/usage.rst
Outdated
| which describe the performance of an algorithm (called a flow in OpenML) on a | ||
| task. Task is the combination of a dataset, a split and an evaluation metric. In | ||
| this user guide we will go through listing and exploring existing tasks to | ||
| actually running machine learning algorithms on them. In a further user guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say "run is flow + setup + task and produces metric and predictions"? Right now you don't explain "run" right? Maybe make the key concepts bold.
doc/usage.rst
Outdated
| Tasks are containers, defining how to split the dataset into a train and test | ||
| set, whether to use several disjoint train and test splits (cross-validation) | ||
| and whether this should be repeated several times. Also, the task defines a | ||
| target metric for which a flow should be optimized. You can think of a task as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make the "You can" sentence the first sentence. I think more essential is that the task defines which dataset to use, which column (if any) is the target and whether it's a classification, regression, clustering etc task.
|
|
||
| Just like datasets, tasks are identified by IDs and can be accessed in three | ||
| different ways: | ||
| Tasks are identified by IDs and can be accessed in two different ways: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not filter by tags? Maybe I would say "you can explore tasks on the website or via list_tasks. You can get a single task with get_task". Because these two methods do semantically very different things.
doc/usage.rst
Outdated
| 'NumberOfSymbolicFeatures', 'cost_matrix'], | ||
| dtype='object') | ||
| Now we can restrict the tasks to all tasks with the desired resampling strategy: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtering by CV strategy seems a bit unnatural to me. Can we do it by dataset?
doc/usage.rst
Outdated
| .. code:: python | ||
| >>> tasks = openml.tasks.list_tasks(tag='study_1') | ||
| >>> filtered_tasks = filtered_tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just move this up, this seems more natural then the CV type to me? Or motivate the CV type?
doc/usage.rst
Outdated
| the concepts of flows and runs. | ||
| In order to upload and share results of running a machine learning algorithm | ||
| on a task, we need to create an :class:`~openml.OpenMLRun`. A run object can | ||
| be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn model on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scikit-learn compatible?
amueller
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This PR fixes: