Skip to content

Conversation

@mfeurer
Copy link
Collaborator

@mfeurer mfeurer commented Oct 12, 2017

This PR fixes:

@codecov-io
Copy link

codecov-io commented Oct 12, 2017

Codecov Report

Merging #352 into develop will decrease coverage by 0.09%.
The diff coverage is 88.23%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop     #352     +/-   ##
==========================================
- Coverage    89.69%   89.59%   -0.1%     
==========================================
  Files           32       32             
  Lines         2522     2788    +266     
==========================================
+ Hits          2262     2498    +236     
- Misses         260      290     +30
Impacted Files Coverage Δ
openml/tasks/task.py 95.45% <100%> (-0.33%) ⬇️
openml/datasets/dataset.py 80.16% <86.66%> (+1.31%) ⬆️
openml/exceptions.py 100% <0%> (ø) ⬆️
openml/datasets/functions.py 91.01% <0%> (+0.96%) ⬆️
openml/tasks/functions.py 87.55% <0%> (+1.66%) ⬆️
openml/_api_calls.py 92.7% <0%> (+2.7%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e01ef40...c6f85b6. Read the comment docs.

@mfeurer mfeurer changed the title WIP: Make task the default concept to work with Make task the default concept to work with Oct 12, 2017
@mfeurer mfeurer requested a review from amueller October 12, 2017 13:45
Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good but it's not clear to me why we have to do any casting at all. Also, shouldn't the arff contain the info on what the data type is? (actually the numpy recarray stil contained it)

else:
if isinstance(target, six.string_types):
target = [target]
legal_target_types = (int, float)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is float float64? Why not float32? And why do we require to cast?

doc/usage.rst Outdated
>>> print(datasets[0].name)
mfeat-factors
OpenML contains several key concepts which it needs to make machine learning
research shareable. A machine learning experiment consists of several runs,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say an ML experiment could also be a single run.

doc/usage.rst Outdated
OpenML contains several key concepts which it needs to make machine learning
research shareable. A machine learning experiment consists of several runs,
which describe the performance of an algorithm (called a flow in OpenML) on a
task. Task is the combination of a dataset, a split and an evaluation metric. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"A task"

doc/usage.rst Outdated
which describe the performance of an algorithm (called a flow in OpenML) on a
task. Task is the combination of a dataset, a split and an evaluation metric. In
this user guide we will go through listing and exploring existing tasks to
actually running machine learning algorithms on them. In a further user guide
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say "run is flow + setup + task and produces metric and predictions"? Right now you don't explain "run" right? Maybe make the key concepts bold.

doc/usage.rst Outdated
Tasks are containers, defining how to split the dataset into a train and test
set, whether to use several disjoint train and test splits (cross-validation)
and whether this should be repeated several times. Also, the task defines a
target metric for which a flow should be optimized. You can think of a task as
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the "You can" sentence the first sentence. I think more essential is that the task defines which dataset to use, which column (if any) is the target and whether it's a classification, regression, clustering etc task.


Just like datasets, tasks are identified by IDs and can be accessed in three
different ways:
Tasks are identified by IDs and can be accessed in two different ways:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not filter by tags? Maybe I would say "you can explore tasks on the website or via list_tasks. You can get a single task with get_task". Because these two methods do semantically very different things.

doc/usage.rst Outdated
'NumberOfSymbolicFeatures', 'cost_matrix'],
dtype='object')
Now we can restrict the tasks to all tasks with the desired resampling strategy:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filtering by CV strategy seems a bit unnatural to me. Can we do it by dataset?

doc/usage.rst Outdated
.. code:: python
>>> tasks = openml.tasks.list_tasks(tag='study_1')
>>> filtered_tasks = filtered_tasks.query('NumberOfInstances > 500 and NumberOfInstances < 1000')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or just move this up, this seems more natural then the CV type to me? Or motivate the CV type?

doc/usage.rst Outdated
the concepts of flows and runs.
In order to upload and share results of running a machine learning algorithm
on a task, we need to create an :class:`~openml.OpenMLRun`. A run object can
be created by running a :class:`~openml.OpenMLFlow` or a scikit-learn model on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn compatible?

Copy link
Contributor

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mfeurer mfeurer merged commit 1fff169 into develop Oct 13, 2017
@mfeurer mfeurer deleted the fix_#197 branch October 13, 2017 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants