-
-
Notifications
You must be signed in to change notification settings - Fork 211
Doc/type def #1255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Doc/type def #1255
Conversation
openml/datasets/dataset.py
Outdated
| description: str, | ||
| data_format: str = "arff", | ||
| cache_format: str = "pickle", | ||
| dataset_id: int = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent with other lines, in which we already have Optional[str]. I know that this syntax is legal, but I'm not sure whether I like it too much, and also, we should be consistent and adopt only one way of declaring optional arguments.
openml/datasets/dataset.py
Outdated
| self.update_comment = update_comment | ||
| self.md5_checksum = md5_checksum | ||
| self.data_file = data_file | ||
| self.data_file = cast(str, data_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we now know that this is not None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, moved it to after data_file check and it solved cast requirement
openml/datasets/dataset.py
Outdated
|
|
||
| return data_container | ||
|
|
||
| def _get_arff(self, format: str) -> Dict: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the reason to not give more details on the return type here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every type I tried threw an error=) I silenced it and moved on, but I will try one more time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.pre-commit-config.yaml
Outdated
| additional_dependencies: | ||
| - types-requests | ||
| - types-python-dateutil | ||
| args: [ --disallow-untyped-defs, --disallow-any-generics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that this is too strict, as it will test everything. Could you please create a new entry in which you can add the newly typed files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated it and restricted to current changed file only, let me know if it is not what you wanted.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #1255 +/- ##
===========================================
- Coverage 85.26% 80.47% -4.80%
===========================================
Files 38 38
Lines 5104 5019 -85
===========================================
- Hits 4352 4039 -313
- Misses 752 980 +228
☔ View full report in Codecov by Sentry. |
openml/datasets/dataset.py
Outdated
| return data_container | ||
|
|
||
| def _get_arff(self, format: str) -> Dict: # type: ignore | ||
| def _get_arff(self, format: str) -> Dict[str, Union[arff.DENSE, arff.COO]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the return type of this needs to be the same as the inner decode_arff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then parse_data_from_arff function started complaining, looking into that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay the issue here is since our Dict returned from _get_arff is a big Union, in parse_data_from_arff function, all dict loads are complaining with all the type options dict indices / values can get. I added a typing ignore to that function for now, will go back to it later.
openml/datasets/data_feature.py
Outdated
| return "[%d - %s (%s)]" % (self.index, self.name, self.data_type) | ||
|
|
||
| def _repr_pretty_(self, pp, cycle): | ||
| def _repr_pretty_(self, pp, cycle) -> None: # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for this ignore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pycharm couldn't find any use of this function in the library so no idea what type the variables should be. I can fix it if you tell me the types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see its use case now from pretty library but still not sure what variable type to define.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is for displaying things nicely for Jupyter notebooks and documented here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pp is a prettyprinter class instance so this will require prettyprinter library to be installed, but not sure if it is in the requirements.
openml/datasets/functions.py
Outdated
| output_format: str = "dict", | ||
| **kwargs, | ||
| ) -> Union[Dict, pd.DataFrame]: | ||
| **kwargs: Optional[Union[str, int]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring states that this should be a dict, and from its usage I also would assume that this is actually a dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a dict but mypy has issues with kwargs (see: python/typing#1406). I changed it to a union of dict and str, int for now so it doesn't complain and dict still works. But please feel free to adjust if you have a cleaner idea.
| dataset = _create_dataset_from_description( | ||
| description, features_file, qualities_file, arff_file, parquet_file, cache_format | ||
| ) | ||
| if qualities_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be quite a change in the behavior of this function, why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mypy gives an arg-type error otherwise, since qualities_file can be None (in exception but mypy doesn't see that). So in run time behavior it is never None so line 472 is always true, since otherwise exception is thrown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't follow. What do you mean by "in exception"?
| def attributes_arff_from_df(df): | ||
| def attributes_arff_from_df( | ||
| df: pd.DataFrame, | ||
| ) -> List[Union[Tuple[str, str], Tuple[str, List[str]]]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could create a variable ARFF_ATTRIBUTE_TYPE and use that throughout the files instead of redefining this again and again?
| update_comment=None, | ||
| version_label=None, | ||
| ): | ||
| name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please check this again, and maybe compare with the dataset XSD? A lot of them are not needed for the upload, and I believe also not for the creation of a dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the optional ones in OpenMLDataset to optional here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?
| attributes_ = attributes | ||
|
|
||
| # attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str] | ||
| # attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are these comments for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid they resurged with the rebase :(
|
|
||
|
|
||
| def _get_online_dataset_arff(dataset_id): | ||
| def _get_online_dataset_arff(dataset_id: int) -> Optional[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the return type Optional? Are there reasons why this would not return something?
| - types-python-dateutil | ||
| args: [ --disallow-untyped-defs, --disallow-any-generics, | ||
| --disallow-any-explicit, --implicit-optional ] | ||
| --disallow-any-explicit, --implicit-optional, --allow-redefinition] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, what's the reason for allow-redefinition? And should these more strict checks maybe also be applied to the newly added check for the datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow-redefinition allows redefining a variable with a different type within same block and nesting depth, there were a few cases that happened and i didn't want to change the code too much. Agreed with the second part, adding them to datasets part as well.
| """ | ||
| datasets = list_datasets(status="all", data_id=dataset_ids, output_format="dataframe") | ||
| missing = set(dataset_ids) - set(datasets.get("did", [])) | ||
| missing = set(dataset_ids) - set(datasets.get("did", [])) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, what's the reason for this ignore? Should this not be Set[int]?
| missing_str = ", ".join(str(did) for did in missing) | ||
| raise ValueError(f"Could not find dataset(s) {missing_str} in OpenML dataset list.") | ||
| return dict(datasets["status"] == "active") | ||
| return dict(datasets["status"] == "active") # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, what's the reason for this ignore? The function defines a return type, so should this not just be the return type?
| attributes_ = attributes | ||
|
|
||
| # attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str] | ||
| # attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid they resurged with the rebase :(
| update_comment=None, | ||
| version_label=None, | ||
| ): | ||
| name: str, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?
| dataset = _create_dataset_from_description( | ||
| description, features_file, qualities_file, arff_file, parquet_file, cache_format | ||
| ) | ||
| if qualities_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I can't follow. What do you mean by "in exception"?
| with open(self.feather_attribute_file, "rb") as fh: | ||
| categorical, attribute_names = pickle.load(fh) | ||
| else: | ||
| elif self.data_pickle_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an else again to make sure we're not running into some edge case?
| dataset_format: str = "dataframe", | ||
| ) -> Tuple[ | ||
| Union[np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix], | ||
| Union[np.ndarray, pd.DataFrame, pd.SparseDtype], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. Do we return spares Dtype instead of sparse scipy matrices now?
| return qualities_ | ||
|
|
||
|
|
||
| def _parse_qualities_xml(qualities_xml): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it is reverting some changes from @PGijsbers, is this on purpose or rather a merge issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any reason to revert the changes at least.
|
|
||
| return data_container | ||
|
|
||
| def _get_arff( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you moved this function? This makes it hard to review, and I would appreciate if you could move it back to its original place.
|
Heyho, after looking at the changes, conflicts, and open questions, I propose closing this PR and manually adding its changes as part of the refactor in #1298. With the changes in #1298 and the linter of ruff, it will be easier to spot all the missing parts and have the same typing everywhere. I will update this PR once we decide on our next steps related to PR #1298. |
|
@LennartPurucker could you please make Bilge co-author in one of the commits that migrate her changes over? That way she gets listed as a contributor when it eventually gets squash merged into develop. |
|
Will do! |
What does this PR implement/fix? Explain your changes.
(dataset.py) annotate functions and classes, forbid use of "any". docstring not updated yet.
How should this PR be tested?
tests fail with pre-changes state too
Any other comments?
switched back to "cast" solution to handle optional variable errors in mypy, since tests failed for the function we excluded from the class. (also more cases occured other than that function)