Skip to content

Conversation

@Bilgecelik
Copy link
Contributor

What does this PR implement/fix? Explain your changes.

(dataset.py) annotate functions and classes, forbid use of "any". docstring not updated yet.

How should this PR be tested?

tests fail with pre-changes state too

Any other comments?

switched back to "cast" solution to handle optional variable errors in mypy, since tests failed for the function we excluded from the class. (also more cases occured other than that function)

@Bilgecelik Bilgecelik requested a review from mfeurer June 13, 2023 11:50
description: str,
data_format: str = "arff",
cache_format: str = "pickle",
dataset_id: int = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent with other lines, in which we already have Optional[str]. I know that this syntax is legal, but I'm not sure whether I like it too much, and also, we should be consistent and adopt only one way of declaring optional arguments.

self.update_comment = update_comment
self.md5_checksum = md5_checksum
self.data_file = data_file
self.data_file = cast(str, data_file)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we now know that this is not None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, moved it to after data_file check and it solved cast requirement


return data_container

def _get_arff(self, format: str) -> Dict: # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the reason to not give more details on the return type here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every type I tried threw an error=) I silenced it and moved on, but I will try one more time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

additional_dependencies:
- types-requests
- types-python-dateutil
args: [ --disallow-untyped-defs, --disallow-any-generics,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that this is too strict, as it will test everything. Could you please create a new entry in which you can add the newly typed files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated it and restricted to current changed file only, let me know if it is not what you wanted.

@codecov-commenter
Copy link

codecov-commenter commented Jun 13, 2023

Codecov Report

Patch coverage: 82.69% and project coverage change: -4.80 ⚠️

Comparison is base (5d2128a) 85.26% compared to head (5348eed) 80.47%.

❗ Current head 5348eed differs from pull request most recent head 1096567. Consider uploading reports for the commit 1096567 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #1255      +/-   ##
===========================================
- Coverage    85.26%   80.47%   -4.80%     
===========================================
  Files           38       38              
  Lines         5104     5019      -85     
===========================================
- Hits          4352     4039     -313     
- Misses         752      980     +228     
Impacted Files Coverage Δ
openml/datasets/dataset.py 75.15% <82.69%> (-12.45%) ⬇️

... and 17 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

return data_container

def _get_arff(self, format: str) -> Dict: # type: ignore
def _get_arff(self, format: str) -> Dict[str, Union[arff.DENSE, arff.COO]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return type of this needs to be the same as the inner decode_arff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then parse_data_from_arff function started complaining, looking into that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay the issue here is since our Dict returned from _get_arff is a big Union, in parse_data_from_arff function, all dict loads are complaining with all the type options dict indices / values can get. I added a typing ignore to that function for now, will go back to it later.

return "[%d - %s (%s)]" % (self.index, self.name, self.data_type)

def _repr_pretty_(self, pp, cycle):
def _repr_pretty_(self, pp, cycle) -> None: # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this ignore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pycharm couldn't find any use of this function in the library so no idea what type the variables should be. I can fix it if you tell me the types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see its use case now from pretty library but still not sure what variable type to define.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is for displaying things nicely for Jupyter notebooks and documented here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pp is a prettyprinter class instance so this will require prettyprinter library to be installed, but not sure if it is in the requirements.

output_format: str = "dict",
**kwargs,
) -> Union[Dict, pd.DataFrame]:
**kwargs: Optional[Union[str, int]],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring states that this should be a dict, and from its usage I also would assume that this is actually a dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a dict but mypy has issues with kwargs (see: python/typing#1406). I changed it to a union of dict and str, int for now so it doesn't complain and dict still works. But please feel free to adjust if you have a cleaner idea.

dataset = _create_dataset_from_description(
description, features_file, qualities_file, arff_file, parquet_file, cache_format
)
if qualities_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be quite a change in the behavior of this function, why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy gives an arg-type error otherwise, since qualities_file can be None (in exception but mypy doesn't see that). So in run time behavior it is never None so line 472 is always true, since otherwise exception is thrown.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't follow. What do you mean by "in exception"?

def attributes_arff_from_df(df):
def attributes_arff_from_df(
df: pd.DataFrame,
) -> List[Union[Tuple[str, str], Tuple[str, List[str]]]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you could create a variable ARFF_ATTRIBUTE_TYPE and use that throughout the files instead of redefining this again and again?

update_comment=None,
version_label=None,
):
name: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check this again, and maybe compare with the dataset XSD? A lot of them are not needed for the upload, and I believe also not for the creation of a dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the optional ones in OpenMLDataset to optional here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?

attributes_ = attributes

# attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str]
# attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these comments for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid they resurged with the rebase :(



def _get_online_dataset_arff(dataset_id):
def _get_online_dataset_arff(dataset_id: int) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the return type Optional? Are there reasons why this would not return something?

- types-python-dateutil
args: [ --disallow-untyped-defs, --disallow-any-generics,
--disallow-any-explicit, --implicit-optional ]
--disallow-any-explicit, --implicit-optional, --allow-redefinition]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, what's the reason for allow-redefinition? And should these more strict checks maybe also be applied to the newly added check for the datasets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allow-redefinition allows redefining a variable with a different type within same block and nesting depth, there were a few cases that happened and i didn't want to change the code too much. Agreed with the second part, adding them to datasets part as well.

"""
datasets = list_datasets(status="all", data_id=dataset_ids, output_format="dataframe")
missing = set(dataset_ids) - set(datasets.get("did", []))
missing = set(dataset_ids) - set(datasets.get("did", [])) # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, what's the reason for this ignore? Should this not be Set[int]?

missing_str = ", ".join(str(did) for did in missing)
raise ValueError(f"Could not find dataset(s) {missing_str} in OpenML dataset list.")
return dict(datasets["status"] == "active")
return dict(datasets["status"] == "active") # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, what's the reason for this ignore? The function defines a return type, so should this not just be the return type?

attributes_ = attributes

# attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str]
# attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid they resurged with the rebase :(

update_comment=None,
version_label=None,
):
name: str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?

dataset = _create_dataset_from_description(
description, features_file, qualities_file, arff_file, parquet_file, cache_format
)
if qualities_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I can't follow. What do you mean by "in exception"?

with open(self.feather_attribute_file, "rb") as fh:
categorical, attribute_names = pickle.load(fh)
else:
elif self.data_pickle_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add an else again to make sure we're not running into some edge case?

dataset_format: str = "dataframe",
) -> Tuple[
Union[np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix],
Union[np.ndarray, pd.DataFrame, pd.SparseDtype],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong. Do we return spares Dtype instead of sparse scipy matrices now?

return qualities_


def _parse_qualities_xml(qualities_xml):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is reverting some changes from @PGijsbers, is this on purpose or rather a merge issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any reason to revert the changes at least.


return data_container

def _get_arff(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason you moved this function? This makes it hard to review, and I would appreciate if you could move it back to its original place.

@LennartPurucker
Copy link
Contributor

Heyho, after looking at the changes, conflicts, and open questions, I propose closing this PR and manually adding its changes as part of the refactor in #1298. With the changes in #1298 and the linter of ruff, it will be easier to spot all the missing parts and have the same typing everywhere.

I will update this PR once we decide on our next steps related to PR #1298.

@PGijsbers
Copy link
Collaborator

@LennartPurucker could you please make Bilge co-author in one of the commits that migrate her changes over? That way she gets listed as a contributor when it eventually gets squash merged into develop.

@LennartPurucker
Copy link
Contributor

Will do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants