Doc/type def #1255

Bilgecelik · 2023-06-13T11:50:22Z

What does this PR implement/fix? Explain your changes.

(dataset.py) annotate functions and classes, forbid use of "any". docstring not updated yet.

How should this PR be tested?

tests fail with pre-changes state too

Any other comments?

switched back to "cast" solution to handle optional variable errors in mypy, since tests failed for the function we excluded from the class. (also more cases occured other than that function)

mfeurer · 2023-06-13T12:12:11Z

openml/datasets/dataset.py

+        description: str,
+        data_format: str = "arff",
+        cache_format: str = "pickle",
+        dataset_id: int = None,


This is inconsistent with other lines, in which we already have Optional[str]. I know that this syntax is legal, but I'm not sure whether I like it too much, and also, we should be consistent and adopt only one way of declaring optional arguments.

mfeurer · 2023-06-13T12:13:13Z

openml/datasets/dataset.py

        self.update_comment = update_comment
        self.md5_checksum = md5_checksum
-        self.data_file = data_file
+        self.data_file = cast(str, data_file)


How do we now know that this is not None?

you are right, moved it to after data_file check and it solved cast requirement

openml/datasets/dataset.py

mfeurer · 2023-06-13T12:17:59Z

openml/datasets/dataset.py


        return data_container

+    def _get_arff(self, format: str) -> Dict:  # type: ignore


What would be the reason to not give more details on the return type here?

Every type I tried threw an error=) I silenced it and moved on, but I will try one more time.

mfeurer · 2023-06-13T12:19:36Z

.pre-commit-config.yaml

        additional_dependencies:
          - types-requests
          - types-python-dateutil
+        args: [ --disallow-untyped-defs, --disallow-any-generics,


I'm afraid that this is too strict, as it will test everything. Could you please create a new entry in which you can add the newly typed files?

I separated it and restricted to current changed file only, let me know if it is not what you wanted.

codecov-commenter · 2023-06-13T14:11:06Z

Codecov Report

Patch coverage: 82.69% and project coverage change: -4.80 ⚠️

Comparison is base (5d2128a) 85.26% compared to head (5348eed) 80.47%.

❗ Current head 5348eed differs from pull request most recent head 1096567. Consider uploading reports for the commit 1096567 to get more accurate results

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #1255      +/-   ##
===========================================
- Coverage    85.26%   80.47%   -4.80%     
===========================================
  Files           38       38              
  Lines         5104     5019      -85     
===========================================
- Hits          4352     4039     -313     
- Misses         752      980     +228

Impacted Files	Coverage Δ
openml/datasets/dataset.py	`75.15% <82.69%> (-12.45%)`	⬇️

... and 17 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

mfeurer · 2023-06-13T17:12:34Z

openml/datasets/dataset.py

        return data_container

-    def _get_arff(self, format: str) -> Dict:  # type: ignore
+    def _get_arff(self, format: str) -> Dict[str, Union[arff.DENSE, arff.COO]]:


I think the return type of this needs to be the same as the inner decode_arff.

Yes, but then parse_data_from_arff function started complaining, looking into that.

Okay the issue here is since our Dict returned from _get_arff is a big Union, in parse_data_from_arff function, all dict loads are complaining with all the type options dict indices / values can get. I added a typing ignore to that function for now, will go back to it later.

mfeurer · 2023-06-15T07:01:03Z

openml/datasets/data_feature.py

        return "[%d - %s (%s)]" % (self.index, self.name, self.data_type)

-    def _repr_pretty_(self, pp, cycle):
+    def _repr_pretty_(self, pp, cycle) -> None:  # type: ignore


What's the reason for this ignore?

Pycharm couldn't find any use of this function in the library so no idea what type the variables should be. I can fix it if you tell me the types.

Ah I see its use case now from pretty library but still not sure what variable type to define.

This function is for displaying things nicely for Jupyter notebooks and documented here.

pp is a prettyprinter class instance so this will require prettyprinter library to be installed, but not sure if it is in the requirements.

openml/datasets/dataset.py

openml/base.py

mfeurer · 2023-06-15T07:14:29Z

openml/datasets/functions.py

    output_format: str = "dict",
-    **kwargs,
-) -> Union[Dict, pd.DataFrame]:
+    **kwargs: Optional[Union[str, int]],


The docstring states that this should be a dict, and from its usage I also would assume that this is actually a dict.

It is a dict but mypy has issues with kwargs (see: python/typing#1406). I changed it to a union of dict and str, int for now so it doesn't complain and dict still works. But please feel free to adjust if you have a cleaner idea.

mfeurer · 2023-06-15T07:16:58Z

openml/datasets/functions.py

-    dataset = _create_dataset_from_description(
-        description, features_file, qualities_file, arff_file, parquet_file, cache_format
-    )
+    if qualities_file:


This appears to be quite a change in the behavior of this function, why is this necessary?

mypy gives an arg-type error otherwise, since qualities_file can be None (in exception but mypy doesn't see that). So in run time behavior it is never None so line 472 is always true, since otherwise exception is thrown.

Sorry, I can't follow. What do you mean by "in exception"?

mfeurer · 2023-06-15T07:19:28Z

openml/datasets/functions.py

-def attributes_arff_from_df(df):
+def attributes_arff_from_df(
+    df: pd.DataFrame,
+) -> List[Union[Tuple[str, str], Tuple[str, List[str]]]]:


Maybe you could create a variable ARFF_ATTRIBUTE_TYPE and use that throughout the files instead of redefining this again and again?

mfeurer · 2023-06-15T07:25:20Z

openml/datasets/functions.py

-    update_comment=None,
-    version_label=None,
-):
+    name: str,


Could you please check this again, and maybe compare with the dataset XSD? A lot of them are not needed for the upload, and I believe also not for the creation of a dataset.

I changed the optional ones in OpenMLDataset to optional here too.

Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?

mfeurer · 2023-06-15T07:26:09Z

openml/datasets/functions.py

        attributes_ = attributes
+
+    # attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str]
+    # attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]]


What are these comments for?

I'm afraid they resurged with the rebase :(

mfeurer · 2023-06-15T07:29:21Z

openml/datasets/functions.py



-def _get_online_dataset_arff(dataset_id):
+def _get_online_dataset_arff(dataset_id: int) -> Optional[str]:


Why is the return type Optional? Are there reasons why this would not return something?

for more information, see https://pre-commit.ci

mfeurer · 2023-07-11T07:40:59Z

.pre-commit-config.yaml

          - types-python-dateutil
        args: [ --disallow-untyped-defs, --disallow-any-generics,
-                --disallow-any-explicit, --implicit-optional ]
+                --disallow-any-explicit, --implicit-optional, --allow-redefinition]


Hey, what's the reason for allow-redefinition? And should these more strict checks maybe also be applied to the newly added check for the datasets?

allow-redefinition allows redefining a variable with a different type within same block and nesting depth, there were a few cases that happened and i didn't want to change the code too much. Agreed with the second part, adding them to datasets part as well.

mfeurer · 2023-07-11T07:45:23Z

openml/datasets/functions.py

    """
    datasets = list_datasets(status="all", data_id=dataset_ids, output_format="dataframe")
-    missing = set(dataset_ids) - set(datasets.get("did", []))
+    missing = set(dataset_ids) - set(datasets.get("did", []))  # type: ignore


Hey, what's the reason for this ignore? Should this not be Set[int]?

mfeurer · 2023-07-11T07:46:29Z

openml/datasets/functions.py

        missing_str = ", ".join(str(did) for did in missing)
        raise ValueError(f"Could not find dataset(s) {missing_str} in OpenML dataset list.")
-    return dict(datasets["status"] == "active")
+    return dict(datasets["status"] == "active")  # type: ignore


Hey, what's the reason for this ignore? The function defines a return type, so should this not just be the return type?

mfeurer · 2023-07-11T07:54:10Z

openml/datasets/functions.py

        attributes_ = attributes
+
+    # attributes: Union[List[Tuple[str, str]], List[Tuple[str, List[str]]], str]
+    # attributes_: List[Union[Tuple[str, str], Tuple[str, List[str]]]]


I'm afraid they resurged with the rebase :(

mfeurer · 2023-07-11T07:55:55Z

openml/datasets/functions.py

-    update_comment=None,
-    version_label=None,
-):
+    name: str,


Hey, I just had a look at the dataset XSD, and for example ignore_attributes is optional, while it is mandatory in the type definitions below. Could you please double-check these definitions?

mfeurer · 2023-07-11T07:57:56Z

openml/datasets/functions.py

-    dataset = _create_dataset_from_description(
-        description, features_file, qualities_file, arff_file, parquet_file, cache_format
-    )
+    if qualities_file:


Sorry, I can't follow. What do you mean by "in exception"?

mfeurer · 2023-07-11T08:00:11Z

openml/datasets/dataset.py

                with open(self.feather_attribute_file, "rb") as fh:
                    categorical, attribute_names = pickle.load(fh)
-            else:
+            elif self.data_pickle_file:


Could you please add an else again to make sure we're not running into some edge case?

mfeurer · 2023-07-11T08:01:38Z

openml/datasets/dataset.py

        dataset_format: str = "dataframe",
    ) -> Tuple[
-        Union[np.ndarray, pd.DataFrame, scipy.sparse.csr_matrix],
+        Union[np.ndarray, pd.DataFrame, pd.SparseDtype],


This looks wrong. Do we return spares Dtype instead of sparse scipy matrices now?

mfeurer · 2023-07-11T08:06:12Z

openml/datasets/dataset.py

    return qualities_
-
-
-def _parse_qualities_xml(qualities_xml):


This looks like it is reverting some changes from @PGijsbers, is this on purpose or rather a merge issue?

I don't see any reason to revert the changes at least.

mfeurer · 2023-07-11T08:07:30Z

openml/datasets/dataset.py


        return data_container

+    def _get_arff(


Is there a reason you moved this function? This makes it hard to review, and I would appreciate if you could move it back to its original place.

LennartPurucker · 2024-01-07T16:06:58Z

Heyho, after looking at the changes, conflicts, and open questions, I propose closing this PR and manually adding its changes as part of the refactor in #1298. With the changes in #1298 and the linter of ruff, it will be easier to spot all the missing parts and have the same typing everywhere.

I will update this PR once we decide on our next steps related to PR #1298.

PGijsbers · 2024-01-08T11:11:42Z

@LennartPurucker could you please make Bilge co-author in one of the commits that migrate her changes over? That way she gets listed as a contributor when it eventually gets squash merged into develop.

LennartPurucker · 2024-01-08T12:10:27Z

Will do!

Bilgecelik requested a review from mfeurer June 13, 2023 11:50

mfeurer reviewed Jun 13, 2023

View reviewed changes

mfeurer requested changes Jun 13, 2023

View reviewed changes

mfeurer reviewed Jun 15, 2023

View reviewed changes

Bilgecelik force-pushed the doc/type-def branch from fc084fd to 1f0ee34 Compare June 16, 2023 06:53

Bilgecelik and others added 11 commits July 7, 2023 09:11

fix: type-defs in dataset

deace5f

fix: type-defs in dataset

40cec5e

[pre-commit.ci] auto fixes from pre-commit.com hooks

3f48098

for more information, see https://pre-commit.ci

fix: type-defs in dataset - review applied

d2cd61e

fix: type-def fix on dataset

cd8fe95

fix: return type get_arff

db6a76b

fix: datasets type-def comment fix

4f93de5

fix: comments on datasets files

9f2feb7

merge conflict changes

11cf6f4

merge conflicts

557f969

fix: merge conflicts

e6735fb

Bilgecelik force-pushed the doc/type-def branch from b71d3e9 to e6735fb Compare July 7, 2023 07:46

Bilgecelik added 6 commits July 7, 2023 09:52

testing pre-commit

89d1868

merge conflict fixes

e9bc086

pre-commit

f4bbfef

pre-commit errors fix

e14d00e

merge conflict fixes

6fe2a56

merge conflict fixes

1096567

mfeurer reviewed Jul 11, 2023

View reviewed changes

LennartPurucker mentioned this pull request Jan 9, 2024

Linting Everything - Fix All mypy and ruff Errors #1307

Merged

LennartPurucker closed this Jan 9, 2024


		return data_container

		def _get_arff(self, format: str) -> Dict: # type: ignore



		def _get_online_dataset_arff(dataset_id):
		def _get_online_dataset_arff(dataset_id: int) -> Optional[str]:

Uh oh!

Doc/type def #1255

Doc/type def #1255

Uh oh!

Conversation

Bilgecelik commented Jun 13, 2023

What does this PR implement/fix? Explain your changes.

How should this PR be tested?

Any other comments?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Jun 13, 2023 •

edited

Loading