Skip to content

Add Pf8#659

Merged
alimanfoo merged 16 commits intomasterfrom
gcs-pf8
Dec 3, 2024
Merged

Add Pf8#659
alimanfoo merged 16 commits intomasterfrom
gcs-pf8

Conversation

@eselimnl
Copy link
Copy Markdown
Collaborator

@eselimnl eselimnl commented Nov 21, 2024

This PR adds a new Pf8 API class for accessing the Pf8 data resource.

Pf8(): The Pf8() class is built upon the Pf7() template, and should not disrupt the functionality of Pf7() (verified via test_pf7_integration.py).

Sanger S3 Configuration: Updates have been made in malariagen_data/util.py to support data access on the Sanger S3 storage. For privacy reasons, the bucket name is not included in the PR (such as in the tests/test_pf8_integration.py).

Variant/altlen Field Handling: We observed that the variant/altlen field contains a single value rather than one per ALT, despite the number of ALT being set to 6. The cause of this behavior is unclear, but it may relate to the comment “special case (for altlen, number depends on ALT)” in scikit-allel. Since it is possible to calculate this value from REF and ALT, we are not concerned about it now. Therefore, this field is excluded from data_vars after reading the Pf8 zarr file.

This PR was previously created from a fork, so comments made there will also be followed here.

@eselimnl eselimnl requested a review from alimanfoo November 21, 2024 13:33
@ahernank
Copy link
Copy Markdown
Collaborator

Just FYI, tests are failing here as the github-actions service account doesn't have permissions for the pf8 bucket.
I've assigned the permissions now, so hopefully they should run smoothly now.

@eselimnl
Copy link
Copy Markdown
Collaborator Author

That's extremely helpful, thanks @ahernank

@eselimnl eselimnl mentioned this pull request Nov 26, 2024
Copy link
Copy Markdown
Member

@alimanfoo alimanfoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eselimnl, a few small suggestions.

Btw currently the way the S3 URLs are handled will mean that chained URLs like simplecache::s3://... won't work. Perhaps that is fine for now, up to you.

# Create an S3FileSystem with custom endpoint if specified.
kwargs["anon"] = kwargs.get("anon", True) # Default to anonymous access.
kwargs["endpoint_url"] = "https://cog.sanger.ac.uk"
kwargs["config_kwargs"] = kwargs.get("config_kwargs", config)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs["config_kwargs"] = kwargs.get("config_kwargs", config)
kwargs.setdefault("config_kwargs", config)

}

# Create an S3FileSystem with custom endpoint if specified.
kwargs["anon"] = kwargs.get("anon", True) # Default to anonymous access.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs["anon"] = kwargs.get("anon", True) # Default to anonymous access.
kwargs.setdefault("anon", True) # Default to anonymous access.


# Create an S3FileSystem with custom endpoint if specified.
kwargs["anon"] = kwargs.get("anon", True) # Default to anonymous access.
kwargs["endpoint_url"] = "https://cog.sanger.ac.uk"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
kwargs["endpoint_url"] = "https://cog.sanger.ac.uk"
kwargs.setdwfault("endpoint_url", "https://cog.sanger.ac.uk")

data_vars = self._add_extended_data(root, inline_array, chunks, data_vars)
data_vars.pop(
"variant_altlen", None
) # variant_altlen is under investigation in Pf8.zarr
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be possible to handle this by removing the variable from the Pf8 config file, rather than here? In which case no changes to tests for pf7 or pv4 would be needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed accordingly.

@alimanfoo alimanfoo changed the title Add Pf8() Add Pf8 Nov 27, 2024
@eselimnl
Copy link
Copy Markdown
Collaborator Author

Thanks @eselimnl, a few small suggestions.

Btw currently the way the S3 URLs are handled will mean that chained URLs like simplecache::s3://... won't work. Perhaps that is fine for now, up to you.

Thanks @alimanfoo, I have incorporated your suggestions. As you expected, I receive botocore.exceptions.NoCredentialsError when I tried simplecache::s3://..., but I couldn't figure out why this is the case.

@eselimnl
Copy link
Copy Markdown
Collaborator Author

Sorry @alimanfoo, debugging the chained URL issue took a while, but simplecache::s3://... works fine now.

@alimanfoo
Copy link
Copy Markdown
Member

Sorry @alimanfoo, debugging the chained URL issue took a while, but simplecache::s3://... works fine now.

My apologies, I should've explained a bit better how chained URLs are handled in fsspec, at least as I understand it. In looking at this I've also realised the way this is done for GCS could be clearer. I've taken the liberty to push a commit here which consistifies handling of storage options for both GCS and S3 including the the case where URLs are chained. Would you mind taking a look and testing to see if this works for your S3 bucket?

@alimanfoo
Copy link
Copy Markdown
Member

Hold that thought, looks like I've broken some tests, investigating...

@alimanfoo
Copy link
Copy Markdown
Member

Bug fixed, ready for testing with S3.

@eselimnl
Copy link
Copy Markdown
Collaborator Author

eselimnl commented Dec 2, 2024

Many thanks for sorting this out, works perfectly fine with S3 in my testing environment.

@alimanfoo
Copy link
Copy Markdown
Member

Many thanks for sorting this out, works perfectly fine with S3 in my testing environment.

Cool, thanks @eselimnl. Happy for this to be merged?

@eselimnl
Copy link
Copy Markdown
Collaborator Author

eselimnl commented Dec 3, 2024

Yes, I am happy for this to be merged

@alimanfoo alimanfoo merged commit 6cd964b into master Dec 3, 2024
@alimanfoo alimanfoo deleted the gcs-pf8 branch December 3, 2024 09:49
@alimanfoo
Copy link
Copy Markdown
Member

Thanks @eselimnl, merged! 🎉

Would you like us to make a release?

@eselimnl
Copy link
Copy Markdown
Collaborator Author

eselimnl commented Dec 3, 2024

Exciting! Yes please!

@ahernank
Copy link
Copy Markdown
Collaborator

ahernank commented Dec 3, 2024

Just to note here that the S3 and GCS buckets for Pf8 are not currently "publicly" accessible (i.e. not public/not part of the authentication process).

@eselimnl
Copy link
Copy Markdown
Collaborator Author

eselimnl commented Dec 3, 2024

I noticed an oversight in the testing environment: S3 access requires the s3fs library to be installed. While this works fine if done on the user side, I realize I could have included it in the package as well. Would you like me to create a new PR to address this?

@alimanfoo alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants