Skip to content

Pf8() #658

Closed
eselimnl wants to merge 3 commits intomalariagen:masterfrom
eselimnl:pf8-gcs
Closed

Pf8() #658
eselimnl wants to merge 3 commits intomalariagen:masterfrom
eselimnl:pf8-gcs

Conversation

@eselimnl
Copy link
Copy Markdown
Collaborator

@eselimnl eselimnl commented Nov 20, 2024

Hi @alimanfoo, here is the draft PR for Pf8(). I couldn't add my GOOGLE_CREDENTIALS that is needed by the workflow, please advise if you would like me to do this.

Pf8(): The Pf8() function is built upon the Pf7() template, and should not disrupt the functionality of Pf7() (verified via test_pf7_integration.py).

Sanger S3 Configuration: Updates have been made in malariagen_data/util.py to support data access on the Sanger S3 storage. For privacy reasons, the bucket name is not included in the PR (such as in the tests/test_pf8_integration.py).

Variant/altlen Field Handling: We observed that the variant/altlen field contains a single value rather than one per ALT, despite the number of ALT being set to 6. The cause of this behavior is unclear, but it may relate to the comment “special case (for altlen, number depends on ALT)” in scikit-allel. Since it is possible to calculate this value from REF and ALT, we are not concerned about it now. Therefore, this field is excluded from data_vars after reading the Pf8 zarr file.

@alimanfoo
Copy link
Copy Markdown
Member

Thanks @eselimnl, I'll take a look asap :)

@alimanfoo
Copy link
Copy Markdown
Member

I couldn't add my GOOGLE_CREDENTIALS that is needed by the workflow, please advise if you would like me to do this.

Credentials are supplied automatically via a secret, however this is disabled for PRs coming from a fork. Basically, it would be easier if you recreate this PR from a branch within the malariagen/malariagen-data-python repo, rather than from a fork, then all checks will run properly. I've given you write access to this repo so you should be able to do that now.

Comment on lines +470 to +490
# S3 access: set up the filesystem.
if "s3://" in url:
try:
import s3fs
except ImportError:
raise ImportError("Amazon S3 support requires the 's3fs' package.")

# Configuration for S3 access, customize as needed.
config = {
"signature_version": "s3",
"s3": {"addressing_style": "virtual"},
}

# Create an S3FileSystem with custom endpoint if specified.
kwargs["anon"] = kwargs.get("anon", True) # Default to anonymous access.
kwargs["endpoint_url"] = "https://cog.sanger.ac.uk"
kwargs["config_kwargs"] = kwargs.get("config_kwargs", config)

# Use s3fs to create the filesystem.
fs = s3fs.S3FileSystem(**kwargs)
path = url.split("s3://")[1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to simplify this a little. This is because fsspec should have the ability to detect S3 URLs and use s3fs automatically, rather than requiring you to import and instantiate directly. I.e., url_to_fs(url, **kwargs) should work for S3 just as well as for GCS.

@alimanfoo
Copy link
Copy Markdown
Member

Variant/altlen Field Handling: We observed that the variant/altlen field contains a single value rather than one per ALT, despite the number of ALT being set to 6. The cause of this behavior is unclear, but it may relate to the comment “special case (for altlen, number depends on ALT)” in scikit-allel. Since it is possible to calculate this value from REF and ALT, we are not concerned about it now. Therefore, this field is excluded from data_vars after reading the Pf8 zarr file.

Sorry about this, don't know what's going on there. Seems like a reasonable approach for now, but happy to investigate if there's a problem upstream in scikit-allel.

@eselimnl eselimnl mentioned this pull request Nov 21, 2024
@eselimnl
Copy link
Copy Markdown
Collaborator Author

eselimnl commented Nov 26, 2024

Many thanks for the review @alimanfoo. You were right that fsspec is compatible with S3, so I have simplified this in the new PR. It would be great if you could review this again: #659.

@alimanfoo
Copy link
Copy Markdown
Member

Closing in favour of #659.

@alimanfoo alimanfoo closed this Nov 26, 2024
@alimanfoo alimanfoo added the BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027). label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BMGF-068808 Work supported by BMGF grant INV-068808 (MalariaGEN 2024-2027).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants