Skip to content

Preliminary RPackage class#2761

Merged
tgamblin merged 1 commit intospack:developfrom
adamjstewart:features/rpackage
Jan 8, 2017
Merged

Preliminary RPackage class#2761
tgamblin merged 1 commit intospack:developfrom
adamjstewart:features/rpackage

Conversation

@adamjstewart
Copy link
Copy Markdown
Member

@adamjstewart adamjstewart commented Jan 5, 2017

Same as #2709 but for R packages.

@glennpj @JavierCVilla You two seem to contribute the most R packages. Do these changes look safe to you?



class RBiocgenerics(Package):
class RBiocgenerics(RPackage):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only package that was even slightly different from the rest. I confirmed that the changes I made allow it to still install correctly.

@adamjstewart adamjstewart requested a review from alalazo January 5, 2017 22:12
@adamjstewart
Copy link
Copy Markdown
Member Author

The R build system is remarkably uniform. For example, almost every package has a url and list_url of the following format:

url      = "https://cran.r-project.org/src/contrib/Rcpp_0.12.6.tar.gz"      
list_url = "https://cran.r-project.org/src/contrib/Archive/Rcpp"

Due to the presence of capital letters, I can't simply add:

url      = "https://cran.r-project.org/src/contrib/{0}_{1}.tar.gz".format(self.name, self.something_that_gets_me_the_latest_version)      
list_url = "https://cran.r-project.org/src/contrib/Archive/{0}".format(self.name)

to the RPackage class, but I could probably trick spack create into getting the name from the URL without removing capitalization and adding a list_url that way. Thoughts?

EDIT: Actually, if I added a package_name attribute, defaulting to self.name, and figured out how to get the latest available version in Spack, I could theoretically remove the url and list_url from almost every R package. Thoughts?

@JavierCVilla
Copy link
Copy Markdown
Contributor

@glennpj @JavierCVilla You two seem to contribute the most R packages. Do these changes look safe to you?

I totally agree with the changes, almost all the R packages have the same structure except for the dependencies and links to the source files, so in my opinion there is no any risk with the change.

In fact, this change could make easier to create new R packages if we combine it properly with what you mentioned in your second comment, given that it reduces the cumbersome part of copy the same code once and again and make the scrapy part manually.

@JavierCVilla
Copy link
Copy Markdown
Contributor

JavierCVilla commented Jan 6, 2017

...
to the RPackage class, but I could probably trick spack create into getting the name from the URL without removing capitalization and adding a list_url that way. Thoughts?

Regarding to this part, to take the capitalized name from the given URL seems to me a quick and good option.

In addition, I think that this process could be even more automatic for new R packages creating a new kind of spack.util.web.find_versions_of_archive(url) but specialized in finding R packages metadata given that all packages located in the CRAN repository share a common structure.
In that way, the last version that you mention could be easily taken from the link showed in the Package source label from the package CRAN web. Also dependencies of the package could be figured out by searching the labels Imports in the same web, this label shows a list with all the dependencies.
Besides, this would allow the spack create to find the versions and checksum from the R packages with a CRAN URL and the dependencies as well as the homepage, url and list_url links.

EDIT: Actually, if I added a package_name attribute, defaulting to self.name, and figured out how to get the latest available version in Spack, I could theoretically remove the url and list_url from almost every R package. Thoughts?

In my opinion, I wouldn't remove these links from the package files. I think that it consistently maintains the R package format equal to the rest of Spack packages. In fact, some R packages (like r-irkernel) may have other different source as github.

Copy link
Copy Markdown
Member

@alalazo alalazo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids a lot of duplicated code. I like it! 😄

@adamjstewart
Copy link
Copy Markdown
Member Author

@JavierCVilla I think all of those are good ideas. @citibeth and I are suggesting similar improvements to the creation of Python packages in #2709, #2718, #2749, and #2750. I currently have a lot of PRs open that affect spack create (#2707, #2709, and this one) so it might be best to merge them as is and work on better auto-creation for Python and R packages in separate PRs once spack create stabilizes again.

@adamjstewart
Copy link
Copy Markdown
Member Author

it might be best to merge them as is

I say this not because I don't think it should be done in this PR but mostly because if a single one of the 173 files I've already changed is edited in the meantime, I'll need to continue to rebase and resolve any conflicts.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 6, 2017

I have a proposal brewing, will try to write it up by tonight...

@adamjstewart adamjstewart changed the title [WIP] Preliminary RPackage class Preliminary RPackage class Jan 6, 2017
@adamjstewart adamjstewart removed the WIP label Jan 6, 2017
@adamjstewart
Copy link
Copy Markdown
Member Author

I think this is safe to merge. It affects a lot of packages, so I would like to get it merged sooner rather than later. There are a lot of fancy things we can add to spack create later, but I will leave that for a future PR.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 6, 2017

@glennpj: any comments? This looks good to me. Thanks @adamjstewart!

@glennpj
Copy link
Copy Markdown
Contributor

glennpj commented Jan 7, 2017

This looks good to me.

My only comment is on the BiocGenerics package. That is a snowflake, as @adamjstewart noted. That is a BioConductor package and versions are tied to bioconductor suite versions, which in turn are tied to R versions. The reason for the bioc- prefix was to indicate the version of bioconductor that the package worked in. It is not the version of biocgenerics, which is different, currently 0.20.0. That said, most people using bioconductor packages do not care about the versions of the individual packages, but rather the version of the bioconductor suite, so Adam's changes for the version should work just fine.

I am not as sure about the dependency specs though. Bioconductor is really picky about the R version and the versions are not always in sync, as they appear in the package.py file. For instance, the current version of bioconductor is 3.4 and that works with R-3.3.1. I think specifying R version ranges in the package.py may have spack potentially getting bioconductor packages out of sync with bioconductor "supported" versions of R. I believe biocgenerics is the only bioconductor package in spack at the moment. It is a core bioconductor package.

@adamjstewart
Copy link
Copy Markdown
Member Author

Thanks for the knowledge @glennpj! My users don't use R very heavily so I admittedly don't have much experience with it.

The reason for the bioc- prefix was to indicate the version of bioconductor that the package worked in. It is not the version of biocgenerics, which is different, currently 0.20.0.

Hmm, this is quite different from every other package in Spack. Bioconductor is a collection of R packages? Perhaps we could use real versions for each package and create a bioconductor package that specifies the appropriate versions of each.

I am not as sure about the dependency specs though. Bioconductor is really picky about the R version and the versions are not always in sync, as they appear in the package.py file. For instance, the current version of bioconductor is 3.4 and that works with R-3.3.1.

All I did was move the constraints from validate to depends_on so that the spec would crash during concretization instead of at build time. If you think we should modify the version constraints or remove them altogether I'd be happy to do that in another PR.

Copy link
Copy Markdown
Member

@citibeth citibeth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is cleaning up an old problem.... but is it possible to get rid of git= download methods and replace them with checksummable URLs? I found three places in this commit where that could be done.

version('bioc-3.3',
version('3.3',
git='https://github.com/Bioconductor-mirror/BiocGenerics.git',
branch='release-3.3')
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace these downloads with a checksummable URL? Something like this:

    url      = "https://github.com/Biocondutor-mirror/BiocGenerics/tarball/release-3.3"

version('3.3', '...')```

@@ -35,8 +35,6 @@ class RIrkernel(Package):
version('master', git='https://github.com/IRkernel/IRkernel.git',
tag='0.7')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you replace these downloads with a checksummable URL? Something like this:

    url      = "https://github.com/IRKernel/IRKernel/tarball/0.7"

version('0.7', '...')```

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 7, 2017

The reason for the bioc- prefix was to indicate the version of bioconductor that the package worked in. It is not the version of biocgenerics, which is different, currently 0.20.0.
Hmm, this is quite different from every other package in Spack. Bioconductor is a collection of R packages? Perhaps we could use real versions for each package and create a bioconductor package that specifies the appropriate versions of each.

In general I agree with @adamjstewart here; and certainly, putting it in the individual packages is not the place to go. But I'm left wondering what the "right" approach is?

Apparently, Bioconductor is a repository of 2600 packages. We're back to the same problem of dealing with external repos. @JavierCVilla can you expand upon how you are intending to use the biocgenerics package in Spack? Would Bioconductor be installed on top of the Spack-installed biocgenerics? Why is just one of 2600 pacakges being included? What does the end result look like here?

https://www.bioconductor.org/news/bioc_3_4_release/

All I did was move the constraints from validate to depends_on so that the spec would crash during concretization instead of at build time.

That looks good to me.

@adamjstewart
Copy link
Copy Markdown
Member Author

I would like to leave any changes to biocgenerics for another PR. It is well outside the scope of this PR, and from what it sounds like the exact version number used is very sensitive. I definitely don't know enough about it to trust myself to make that change.

@tgamblin It sounds like @JavierCVilla and @glennpj are ok with all of the changes in this PR. Once this is merged, either one of them can continue tuning the biocgenerics package.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

@citibeth: I'm tempted to include the needed package here and let the people who use R and Bioconductor work out the versioning. Including one package from it shouldn't be a huge deal, and if they want to version-lock it, they can depends_on('[email protected]', when='@3.3'). If they want to include the whole ecosystem, we can work that out when we get to it, but if the whole suite is versioned it might actually be fairly easy to include, but I guess we'll see.

@tgamblin tgamblin merged commit daff3c0 into spack:develop Jan 8, 2017
@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

Thanks @glennpj!

@glennpj
Copy link
Copy Markdown
Contributor

glennpj commented Jan 8, 2017

I know the bioconductor issues are out of scope for this ticket but I would like to capture one more important item for future reference. The bioconductor project does not keep older versions of individual packages around, only the current version is ever available. They recently starting keeping the versions associated with a particular version of the suite in git with branches. That is the only way to make sure that an installed version of a package can be reinstalled at some later date. For example, whichever version of biocgenerics is used for bioconductor-3.3 will always be available in the 'release-3.3' branch of the bioconductor git mirror.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

@glennpj: We're setting up an S3 bucket at LLNL to archive versions of Spack packages. We will probably add it as a mirror in etc/spack/defaults/mirrors.yaml so that if a fetch fails, the user still has a chance of grabbing it from Spack's mirror. I think this will help with a lot of these packages that don't keep old versions around.

@adamjstewart adamjstewart deleted the features/rpackage branch January 8, 2017 01:32
@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 8, 2017

The bioconductor project does not keep older versions of individual packages around, only the current version is ever available. They recently starting keeping the versions associated with a particular version of the suite in git with branches. That is the only way to make sure that an installed version of a package can be reinstalled at some later date. For example, whichever version of biocgenerics is used for bioconductor-3.3 will always be available in the 'release-3.3' branch of the bioconductor git mirror.

GitHub is good at making tarballs on demand. Therefore, the practice described above is equivalent to keeping old tarballs around for re-installation later. If you have your source on GitHub, manually making tarballs is a waste of time (IMHO); you get exactly the same effect (including checksummability) by setting a tag or (non-moving) branch, and forming URLs to access that branch. On my projects, I don't even bother generating tarballs for this reason.

That's the principle behind my suggested changes above:

 url      = "https://github.com/Biocondutor-mirror/BiocGenerics/tarball/release-3.3"
version('3.3', '...')

If it were me, I'd be happy to create the Spack package to build directly from the GitHub-generated tarballs, and to ignore the "latest" release (which changes all the time anyway).

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 8, 2017

@glennpj: We're setting up an S3 bucket at LLNL to archive versions of Spack packages. We will probably add it as a mirror in etc/spack/defaults/mirrors.yaml so that if a fetch fails, the user still has a chance of grabbing it from Spack's mirror. I think this will help with a lot of these packages that don't keep old versions around.

I'm not sure what I think of Spack becoming the ONLY source for old tarballs that the authors have, for one reason or another, decided to make disappear.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

We don't have to keep them forever, but this is not unlike what Linux distributions do.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

Also we don't have to be the only source -- as the project gains ground we may find that others are willing to provide mirrors.

@citibeth
Copy link
Copy Markdown
Member

citibeth commented Jan 8, 2017 via email

@citibeth citibeth mentioned this pull request Jan 8, 2017
@adamjstewart
Copy link
Copy Markdown
Member Author

I'm not sure how easy it will be to tell the difference. If it's something like libpng where we know there is a security vulnerability in older versions, then we should remove those versions. But as we get more and more packages in Spack, it will be more difficult to track down. I think the mirror should only provide versions present in Spack package.py files. If a version is known to be vulnerable, it should be removed from the package.py file manually, and removed from the mirror automatically.

@tgamblin
Copy link
Copy Markdown
Member

tgamblin commented Jan 8, 2017

If a version is known to be vulnerable, it should be removed from the package.py file manually, and removed from the mirror automatically.

I'm actually not sure this is the way to go. I'd rather just keep the versions and warn the user about vulnerabilities. If we remove old versions, we make it hard to reproduce an old install that maybe used a vulnerable version. I think adding something to mark versions vulnerable or insecure (like preferred=True) might be better...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants