Skip to content

Add option to skip maven execution if source is older than output#1502

Merged
wing328 merged 1 commit intoOpenAPITools:masterfrom
rasmusfaber:master
Nov 28, 2018
Merged

Add option to skip maven execution if source is older than output#1502
wing328 merged 1 commit intoOpenAPITools:masterfrom
rasmusfaber:master

Conversation

@rasmusfaber
Copy link
Copy Markdown
Contributor

PR checklist

  • [X ] Read the contribution guidelines.
  • [N/A] Ran the shell script under ./bin/ to update Petstore sample so that CIs can verify the change. (For instance, only need to run ./bin/{LANG}-petstore.sh and ./bin/security/{LANG}-petstore.sh if updating the {LANG} (e.g. php, ruby, python, etc) code generator or {LANG} client's mustache templates). Windows batch files can be found in .\bin\windows\.
  • [X ] Filed the PR against the correct branch: master, 3.4.x, 4.0.x. Default: master.
  • [N/A ] Copied the technical committee to review the pull request if your PR is targeting a particular programming language.

Description of the PR

Implementation of #1501

Copy link
Copy Markdown
Member

@jimschubert jimschubert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I'm not too familiar with the maven plugin, but I thought I would comment on the PR itself.

One issue we might have with skipping generation when there have been no new spec changes by file age of the input spec is that it's possible there have been spec changes between when the output directory was created and when the output directory was modified. An example would be this scenario:

  • generate
  • edit spec
  • create some nested directory under output directory (this could happen if you build the output directory in-place or clean a previous build)
  • edit spec
  • regenerate

Here's a mock example in terminal:

# create spec file with mtime in the past
$ touch -a -m -t 201801181205.09 spec.json

# create output directory some time (5 min) after spec modification
$ mkdir output
$ touch -a -m -t 201801181210.09 output

# check that mtimes are spec.json < output
$ ls -la --full-time
total 0
drwxr-xr-x 4 jim  wheel 128 2018-11-20 23:11:26.670343174 -0500 .
drwxrwxrwt 8 root wheel 256 2018-11-20 23:11:14.882538014 -0500 ..
drwxr-xr-x 2 jim  wheel  64 2018-01-18 12:10:09.000000000 -0500 output
-rw-r--r-- 1 jim  wheel   0 2018-01-18 12:05:09.000000000 -0500 spec.json

# pretend we've compiled code under the output directory, resulting in nested directories
$ mkdir -p output/build

# check mtimes again output is now updated to today
$ ls -la --full-time
total 0
drwxr-xr-x 4 jim  wheel 128 2018-11-20 23:11:26.670343174 -0500 .
drwxrwxrwt 8 root wheel 256 2018-11-20 23:11:14.882538014 -0500 ..
drwxr-xr-x 3 jim  wheel  96 2018-11-20 23:11:55.603842334 -0500 output
-rw-r--r-- 1 jim  wheel   0 2018-01-18 12:05:09.000000000 -0500 spec.json

# edit spec in some way
$ cat >> spec.json <<EOF
> {
> }
> EOF

# check mtimes, notice spec.json > output
# I think we can agree that most would probably expect generation here
$ ls -la --full-time
total 4
drwxr-xr-x 4 jim  wheel 128 2018-11-20 23:11:26.670343174 -0500 .
drwxrwxrwt 8 root wheel 256 2018-11-20 23:11:14.882538014 -0500 ..
drwxr-xr-x 3 jim  wheel  96 2018-11-20 23:11:55.603842334 -0500 output
-rw-r--r-- 1 jim  wheel   4 2018-11-20 23:12:37.773754953 -0500 spec.json

# simulate a "clean" build operation
# I agree this _might_ be ok to skip, but seems this logic always skips.
$ rm output/build/
$ ls -la --full-time
total 4
drwxr-xr-x 4 jim  wheel 128 2018-11-20 23:11:26.670343174 -0500 .
drwxrwxrwt 8 root wheel 256 2018-11-20 23:11:14.882538014 -0500 ..
drwxr-xr-x 2 jim  wheel  64 2018-11-20 23:21:36.119196033 -0500 output
-rw-r--r-- 1 jim  wheel   4 2018-11-20 23:12:37.773754953 -0500 spec.json

In the above example, I would personally expect regeneration in both of the last two examples since both occurred after the spec document was changed. It seems to me that in this PR, the second to last would not be skipped while the last example would be skipped.

Do you think it might make sense to store a checksum of the specification document in the output directory (e.g. output/.openapi-generator/checksum), then evaluate that checksum and skip generation if the spec's checksum is the same? I think this would achieve the same goal and would avoid any potential surprises, like if a generation occurred after the last example above.

We'd previously discussed storing some details about the original spec in the output directory and I think a checksum or other hash of the input file's contents would be a good addition.

edit: I left out a bit of the example, and cleaned up some confusing wording in the scenario.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If one were to reuse the property codegen.skip, doesn't that result in all generation being skipped (line 358 in this diff)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputSpecFile here would reference an OpenAPI document and output refers to the output directory.

Is the intention of skipIfSourceIsOlderThanOutput to only regenerate if someone has modified the spec document since the last regeneration? If this is the case, I would suggest renaming the property to avoid confusion… something like skipIfSpecIsUnchanged might be clearer to those reading the properties list. Even after reading the issue and scanning the PR's changes, I thought this was doing something similar to skipOvewrite.

Although… my suggested skipIfSpecIsUnchanged might suggest to users that we're storing the atime/mtime of the spec (which we could definitely persisted to {output}/.openapi-generator/SOME_FILENAME).

@rasmusfaber
Copy link
Copy Markdown
Contributor Author

Thanks for the detailed comments.

I have changed the implementation to generate a SHA-256 checksum of the spec and store it in the .openapi-generator folder under the name .sha256 and then use that to compare.

The generation of the checksum happens in the CodeGenMojo-class, so it only runs when using Maven. Another option would be to have DefaultGenerator generate it, but that would have higher impact.

@jimschubert
Copy link
Copy Markdown
Member

@rasmusfaber I've reviewed the changes, and I think it fixes any issues in workflow as I presented above.

I'll leave this for review/merge by someone more familiar with the Maven plugin. I'll be happy to do the refactor you mentioned once this is merged, so the same strategy can be used outside of the Maven plugin (like in the gradle plugin).

@wing328
Copy link
Copy Markdown
Member

wing328 commented Nov 28, 2018

LGTM (but no expert in the Maven plug-in either)

@wing328 wing328 merged commit d8dde68 into OpenAPITools:master Nov 28, 2018
@wheezil
Copy link
Copy Markdown
Contributor

wheezil commented Nov 28, 2018

Greetings! @wing328 pointed out that we have similar PRs addressing similar problems, so I'd like to have a little discussion to see if we should choose one approach or both. My PR is #1397 and issue #1081.
They are related in the sense that they are trying to solve the same problem, which is not updating the output files unless the spec has changed.
I believe that my approach is better at least for some cases (of course I would!), because

  • It is insensitive to the vagaries of file modification times, and so the example given by @jimschubert -- files are manually touched -- doesn't matter at all. It simply generates the code, but as output files are about to be written, it shunts them to a temp file first, compares the temp file to any existing file, and then replaces the existing file (if any) with the temp file only if there are changes.
  • Works on the command-line in addition to maven, so code-generation in other build environments also benefits. This is important for my use case generating C++, because touching headers can cause a rebuild-the-world event, and you really don't want to sit through that every time. Yes, ideally the build toolchain would sort that out and simply avoid codegen, but I also have to live in the Visual Studio / MSBuild world which is quite primitive regarding generated code dependencies.

The thing I don't understand about the difference between my approach and this approach is how it may or may not affect downstream build steps. If the maven plugin change avoids firing the code-generation step at all, this may result in fewer ripples down the dependency tree. Unfortunately I don't understand maven plugins and can't judge.

It is possible that the combined techniques will be useful. Because even if the maven plugin generates code due to timestamp update, we can still only touch a minimal number of output files. Assuming the maven plugin doesn't delete the output folder first. This may or may not really matter in a java environment where recompiling some source files is fairly quick.

One other thing to note: my approach will of course always generate new output files if the output is cleaned first. It basically has no relation to the clean lifecycle. Because of that, build-without-clean may result in lingering output files whose specification has since disappeared -- I don't check for removal of an output file, but perhaps I should?

@rasmusfaber your thoughts appreciated.

@jimschubert
Copy link
Copy Markdown
Member

@wheezil I don't think your contributions are doing the same thing.

This PR speeds builds by not wasting cycles if your spec is identical to the spec that previously generated your code. This means that once you've generated code, you assume it shouldn't regenerate unless your spec's hash changes. Maven builds are notoriously slow, which is why I don't personally use it and therefore have little experience with it. This PR helps there.

Your PR is only changing files that are not identical to the contents that would be generated. While that could be useful in many cases, it would cause issues for example where someone has modified templates to output template generation time in a file header/comment. If the spec hasn't changed for 20 builds, there's not really a reason to run codegen 20 times just to update that header.

I do think the work in this PR could be moved from the Maven plugin into Codegen core, so it'll work the same across all plugins/entrypoints. But I don't think that'll have an effect on your PR. The difference between "should I generate at all" and "should I replace this generated file" are significant enough to keep separate.

@wheezil
Copy link
Copy Markdown
Contributor

wheezil commented Nov 28, 2018

@jimschubert Thanks for comment. I see what you mean that they are addressing separate concerns.

@wheezil
Copy link
Copy Markdown
Contributor

wheezil commented Nov 28, 2018

Is this check based on file times or file contents?
Does it chase referenced files as well?

@wheezil
Copy link
Copy Markdown
Contributor

wheezil commented Nov 28, 2018

@jimschubert regarding performance of the build, is this time spent running the generator, or is it time spent compiling the resulting code? I have not yet used a large API so I don't know what to expect.
Is it faster to use the maven plugin or to launch the command-line?

@jimschubert
Copy link
Copy Markdown
Member

@wheezil I'm confused by your comment about file time or contents (I can't tell if you're referring to spec or generated code). The work in this PR compares the sha hash of a spec document, and if that matches the hash stored in the existing generated code, it skips executing the generator at all. It has nothing to do with the generated files, other than the metadata file storing the hash. When you ask about referenced files, it's something I'm considering to improve upon when I move into core... most likely parsing into the OpenAPI structure, then hashing against those contents. This would mean rather than file modifications, it would regenerate on changes in our abstraction across versions (which often results in fixes or changes in generated code). This would probably need an option to allow file-based or object-based hashing for those who don't want to consider parsing differences to be spec differences.

Your question about Maven is hard to answer. Maven is slow when you have many dependencies, many projects, and other factors. It doesn't have built in caching like Gradle/SBT or other JVM targeting build systems. There are options out there for speeding up builds, but they can be difficult to setup. Maven has parallel build as an option, although I think it's fairly recent. I don't know if it's better to use the exec plugin because, as I mentioned I'm not that familiar with Maven (this being the only project I touch that uses it, and I don't do much of the build management). I do know that I've had issues trying to get the exec plugin to work in the past when I was trying to integrate the gradle plugin into the build. I would personally stick with the Maven plugin, since it has cleanly into the architecture. Or, use our gradle plugin ;)

@wheezil
Copy link
Copy Markdown
Contributor

wheezil commented Nov 28, 2018

@jimschubert thanks for clarifying that it compares the contents (hash) of the spec file. I was confused and thought it was using modification times.

Regarding maven/gradle, I will stick with maven, since that's our team standard. Thanks again.

@jimschubert
Copy link
Copy Markdown
Member

@wheezil I understand the confusion. The original implementation used file modification time compared to output directory modification time.

@wing328
Copy link
Copy Markdown
Member

wing328 commented Dec 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants