Skip to content

Copy Snapshot operation arguments in a distributed task#10391

Merged
vepadulano merged 1 commit intoroot-project:masterfrom
vepadulano:distrdf-snapshot-fix
Apr 21, 2022
Merged

Copy Snapshot operation arguments in a distributed task#10391
vepadulano merged 1 commit intoroot-project:masterfrom
vepadulano:distrdf-snapshot-fix

Conversation

@vepadulano
Copy link
Copy Markdown
Member

The Snapshot operation file name is modified in-place to append the
range id of a certain task. This can lead to a task receiving the
input operation from a previous task with an already modified file
name. Thus, the current task would create a wrong file name with more
than one range id. Solve this by creating a deep copy of the Snapshot
operation arguments in each task, so that the filename is correctly
changed in isolation.

This PR fixes #10390

@vepadulano vepadulano requested a review from eguiraud April 13, 2022 08:34
@vepadulano vepadulano requested a review from etejedor as a code owner April 13, 2022 08:34
@vepadulano vepadulano self-assigned this Apr 13, 2022
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/soversion, mac1015/python3, mac11/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-ubuntu2004/soversion.
Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Copy Markdown

Build failed on mac11/cxx17.
Running on macphsft20.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@phsft-bot
Copy link
Copy Markdown

Build failed on mac1015/python3.
Running on macitois21.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

The Snapshot operation file name is modified in-place to append the
range id of a certain task. This can lead to a task receiving the
input operation from a previous task with an already modified file
name. Thus, the current task would create a wrong file name with more
than one range id. Solve this by creating a deep copy of the Snapshot
operation arguments in each task, so that the filename is correctly
changed in isolation.
@vepadulano vepadulano force-pushed the distrdf-snapshot-fix branch from 9fb282e to fc76271 Compare April 13, 2022 11:53
@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/soversion, mac1015/python3, mac11/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on mac1015/python3.
Running on macitois21.dyndns.cern.ch:/Users/sftnight/build/workspace/root-pullrequests-build
See console output.

Failing tests:

@vepadulano
Copy link
Copy Markdown
Member Author

@phsft-bot build also on mac12/default

@phsft-bot
Copy link
Copy Markdown

Starting build on mac12/default, ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/soversion, mac1015/python3, mac11/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on windows10/cxx14.
Running on null:C:\build\workspace\root-pullrequests-build
See console output.

Errors:

  • [2022-04-13T15:33:08.456Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\include\llvm/ADT/DenseMap.h(555,25): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\Lex\PPMacroExpansion.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\Lex\obj.clangLex.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(687,5): error MSB8071: Cannot parse tool output 'C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\memory(3155): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\Driver\ToolChains\Arch\X86.cpp)' with regex '^In file included from .*$': Exception of type 'System.OutOfMemoryException' was thrown. [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\Driver\obj.clangDriver.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\include\clang/AST/CanonicalType.h(281,3): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\ASTContext.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\include\tuple(392,34): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\Lex\PPCaching.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\Lex\obj.clangLex.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\Program Files (x86)\Windows Kits\10\Include\10.0.19041.0\ucrt\ctype.h(46,76): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\ExternalASTMerger.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\include\llvm/ADT/SmallVector.h(86,1): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\Lex\ModuleMap.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\Lex\obj.clangLex.vcxproj]
  • [2022-04-13T15:33:08.456Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\include\clang/AST/ASTVector.h(48,1): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\ASTTypeTraits.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]
  • [2022-04-13T15:33:09.244Z] C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\include\clang/AST/Attrs.inc(9117,1): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\ASTImporter.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]
  • [2022-04-13T15:33:09.244Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\include\clang/AST/Type.h(1440,22): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\ExprCXX.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]
  • [2022-04-13T15:33:09.244Z] C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\include\clang/AST/OpenMPClause.h(1381,1): fatal error C1060: compiler is out of heap space (compiling source file C:\build\workspace\root-pullrequests-build\root\interpreter\llvm\src\tools\clang\lib\AST\Comment.cpp) [C:\build\workspace\root-pullrequests-build\build\interpreter\llvm\src\tools\clang\lib\AST\obj.clangAST.vcxproj]

And 309 more

@phsft-bot
Copy link
Copy Markdown

Build failed on mac12/default.
Running on macphsft18.dyndns.cern.ch:/Users/sftnight/build/jenkins/workspace/root-pullrequests-build
See console output.

Warnings:

  • [2022-04-13T15:08:06.840Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/builtins/zstd/compress/zstd_compress_superblock.c:412:12: warning: variable 'litLengthSum' set but not used [-Wunused-but-set-variable]
  • [2022-04-13T15:24:14.944Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/root/core/unix/src/TUnixSystem.cxx:4926:17: warning: variable 'vsize' set but not used [-Wunused-but-set-variable]
  • [2022-04-13T15:25:43.103Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:27:58.229Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:27:58.229Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:27:59.620Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:28:07.141Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/externals/Users/sftnight/build/jenkins/workspace/root-pullrequests-build/install/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:28:07.141Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/externals/Users/sftnight/build/jenkins/workspace/root-pullrequests-build/install/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:28:07.702Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/externals/Users/sftnight/build/jenkins/workspace/root-pullrequests-build/install/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]
  • [2022-04-13T15:28:07.968Z] /Users/sftnight/build/jenkins/workspace/root-pullrequests-build/build/externals/Users/sftnight/build/jenkins/workspace/root-pullrequests-build/install/include/Vc/common/memory.h:299:25: warning: performing pointer subtraction with a null pointer may have undefined behavior [-Wnull-pointer-subtraction]

And 349 more

Failing tests:

@vepadulano
Copy link
Copy Markdown
Member Author

@phsft-bot build

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/soversion, mac1015/python3, mac11/cxx17, windows10/cxx14
How to customize builds

@vepadulano
Copy link
Copy Markdown
Member Author

@phsft-bot build

@phsft-bot
Copy link
Copy Markdown

Starting build on ROOT-debian10-i386/cxx14, ROOT-performance-centos8-multicore/default, ROOT-ubuntu16/nortcxxmod, ROOT-ubuntu2004/soversion, mac1015/python3, mac11/cxx17, windows10/cxx14
How to customize builds

@phsft-bot
Copy link
Copy Markdown

Build failed on ROOT-performance-centos8-multicore/default.
Running on olbdw-01.cern.ch:/data/sftnight/workspace/root-pullrequests-build
See console output.

Failing tests:

rdf_operation = getattr(previous_rdf_node, distrdf_node.operation.name)
_make_op_lazy_if_needed(distrdf_node.operation, range_id)
pyroot_node = rdf_operation(*distrdf_node.operation.args, **distrdf_node.operation.kwargs)
in_task_op = _create_lazy_op_if_needed(distrdf_node.operation, range_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So two tasks can invoke _call_rdf_operation on the same distrdf_node object? How does this happen, I thought every task generates its own graph?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every task generates its own RDataFrame C++ graph, but the DistRDF Python graph is a single object that gets serialized/deserialized. On a single machine, a single Python process can receive two(or more) tasks. When the first task starts, it deserializes the distrdf graph, modifies in-place the operation object, then it creates the RDF C++ calls, sends everything to the mapper function that executes them. When the second task starts, it gets the same distrdf graph objects, but at this point their operation attributes were modified by the previous task, thus leading to the errors described in the PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks!

@vepadulano vepadulano merged commit fcd2cd0 into root-project:master Apr 21, 2022
@vepadulano vepadulano deleted the distrdf-snapshot-fix branch May 20, 2022 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wrong file names created in distributed Snapshot

3 participants