-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
🐛 Describe the bug
Doing a research for this issue #72368 I found that we run sharding algorithm for each of the shard separately. We retreive same data and run same commit then based on the number of current shard and total shard we assign the worker with the test jobs to execute.
The algorithm retrieves a commit from day ago here:
get_previous_reports_for_branch
Issue with this approach, since there are multiple workers its possible that they retrieve different commits (if commits are made within seconds interval) and hence sharding breakdown will be different for different workers, and we will end up with situation where some tests are not executed at all and some tests are executed multiple times.
Proposed solution 1:
Run all sharding algorithm after build have been completed and pass list of jobs to each of the workers
Proposed solution 2:
Reuse same logic but retrieve the day ago commit SHA after build have been completed, then pass SHA to each of the workers and let the workers run sharding algorithm and assign the shards to themselves.
Here is the example of current breakdown
2022-03-22T13:52:06.4823300Z Calculating Shard 0 (7796.247999999994, ['test_ops_jit'])
2022-03-22T13:52:06.4825155Z Calculating Shard 1 (5916.576000000704, ['test_ops_gradients', 'test_jit', 'test_unary_ufuncs', 'test_linalg', 'test_fx', 'test_foreach', 'test_ops', 'test_reductions', 'test_tensorboard', 'distributions/test_distributions', 'test_tensor_creation_ops', 'test_dispatch', 'test_torch', 'test_type_promotion', 'test_utils', 'test_type_hints', 'test_package', 'test_sort_and_select', 'test_multiprocessing', 'test_module_init', 'test_import_stats', 'test_shape_ops', 'test_bundled_inputs', 'test_scatter_gather_ops', 'test_namedtuple_return_api', 'test_datapipe', 'benchmark_utils/test_benchmark_utils', 'test_autocast', 'test_function_schema', 'test_monitor', 'test_jit_disabled', 'test_python_dispatch', 'test_overrides', 'test_pytree', 'test_set_default_mobile_cpu_allocator', 'test_per_overload_api', 'test_license', 'test_cpp_extensions_aot_ninja', 'test_cpp_extensions_aot_no_ninja', 'test_jit_autocast', 'test_numba_integration', 'test_vulkan', 'test_openmp'])
2022-03-22T13:52:06.4827506Z Calculating Shard 2 (5916.575000000985, ['test_jit_fuser_te', 'test_quantization', 'test_modules', 'test_nn', 'test_cpp_extensions_jit', 'test_functional_autograd_benchmark', 'test_optim', 'test_binary_ufuncs', 'test_tensorexpr', 'test_serialization', 'test_sparse', 'test_fx_experimental', 'test_autograd', 'test_view_ops', 'test_spectral_ops', 'test_sparse_csr', 'test_cpp_api_parity', 'test_ao_sparsity', 'test_testing', 'test_vmap', 'test_multiprocessing_spawn', 'test_indexing', 'test_logging', 'test_namedtensor', 'test_expanded_weights', 'test_profiler', 'test_futures', 'test_dataloader', 'test_model_dump', 'test_native_functions', 'test_functional_optim', 'test_mobile_optimizer', 'test_stateless', 'test_show_pickle', 'test_type_info', 'test_public_bindings', 'test_numpy_interop', 'test_tensorexpr_pybind', 'test_mkldnn', 'test_xnnpack_integration', 'test_complex'])
Versions
1.11.1
cc @seemethere @malfet @pytorch/pytorch-dev-infra
Metadata
Metadata
Assignees
Labels
Type
Projects
Status