Skip to content

Improve sharding algorithm for ASAN (any maybe other jobs as well) #74620

@atalman

Description

@atalman

🐛 Describe the bug

Doing a research for this issue #72368 I found that we run sharding algorithm for each of the shard separately. We retreive same data and run same commit then based on the number of current shard and total shard we assign the worker with the test jobs to execute.

The algorithm retrieves a commit from day ago here:
get_previous_reports_for_branch

Issue with this approach, since there are multiple workers its possible that they retrieve different commits (if commits are made within seconds interval) and hence sharding breakdown will be different for different workers, and we will end up with situation where some tests are not executed at all and some tests are executed multiple times.

Proposed solution 1:
Run all sharding algorithm after build have been completed and pass list of jobs to each of the workers

Proposed solution 2:
Reuse same logic but retrieve the day ago commit SHA after build have been completed, then pass SHA to each of the workers and let the workers run sharding algorithm and assign the shards to themselves.

Here is the example of current breakdown

2022-03-22T13:52:06.4823300Z Calculating Shard 0 (7796.247999999994, ['test_ops_jit'])
2022-03-22T13:52:06.4825155Z Calculating Shard 1 (5916.576000000704, ['test_ops_gradients', 'test_jit', 'test_unary_ufuncs', 'test_linalg', 'test_fx', 'test_foreach', 'test_ops', 'test_reductions', 'test_tensorboard', 'distributions/test_distributions', 'test_tensor_creation_ops', 'test_dispatch', 'test_torch', 'test_type_promotion', 'test_utils', 'test_type_hints', 'test_package', 'test_sort_and_select', 'test_multiprocessing', 'test_module_init', 'test_import_stats', 'test_shape_ops', 'test_bundled_inputs', 'test_scatter_gather_ops', 'test_namedtuple_return_api', 'test_datapipe', 'benchmark_utils/test_benchmark_utils', 'test_autocast', 'test_function_schema', 'test_monitor', 'test_jit_disabled', 'test_python_dispatch', 'test_overrides', 'test_pytree', 'test_set_default_mobile_cpu_allocator', 'test_per_overload_api', 'test_license', 'test_cpp_extensions_aot_ninja', 'test_cpp_extensions_aot_no_ninja', 'test_jit_autocast', 'test_numba_integration', 'test_vulkan', 'test_openmp'])
2022-03-22T13:52:06.4827506Z Calculating Shard 2 (5916.575000000985, ['test_jit_fuser_te', 'test_quantization', 'test_modules', 'test_nn', 'test_cpp_extensions_jit', 'test_functional_autograd_benchmark', 'test_optim', 'test_binary_ufuncs', 'test_tensorexpr', 'test_serialization', 'test_sparse', 'test_fx_experimental', 'test_autograd', 'test_view_ops', 'test_spectral_ops', 'test_sparse_csr', 'test_cpp_api_parity', 'test_ao_sparsity', 'test_testing', 'test_vmap', 'test_multiprocessing_spawn', 'test_indexing', 'test_logging', 'test_namedtensor', 'test_expanded_weights', 'test_profiler', 'test_futures', 'test_dataloader', 'test_model_dump', 'test_native_functions', 'test_functional_optim', 'test_mobile_optimizer', 'test_stateless', 'test_show_pickle', 'test_type_info', 'test_public_bindings', 'test_numpy_interop', 'test_tensorexpr_pybind', 'test_mkldnn', 'test_xnnpack_integration', 'test_complex'])

Versions

1.11.1

cc @seemethere @malfet @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNot as big of a feature, but technically not a bug. Should be easy to fixmodule: ciRelated to continuous integrationtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions