-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Batch Sampler Speedup #149441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch Sampler Speedup #149441
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149441
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit e08e87a with merge base f47573f ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This is the code comparing speed between version of import itertools
import timeit
import numpy as np
import pandas as pd
from itertools import product
from tqdm import tqdm
from sampler import Sampler, Iterator, Iterable, Union
from sampler import SequentialSampler, RandomSampler, BatchSampler
def iter_sampler(sampler):
for _ in sampler:
pass
def time_sampler(sampler):
timer = timeit.Timer(lambda: iter_sampler(sampler))
times = timer.repeat(AVG_TIMES, 1)
return np.mean(times), np.std(times)
class PrevBatchSampler(Sampler[list[int]]):
"""
A copy of the previous BatchSampler
"""
def __init__(
self,
sampler: Union[Sampler[int], Iterable[int]],
batch_size: int,
drop_last: bool,
) -> None:
# Since collections.abc.Iterable does not check for `__getitem__`, which
# is one way for an object to be an iterable, we don't do an `isinstance`
# check here.
if (
not isinstance(batch_size, int)
or isinstance(batch_size, bool)
or batch_size <= 0
):
raise ValueError(
f"batch_size should be a positive integer value, but got batch_size={batch_size}"
)
if not isinstance(drop_last, bool):
raise ValueError(
f"drop_last should be a boolean value, but got drop_last={drop_last}"
)
self.sampler = sampler
self.batch_size = batch_size
self.drop_last = drop_last
def __iter__(self) -> Iterator[list[int]]:
# Implemented based on the benchmarking in https://github.com/pytorch/pytorch/pull/76951
sampler_iter = iter(self.sampler)
if self.drop_last:
# Create multiple references to the same iterator
args = [sampler_iter] * self.batch_size
for batch_droplast in zip(*args):
yield [*batch_droplast]
else:
batch = [*itertools.islice(sampler_iter, self.batch_size)]
while batch:
yield batch
batch = [*itertools.islice(sampler_iter, self.batch_size)]
def __len__(self) -> int:
# Can only be called if self.sampler has __len__ implemented
# We cannot enforce this condition, so we turn off typechecking for the
# implementation below.
# Somewhat related: see NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
if self.drop_last:
return len(self.sampler) // self.batch_size # type: ignore[arg-type]
else:
return (len(self.sampler) + self.batch_size - 1) // self.batch_size # type: ignore[arg-type]
if __name__ == '__main__':
DATA_SIZE = 1_000_000
AVG_TIMES = 10
data = np.zeros(DATA_SIZE)
batch_sizes = [4, 8, 16, 32, 64, 256, 1024, 4096, 8192, 16384]
replacements = [True, False]
drop_lasts = [True, False]
results_sequential = []
results_random = []
for batch_size, replacement, drop_last in tqdm(product(batch_sizes, replacements, drop_lasts)):
# Sequential
if not replacement:
sequential_sampler = SequentialSampler(data)
prev_batch_sampler = PrevBatchSampler(sequential_sampler, batch_size, drop_last)
new_batch_sampler = BatchSampler(sequential_sampler, batch_size, drop_last)
avg_prev, std_prev = time_sampler(prev_batch_sampler)
avg_new, std_new = time_sampler(new_batch_sampler)
speedup = "%.2f" % ((1 / avg_new - 1 / avg_prev) * avg_prev * 100) + "%"
row = [batch_size, drop_last, avg_prev, std_prev, avg_new, std_new, speedup]
results_sequential.append(row)
# Random
sequential_sampler = RandomSampler(data, replacement)
prev_batch_sampler = PrevBatchSampler(sequential_sampler, batch_size, drop_last)
new_batch_sampler = BatchSampler(sequential_sampler, batch_size, drop_last)
avg_prev, std_prev = time_sampler(prev_batch_sampler)
avg_new, std_new = time_sampler(new_batch_sampler)
speedup = "%.2f" % ((1 / avg_new - 1 / avg_prev) * avg_prev * 100) + "%"
row = [batch_size, drop_last, replacement, avg_prev, std_prev, avg_new, std_new, speedup]
results_random.append(row)
sequential_columns = ["batch_size", "drop_last",
"original(avg)", "original(std)", "new(avg)", "new(std)", "speedup"]
seq_df = pd.DataFrame(results_sequential, columns=sequential_columns)
seq_df = seq_df.set_index(["batch_size", "drop_last"])
random_columns = ["batch_size", "drop_last", 'replacement',
"original(avg)", "original(std)", "new(avg)", "new(std)", "speedup"]
random_df = pd.DataFrame(results_random, columns=random_columns)
random_df = random_df.set_index(["batch_size", "drop_last", "replacement"])
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print('Random Sampler')
print(random_df.to_string(justify='left', col_space=12))
print('\nSequential Sampler')
print(seq_df.to_string(justify='left', col_space=12)) |
|
Here is the speedup achieved by alternative implementations:
|
3e72d4a to
e08e87a
Compare
|
@albanD Would you know the guidance on using numpy in torch.utils? |
albanD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ho we should not use numpy here. Numpy is not a dependency of PyTorch, so you cannot rely on it being available!
torch.arange and similar functions should give you the same behavior for what you need here!
|
The tensor based implementation produces negative speedup for sequential sampler when batchsize <= 64. I have 3 ideas to address this:
try:
import numpy
# Use new code (numpy based)
except:
# Use original code
if batchsize > 64:
# Use new code (tensor based)
else:
# Use original codeLet me know what you think! |
|
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Motivation
#147706 attempts to accelerate
BatchSampleroverRandomSamplerby utilizing the fact thatRandomSamplercan construct all the epoch's indices before yielding them.This PR generalizes this approach for all samplers that share this feature (e.g
SequentialSampler).Content
This PR introduces a new sampler base class
ArrayableSampler(a poor name perhaps, happy for suggestions!)that has a function
to_arraywhich returns the entire sequence of indices, instead of yielding it.BatchSampleris modified to callto_arrayif it is available, and then partition the indices into batches more efficiently.RandomSamplerandSequentialSamplerare changed to inheritArrayableSamplerinstead ofSamplerand implementto_array.I've also added unit tests for
BatchSampler.Results
These are the speedup results over
RandomSamplerandSequentialSamplerNote
While
BatchSamplerpreviously yieldedList[int], it now yieldsnumpyarrays instead.Furthermore
RandomSampler.to_arrayusesnumpygenerator instead oftorchgenerator.I'll provide speed comparisons using alternative implementations:
numpygenerator and yieldingList[int].torchgenerator and yieldingTensor.torchgenerator and yieldingList[int].