-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Comprehensive-ish instrumentation for CUDA memory allocator #27361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jma127 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
There aren't any tests. I accept that this sort of instrumentation is hard to test. How did you verify it was working? |
|
The diff looks fine; @colesbury do you want to take a quick look? |
|
yea needs a review from @colesbury |
colesbury
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data race is the most significant issue, but I also think that if we're going to add a bunch of tracked stats we should think more carefully about what we want to track and how.
torch/csrc/cuda/Module.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The returns from PyDict_SetItemString and PyLong_FromUnsignedLongLong should both be checked for errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, all existing PyObject creation in this file doesn't check for null return value – is there a special reason for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some are probably buggy; others may be OK. i.e this is fine:
because the return statement propagates the NULL error return.
c10/cuda/CUDACachingAllocator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread-safe.
The stats API is unwieldy. There should be just one accessor function that returns a copy of the current DeviceStats. (The copy part is important)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is again taken from existing functions (e.g. currentMemoryAllocated) – if this isn't thread-safe then the existing functionality should also be thread-unsafe correct?
Regardless, happy to change this – does a lock around THCCachingAllocator::mutex seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, a number of these issues already exist in the code (I didn't review that change)
A lock around the allocator's mutex should be fine, but it shouldn't be in get_stats_for_device since that's used internally when the mutex is already acquired. I'd suggest another top-level function (i.e. like malloc/free) that takes out the lock and copies the stats (so that they're consistent)
torch/csrc/cuda/Module.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be a good enough reason to use a macro here (instead of say a function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point – will change
c10/cuda/CUDACachingAllocator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mixes size_t with uint64_t. Use one or the other.
I tend to think these functions don't add much over just writing stats.total_num_alloc_requests += 1; and make it harder to trace which stat variables correspond to which functions. (i.e as a reader I have to go from total_num_blocks_allocated to logBlockAlloc to the uses of that function).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used existing convention here (see e.g. decreaseAllocated ); I'm happy to change everything to straight uint64_t. re. value of these functions – also just convention from decreaseAllocated and friends.
My thoughts are:
- Change
size_ttouint64_tacross the board. - Leave the helper methods as is (some of the helper methods do add value e.g.
increaseAllocated, and it would be unwieldy to have a mix of plain+=statements and calls to helper methods.
c10/cuda/CUDACachingAllocator.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the tracked stats may need a bit of rework. Some reasonable questions might be "how many allocated blocks are there now?" "What was the peak number of allocated blocks?". (The first would be answerable by subtracting these two stats, but the second isn't computable from this).
I've been looking at how other allocators track these sorts of things. I'll add some suggestions in a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks – happy to incorporate any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been looking at mimalloc and for nearly every stat they track "current", "peak", "allocated", and "freed", which seems useful here too.
Something like:
struct Stat {
int64_t current = 0;
int64_t peak = 0;
int64_t allocated = 0;
int64_t freed = 0;
};
struct DeviceStats {
Stat allocations;
Stat allocated_bytes;
...
}
void update(Stat& stat, int64_t amount) {
current += amount;
peak = max(current, peak);
if (amount > 0) allocated += amount;
if (amount < 0) freed += amount;
}
void example() {
update(stats.allocations, 1); // alternatively stats.allocations.update(1); if you prefer
update(stats.allocated_bytes, nbytes);
}
Stat and DeviceStats should be "public".
I think the things we might care about are:
allocations: (current/peak/allocated/freed) - how many malloc/free requests have we received?
segments: (current/peak/allocated/freed) - how many underlying allocations have we made to the cuda API.
allocated_bytes: (current/peak/allocated/freed) - how many bytes has the user requested (after rounding block sizes)?
reserved_bytes: (current/peak/allocated/freed) - how many bytes have we reserved from CUDA
Possibly also:
active: (current/peak/allocated/freed) -- how many blocks are either allocated or have positive event_count (i.e. re-use is delayed due to a stream use)
active_bytes: (current/peak/allocated/freed) -- total size in bytes of in-use blocks
i.e. roughly what's tracked by "total_num_blocks_released". The condition active >= allocated should always hold, and after an empty cache operation (such as on retry) active == allocated.
split: (current/peak/allocated/freed): number of split blocks
i.e. roughly "total_num_blocks_split" but should also be decreased when blocks are merged. I'm skeptical of the usefulness of this and would lean towards not including it. Most of the time it's completely irrelevant. When you're out-of-memory (and think you shouldn't be), the number of split blocks doesn't tell you much, but the actual heap structure does (see https://gist.github.com/colesbury/0ab4eb58c1d542c5a626040e6714388e)
Simple Counters:
cuda_malloc_retries: how many times has cudaMalloc failed with OOM and we've had to empty the cache and retry. This is total_num_cache_flushes. Since it already excludes calls to torch.cuda.empty_cache() (which I think was the correct choice), I think we should make that clear in the name.
In general, I think the choice of names is pretty important here. I'm not entirely confident about these particular choices, but I think it makes sense to try to re-use terminology common to other memory allocators when appropriate. I'd also like to avoid the term "cached" because it's been ambiguous if this has included blocks in-use by the application. (i.e. torch.cuda.memory_cached includes bytes in-use by the application). I think "reserved" is a better term here.
I used both the toy script (from the earlier AI Platform discussion thread) and a run on examples/imagenet. I'll rerun both these workflows once @colesbury and I converge on a set of desired stats. |
678d927 to
52cb13a
Compare
|
Addressed CR comments – will hold for @colesbury 's findings re. other things worth tracking. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jma127 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
367889f to
2c1fc9a
Compare
c40c14e to
dc14e04
Compare
|
Added a slight superset of the above. Marking this as WIP – will un-mark and comment here when code is stable. |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jma127 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jma127 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: PyTorch now has more comprehensive memory instrumentation, added in pytorch/pytorch#27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: fairinternal/fairseq-py#885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
|
wow this is awesome. thanks! |
…27361) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: pytorch#27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
…27361) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: pytorch#27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
TPU specific changes [here](https://gist.github.com/taylanbil/150abd31b1fbf5c91ca90ef5a4d79f08)
The rest is rebasing on a more current fairseq upstream commit.
---
* v0.7.1 -> v0.7.2 (#891)
Summary:
No major API changes since the last release. Cutting a new release since we'll be merging significant (possibly breaking) changes to logging, data loading and the masked LM implementation soon.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/891
Differential Revision: D16377132
Pulled By: myleott
fbshipit-source-id: f1cb88e671ccd510e53334d0f449fe18585268c7
* Switch to torch.nn.functional.gelu when available
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/735
Differential Revision: D16377046
Pulled By: myleott
fbshipit-source-id: 9725d4a3ce6b2fc8cee0b1d1cb8921f9d59c551a
* Improve interactive generation (support --tokenizer and --bpe)
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/734
Differential Revision: D16377044
Pulled By: myleott
fbshipit-source-id: 37d5553d76aa7c653113fec089f59710281c31d7
* Store task in the criterion base class
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/737
Differential Revision: D16377805
Pulled By: myleott
fbshipit-source-id: 1e090a02ff4fbba8695173f57d3cc5b88ae98bbf
* Create standalone label_smoothed_nll_loss
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/739
Differential Revision: D16377798
Pulled By: myleott
fbshipit-source-id: 20047c80de2e6f108269ace4ae3eec906a5920dd
* Allow not specifying --warmup-init-lr
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/736
Differential Revision: D16378001
Pulled By: myleott
fbshipit-source-id: 2907f63bcbf7068ceaa48b00096040fa2639e569
* Rename _load_model_ensemble -> load_model_ensemble_and_task
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/738
Differential Revision: D16377803
Pulled By: myleott
fbshipit-source-id: 6beb2f78e7464b70ff65a965d2b747cdca0ca951
* Rename data.transforms -> data.encoders
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/747
Differential Revision: D16403464
Pulled By: myleott
fbshipit-source-id: ee3b4184f129a02be833c7bdc00685978b4de883
* Fix topp sampling issues (#882)
Summary:
Two issues here:
1. `last_included` should be the last included index `cumsum_mask[:, :, -1:]` instead of `cumsum_mask[:, :, :1]` (which is either 0 or 1);
2. If `--no-repeat-ngram-size` is set, the sum of `probs` may less than 1, we need to re-normalize to make it a valid probability distribution
The following code can reproduce this issues:
```
import torch
import numpy as np
def _sample_topp(probs):
# ===== Code from fairseq/search.py _sample_topp ======
# sort the last dimension (vocab dimension) in descending order
sorted_probs, sorted_indices = probs.sort(descending=True)
# compute a mask to indicate the words to be included in the top-P set.
cumsum_probs = sorted_probs.cumsum(dim=2)
mask = cumsum_probs.lt(sampling_topp)
# note that mask was computed by 'lt'. One more word needs to be included
# so that the cumulative probability mass can exceed p.
cumsum_mask = mask.cumsum(dim=2)
last_included = cumsum_mask[:, :, :1]
mask = mask.scatter_(2, last_included, 1)
# truncate unnecessary dims.
max_dim = last_included.max()
truncated_mask = mask[:, :, :max_dim + 1]
truncated_probs = sorted_probs[:, :, :max_dim + 1]
truncated_indices = sorted_indices[:, :, :max_dim + 1]
# trim the words that are not in top-P by setting their probabilities
# to 0, so that they would not be sampled later.
trim_mask = 1 - truncated_mask
trimed_probs = truncated_probs.masked_fill_(trim_mask, 0)
return trimed_probs, truncated_indices
# ========================================================
if __name__ == '__main__':
np.random.seed(1234)
torch.manual_seed(1234)
sampling_topp = 0.9
probs = torch.softmax(torch.randn(1, 1, 10), dim=-1)
# probs = tensor([0.0545, 0.0779, 0.0189, 0.0647, 0.0282, 0.0862, 0.0656, 0.1041, 0.0399, 0.4600])
print('probs =', probs[0][0])
trimed_probs, truncated_indices = _sample_topp(probs)
cum_probs = trimed_probs.cumsum(dim=-1)[0][0]
# cumsum = tensor([0.4600, 0.5641])
print('cumsum =', cum_probs)
# Will throw AssertionError
assert float(cum_probs[-1]) >= sampling_topp
```
Pull Request resolved: https://github.com/pytorch/fairseq/pull/882
Differential Revision: D16409269
Pulled By: xingz9
fbshipit-source-id: 94b1122eed50c656057b64e22af6f4a6ea7a68af
* Default to mmap and infer dataset implementations automatically
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/751
Differential Revision: D16410989
Pulled By: myleott
fbshipit-source-id: ddbbee49756f9ff6c4487977a3f5d2259b7abafe
* Update GPT-2 BPE
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/749
Differential Revision: D16410984
Pulled By: myleott
fbshipit-source-id: 7698df46b8a179afccb287990f9705358690454a
* Misc improvements to torch hub interface
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/750
Differential Revision: D16410986
Pulled By: myleott
fbshipit-source-id: 8ee6b4371d6ae5b041b00a54a6039a422345795e
* Move Masked LM components to legacy/ -- new ones are coming
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/740
Differential Revision: D16377797
Pulled By: myleott
fbshipit-source-id: f7d6c8b00a77e279ea94376b1f0fcd15087eaf5f
* Add fallback for SLURM config
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/752
Differential Revision: D16417582
Pulled By: myleott
fbshipit-source-id: 6b4289febcf9290452bb91f1f2181a02c09c82a7
* Fix --reset-meters
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/756
Differential Revision: D16418302
Pulled By: myleott
fbshipit-source-id: 62495a0bff41d1741e2b09807a3b43ff2c66c8fb
* Simplify hubconf
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/758
Differential Revision: D16418932
Pulled By: myleott
fbshipit-source-id: 59f005164b61b9fa712922eeb23525f7eec38f38
* Add new Datasets
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/757
Differential Revision: D16418305
Pulled By: myleott
fbshipit-source-id: 25f293a2792509f7a75c688e4bf8cff02e6bba2e
* Add new Masked LM task + criterion
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/761
Differential Revision: D16421335
Pulled By: myleott
fbshipit-source-id: 257d92c2b90361147642e2baa38486b4d18f6297
* Implement sparse transformer fixed attention pattern (#804)
Summary:
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/804
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/746
Pull Request resolved: https://github.com/pytorch/fairseq/pull/894
Adding an implementation of the sparse transformer to multi-head attention using the fixed attention pattern specified https://arxiv.org/pdf/1904.10509.pdf. The sparse_mask masks out words using -inf; after softmax, -inf becomes 0. Thus, a mask does not need to be re-calculated and re-applied when multiplying attn_weights and values.
Four inputs are added to the config: sparse, is_bidirectional, stride, expressivity. If we are using the sparse transformer, is_bidirectional, stride, and expressivity must be specified (there are defaults). If is_bidirectional is False, the mask values using the fixed attention pattern described in the paper. If is_bidirectional is True, subset one includes all values in the current stride window and a summary from every stride window--all other values are masked. Stride (L in the paper) controls the window size and expressivity (c in the paper) controls the size of the summary.
Reviewed By: borguz
Differential Revision: D16042988
fbshipit-source-id: c59166dc7cfe89187a256e4076000c2458842fd5
* Fix read_binarized.py script
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/762
Differential Revision: D16427266
Pulled By: myleott
fbshipit-source-id: 9bd9b8c6b4994ae98a62a37b34d03265bd365453
* Initializing mask as a tensor of ints (not long) (#875)
Summary:
Since mask really is a tensor of ints, this change should be mathematically
equivalent to the base.
On the other hand, this has performance implications for xla, hence the
pull request.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/875
Differential Revision: D16232877
Pulled By: myleott
fbshipit-source-id: e63175ee0016dcf0dfe10e2fd22570b8bbfbde84
* Update README.md
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/899
Differential Revision: D16448602
Pulled By: myleott
fbshipit-source-id: afd1a1b713274b6328150cd85d7f8a81833597aa
* check save_dir before beginning training
Summary: I sadly discovery that my checkpoint directory wasn't globally readable after 8 hours of training. Adding this check at the beginning of train loop to keep that from happening again!
Reviewed By: myleott
Differential Revision: D16455394
fbshipit-source-id: 35959aa058150b2afb63710c468d01ebc8a12b0c
* Update torch.hub usage
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/770
Differential Revision: D16491911
Pulled By: myleott
fbshipit-source-id: 8dd2b76f8fa24183640ae9d1129ea47ded77d43d
* Standardize on 'teacher forcing' rather than 'input feeding' which is… (#769)
Summary:
Input feeding generally refers to a slightly different concept
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/769
Differential Revision: D16491898
Pulled By: myleott
fbshipit-source-id: 68573584e820f11f199db4e7e37e9ee7a69a3287
* Add RoBERTa README
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/778
Differential Revision: D16525447
Pulled By: myleott
fbshipit-source-id: e721e3a10e243a2408a04f89f06b5adbbe2fdff2
* Add return_all_hiddens flag to hub interface
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/909
Differential Revision: D16532919
Pulled By: myleott
fbshipit-source-id: 16ce884cf3d84579026e4406a75ba3c01a128dbd
* Fix compatibility with PyTorch 1.0.x (Fixes #906)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/910
Differential Revision: D16536532
Pulled By: myleott
fbshipit-source-id: 56bb5570e70b5670ad87c64d9dd20c64c1fa9f5c
* Make hub_utils.generator inherit from nn.Module
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/913
Differential Revision: D16536562
Pulled By: myleott
fbshipit-source-id: ce28642da6868ec884e3e416388a652977a062df
* Misc dataset improvements
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/911
Differential Revision: D16536559
Pulled By: myleott
fbshipit-source-id: 7fe495054ce5b7658b1d3a43eca38c5858360236
* Correctly zero padding index in TransformerSentenceEncoder
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/912
Differential Revision: D16536561
Pulled By: myleott
fbshipit-source-id: 54c5c20a826a14f4e690770e027bcb282acdf911
* Add Adamax optimizer
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/914
Differential Revision: D16536670
Pulled By: myleott
fbshipit-source-id: 8a41c98f0fb87af6c384cdade756e3eae2978a88
* Change default --num-workers to 1
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/779
Differential Revision: D16536673
Pulled By: myleott
fbshipit-source-id: bf56e9a81d3086f3d95a3273391dc5e04ed2dbc4
* Update BPE library code
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/780
Differential Revision: D16537567
Pulled By: myleott
fbshipit-source-id: 4e18c529959935e82ea122c3a2ee477308ffcbe3
* Add RoBERTa
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/916
Differential Revision: D16537774
Pulled By: myleott
fbshipit-source-id: 86bb7b1913a428ee4a21674cc3fc7b39264067ec
* Add instructions to load RoBERTa models on PyTorch 1.0
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/921
Differential Revision: D16541025
Pulled By: myleott
fbshipit-source-id: bb78d30fe285da2adfc7c4e5897ee01fa413b2e4
* Fix RoBERTa model import (fixes #918)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/920
Differential Revision: D16540932
Pulled By: myleott
fbshipit-source-id: b64438ad8651ecc8fe8904c5f69fa6111b4bed64
* Add missing files for RoBERTa hub interface
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/923
Differential Revision: D16541289
Pulled By: myleott
fbshipit-source-id: b3563a9d61507d4864ac6ecf0648672eaa40b5f3
* Update README.md to add top-p sampling (#783)
Summary:
Update README.md to include the recently implemented top-p/nucleus sampling.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/783
Differential Revision: D16543974
Pulled By: myleott
fbshipit-source-id: 27c502af10ee390d29607038118a99ff0067aec4
* Support different --max-positions and --tokens-per-sample
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/924
Differential Revision: D16548165
Pulled By: myleott
fbshipit-source-id: 49569ece3e54fad7b4f0dfb201ac99123bfdd4f2
* adding glue data preprocessing scripts (#771)
Summary:
1) Added glue data pre-processing script.
2) updated README with usage.
TODO:
1) releasing fairseq dictionary and remove hardcoded path.
2) remove hard-coded path for bpe-encoding,
myleott what do you recommend for above TODOs?
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/771
Reviewed By: myleott
Differential Revision: D16547679
Pulled By: myleott
fbshipit-source-id: 6a6562d9b6215523d048fdf3daee63ffac21e231
* Fix tokenization (fixes #926) (#929)
Summary:
Fixes https://github.com/pytorch/fairseq/issues/926
Pull Request resolved: https://github.com/pytorch/fairseq/pull/929
Differential Revision: D16560281
Pulled By: myleott
fbshipit-source-id: 751051bcdbf25207315bb05f5bee0235d21be627
* Relicense fairseq under MIT license (#786)
Summary:
The previous BSD+PATENTS license was controversial. We have been
approved to relicense fairseq under the MIT license.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/786
Differential Revision: D16560654
Pulled By: myleott
fbshipit-source-id: f78b1beb4f2895dd7b9bfc79f5f952a2bfb94034
* 1) replaced fstring 2) fixed error from max-positions arg
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/787
Differential Revision: D16562052
fbshipit-source-id: 640e30b2378ec917d60092558d3088a77f9741cb
* Add roberta.decode to hub interface to decode BPE (#931)
Summary:
Fixes https://github.com/pytorch/fairseq/issues/930.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/931
Differential Revision: D16562511
Pulled By: myleott
fbshipit-source-id: c4c07e2f067326b79daa547dcb3db84aeddbd555
* Wmt19 models (#767)
Summary:
Release of the WMT 19 pretrained models
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/767
Reviewed By: edunov
Differential Revision: D16472717
Pulled By: nng555
fbshipit-source-id: acf0fa3548c33f2bf2b5f71e551c782ad8c31a42
* Use commandline interface in preprocess_GLUE_tasks.sh (#937)
Summary:
Just a small fix for issue https://github.com/pytorch/fairseq/issues/936 .
Pull Request resolved: https://github.com/pytorch/fairseq/pull/937
Differential Revision: D16580263
Pulled By: myleott
fbshipit-source-id: 1777e782491c63697726e95bd555892da3fed4ec
* Update language_model README.md (#941)
Summary:
Adding a backslash in the convolutional language model training usage.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/941
Differential Revision: D16581388
Pulled By: myleott
fbshipit-source-id: 7e2e05ecf13e86cb844dc5200d49f560c63b12ff
* Roberta add classification finetuning example readme (#790)
Summary:
Added readme for IMDB classification as tutorial for custm finetuning of roberta
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/790
Reviewed By: myleott
Differential Revision: D16587877
Pulled By: myleott
fbshipit-source-id: ed265b7254e6fa2fc8a899ba04c0d2bb45a7f5c4
* Fix citation errors (#791)
Summary:
Fixing booktitle in wmt19 citation
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/791
Reviewed By: myleott
Differential Revision: D16589372
Pulled By: nng555
fbshipit-source-id: 28402784bb6ef0615e46b8d8383bfa52d79e46de
* Fix small syntax error in hub_utils.py (fixes #942)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/944
Differential Revision: D16593568
Pulled By: myleott
fbshipit-source-id: 611bccae2ad0b8dc704c47a8a3343161010c2356
* Update PyTorch Hub interface
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/782
Differential Revision: D16542256
Pulled By: myleott
fbshipit-source-id: ea3279e7a1ce4687a5914f32b76787c419be1ffa
* Fix sampling with beam>1
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/792
Differential Revision: D16591987
Pulled By: myleott
fbshipit-source-id: d27c490ae75f80ded19226b8384f4776485dd694
* Changed tensor comparison return type from uint8 to bool (#21113)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/21113
ghimport-source-id: 9c4ba63457a72bfc41894387e0b01be3fd9a9baf
Test Plan: Imported from OSS
Differential Revision: D15552204
Pulled By: izdeby
fbshipit-source-id: a608213668649d058e22b510d7755cb99e7d0037
* Add more details for bulk BPE encoding
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/793
Differential Revision: D16603930
Pulled By: myleott
fbshipit-source-id: b302db3743db4f36c14fb0dc7f3456fe8a0079dd
* Use ==/!= to compare str, bytes, and int literals (#948)
Summary:
Identity is not the same thing as equality in Python.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/948
Differential Revision: D16608269
Pulled By: myleott
fbshipit-source-id: be203d62e7824c96c59400d1b342196adb89a839
* Fix wmt19 links (#796)
Summary:
fix links to .tar.gz vs .tar.bz2
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/796
Reviewed By: myleott
Differential Revision: D16611740
Pulled By: nng555
fbshipit-source-id: 76210484225ed917ff14ef626845680d918948f5
* Update beam search code to support torch.bool change
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/797
Differential Revision: D16617067
Pulled By: myleott
fbshipit-source-id: 52e3aeb98d6e3b55ff9154b784028bf13eabfe38
* Update READMEs for torch.hub
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/795
Differential Revision: D16620488
Pulled By: myleott
fbshipit-source-id: 1998a9ccd8816fc7f590861fb4898f910a36bc1e
* Add single-models for WMT'19 for hub tutorial
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/800
Differential Revision: D16621509
Pulled By: myleott
fbshipit-source-id: d3e8e97d30bcafbc35c3f67cd8bbc657b6fa5fe7
* Fewer torch.hub requirements (#959)
Summary:
We will raise exceptions if these are needed and aren't available. Only keep minimum set of reqs
Pull Request resolved: https://github.com/pytorch/fairseq/pull/959
Differential Revision: D16623304
Pulled By: myleott
fbshipit-source-id: 8e65253742e393b527e8396a9433e64ebec9bb55
* Avoid cast in PositionalEmbeddings to fix BLEU drop in pytorch native export
Summary:
Tracing mode doesn't generalize correctly in positional embedding calculation, which caused -5 BLEU at transformer export when using pytorch native.
Details: The original issue was that in ensemble_export, _to_tensor(x) in scripting mode turns integer x into 1-d tensor torch.tensor([x]), not 0-d tensor (scalar x) which is expected in the embedding. So the return value in embedding forward() is actually of wrong shape. When self.weights is of size [x,y], the return value should be (bsz, y, 1) but it was (bsz, 1, y), which caused problem in downstream computation. Tracing only becomes an issue when I used pos = timestep.view(-1)[0] to fix the shape. Then casting the scalar to primary int, to be used as index is not generalizable by tracing mode. Thus I need to convert everything to tensor and replace the advanced indexing with index_select operator.
In summary, less understood features in both scripting&tracing sides caused the bleu drop. :)
Reviewed By: myleott
Differential Revision: D16623025
fbshipit-source-id: 0c7a2c3eafbd774760a5c880c6034009ee084abb
* Fix generating with a fixed prefix
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/801
Differential Revision: D16628318
Pulled By: myleott
fbshipit-source-id: 50e93bb9108afd2ba90f1edd4f34306a7c9964a4
* remove default params from args so architecture works properly
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/798
Reviewed By: myleott
Differential Revision: D16619502
Pulled By: alexeib
fbshipit-source-id: af20c90c4522458850d8f42cab001259ef4293cc
* Add doc string for Roberta.encode function
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/969
Differential Revision: D16642388
Pulled By: myleott
fbshipit-source-id: c5b1655dbddb697822feefa433f33f6bb08253ab
* fixed roberta finetuning with --find-unused-parameters on multiGPU
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/806
Differential Revision: D16649933
fbshipit-source-id: 6eeda6e2caf8019228e3efc0c27ddfcc3c4d8674
* Add back set_epoch functionality lost in RoBERTa merge
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/982
Differential Revision: D16668353
Pulled By: myleott
fbshipit-source-id: 699243d6c028c47cd0e3f801d89051b3f919b17e
* Add code to realign RoBERTa features to word-level tokenizers
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/805
Differential Revision: D16670825
Pulled By: myleott
fbshipit-source-id: 872a1a0274681a34d54bda00bfcfcda2e94144c6
* Fix tests and GLUE finetuning (fixes #989)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/991
Differential Revision: D16687970
Pulled By: myleott
fbshipit-source-id: d877fc16891a8ab97aec47a8d440baa56c2b5f46
* Added mask_fill api and some examples in README (#807)
Summary:
1) This currently works only for single `<mask>` token as multi mask, we might have to look more into order of factorization.
2) This is currently only for single BPE token
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/807
Differential Revision: D16674509
fbshipit-source-id: 0a020030ee5df6a5115e5f85d5a9ef52b1ad9e1c
* fixed reloading from checkpoint (#811)
Summary:
Tested by starting training from (a) `roberta.large`, (b) `roberta.large.mnli`, (c) `checkpoints/checkpoint_last.pt`
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/811
Reviewed By: myleott
Differential Revision: D16689528
Pulled By: myleott
fbshipit-source-id: 849d72ede9d526c34b4753c1bffd689554d1f837
* Asr initial push (#810)
Summary:
Initial code for speech recognition task.
Right now only one ASR model added - https://arxiv.org/abs/1904.11660
unit test testing:
python -m unittest discover tests
also run model training with this code and obtained
5.0 test_clean | 13.4 test_other
on librispeech with pytorch/audio features
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/810
Reviewed By: cpuhrsch
Differential Revision: D16706659
Pulled By: okhonko
fbshipit-source-id: 89a5f9883e50bc0e548234287aa0ea73f7402514
* Integrate with Apache Arrow/Plasma in-memory store for large datasets (#995)
Summary:
Datasets with many examples can generate very large indexes in TokenBlockDataset (and possibly elsewhere). When using `--num-workers>0` these indexes are pickled and transferred via a multiprocessing pipe, which is slow and can fail if the index grows beyond 4GB (~0.5B examples). Apache Arrow has an in-memory store called Plasma that will offload these arrays to shared memory, which both reduces duplication of the data and avoids needing to pickle.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/995
Differential Revision: D16697219
Pulled By: myleott
fbshipit-source-id: 1b679ee5b3d2726af54ff418f6159a3671173fb8
* replace 'mkdir' with 'mkdir -p' (#997)
Summary:
Allow shell script to create sub directories with -p flag. Amends readme file too.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/997
Differential Revision: D16710813
Pulled By: myleott
fbshipit-source-id: 89abefa27e8fac99d212fc9b7b0dbc3690043ba0
* added superglue dev set results to readme
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/815
Differential Revision: D16733633
fbshipit-source-id: 0a5029e41b6dbb9fb28e9703ad057d939d489d90
* MacOS requires c++ flag (#1000)
Summary:
To install on MacOS, `-stdlib=libc++` needs to be specified.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1000
Differential Revision: D16733819
Pulled By: myleott
fbshipit-source-id: 7a1ed11e2b4e1071e61c64c379c84f72e02ad2b5
* added sentence ranking task and loss (#809)
Summary:
This task and loss are used for sentence ranking and multiple choice tasks such as RACE
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/809
Reviewed By: myleott
Differential Revision: D16715745
Pulled By: jingfeidu
fbshipit-source-id: cb4d1c7b26ebb3e2382449ba51af5745ef56f30f
* Fix Python 3.5 compat
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1005
Differential Revision: D16751489
Pulled By: myleott
fbshipit-source-id: 6e372ac23643e32a3791044c13f4466bdc28f049
* Add WSC task and criterion
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1004
Differential Revision: D16751443
Pulled By: myleott
fbshipit-source-id: f70acd6c7be6d69da45b5b32fe4c4eff021539ab
* Fix torch.hub for MNLI
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1006
Differential Revision: D16753078
Pulled By: myleott
fbshipit-source-id: 970055632edffcce4e75931ed93b42a249120a4a
* Update --restore-file logic (partially fixes #999)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1007
Differential Revision: D16762490
Pulled By: myleott
fbshipit-source-id: d67137bcf581887850323d188bb4ea643a35ac9e
* Remove LAMB optimizer (at least until we can test it more)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1008
Differential Revision: D16763315
Pulled By: myleott
fbshipit-source-id: d4bad8384eec273f2d5de4ed29fb8d158ab9187c
* Lint
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/817
Differential Revision: D16762905
Pulled By: myleott
fbshipit-source-id: d920595bec44ed26b72dfc6fbc15c0aa107b4e56
* Minor fixes for RACE finetuning (#818)
Summary:
- remove unnecessary extra spaces in RACE data in preprocessing
- fix finetuning instructions (add `--truncate-sequence` and add `--dropout` params)
- close file handle in SentenceRankingTask
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/818
Differential Revision: D16770055
Pulled By: myleott
fbshipit-source-id: 2c80084e92cdf8692f2ea7e43f7c344c402b9e61
* ignore files starting with . e.g. .ipynb_checkpoints (#819)
Summary:
.ipynb_checkpoints folder in models folders crashed the importlib
now there is a check for this
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/819
Differential Revision: D16772192
Pulled By: myleott
fbshipit-source-id: 01c956aef4ed312bc7645c31c83dbf98af89d931
* fix cosine scheduler docstring
Summary: as title
Reviewed By: myleott
Differential Revision: D16773845
fbshipit-source-id: 2d10e197c31f94d894430559327289a4d03e33f7
* added readme code for inference with GLUE finetuned model
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/820
Differential Revision: D16783469
fbshipit-source-id: d5af8ba6a6685608d67b72d584952b8e43eabf9f
* Add Commonsense QA task
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1014
Differential Revision: D16784120
Pulled By: myleott
fbshipit-source-id: 946c0e33b594f8378e4ab6482ce49efcb36e1743
* Add fairseq-validate
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/765
Differential Revision: D16763357
Pulled By: myleott
fbshipit-source-id: 758b03158e486ee82786e2d5bf4e46073b50c503
* Updates for PyTorch 1.2 masking/bool behavior
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/821
Differential Revision: D16790120
Pulled By: myleott
fbshipit-source-id: 2fb5070172636561d08596a29f08c93df07548bf
* Fix tests
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/822
Differential Revision: D16800078
Pulled By: myleott
fbshipit-source-id: b86e08e01f2fe13c64b77f1d23a5f6800f252bf7
* v0.7.2 -> v0.8.0 (#1017)
Summary:
Changelog:
- Relicensed under MIT license
- Add RoBERTa
- Add wav2vec
- Add WMT'19 models
- Add initial ASR code
- Changed torch.hub interface (`generate` renamed to `translate`)
- Add `--tokenizer` and `--bpe`
- f812e52: Renamed data.transforms -> data.encoders
- 654affc: New Dataset API (optional)
- `47fd985`: Deprecate old Masked LM components
- `5f78106`: Set mmap as default dataset format and infer format automatically
- Misc fixes for sampling
- Misc fixes to support PyTorch 1.2
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1017
Differential Revision: D16799880
Pulled By: myleott
fbshipit-source-id: 45ad8bc531724a53063cbc24ca1c93f715cdc5a7
* Update READMEs
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/823
Differential Revision: D16804995
Pulled By: myleott
fbshipit-source-id: abac5dc0ed6b7bfe2309ba273456e54b37340b2c
* initial light and dynamic convolution kernels (#547)
Summary:
CUDA code for light/dynamicconv kernels, including pytorch modules. Modules can be built by running setup.py in each respective folder, and can then be imported and used like any other module.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/547
Reviewed By: myleott, shubho
Differential Revision: D15703660
Pulled By: nng555
fbshipit-source-id: e9c913753be3a1cd571965f7200df6678b644520
* added effcient wsc task/criterion for winogrande (#825)
Summary:
1) So far getting `78%` on winogrande validation dataset comapred to `63.5%` in the paper.
2) Will upgrade readme once everything is finalized.
Questions:
1) Should I just call `binary_wsc_task` instead of `winogrande` to be less specific to dataset and be generic?
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/825
Differential Revision: D16810159
fbshipit-source-id: cfde73561fa4caaaa63a4773c0aecd12ce1fa518
* Update README
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/826
Differential Revision: D16830402
Pulled By: myleott
fbshipit-source-id: 25afaa6d9de7b51cc884e3f417c8e6b349f5a7bc
* Backward reranking public (#667)
Summary:
Implementation of noisy channel model reranking for release with paper
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/667
Reviewed By: michaelauli
Differential Revision: D15901665
Pulled By: nng555
fbshipit-source-id: 2de2c518be8e5828ffad72db3e741b0940623373
* Update README
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/827
Differential Revision: D16833252
Pulled By: myleott
fbshipit-source-id: 8eded8cc651002dfd60869fc2383d305ed335d3a
* BMUF Resetting local state param
Summary:
BMUF
1) Resetting BMUF parameters after warmup.
2) Resetting local param state after warmup.
3) Allowing user to pass block momentum value instead of gpu derived Block Momentum.
Reviewed By: skritika, mrshenli
Differential Revision: D16692026
fbshipit-source-id: d02eaf29d0e4b37007418166ec937d4bf5fe6aca
* added hf bert bpe
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/829
Differential Revision: D16856693
fbshipit-source-id: 545bbf4815f5c40e72a6ed241312a51dc90e34a1
* added check in token block dataset for multiple consecutive blank lines
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/830
Differential Revision: D16861799
fbshipit-source-id: d85deaf78ec5b9c23eafd4145a96252e3901fa22
* implement tri-stage lr_scheduler (#1028)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1028
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/831
tri-stage lr-scheduler consisted of 3 stages: 1. warmup; 2. hold; 3.
(exponentially) decay; used in https://arxiv.org/pdf/1904.08779.pdf
Reviewed By: myleott
Differential Revision: D16806206
fbshipit-source-id: 40e472ec382449a0fb711f8ee980f14d27d2114a
* Fix bug (the returned value has a dimension mismatch) in label-smoothed-cross-entropy for MoE (#1037)
Summary:
MoE will encounter a dimension mismatch bug when using label-smoothed cross entropy as the criterion, which occurs at [https://github.com/pytorch/fairseq/blob/master/fairseq/tasks/translation_moe.py#L125](url). This is a fix to the bug.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1037
Differential Revision: D16892674
Pulled By: myleott
fbshipit-source-id: a73bc03d2280356667d02422d22ad11d968d0c65
* remove shlex.quote in scripts/spm_train.py (#972)
Summary:
to resolve the issue https://github.com/pytorch/fairseq/issues/971
Pull Request resolved: https://github.com/pytorch/fairseq/pull/972
Differential Revision: D16892827
Pulled By: myleott
fbshipit-source-id: baf277961f1e292f4593eefe31e3541aa9d0d8c4
* add constrains when checking multiple consecutive blank lines (#1031)
Summary:
It will cause runtime error on some standard datasets (e.g. wikitext-103).
Details:
After preprocessing to wikitext-103 folder with current master branch, I use fairseq-train and get the following Error:
```bash
Traceback (most recent call last):
File "/home/trinkle/.local/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 321, in cli_main
main(args)
File "/data/git/Transformer/fairseq/fairseq_cli/train.py", line 46, in main
task.load_dataset(valid_sub_split, combine=False, epoch=0)
File "/data/git/Transformer/fairseq/fairseq/tasks/language_modeling.py", line 167, in load_dataset
break_mode=self.args.sample_break_mode, include_targets=True,
File "/data/git/Transformer/fairseq/fairseq/data/token_block_dataset.py", line 54, in init
"Found multiple blank lines in the dataset, please remove them"
AssertionError: Found multiple blank lines in the dataset, please remove them (eg. cat -s raw.txt) and preprocess the data again.
```
It's because these datasets have multiple blank lines. The assertion is added in https://github.com/pytorch/fairseq/commit/851c022610b27da3beaa4e40a6834b5fb3b44f44, however, adding this assertion is not a good way.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1031
Differential Revision: D16892942
Pulled By: myleott
fbshipit-source-id: 90c41b7d98a7b78f506bb57320f9f6b901e05d5b
* Add instructions to resume training from released RoBERTa models (fixes #1034)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1041
Differential Revision: D16904073
Pulled By: myleott
fbshipit-source-id: 22e5e25a15f7a0b6f2d827d98c953a6cec07610e
* Small fixes
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/835
Differential Revision: D16904038
Pulled By: myleott
fbshipit-source-id: 2c9d0b913f8d688297ac80fcabd905bd1397f66a
* Back out "[fairseq][PR] Fix bug (the returned value has a dimension mismatch) in label-smoothed-cross-entropy for MoE" (#837)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/837
Original commit changeset: a73bc03d2280
Differential Revision: D16904372
fbshipit-source-id: b4c4047b2686ba47258cdf0783059726134c920a
* Fix method has same name as property
Summary:
Training is failing sometimes because `self.collater` can be both method and property for AsrDataset
https://github.com/pytorch/fairseq/issues/1036
Reviewed By: jcai1
Differential Revision: D16919945
fbshipit-source-id: b34ba54e4dae315b7c723996610a348a8e3031af
* Give path when checkpoint can't be found (#1040)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1040
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/836
Reviewed By: myleott, liezl200
Differential Revision: D16889252
fbshipit-source-id: 45a1b6c1217fb099f0350096e38e1c7d83ea0a64
* vggblock support without pooling and pooling_kernel_size missing self (#839)
Summary:
1) VggBlock was not supported if pooling kernel size was None.
2) Since we modify pooling kernel size by using _pair. We should use self.pooling_kernel_size. But I agree it doesn't matter as pytorch is robust to this.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/839
Differential Revision: D16934112
Pulled By: okhonko
fbshipit-source-id: b6b95163b0e7f7203d76d535f01a41912382bdc3
* Multiset (#838)
Summary:
Adds ability to tag individual examples with the names of their datasets, along with some minor miscellaneous fixes and improvements
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/838
Differential Revision: D16919175
Pulled By: alexeib
fbshipit-source-id: 4bf493299645bae63f3ee6382e15f18a9f73666c
* Parameterized criterions (#808)
Summary:
Support criterion with parameters, such as AutoSegmentationCriterion (ASG) used in wav2letter which has a transition matrix parameter. This is needed to integrate wav2letter's ASG into PySpeech.
With this diff, parameters in criterions will be:
(1) updated by optimizers, with a configurable learning rate
(2) saved and loaded from checkpoints, preserving backward compatibility for criterions without parameters
(3) synchronized across nodes in distributed training.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/808
Reviewed By: jcai1
Differential Revision: D16934097
Pulled By: okhonko
fbshipit-source-id: 121ec9382459385c6f9cbef3a8274bec1a434038
* fix string format to work in python 3.5 (#1050)
Summary:
change string fromat in fairseq/data/subsample_dataset.py#20
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1050
Differential Revision: D16946060
Pulled By: okhonko
fbshipit-source-id: 0eabf22e7ffd4f658b6d18c87dc6e59c81a355c7
* Misc changes
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/840
Differential Revision: D16947645
Pulled By: myleott
fbshipit-source-id: e869789bc22bbf5cb08d9adfa44f9fc09b3805af
* Add links to cuda models (#828)
Summary:
Add links to pre-trained cuda models in pay less attention
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/828
Reviewed By: michaelauli
Differential Revision: D16833577
Pulled By: nng555
fbshipit-source-id: 1556aa77fd87ea259812de8ef65963257c370f9b
* Fix year in noisy channel citation (#842)
Summary:
2018->2019
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/842
Differential Revision: D16973530
Pulled By: nng555
fbshipit-source-id: 00207b79821ac0257a53a0581a84582130e1bff5
* wav2vec everstore support
Summary: changes for internal support
Differential Revision: D16646887
fbshipit-source-id: ac5bf6c32901819726249422324eae32a0a6e148
* Cythonize token block dataset (#834)
Summary:
Cythonized token block dataset code, it's `> 100x` faster. Token block for entire `bookwiki+CC+stories+openweb` is just ~`39.9` seconds.
TODO:
1) I think, I can make it 2x more faster.
2) cleanup.
EDIT History:
~~First pass at parellelizing `token_block_dataset`. The code feels somewhat complicated and cluttered.
This is 2-3x faster though on my tests on `bookwiki` dataset with both `complete` and `complete_doc` modes.
myleott Can you take a look for correctness as I am still not 100% sure that I am not missing corner cases.~~
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/834
Test Plan:
Imported from GitHub, without a `Test Plan:` line.
Test workflow: f133816198
Reviewed By: myleott
Differential Revision: D16970257
Pulled By: myleott
fbshipit-source-id: ec45a308193c9e9f3e7075336c15df4723228d6f
* Suppress leaked semaphore warnings
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/844
Differential Revision: D16985131
Pulled By: myleott
fbshipit-source-id: 66ba3b9aa0cdf329a1e38fc09786f34906afdb43
* fix cython dependency in the setup (#847)
Summary:
Fixes broken build for `pytext` https://github.com/pytorch/fairseq/commit/4fc39538aec5141aa41f5d6d7dc0097e7c0f7b48
Earlier version of setup tools required `cython` to be installed before even starting setup.py. This one fixes it.
More details: https://github.com/pypa/setuptools/blob/master/CHANGES.rst#180
and https://stackoverflow.com/questions/37471313/setup-requires-with-cython
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/847
Differential Revision: D16997450
fbshipit-source-id: 5f65026c228a1b94280ca73937078ee3e21ce4f8
* wav2vec everstore support fix
Summary: fixes some merge issues that prevented wav2vec from training properly
Reviewed By: myleott
Differential Revision: D16981120
fbshipit-source-id: cad39aaf2f44daabcbafe7b4e8735d055b3842a7
* installing numpy headers for cython
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/848
Differential Revision: D17060283
fbshipit-source-id: c7e61cae76a0566cc3e2ddc3ab4d48f8dec9d777
* Minor update of README.md of language model example (#1063)
Summary:
With this white space, the command might fail.
```
fairseq-preprocess: error: unrecognized arguments:
zsh: command not found: --destdir
```
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1063
Differential Revision: D17072516
Pulled By: myleott
fbshipit-source-id: 68bb9d05b40b215b18aceac2bff3f5ec1ef2f537
* Minor cleanup for setup.py
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1078
Differential Revision: D17072514
Pulled By: myleott
fbshipit-source-id: 69a8c8c9cc7caa7e04c414329a5d79e6e1a6621c
* use numpy function for filter by size when possible (#845)
Summary:
For general Masked language modeling use-case, this is much faster, (`3 minutes vs 1 sec`).
Let me know what you think about it myleott, if you don't like all the special case checking, we can think of reorganizing the dataset APIs to always have `sizes` as property calculated in `__init__`.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/845
Reviewed By: myleott
Differential Revision: D16993769
Pulled By: myleott
fbshipit-source-id: 161bba62af2965190c07c47e838ee967cb886e88
* Fix multi-gpu training (fixes #1088)
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1089
Differential Revision: D17108918
Pulled By: myleott
fbshipit-source-id: 818c77a5bbf3b146028991aca64d79b93f144b28
* Adopt Contributor Covenant
Summary:
In order to foster healthy open source communities, we're adopting the
[Contributor Covenant](https://www.contributor-covenant.org/). It has been
built by open source community members and represents a shared understanding of
what is expected from a healthy community.
Reviewed By: josephsavona, danobi, rdzhabarov
Differential Revision: D17104640
fbshipit-source-id: d210000de686c5f0d97d602b50472d5869bc6a49
* set numpy seed explicitly + other minor fixes (#850)
Summary:
not setting the numpy seed explicitly at the beginning was an extremely annoying bug to find. it it caused different gpus to have a different view of data if some randomization was used in the dataset (e.g. subsample dataset)
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/850
Differential Revision: D17085006
Pulled By: alexeib
fbshipit-source-id: 62bb2116369fb703df878e6bc24c06f1ea4e75a0
* add missing colorize dataset
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/851
Differential Revision: D17145769
Pulled By: alexeib
fbshipit-source-id: 9dd26799d044ae5386e8204a129b5e3fc66d6e85
* Improve support for `python setup.py build_ext --inplace`
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/852
Differential Revision: D17147452
Pulled By: myleott
fbshipit-source-id: 5fd9c7da3cc019c7beec98d41db1aef1329ee57a
* Cleaner handling of numpy-based extensions in setup.py
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/853
Differential Revision: D17147879
Pulled By: myleott
fbshipit-source-id: b1f5e838533de62ade52fa82112ea5308734c70f
* fixed numpy based size filtering (#854)
Summary:
This bug got introduced in my [commit](https://github.com/fairinternal/fairseq-py/commit/9624f9651478bcb88022decf7e1b0685b410133b) for fast numpy based size filtering.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/854
Differential Revision: D17150350
fbshipit-source-id: cb564119543e116d6a17784d1c22e9bce7059a0c
* Fix an error in the command about Hierarchical Neural Story Generation (#1099)
Summary:
When I try to reproduce the experiment in _Hierarchical Neural Story Generation_, I found the command about generation cannot be executed.
It said that **fairseq-generate: error: unrecognized arguments: --sampling-temperature 0.8**
In the document, I find:
```
--temperature temperature for generation
Default: 1.0
```
And I don't find a parameter named `--sampling-temperature`, so I think the parameter `--sampling-temperature` should be changed to `--temperature`
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1099
Differential Revision: D17163065
Pulled By: myleott
fbshipit-source-id: 25c430eeee4703f8ec30353825ffec4bb973da0d
* added cython to install_requires
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/856
Reviewed By: myleott
Differential Revision: D17162411
Pulled By: myleott
fbshipit-source-id: e70ecc802398bbba2b5326e9700f2121c422fd18
* Fix multilingual translation bug for to-many case
Summary:
The logic for adding decoder side language token was wrongly implemented.
The way we inject the language token is by replacing the eos symbol with language token symbol. However, the parameter for source / target eos symbol was not set correctly.
Reviewed By: tangyuq
Differential Revision: D17129108
fbshipit-source-id: 6fae385b787370656fd7ca7ab74e6bb91fe5463b
* Return predicted token for RoBERTa filling mask
Summary:
Added the `predicted_token` to each `topk` filled output item
Updated RoBERTa filling mask example in README.md
Reviewed By: myleott
Differential Revision: D17188810
fbshipit-source-id: 5fdc57ff2c13239dabf13a8dad43ae9a55e8931c
* Average local optimizer param after warmup and during bmuf sync
Summary: We have seen that averaging the local param instead of doing reset or broadcast after warmup improves the WER.
Reviewed By: skritika
Differential Revision: D16739278
fbshipit-source-id: 75033d2d25f9a88fd6dd325d0d9d4c856d22d947
* added fast stats sync option (#858)
Summary:
Added `--fast-stat-sync` option.
This avoids pickle and achieves `~7%` more `wps` on 16 nodes.
It is less flexible as it just aggregates only basic stats and it ignores the aggregate function defined by criterion.
Let me know what you think myleott
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/858
Differential Revision: D17398770
fbshipit-source-id: 36261a1d970e67deeda8211af8f009ef9b4f9c14
* Update README.md
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1140
Differential Revision: D17431506
Pulled By: myleott
fbshipit-source-id: b47dae303d7e76daa5b49795476b5e48d7b090ad
* Fix link to RACE fine-tuning instructions.
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1125
Differential Revision: D17431557
Pulled By: myleott
fbshipit-source-id: f712e5355d8dbb0a8f1170674d62e2b6880295b4
* dont project maske tokens for mlm loss (#859)
Summary:
This saves ~4-5gb gpu memory while training roberta large with `seq_len=512`.
I am able to fit `--max-sentences=16` on `volta32gb` for `roberta-large`
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/859
Differential Revision: D17435814
fbshipit-source-id: 2663909768fac0ef0102107613770ee01b1f8c00
* Minor fix to make adafactor work for >2d conv kernels (#1122)
Summary:
missing .unsqueeze(-1) in line 124,
without this change we'll encounter runtime error for >2d convolutional kernels, with this fix, we're applying adafactor's 2d logic to the two final dimensions.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1122
Differential Revision: D17431662
Pulled By: myleott
fbshipit-source-id: e7435e77270a9252f75f01b2457ef0048f5bcf36
* Add autogenerated cython files to gitignore (#860)
Summary:
`python setup.py build_ext --inplace` generates C++ source files directly in the Python source tree. They should most likely be ignored by git.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/860
Differential Revision: D17460597
Pulled By: jma127
fbshipit-source-id: 72a29d438ebb57627b68ec7e9a2a77c8a36f1c21
* Add cython language_level hints
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1147
Differential Revision: D17468447
Pulled By: myleott
fbshipit-source-id: 0dbac04b92c8df74ad991d5e92cd02036d662369
* Add dataset class for weighted sampling with replacement. (#861)
Summary:
As discussed with Naman earlier today. Weighted sampling with
replacement can be done on a per-epoch basis using `set_epoch()`
functionality, which generates the samples as a function of random seed
and epoch.
Additionally, `FairseqTask` needs to set the starting epoch for the
dataset at the very beginning of iterator construction.
Not yet implemented is the per-epoch iterator construction, which
is necessary to actually regenerate the batches for each epoch.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/861
Differential Revision: D17460687
Pulled By: jma127
fbshipit-source-id: 1c2a54f04ac96b3561c100a6fd66a9fccbe3c658
* added multilingual masked LM training (#849)
Summary:
The multilingual-RoBERTa training is working with aconneau XLM data.
Two pieces remaining:
1) `XLM` limits batch to be from same language, I am not 100% sure about the reason for that, but should be easy to implement, basically we can add `batch_by_size_and_language` instead of default `batch_by_size` function. If it's not critical, I would want to leave it out as it keeps the code very clean and simple.
2) `sample_ratio` in `ConcatDataset` works with `int` by tiling the datasets based on ratio. Currently I am handling it by sounding off the ratio to `first decimal` and then multiplying by `10`. We can see if some such simple heuristics are good enough, there are other options (we can talk about them offline).
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/849
Differential Revision: D17162460
fbshipit-source-id: d967f3d872f7a1f0aa4ea418bd362b68af9e432f
* Update README.race.md
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1155
Differential Revision: D17509762
Pulled By: myleott
fbshipit-source-id: 4de535289c1f35abff0d8142d8580f3ede039f47
* Remove extraneous call to RNG in multi-GPU code path
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/865
Differential Revision: D17510276
Pulled By: myleott
fbshipit-source-id: 24119402ad5fe95a1312fadb77bafe49a9197c6b
* fixed train valid epoch iter
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/866
Differential Revision: D17517115
fbshipit-source-id: fd6921e642c99e37fce6ad58b24c93e70a5364e5
* Miscellaneous documentation improvements: (#868)
Summary:
- More clearly document the correspondence between FairseqAdam and torch.optim.AdamW
- Add ResamplingDataset to Sphinx docs
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/868
Differential Revision: D17523244
Pulled By: jma127
fbshipit-source-id: 8e7b34b24889b2c8f70b09a52a625d2af135734b
* fixed corner case in mlm criterion when all tokens get masked
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/869
Reviewed By: myleott
Differential Revision: D17531776
Pulled By: myleott
fbshipit-source-id: 349c9449a0a7db5d3bb8449561302d4220cfa60c
* Issue 1146: Minor fix to roberta pre-training readme (#1165)
Summary:
This is to make this instructions a little more generalizable, since in some systems, bash will parse the spaces within quotes
Addressing https://github.com/pytorch/fairseq/issues/1146
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1165
Differential Revision: D17547810
Pulled By: myleott
fbshipit-source-id: 5a026d42f678126b5ca8bc4477ba8f26ea549dcd
* PR for Issue #1154: Two comments in lstm.py seem to be incorrect
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1185
Differential Revision: D17602249
Pulled By: lematt1991
fbshipit-source-id: bd515b7d2ebce8181a80684f45223a8db7c7e3cd
* Update getting_started.rst (#1188)
Summary:
Hi,
I think there is a minor mistake in the doc. `--distributed-no-spawn` argument is needed for distributed training on multiple machines without `slurm`. Otherwise, the program will start 8 jobs on each GPU, when `nproc_per_node=8`.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1188
Differential Revision: D17627778
Pulled By: myleott
fbshipit-source-id: 35ab6b650dc1132d7cb2d150e80d2ebf0caf3e69
* Explain the language modelling format in RoBERTa pretraining readme
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1174
Differential Revision: D17627767
Pulled By: myleott
fbshipit-source-id: 7b5f77146b8776a5967699e430136039c066c851
* Fixing BMUF warmup and sync strategy
Summary:
Bmuf sync started happening even before warmup is done.
This diff fixes the behavior and do bmuf sync once warmup is done or if it's zero.
TODO: write a unit test case so that these problems can be figure out faster.
Reviewed By: jay-mahadeokar
Differential Revision: D17356277
fbshipit-source-id: 21500e6ed1225b97794e4ee203e5d7d04a2840f8
* Levenshtein Transformer paper code
Summary:
Code for our NeurIPS paper [Levenshtein Transformer](https://arxiv.org/abs/1905.11006)
* Added Levenshtein Transformer model, task and criterion class
* Added iterative NAT Transformer, insertion Transformer and CMLM Transformer model class for baselines
* Add an option for prepending BOS to dictionary class and translation task class
Reviewed By: myleott
Differential Revision: D17297372
fbshipit-source-id: 54eca60831ae95dc721c2c34e882e1810ee575c7
* Fixing example of batched predictions for Roberta (#1195)
Summary:
For batched predictions in Roberta, the README was giving an example that was pretty unclear. After a thorough discussion with ngoyal2707 in issue https://github.com/pytorch/fairseq/issues/1167 he gave a clear example of how batched predictions were supposed to be done. Since I spent a lot of time on this inconsistency, I thought that it might benefit the community if his solution was in the official README 😄 !
For for details, see issue https://github.com/pytorch/fairseq/issues/1167
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1195
Differential Revision: D17639354
Pulled By: myleott
fbshipit-source-id: 3eb60c5804a6481f533b19073da7880dfd0d522d
* RoBERTa now supported on TPU and TensorFlow via transformers library
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1197
Differential Revision: D17651374
Pulled By: myleott
fbshipit-source-id: 5feb986de1e682eb83c4479f419ad51325718572
* Implementation of the WeCNLP abstract "Cross+Self-Attention for Transformer Models" (#1097)
Summary:
This PR implements a new attention module which combines cross-attention (encoder-decoder attention) and the decoder self-attention. This work was accepted as an abstract at WeCNLP 2019 (https://www.wecnlp.ai/wecnlp-2019).
Cross+Self-Attention reduces the amount of parameter and increases the inference speed without any degradation in translation quality.
More details can be found in the attached [abstract](https://github.com/pytorch/fairseq/files/3561282/paper.pdf)
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1097
Differential Revision: D17653168
Pulled By: myleott
fbshipit-source-id: deb834c2c78a229d7418ffbfea20ba3ce252991c
* fix typo in README of examples/translation
Summary: Pull Request resolved: https://github.com/pytorch/fairseq/pull/1200
Differential Revision: D17659658
Pulled By: myleott
fbshipit-source-id: 1863e6d60a439dbb7e71e5da68817c9d53649737
* Fix torch.hub to not depend on libnat
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/878
Differential Revision: D17661768
Pulled By: myleott
fbshipit-source-id: 1e4c5f09eb14c40d491ca2459fd2adb8382fb6d2
* Implementation of the paper "Jointly Learning to Align and Translate with Transformer Models" (#877)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/877
This PR implements guided alignment training described in "Jointly Learning to Align and Translate with Transformer Models (https://arxiv.org/abs/1909.02074)".
In summary, it allows for training selected heads of the Transformer Model with external alignments computed by Statistical Alignment Toolkits. During inference, attention probabilities from the trained heads can be used to extract reliable alignments. In our work, we did not see any regressions in the translation performance because of guided alignment training.
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1095
Differential Revision: D17170337
Pulled By: myleott
fbshipit-source-id: daa418bef70324d7088dbb30aa2adf9f95774859
* extract FP16OptimizerMixin for share the same logic in PyText (#1180)
Summary:
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1180
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/874
extract FP16OptimizerMixin for share the same logic in PyText
Reviewed By: hudeven
Differential Revision: D17594102
fbshipit-source-id: 8625a4e4f3e09cbaba6ae92599c1121b86ed4e78
* Native Torchscript Wordpiece Tokenizer Op for BERTSquadQA, Torchscriptify BertSQUADQAModel (#879)
Summary:
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/879
Pull Request resolved: https://github.com/facebookresearch/pytext/pull/1023
Pull Request resolved: https://github.com/pytorch/fairseq/pull/1211
Added a new native op that does wordpiece tokenization while additionally returning token start and end indices in the raw text as required by BertSquadQA. Includes Unit Tests for the native op and also to check its parity with the PyText Wordpiece Tokenizer.
Also combined is a torchscript implementation of the Bert SQUAD QA Model.
There are scripts for evaluation and testing of the torchscript code as well.
Reviewed By: borguz, hikushalhere
Differential Revision: D17455985
fbshipit-source-id: c2617c7ecbce0f733b31d04558da965d0b62637b
* Add periodic CUDA cache cleanup (#882)
Summary:
This adds a periodic call to `torch.cuda.empty_cache()` in order to
mitigate memory fragmentation in the PyTorch CUDA cached allocator
that can cause OOMs on models approaching GPU memory limit.
By default, this will occur every 64 updates.
Performance considerations:
- I've benchmarked this on a reasonably large model with memory
footprint 16 GB, and the overhead with the default setting is <0.2%.
With `update-freq > 1`, the cost is mitigated even further.
- This behavior can be disabled with a value of zero.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/882
Differential Revision: D17742386
Pulled By: jma127
fbshipit-source-id: 68d8f93f798d6818b5efc3d67d43b52dfb8b2865
* add pre-trained wav2vec model
Summary: Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/884
Differential Revision: D17774515
Pulled By: alexeib
fbshipit-source-id: d1ffe8ab723fa284c69b067bbd43d699eaa2f02f
* Setting Global sync to 50 in BMUF
Summary:
In all our final settings, we are using global_sync = 50 and we get comparable results with DDP and caffe2.
Setting the default global-sync-iter = 50
and users can just define --use-bmuf to enable it for training.
Reviewed By: skritika
Differential Revision: D17765094
fbshipit-source-id: 369591eeff266d757f89e1fc8dda01711146fdbc
* fix max lengths in Levenshtein Tramsformer
Summary: Fix the max length calculation in Levenshtein Transformer
Reviewed By: jhcross
Differential Revision: D17672946
fbshipit-source-id: e5efbe7e56cf879d3e822864e4398f99f45b04d4
* ensemble levts
Summary:
Add ensemble wrappers to the levenshtein NAT.
Levenshtein
Final softmax ensemble over the pipeline of three steps: deletion, placeholder insertion, and word selection.
1. Deletion
2. Placeholder Insertion
3. Word Selection
Each step involves scoring, averaging the scores over the ensemble, and then make hard decisions with argmax. Then next step follows. We cannot do the three steps in parallel by design.
Reviewed By: kahne
Differential Revision: D17723202
fbshipit-source-id: 05f7a4fcd922a972cc4796ca397e8220f0b4d53e
* Add printing of PyTorch memory summary on OOM (#885)
Summary:
PyTorch now has more comprehensive memory instrumentation, added in https://github.com/pytorch/pytorch/pull/27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs.
Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/885
Differential Revision: D17820445
Pulled By: jma127
fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
* Fix data loading memory issue in pyspeech
Summary:
We currently shard data when creating the batch iterator. This means we first load all indicese/frame lengths/handles into memory, and then do the sharding. This makes it impossible to train on large datasets with a high amount of workers because each worker will need to load the entire dataset into memory. For training on a million hours of data (i.e. semi-supervised or unsupervised approaches) this data loading just mak…
|
I added "topic: deprecation" but it looks like this doesn't actually warn to tell the user to use the new APIs (and the old APIs are removed from the docs). We should add that so we can eventually remove the old stuff. |
|
ah, my mistake: this does add DeprecationWarnings but those are by default filtered by many versions of python. We should turn these into UserWarnings. |
|
@gchanan I was just about to comment that I recall adding DeprecationWarnings to the old |
|
sorry, by changing the docs I meant "updated the docs to refer to the new function names". |
|
Ah gotcha. In any case, PR for rubber-stamping & Phabricator import: #32142 |
Follow-up of pytorch#27361 . Addresses pytorch#32141 .
…27361) Summary: Adds comprehensive memory instrumentation to the CUDA caching memory allocator. # Counters Added comprehensive instrumentation for the following stats: - Allocation requests (`allocation`) - Allocated memory (`allocated_bytes`) - Reserved segments from cudaMalloc (`segment`) - Reserved memory (`reserved_bytes`) - Active memory blocks (`active`) - Active memory (`active_bytes`) - Inactive, non-releasable blocks (`inactive_split`) - Inactive, non-releasable memory (`inactive_split_bytes`) - Number of failed cudaMalloc calls that result in a cache flush and retry (`cuda_malloc_retries`) - Number of OOMs (`num_ooms`) Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator. # Snapshots Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state. # Implementation: major changes - Added `torch.cuda.memory_stats()` (and associated C++ changes) which returns all instrumented stats as a dictionary. - Added `torch.cuda.snapshot()` (and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments. - Added memory summary generator in `torch.cuda.memory_summary()` for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupq # Implementation: minor changes - Add error-checking helper functions for Python dicts and lists in `torch/csrc/utils/`. - Existing memory management functions in `torch.cuda` moved from `__init__.py` to `memory.py` and star-imported to the main CUDA module. - Add various helper functions to `torch.cuda` to return individual items from `torch.cuda.memory_stats()`. - `torch.cuda.reset_max_memory_cached()` and `torch.cuda.reset_max_memory_allocated()` are deprecated in favor of `reset_peak_stats`. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent. - `torch.cuda.memory_cached()` and `torch.cuda.max_memory_cached()` are deprecated in favor of `*memory_reserved()`. - Style (add access modifiers in the allocator class, random nit fixes, etc.) # Testing - Added consistency check for stats in `test_cuda.py`. This verifies that the data from `memory_stats()` is faithful to the data from `snapshot()`. - Ran on various basic workflows (toy example, CIFAR) # Performance Running the following speed benchmark: https://pastebin.com/UNndQg50 - Before this PR: 45.98 microseconds per tensor creation - After this PR: 46.65 microseconds per tensor creation Pull Request resolved: pytorch#27361 Differential Revision: D17758747 Pulled By: jma127 fbshipit-source-id: 5a84e82d696c40c505646b9a1b4e0c3bba38aeb6
Summary: PyTorch now has more comprehensive memory instrumentation, added in pytorch/pytorch#27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: fairinternal/fairseq-py#885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
Summary: PyTorch now has more comprehensive memory instrumentation, added in pytorch/pytorch#27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: fairinternal/fairseq-py#885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
Summary: PyTorch now has more comprehensive memory instrumentation, added in pytorch/pytorch#27361 . This PR makes fairseq print a summary table of the memory state when an OOM occurs. Pull Request resolved: fairinternal/fairseq-py#885 Differential Revision: D17820445 Pulled By: jma127 fbshipit-source-id: 1887417c7648d703f78e1cff9f2a5b89901f49d0
Adds comprehensive memory instrumentation to the CUDA caching memory allocator.
Counters
Added comprehensive instrumentation for the following stats:
allocation)allocated_bytes)segment)reserved_bytes)active)active_bytes)inactive_split)inactive_split_bytes)cuda_malloc_retries)num_ooms)Except for the last two, these stats are segmented between all memory, large blocks, and small blocks. Along with the current value of each stat, historical counts of allocs/frees as well as peak usage are tracked by the allocator.
Snapshots
Added the capability to get a "memory snapshot" – that is, to generate a complete dump of the allocator block/segment state.
Implementation: major changes
torch.cuda.memory_stats()(and associated C++ changes) which returns all instrumented stats as a dictionary.torch.cuda.snapshot()(and associated C++ changes) which returns a complete dump of the allocator block/segment state as a list of segments.torch.cuda.memory_summary()for ease of client access to the instrumentation stats. Potentially useful to dump when catching OOMs. Sample output here: https://pastebin.com/uKZjtupqImplementation: minor changes
torch/csrc/utils/.torch.cudamoved from__init__.pytomemory.pyand star-imported to the main CUDA module.torch.cudato return individual items fromtorch.cuda.memory_stats().torch.cuda.reset_max_memory_cached()andtorch.cuda.reset_max_memory_allocated()are deprecated in favor ofreset_peak_stats. It's a bit difficult to think of a case where only one of those stats should be reset, and IMO this makes the peak stats collectively more consistent.torch.cuda.memory_cached()andtorch.cuda.max_memory_cached()are deprecated in favor of*memory_reserved().Testing
test_cuda.py. This verifies that the data frommemory_stats()is faithful to the data fromsnapshot().Performance
Running the following speed benchmark: https://pastebin.com/UNndQg50