Slower training speed?

Thank the authors for the great work and making it open-sourced.
.
I made a minimal trainable mamba on Tinystories [here](https://github.com/xyzhang626/Mamba-tinystories) based on llama2.c by a few lines.  But found it is ~13% slower in training on my v100s (800ms v.s. 650ms per iter, 2048 seq_len) than `torch.compile`d Transformers. 

Does it work as expected? I also notice the `torch.compile` can not directly work with the current mamba model in this repo. Is that one factor given the mamda has been equipped with serval fused ops.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower training speed? #156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slower training speed? #156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions