My work focuses on data and speed for language models and machine translation.
Data:- Technical lead manager of text pre-training data at Meta for two years.
- Created HPLT that released clean text and models from 7.2 petabytes of web crawl.
- Ran ParaCrawl that made the largest parallel corpora for many languages; also did patents.
- Worked on No Language Left Behind, published in Nature.
- Founded and industry exit for Efficient Translation Limited, which sold low latency machine translation.
- Created and ran Bergamot that launched client-side machine translation, installed by default in Firefox.
- Wrote the KenLM toolkit for efficient n-gram language models.
According to the New York Times, I am a native speaker of C++ "on semipermanent loan from the Internet" and my t-shirt collection is "threadbare."
People have trouble spelling my last name, even under oath.
Brief CV
| Meta: | Technical Lead Manager, Llama text data |
|---|---|
| Efficient Translation: | Founder with industry exit |
| Edinburgh: | Reader ≅ Associate Professor |
| Edinburgh: | Lecturer ≅ Assistant Professor |
| Bloomberg: | Senior Research Scientist |
| Stanford: | Postdoc |
| Edinburgh: | Research Associate |
| Carnegie Mellon: | PhD advised by Alon Lavie |
| Google: | Software Engineer |
| Caltech: | BSc, Mathematics and Computer Science |