Skip to content

Commit 910692d

Browse files
committed
Auto merge of #89582 - jkugelman:optimize-file-read-to-end, r=joshtriplett
Optimize File::read_to_end and read_to_string Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity. `std::fs::read` and `std::fs::read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls. However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization. ```rust let mut buf = Vec::new(); file.read_to_end(&mut buf)?; ``` I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here: #89516 (comment) Most telling, the documentation for both the `Read` trait and the `Read::read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem. To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in. A nice side effect of this change is that code that accesses a `File` as `impl Read` or `dyn Read` will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`: ```rust pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> { let mut contents = Vec::new(); match rdr.read_to_end(&mut contents) { ``` Related changes: - I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized implementations if the inner reader is a file. - The private `std::io::append_to_string` function is now marked `unsafe`. - `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added `@camelid's` suggested wording from #80218 (comment). r? `@joshtriplett`
2 parents f875143 + a990c76 commit 910692d

File tree

5 files changed

+146
-46
lines changed

5 files changed

+146
-46
lines changed

library/std/src/fs.rs

+37-15
Original file line numberDiff line numberDiff line change
@@ -198,19 +198,10 @@ pub struct DirBuilder {
198198
recursive: bool,
199199
}
200200

201-
/// Indicates how large a buffer to pre-allocate before reading the entire file.
202-
fn initial_buffer_size(file: &File) -> usize {
203-
// Don't worry about `usize` overflow because reading will fail regardless
204-
// in that case.
205-
file.metadata().map(|m| m.len() as usize).unwrap_or(0)
206-
}
207-
208201
/// Read the entire contents of a file into a bytes vector.
209202
///
210203
/// This is a convenience function for using [`File::open`] and [`read_to_end`]
211-
/// with fewer imports and without an intermediate variable. It pre-allocates a
212-
/// buffer based on the file size when available, so it is generally faster than
213-
/// reading into a vector created with [`Vec::new()`].
204+
/// with fewer imports and without an intermediate variable.
214205
///
215206
/// [`read_to_end`]: Read::read_to_end
216207
///
@@ -237,7 +228,7 @@ fn initial_buffer_size(file: &File) -> usize {
237228
pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> {
238229
fn inner(path: &Path) -> io::Result<Vec<u8>> {
239230
let mut file = File::open(path)?;
240-
let mut bytes = Vec::with_capacity(initial_buffer_size(&file));
231+
let mut bytes = Vec::new();
241232
file.read_to_end(&mut bytes)?;
242233
Ok(bytes)
243234
}
@@ -247,9 +238,7 @@ pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> {
247238
/// Read the entire contents of a file into a string.
248239
///
249240
/// This is a convenience function for using [`File::open`] and [`read_to_string`]
250-
/// with fewer imports and without an intermediate variable. It pre-allocates a
251-
/// buffer based on the file size when available, so it is generally faster than
252-
/// reading into a string created with [`String::new()`].
241+
/// with fewer imports and without an intermediate variable.
253242
///
254243
/// [`read_to_string`]: Read::read_to_string
255244
///
@@ -278,7 +267,7 @@ pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> {
278267
pub fn read_to_string<P: AsRef<Path>>(path: P) -> io::Result<String> {
279268
fn inner(path: &Path) -> io::Result<String> {
280269
let mut file = File::open(path)?;
281-
let mut string = String::with_capacity(initial_buffer_size(&file));
270+
let mut string = String::new();
282271
file.read_to_string(&mut string)?;
283272
Ok(string)
284273
}
@@ -615,6 +604,15 @@ impl fmt::Debug for File {
615604
}
616605
}
617606

607+
/// Indicates how much extra capacity is needed to read the rest of the file.
608+
fn buffer_capacity_required(mut file: &File) -> usize {
609+
let size = file.metadata().map(|m| m.len()).unwrap_or(0);
610+
let pos = file.stream_position().unwrap_or(0);
611+
// Don't worry about `usize` overflow because reading will fail regardless
612+
// in that case.
613+
size.saturating_sub(pos) as usize
614+
}
615+
618616
#[stable(feature = "rust1", since = "1.0.0")]
619617
impl Read for File {
620618
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
@@ -635,6 +633,18 @@ impl Read for File {
635633
// SAFETY: Read is guaranteed to work on uninitialized memory
636634
unsafe { Initializer::nop() }
637635
}
636+
637+
// Reserves space in the buffer based on the file size when available.
638+
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
639+
buf.reserve(buffer_capacity_required(self));
640+
io::default_read_to_end(self, buf)
641+
}
642+
643+
// Reserves space in the buffer based on the file size when available.
644+
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
645+
buf.reserve(buffer_capacity_required(self));
646+
io::default_read_to_string(self, buf)
647+
}
638648
}
639649
#[stable(feature = "rust1", since = "1.0.0")]
640650
impl Write for File {
@@ -681,6 +691,18 @@ impl Read for &File {
681691
// SAFETY: Read is guaranteed to work on uninitialized memory
682692
unsafe { Initializer::nop() }
683693
}
694+
695+
// Reserves space in the buffer based on the file size when available.
696+
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
697+
buf.reserve(buffer_capacity_required(self));
698+
io::default_read_to_end(self, buf)
699+
}
700+
701+
// Reserves space in the buffer based on the file size when available.
702+
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
703+
buf.reserve(buffer_capacity_required(self));
704+
io::default_read_to_string(self, buf)
705+
}
684706
}
685707
#[stable(feature = "rust1", since = "1.0.0")]
686708
impl Write for &File {

library/std/src/io/buffered/bufreader.rs

+45
Original file line numberDiff line numberDiff line change
@@ -308,6 +308,51 @@ impl<R: Read> Read for BufReader<R> {
308308
unsafe fn initializer(&self) -> Initializer {
309309
self.inner.initializer()
310310
}
311+
312+
// The inner reader might have an optimized `read_to_end`. Drain our buffer and then
313+
// delegate to the inner implementation.
314+
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
315+
let nread = self.cap - self.pos;
316+
buf.extend_from_slice(&self.buf[self.pos..self.cap]);
317+
self.discard_buffer();
318+
Ok(nread + self.inner.read_to_end(buf)?)
319+
}
320+
321+
// The inner reader might have an optimized `read_to_end`. Drain our buffer and then
322+
// delegate to the inner implementation.
323+
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
324+
// In the general `else` case below we must read bytes into a side buffer, check
325+
// that they are valid UTF-8, and then append them to `buf`. This requires a
326+
// potentially large memcpy.
327+
//
328+
// If `buf` is empty--the most common case--we can leverage `append_to_string`
329+
// to read directly into `buf`'s internal byte buffer, saving an allocation and
330+
// a memcpy.
331+
if buf.is_empty() {
332+
// `append_to_string`'s safety relies on the buffer only being appended to since
333+
// it only checks the UTF-8 validity of new data. If there were existing content in
334+
// `buf` then an untrustworthy reader (i.e. `self.inner`) could not only append
335+
// bytes but also modify existing bytes and render them invalid. On the other hand,
336+
// if `buf` is empty then by definition any writes must be appends and
337+
// `append_to_string` will validate all of the new bytes.
338+
unsafe { crate::io::append_to_string(buf, |b| self.read_to_end(b)) }
339+
} else {
340+
// We cannot append our byte buffer directly onto the `buf` String as there could
341+
// be an incomplete UTF-8 sequence that has only been partially read. We must read
342+
// everything into a side buffer first and then call `from_utf8` on the complete
343+
// buffer.
344+
let mut bytes = Vec::new();
345+
self.read_to_end(&mut bytes)?;
346+
let string = crate::str::from_utf8(&bytes).map_err(|_| {
347+
io::Error::new_const(
348+
io::ErrorKind::InvalidData,
349+
&"stream did not contain valid UTF-8",
350+
)
351+
})?;
352+
*buf += string;
353+
Ok(string.len())
354+
}
355+
}
311356
}
312357

313358
#[stable(feature = "rust1", since = "1.0.0")]

library/std/src/io/buffered/tests.rs

+22
Original file line numberDiff line numberDiff line change
@@ -243,6 +243,28 @@ fn test_buffered_reader_seek_underflow_discard_buffer_between_seeks() {
243243
assert_eq!(reader.buffer().len(), 0);
244244
}
245245

246+
#[test]
247+
fn test_buffered_reader_read_to_end_consumes_buffer() {
248+
let data: &[u8] = &[0, 1, 2, 3, 4, 5, 6, 7];
249+
let mut reader = BufReader::with_capacity(3, data);
250+
let mut buf = Vec::new();
251+
assert_eq!(reader.fill_buf().ok(), Some(&[0, 1, 2][..]));
252+
assert_eq!(reader.read_to_end(&mut buf).ok(), Some(8));
253+
assert_eq!(&buf, &[0, 1, 2, 3, 4, 5, 6, 7]);
254+
assert!(reader.buffer().is_empty());
255+
}
256+
257+
#[test]
258+
fn test_buffered_reader_read_to_string_consumes_buffer() {
259+
let data: &[u8] = "deadbeef".as_bytes();
260+
let mut reader = BufReader::with_capacity(3, data);
261+
let mut buf = String::new();
262+
assert_eq!(reader.fill_buf().ok(), Some("dea".as_bytes()));
263+
assert_eq!(reader.read_to_string(&mut buf).ok(), Some(8));
264+
assert_eq!(&buf, "deadbeef");
265+
assert!(reader.buffer().is_empty());
266+
}
267+
246268
#[test]
247269
fn test_buffered_writer() {
248270
let inner = Vec::new();

library/std/src/io/mod.rs

+41-30
Original file line numberDiff line numberDiff line change
@@ -316,11 +316,12 @@ impl Drop for Guard<'_> {
316316
}
317317
}
318318

319-
// A few methods below (read_to_string, read_line) will append data into a
320-
// `String` buffer, but we need to be pretty careful when doing this. The
321-
// implementation will just call `.as_mut_vec()` and then delegate to a
322-
// byte-oriented reading method, but we must ensure that when returning we never
323-
// leave `buf` in a state such that it contains invalid UTF-8 in its bounds.
319+
// Several `read_to_string` and `read_line` methods in the standard library will
320+
// append data into a `String` buffer, but we need to be pretty careful when
321+
// doing this. The implementation will just call `.as_mut_vec()` and then
322+
// delegate to a byte-oriented reading method, but we must ensure that when
323+
// returning we never leave `buf` in a state such that it contains invalid UTF-8
324+
// in its bounds.
324325
//
325326
// To this end, we use an RAII guard (to protect against panics) which updates
326327
// the length of the string when it is dropped. This guard initially truncates
@@ -334,21 +335,19 @@ impl Drop for Guard<'_> {
334335
// 2. We're passing a raw buffer to the function `f`, and it is expected that
335336
// the function only *appends* bytes to the buffer. We'll get undefined
336337
// behavior if existing bytes are overwritten to have non-UTF-8 data.
337-
fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize>
338+
pub(crate) unsafe fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize>
338339
where
339340
F: FnOnce(&mut Vec<u8>) -> Result<usize>,
340341
{
341-
unsafe {
342-
let mut g = Guard { len: buf.len(), buf: buf.as_mut_vec() };
343-
let ret = f(g.buf);
344-
if str::from_utf8(&g.buf[g.len..]).is_err() {
345-
ret.and_then(|_| {
346-
Err(Error::new_const(ErrorKind::InvalidData, &"stream did not contain valid UTF-8"))
347-
})
348-
} else {
349-
g.len = g.buf.len();
350-
ret
351-
}
342+
let mut g = Guard { len: buf.len(), buf: buf.as_mut_vec() };
343+
let ret = f(g.buf);
344+
if str::from_utf8(&g.buf[g.len..]).is_err() {
345+
ret.and_then(|_| {
346+
Err(Error::new_const(ErrorKind::InvalidData, &"stream did not contain valid UTF-8"))
347+
})
348+
} else {
349+
g.len = g.buf.len();
350+
ret
352351
}
353352
}
354353

@@ -361,7 +360,7 @@ where
361360
//
362361
// Because we're extending the buffer with uninitialized data for trusted
363362
// readers, we need to make sure to truncate that if any of this panics.
364-
fn read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize> {
363+
pub(crate) fn default_read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize> {
365364
let start_len = buf.len();
366365
let start_cap = buf.capacity();
367366
let mut g = Guard { len: buf.len(), buf };
@@ -426,6 +425,22 @@ fn read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize>
426425
}
427426
}
428427

428+
pub(crate) fn default_read_to_string<R: Read + ?Sized>(
429+
r: &mut R,
430+
buf: &mut String,
431+
) -> Result<usize> {
432+
// Note that we do *not* call `r.read_to_end()` here. We are passing
433+
// `&mut Vec<u8>` (the raw contents of `buf`) into the `read_to_end`
434+
// method to fill it up. An arbitrary implementation could overwrite the
435+
// entire contents of the vector, not just append to it (which is what
436+
// we are expecting).
437+
//
438+
// To prevent extraneously checking the UTF-8-ness of the entire buffer
439+
// we pass it to our hardcoded `default_read_to_end` implementation which
440+
// we know is guaranteed to only read data into the end of the buffer.
441+
unsafe { append_to_string(buf, |b| default_read_to_end(r, b)) }
442+
}
443+
429444
pub(crate) fn default_read_vectored<F>(read: F, bufs: &mut [IoSliceMut<'_>]) -> Result<usize>
430445
where
431446
F: FnOnce(&mut [u8]) -> Result<usize>,
@@ -716,7 +731,7 @@ pub trait Read {
716731
/// [`std::fs::read`]: crate::fs::read
717732
#[stable(feature = "rust1", since = "1.0.0")]
718733
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<usize> {
719-
read_to_end(self, buf)
734+
default_read_to_end(self, buf)
720735
}
721736

722737
/// Read all bytes until EOF in this source, appending them to `buf`.
@@ -759,16 +774,7 @@ pub trait Read {
759774
/// [`std::fs::read_to_string`]: crate::fs::read_to_string
760775
#[stable(feature = "rust1", since = "1.0.0")]
761776
fn read_to_string(&mut self, buf: &mut String) -> Result<usize> {
762-
// Note that we do *not* call `.read_to_end()` here. We are passing
763-
// `&mut Vec<u8>` (the raw contents of `buf`) into the `read_to_end`
764-
// method to fill it up. An arbitrary implementation could overwrite the
765-
// entire contents of the vector, not just append to it (which is what
766-
// we are expecting).
767-
//
768-
// To prevent extraneously checking the UTF-8-ness of the entire buffer
769-
// we pass it to our hardcoded `read_to_end` implementation which we
770-
// know is guaranteed to only read data into the end of the buffer.
771-
append_to_string(buf, |b| read_to_end(self, b))
777+
default_read_to_string(self, buf)
772778
}
773779

774780
/// Read the exact number of bytes required to fill `buf`.
@@ -1005,6 +1011,11 @@ pub trait Read {
10051011
/// need more control over performance, and in those cases you should definitely use
10061012
/// [`Read::read_to_string`] directly.
10071013
///
1014+
/// Note that in some special cases, such as when reading files, this function will
1015+
/// pre-allocate memory based on the size of the input it is reading. In those
1016+
/// cases, the performance should be as good as if you had used
1017+
/// [`Read::read_to_string`] with a manually pre-allocated buffer.
1018+
///
10081019
/// # Errors
10091020
///
10101021
/// This function forces you to handle errors because the output (the `String`)
@@ -2201,7 +2212,7 @@ pub trait BufRead: Read {
22012212
// Note that we are not calling the `.read_until` method here, but
22022213
// rather our hardcoded implementation. For more details as to why, see
22032214
// the comments in `read_to_end`.
2204-
append_to_string(buf, |b| read_until(self, b'\n', b))
2215+
unsafe { append_to_string(buf, |b| read_until(self, b'\n', b)) }
22052216
}
22062217

22072218
/// Returns an iterator over the contents of this reader split on the byte

library/std/src/io/tests.rs

+1-1
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,7 @@ fn bench_read_to_end(b: &mut test::Bencher) {
290290
b.iter(|| {
291291
let mut lr = repeat(1).take(10000000);
292292
let mut vec = Vec::with_capacity(1024);
293-
super::read_to_end(&mut lr, &mut vec)
293+
super::default_read_to_end(&mut lr, &mut vec)
294294
});
295295
}
296296

0 commit comments

Comments
 (0)