Skip to content

Commit 90e44e7

Browse files
authored
Unrolled build for rust-lang#122614
Rollup merge of rust-lang#122614 - notriddle:notriddle/search-desc, r=GuillaumeGomez rustdoc-search: shard the search result descriptions ## Preview This makes no visual changes to rustdoc search. It's a pure perf improvement. <details><summary>old</summary> Preview: <http://notriddle.com/rustdoc-html-demo-10/doc/std/index.html?search=vec> WebPageTest Comparison with before branch on a sort of worst case (searching `vec`, winds up downloading most of the shards anyway): <https://www.webpagetest.org/video/compare.php?tests=240317_AiDc61_2EM,240317_AiDcM0_2EN> Waterfall diagram: ![image](https://github.com/rust-lang/rust/assets/1593513/39548f0c-7ad6-411b-abf8-f6668ff4da18) </details> Preview: <http://notriddle.com/rustdoc-html-demo-10/doc2/std/index.html?search=vec> WebPageTest Comparison with before branch on a sort of worst case (searching `vec`, winds up downloading most of the shards anyway): <https://www.webpagetest.org/video/compare.php?tests=240322_BiDcCH_13R,240322_AiDcJY_104> ![image](https://github.com/rust-lang/rust/assets/1593513/4be1f9ff-c3ff-4b96-8f5b-b264df2e662d) ## Description r? `@GuillaumeGomez` The descriptions are, on almost all crates[^1], the majority of the size of the search index, even though they aren't really used for searching. This makes it relatively easy to separate them into their own files. Additionally, this PR pulls out information about whether there's a description into a bitmap. This allows us to sort, truncate, *then* download. This PR also bumps us to ES8. Out of the browsers we support, all of them support async functions according to caniuse. https://caniuse.com/async-functions [^1]: <https://microsoft.github.io/windows-docs-rs/>, a crate with 44MiB of pure names and no descriptions for them, is an outlier and should not be counted. But this PR should improve it, by replacing a long line of empty strings with a compressed bitmap with a single Run section. Just not very much. ## Detailed sizes ```console $ cat test.sh set -ex cp ../search-index*.js search-index.js awk 'FNR==NR {a++;next} FNR<a-3' search-index.js{,} | awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}' | sed -E "s:\\\\':':g" > search-index.json jq -c '.t' search-index.json > t.json jq -c '.n' search-index.json > n.json jq -c '.q' search-index.json > q.json jq -c '.D' search-index.json > D.json jq -c '.e' search-index.json > e.json jq -c '.i' search-index.json > i.json jq -c '.f' search-index.json > f.json jq -c '.c' search-index.json > c.json jq -c '.p' search-index.json > p.json jq -c '.a' search-index.json > a.json du -hs t.json n.json q.json D.json e.json i.json f.json c.json p.json a.json $ bash test.sh + cp ../search-index1.78.0.js search-index.js + awk 'FNR==NR {a++;next} FNR<a-3' search-index.js search-index.js + awk 'NR>1 {gsub(/\],\\$/,""); gsub(/^\["[^"]+",/,""); print} {next}' + sed -E 's:\\'\'':'\'':g' + jq -c .t search-index.json + jq -c .n search-index.json + jq -c .q search-index.json + jq -c .D search-index.json + jq -c .e search-index.json + jq -c .i search-index.json + jq -c .f search-index.json + jq -c .c search-index.json + jq -c .p search-index.json + jq -c .a search-index.json + du -hs t.json n.json q.json D.json e.json i.json f.json c.json p.json a.json 64K t.json 800K n.json 8.0K q.json 4.0K D.json 16K e.json 192K i.json 544K f.json 4.0K c.json 36K p.json 20K a.json ``` These are, roughly, the size of each section in the standard library (this tool actually excludes libtest, for parsing-json-with-awk reasons, but libtest is tiny so it's probably not important). t = item type, like "struct", "free fn", or "type alias". Since one byte is used for every item, this implies that there are approximately 64 thousand items in the standard library. n = name, and that's now the largest section of the search index with the descriptions removed from it q = parent *module* path, stored parallel to the items within D = the size of each description shard, stored as vlq hex numbers e = empty description bit flags, stored as a roaring bitmap i = parent *type* index as a link into `p`, stored as decimal json numbers; used only for associated types; might want to switch to vlq hex, since that's shorter, but that would be a separate pr f = function signature, stored as lists of lists that index into `p` c = deprecation flag, stored as a roaring bitmap p = parent *type*, stored separately and linked into from `i` and `f` a = alias, as [[key, value]] pairs ## Search performance http://notriddle.com/rustdoc-html-demo-11/perf-shard/index.html For example, in stm32f4: <table><thead><tr><th>before<th>after</tr></thead> <tbody><tr><td> ``` Testing T -> U ... in_args = 0, returned = 0, others = 200 wall time = 617 Testing T, U ... in_args = 0, returned = 0, others = 200 wall time = 198 Testing T -> T ... in_args = 0, returned = 0, others = 200 wall time = 282 Testing crc32 ... in_args = 0, returned = 0, others = 0 wall time = 426 Testing spi::pac ... in_args = 0, returned = 0, others = 0 wall time = 673 ``` </td><td> ``` Testing T -> U ... in_args = 0, returned = 0, others = 200 wall time = 716 Testing T, U ... in_args = 0, returned = 0, others = 200 wall time = 207 Testing T -> T ... in_args = 0, returned = 0, others = 200 wall time = 289 Testing crc32 ... in_args = 0, returned = 0, others = 0 wall time = 418 Testing spi::pac ... in_args = 0, returned = 0, others = 0 wall time = 687 ``` </td></tr><tr><td> ``` user: 005.345 s sys: 002.955 s wall: 006.899 s child_RSS_high: 583664 KiB group_mem_high: 557876 KiB ``` </td><td> ``` user: 004.652 s sys: 000.565 s wall: 003.865 s child_RSS_high: 538696 KiB group_mem_high: 511724 KiB ``` </td></tr> </table> This perf tester is janky and unscientific enough that the apparent differences might just be noise. If it's not an order of magnitude, it's probably not real. ## Future possibilities * Currently, results are not shown until the descriptions are downloaded. Theoretically, the description-less results could be shown. But actually doing that, and making sure it works properly, would require extra work (we have to be careful to avoid layout jumps). * More than just descriptions can be sharded this way. But we have to be careful to make sure the size wins are worth the round trips. Ideally, data that’s needed only for display should be sharded while data needed for search isn’t. * [Full text search](https://internals.rust-lang.org/t/full-text-search-for-rustdoc-and-doc-rs/20427) also needs this kind of infrastructure. A good implementation might store a compressed bloom filter in the search index, then download the full keyword in shards. But, we have to be careful not just of the amount readers have to download, but also of the amount that [publishers](https://gist.github.com/notriddle/c289e77f3ed469d1c0238d1d135d49e1) have to store.
2 parents 36b6f9b + a272007 commit 90e44e7

File tree

14 files changed

+831
-239
lines changed

14 files changed

+831
-239
lines changed

Cargo.lock

+2
Original file line numberDiff line numberDiff line change
@@ -4783,6 +4783,8 @@ version = "0.0.0"
47834783
dependencies = [
47844784
"arrayvec",
47854785
"askama",
4786+
"base64",
4787+
"byteorder",
47864788
"expect-test",
47874789
"indexmap",
47884790
"itertools 0.12.1",

src/ci/docker/host-x86_64/mingw-check/Dockerfile

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ ENV SCRIPT python3 ../x.py --stage 2 test src/tools/expand-yaml-anchors && \
6060
/scripts/validate-error-codes.sh && \
6161
reuse --include-submodules lint && \
6262
# Runs checks to ensure that there are no ES5 issues in our JS code.
63-
es-check es6 ../src/librustdoc/html/static/js/*.js && \
63+
es-check es8 ../src/librustdoc/html/static/js/*.js && \
6464
eslint -c ../src/librustdoc/html/static/.eslintrc.js ../src/librustdoc/html/static/js/*.js && \
6565
eslint -c ../src/tools/rustdoc-js/.eslintrc.js ../src/tools/rustdoc-js/tester.js && \
6666
eslint -c ../src/tools/rustdoc-gui/.eslintrc.js ../src/tools/rustdoc-gui/tester.js

src/librustdoc/Cargo.toml

+2
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ path = "lib.rs"
99
[dependencies]
1010
arrayvec = { version = "0.7", default-features = false }
1111
askama = { version = "0.12", default-features = false, features = ["config"] }
12+
base64 = "0.21.7"
13+
byteorder = "1.5"
1214
itertools = "0.12"
1315
indexmap = "2"
1416
minifier = "0.3.0"

src/librustdoc/html/render/mod.rs

+4-29
Original file line numberDiff line numberDiff line change
@@ -184,40 +184,15 @@ pub(crate) enum RenderTypeId {
184184

185185
impl RenderTypeId {
186186
pub fn write_to_string(&self, string: &mut String) {
187-
// (sign, value)
188-
let (sign, id): (bool, u32) = match &self {
187+
let id: i32 = match &self {
189188
// 0 is a sentinel, everything else is one-indexed
190189
// concrete type
191-
RenderTypeId::Index(idx) if *idx >= 0 => (false, (idx + 1isize).try_into().unwrap()),
190+
RenderTypeId::Index(idx) if *idx >= 0 => (idx + 1isize).try_into().unwrap(),
192191
// generic type parameter
193-
RenderTypeId::Index(idx) => (true, (-*idx).try_into().unwrap()),
192+
RenderTypeId::Index(idx) => (*idx).try_into().unwrap(),
194193
_ => panic!("must convert render types to indexes before serializing"),
195194
};
196-
// zig-zag encoding
197-
let value: u32 = (id << 1) | (if sign { 1 } else { 0 });
198-
// Self-terminating hex use capital letters for everything but the
199-
// least significant digit, which is lowercase. For example, decimal 17
200-
// would be `` Aa `` if zig-zag encoding weren't used.
201-
//
202-
// Zig-zag encoding, however, stores the sign bit as the last bit.
203-
// This means, in the last hexit, 1 is actually `c`, -1 is `b`
204-
// (`a` is the imaginary -0), and, because all the bits are shifted
205-
// by one, `` A` `` is actually 8 and `` Aa `` is -8.
206-
//
207-
// https://rust-lang.github.io/rustc-dev-guide/rustdoc-internals/search.html
208-
// describes the encoding in more detail.
209-
let mut shift: u32 = 28;
210-
let mut mask: u32 = 0xF0_00_00_00;
211-
while shift < 32 {
212-
let hexit = (value & mask) >> shift;
213-
if hexit != 0 || shift == 0 {
214-
let hex =
215-
char::try_from(if shift == 0 { '`' } else { '@' } as u32 + hexit).unwrap();
216-
string.push(hex);
217-
}
218-
shift = shift.wrapping_sub(4);
219-
mask = mask >> 4;
220-
}
195+
search_index::encode::write_vlqhex_to_string(id, string);
221196
}
222197
}
223198

src/librustdoc/html/render/search_index.rs

+94-13
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
pub(crate) mod encode;
2+
13
use std::collections::hash_map::Entry;
24
use std::collections::{BTreeMap, VecDeque};
35

@@ -17,12 +19,46 @@ use crate::html::format::join_with_double_colon;
1719
use crate::html::markdown::short_markdown_summary;
1820
use crate::html::render::{self, IndexItem, IndexItemFunctionType, RenderType, RenderTypeId};
1921

22+
use encode::{bitmap_to_string, write_vlqhex_to_string};
23+
24+
/// The serialized search description sharded version
25+
///
26+
/// The `index` is a JSON-encoded list of names and other information.
27+
///
28+
/// The desc has newlined descriptions, split up by size into 128KiB shards.
29+
/// For example, `(4, "foo\nbar\nbaz\nquux")`.
30+
///
31+
/// There is no single, optimal size for these shards, because it depends on
32+
/// configuration values that we can't predict or control, such as the version
33+
/// of HTTP used (HTTP/1.1 would work better with larger files, while HTTP/2
34+
/// and 3 are more agnostic), transport compression (gzip, zstd, etc), whether
35+
/// the search query is going to produce a large number of results or a small
36+
/// number, the bandwidth delay product of the network...
37+
///
38+
/// Gzipping some standard library descriptions to guess what transport
39+
/// compression will do, the compressed file sizes can be as small as 4.9KiB
40+
/// or as large as 18KiB (ignoring the final 1.9KiB shard of leftovers).
41+
/// A "reasonable" range for files is for them to be bigger than 1KiB,
42+
/// since that's about the amount of data that can be transferred in a
43+
/// single TCP packet, and 64KiB, the maximum amount of data that
44+
/// TCP can transfer in a single round trip without extensions.
45+
///
46+
/// [1]: https://en.wikipedia.org/wiki/Maximum_transmission_unit#MTUs_for_common_media
47+
/// [2]: https://en.wikipedia.org/wiki/Sliding_window_protocol#Basic_concept
48+
/// [3]: https://learn.microsoft.com/en-us/troubleshoot/windows-server/networking/description-tcp-features
49+
pub(crate) struct SerializedSearchIndex {
50+
pub(crate) index: String,
51+
pub(crate) desc: Vec<(usize, String)>,
52+
}
53+
54+
const DESC_INDEX_SHARD_LEN: usize = 128 * 1024;
55+
2056
/// Builds the search index from the collected metadata
2157
pub(crate) fn build_index<'tcx>(
2258
krate: &clean::Crate,
2359
cache: &mut Cache,
2460
tcx: TyCtxt<'tcx>,
25-
) -> String {
61+
) -> SerializedSearchIndex {
2662
let mut itemid_to_pathid = FxHashMap::default();
2763
let mut primitives = FxHashMap::default();
2864
let mut associated_types = FxHashMap::default();
@@ -319,7 +355,6 @@ pub(crate) fn build_index<'tcx>(
319355
.collect::<Vec<_>>();
320356

321357
struct CrateData<'a> {
322-
doc: String,
323358
items: Vec<&'a IndexItem>,
324359
paths: Vec<(ItemType, Vec<Symbol>)>,
325360
// The String is alias name and the vec is the list of the elements with this alias.
@@ -328,6 +363,11 @@ pub(crate) fn build_index<'tcx>(
328363
aliases: &'a BTreeMap<String, Vec<usize>>,
329364
// Used when a type has more than one impl with an associated item with the same name.
330365
associated_item_disambiguators: &'a Vec<(usize, String)>,
366+
// A list of shard lengths encoded as vlqhex. See the comment in write_vlqhex_to_string
367+
// for information on the format.
368+
desc_index: String,
369+
// A list of items with no description. This is eventually turned into a bitmap.
370+
empty_desc: Vec<u32>,
331371
}
332372

333373
struct Paths {
@@ -409,7 +449,6 @@ pub(crate) fn build_index<'tcx>(
409449
let mut names = Vec::with_capacity(self.items.len());
410450
let mut types = String::with_capacity(self.items.len());
411451
let mut full_paths = Vec::with_capacity(self.items.len());
412-
let mut descriptions = Vec::with_capacity(self.items.len());
413452
let mut parents = Vec::with_capacity(self.items.len());
414453
let mut functions = String::with_capacity(self.items.len());
415454
let mut deprecated = Vec::with_capacity(self.items.len());
@@ -432,7 +471,6 @@ pub(crate) fn build_index<'tcx>(
432471
parents.push(item.parent_idx.map(|x| x + 1).unwrap_or(0));
433472

434473
names.push(item.name.as_str());
435-
descriptions.push(&item.desc);
436474

437475
if !item.path.is_empty() {
438476
full_paths.push((index, &item.path));
@@ -444,7 +482,8 @@ pub(crate) fn build_index<'tcx>(
444482
}
445483

446484
if item.deprecation.is_some() {
447-
deprecated.push(index);
485+
// bitmasks always use 1-indexing for items, with 0 as the crate itself
486+
deprecated.push(u32::try_from(index + 1).unwrap());
448487
}
449488
}
450489

@@ -455,42 +494,84 @@ pub(crate) fn build_index<'tcx>(
455494
let has_aliases = !self.aliases.is_empty();
456495
let mut crate_data =
457496
serializer.serialize_struct("CrateData", if has_aliases { 9 } else { 8 })?;
458-
crate_data.serialize_field("doc", &self.doc)?;
459497
crate_data.serialize_field("t", &types)?;
460498
crate_data.serialize_field("n", &names)?;
461-
// Serialize as an array of item indices and full paths
462499
crate_data.serialize_field("q", &full_paths)?;
463-
crate_data.serialize_field("d", &descriptions)?;
464500
crate_data.serialize_field("i", &parents)?;
465501
crate_data.serialize_field("f", &functions)?;
466-
crate_data.serialize_field("c", &deprecated)?;
502+
crate_data.serialize_field("D", &self.desc_index)?;
467503
crate_data.serialize_field("p", &paths)?;
468504
crate_data.serialize_field("b", &self.associated_item_disambiguators)?;
505+
crate_data.serialize_field("c", &bitmap_to_string(&deprecated))?;
506+
crate_data.serialize_field("e", &bitmap_to_string(&self.empty_desc))?;
469507
if has_aliases {
470508
crate_data.serialize_field("a", &self.aliases)?;
471509
}
472510
crate_data.end()
473511
}
474512
}
475513

476-
// Collect the index into a string
477-
format!(
514+
let (empty_desc, desc) = {
515+
let mut empty_desc = Vec::new();
516+
let mut result = Vec::new();
517+
let mut set = String::new();
518+
let mut len: usize = 0;
519+
let mut item_index: u32 = 0;
520+
for desc in std::iter::once(&crate_doc).chain(crate_items.iter().map(|item| &item.desc)) {
521+
if desc == "" {
522+
empty_desc.push(item_index);
523+
item_index += 1;
524+
continue;
525+
}
526+
if set.len() >= DESC_INDEX_SHARD_LEN {
527+
result.push((len, std::mem::replace(&mut set, String::new())));
528+
len = 0;
529+
} else if len != 0 {
530+
set.push('\n');
531+
}
532+
set.push_str(&desc);
533+
len += 1;
534+
item_index += 1;
535+
}
536+
result.push((len, std::mem::replace(&mut set, String::new())));
537+
(empty_desc, result)
538+
};
539+
540+
let desc_index = {
541+
let mut desc_index = String::with_capacity(desc.len() * 4);
542+
for &(len, _) in desc.iter() {
543+
write_vlqhex_to_string(len.try_into().unwrap(), &mut desc_index);
544+
}
545+
desc_index
546+
};
547+
548+
assert_eq!(
549+
crate_items.len() + 1,
550+
desc.iter().map(|(len, _)| *len).sum::<usize>() + empty_desc.len()
551+
);
552+
553+
// The index, which is actually used to search, is JSON
554+
// It uses `JSON.parse(..)` to actually load, since JSON
555+
// parses faster than the full JavaScript syntax.
556+
let index = format!(
478557
r#"["{}",{}]"#,
479558
krate.name(tcx),
480559
serde_json::to_string(&CrateData {
481-
doc: crate_doc,
482560
items: crate_items,
483561
paths: crate_paths,
484562
aliases: &aliases,
485563
associated_item_disambiguators: &associated_item_disambiguators,
564+
desc_index,
565+
empty_desc,
486566
})
487567
.expect("failed serde conversion")
488568
// All these `replace` calls are because we have to go through JS string for JSON content.
489569
.replace('\\', r"\\")
490570
.replace('\'', r"\'")
491571
// We need to escape double quotes for the JSON.
492572
.replace("\\\"", "\\\\\"")
493-
)
573+
);
574+
SerializedSearchIndex { index, desc }
494575
}
495576

496577
pub(crate) fn get_function_type_for_search<'tcx>(

0 commit comments

Comments
 (0)