Skip to content

Commit f4104f1

Browse files
authored
Unrolled build for rust-lang#119468
Rollup merge of rust-lang#119468 - notriddle:notriddle/compression, r=GuillaumeGomez rustdoc-search: tighter encoding for f index Depends on rust-lang#119457 Two optimizations for the function signature search: * Instead of using JSON arrays, like `[1,20]`, it uses VLQ hex with no commas, like `[aAd]`. * This also adds backrefs: if you have more than one function with exactly the same signature, it'll not only store it once, it'll *decode* it once, and store in the typeIdMap only once. Based partially on discussions on zulip: https://rust-lang.zulipchat.com/#narrow/stream/266220-t-rustdoc/topic/search.20index.20size Performance ----------- https://notriddle.com/rustdoc-html-demo-8/compression-perf-v2/index.html ### memory/time profiler output (for more details, consult the above link) <table> <thead><tr><th>benchmark<th>before<th>after</tr></thead> <tbody> <tr><th>arti<td> ``` user: 002.789 s sys: 000.390 s wall: 002.096 s child_RSS_high: 440796 KiB group_mem_high: 414924 KiB ``` </td><td> ``` user: 002.295 s sys: 000.278 s wall: 001.738 s child_RSS_high: 314588 KiB group_mem_high: 285220 KiB ``` </td></tr><tr><th>cortex-m<td> ``` user: 000.127 s sys: 000.030 s wall: 000.134 s child_RSS_high: 60264 KiB group_mem_high: 23824 KiB ``` </td><td> ``` user: 000.136 s sys: 000.038 s wall: 000.137 s child_RSS_high: 59204 KiB group_mem_high: 22712 KiB ``` </td></tr><tr><th>sqlx<td> ``` user: 000.887 s sys: 000.118 s wall: 000.592 s child_RSS_high: 190408 KiB group_mem_high: 157804 KiB ``` </td><td> ``` user: 000.798 s sys: 000.101 s wall: 000.525 s child_RSS_high: 159292 KiB group_mem_high: 126292 KiB ``` </td></tr><tr><th>stm32f4<td> ``` user: 013.884 s sys: 005.399 s wall: 013.149 s child_RSS_high: 1942244 KiB group_mem_high: 1954916 KiB ``` </td><td> ``` user: 006.128 s sys: 003.297 s wall: 007.994 s child_RSS_high: 1038108 KiB group_mem_high: 1023900 KiB ``` </td></tr><tr><th>ripgrep<td> ``` user: 000.441 s sys: 000.063 s wall: 000.264 s child_RSS_high: 109180 KiB group_mem_high: 74272 KiB ``` </td><td> ``` user: 000.408 s sys: 000.044 s wall: 000.238 s child_RSS_high: 101488 KiB group_mem_high: 66000 KiB ``` </td></tr></tbody></table> Size change ----------- standard library without gzip: ```console $ du -bs search-index-old.js search-index-new.js 4976370 search-index-old.js 4404391 search-index-new.js ``` ((4976370-4404391)/4404391)*100% = 12.9% with gzip: ```console $ du -hs search-index-old.js.gz search-index-new.js.gz 520K search-index-old.js.gz 504K search-index-new.js.gz $ du -bs search-index-old.js.gz search-index-new.js.gz 522092 search-index-old.js.gz 507654 search-index-new.js.gz ``` ((522092-507654)/507654)*100% = 2.8% Benchmarks are similarly shrunk. Without gzip: ```console $ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js 10555067 tmp/arti/toolchain_old/doc/search-index.js 8921236 tmp/arti/toolchain_new/doc/search-index.js 77018 tmp/cortex-m/toolchain_old/doc/search-index.js 66676 tmp/cortex-m/toolchain_new/doc/search-index.js 2876330 tmp/sqlx/toolchain_old/doc/search-index.js 2436812 tmp/sqlx/toolchain_new/doc/search-index.js 63632890 tmp/stm32f4/toolchain_old/doc/search-index.js 52337438 tmp/stm32f4/toolchain_new/doc/search-index.js 631150 tmp/ripgrep/toolchain_old/doc/search-index.js 541646 tmp/ripgrep/toolchain_new/doc/search-index.js ``` With gzip: ```console $ du -bs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.js.gz 1618852 tmp/arti/toolchain_old/doc/search-index.js.gz 1582007 tmp/arti/toolchain_new/doc/search-index.js.gz 16109 tmp/cortex-m/toolchain_old/doc/search-index.js.gz 15831 tmp/cortex-m/toolchain_new/doc/search-index.js.gz 422257 tmp/sqlx/toolchain_old/doc/search-index.js.gz 411507 tmp/sqlx/toolchain_new/doc/search-index.js.gz 4454761 tmp/stm32f4/toolchain_old/doc/search-index.js.gz 4334924 tmp/stm32f4/toolchain_new/doc/search-index.js.gz 98312 tmp/ripgrep/toolchain_old/doc/search-index.js.gz 96864 tmp/ripgrep/toolchain_new/doc/search-index.js.gz $ du -hs tmp/{arti,cortex-m,sqlx,stm32f4,ripgrep}/toolchain_{old,new}/doc/search-index.j s.gz 1.6M tmp/arti/toolchain_old/doc/search-index.js.gz 1.6M tmp/arti/toolchain_new/doc/search-index.js.gz 24K tmp/cortex-m/toolchain_old/doc/search-index.js.gz 24K tmp/cortex-m/toolchain_new/doc/search-index.js.gz 424K tmp/sqlx/toolchain_old/doc/search-index.js.gz 412K tmp/sqlx/toolchain_new/doc/search-index.js.gz 4.3M tmp/stm32f4/toolchain_old/doc/search-index.js.gz 4.2M tmp/stm32f4/toolchain_new/doc/search-index.js.gz 108K tmp/ripgrep/toolchain_old/doc/search-index.js.gz 104K tmp/ripgrep/toolchain_new/doc/search-index.js.gz ```
2 parents d62f05b + 004bfc5 commit f4104f1

File tree

4 files changed

+250
-139
lines changed

4 files changed

+250
-139
lines changed

src/librustdoc/html/render/mod.rs

+115-49
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ use rustc_span::{
5858
symbol::{sym, Symbol},
5959
BytePos, FileName, RealFileName,
6060
};
61-
use serde::ser::{SerializeMap, SerializeSeq};
61+
use serde::ser::SerializeMap;
6262
use serde::{Serialize, Serializer};
6363

6464
use crate::clean::{self, ItemId, RenderedLink, SelfTy};
@@ -123,115 +123,181 @@ pub(crate) struct IndexItem {
123123
}
124124

125125
/// A type used for the search index.
126-
#[derive(Debug)]
126+
#[derive(Debug, Eq, PartialEq)]
127127
pub(crate) struct RenderType {
128128
id: Option<RenderTypeId>,
129129
generics: Option<Vec<RenderType>>,
130130
bindings: Option<Vec<(RenderTypeId, Vec<RenderType>)>>,
131131
}
132132

133-
impl Serialize for RenderType {
134-
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
135-
where
136-
S: Serializer,
137-
{
138-
let id = match &self.id {
133+
impl RenderType {
134+
// Types are rendered as lists of lists, because that's pretty compact.
135+
// The contents of the lists are always integers in self-terminating hex
136+
// form, handled by `RenderTypeId::write_to_string`, so no commas are
137+
// needed to separate the items.
138+
pub fn write_to_string(&self, string: &mut String) {
139+
fn write_optional_id(id: Option<RenderTypeId>, string: &mut String) {
139140
// 0 is a sentinel, everything else is one-indexed
140-
None => 0,
141-
// concrete type
142-
Some(RenderTypeId::Index(idx)) if *idx >= 0 => idx + 1,
143-
// generic type parameter
144-
Some(RenderTypeId::Index(idx)) => *idx,
145-
_ => panic!("must convert render types to indexes before serializing"),
146-
};
141+
match id {
142+
Some(id) => id.write_to_string(string),
143+
None => string.push('`'),
144+
}
145+
}
146+
// Either just the type id, or `{type, generics, bindings?}`
147+
// where generics is a list of types,
148+
// and bindings is a list of `{id, typelist}` pairs.
147149
if self.generics.is_some() || self.bindings.is_some() {
148-
let mut seq = serializer.serialize_seq(None)?;
149-
seq.serialize_element(&id)?;
150-
seq.serialize_element(self.generics.as_ref().map(Vec::as_slice).unwrap_or_default())?;
150+
string.push('{');
151+
write_optional_id(self.id, string);
152+
string.push('{');
153+
for generic in &self.generics.as_ref().map(Vec::as_slice).unwrap_or_default()[..] {
154+
generic.write_to_string(string);
155+
}
156+
string.push('}');
151157
if self.bindings.is_some() {
152-
seq.serialize_element(
153-
self.bindings.as_ref().map(Vec::as_slice).unwrap_or_default(),
154-
)?;
158+
string.push('{');
159+
for binding in &self.bindings.as_ref().map(Vec::as_slice).unwrap_or_default()[..] {
160+
string.push('{');
161+
binding.0.write_to_string(string);
162+
string.push('{');
163+
for constraint in &binding.1[..] {
164+
constraint.write_to_string(string);
165+
}
166+
string.push_str("}}");
167+
}
168+
string.push('}');
155169
}
156-
seq.end()
170+
string.push('}');
157171
} else {
158-
id.serialize(serializer)
172+
write_optional_id(self.id, string);
159173
}
160174
}
161175
}
162176

163-
#[derive(Clone, Copy, Debug)]
177+
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
164178
pub(crate) enum RenderTypeId {
165179
DefId(DefId),
166180
Primitive(clean::PrimitiveType),
167181
AssociatedType(Symbol),
168182
Index(isize),
169183
}
170184

171-
impl Serialize for RenderTypeId {
172-
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
173-
where
174-
S: Serializer,
175-
{
176-
let id = match &self {
185+
impl RenderTypeId {
186+
pub fn write_to_string(&self, string: &mut String) {
187+
// (sign, value)
188+
let (sign, id): (bool, u32) = match &self {
177189
// 0 is a sentinel, everything else is one-indexed
178190
// concrete type
179-
RenderTypeId::Index(idx) if *idx >= 0 => idx + 1,
191+
RenderTypeId::Index(idx) if *idx >= 0 => (false, (idx + 1isize).try_into().unwrap()),
180192
// generic type parameter
181-
RenderTypeId::Index(idx) => *idx,
193+
RenderTypeId::Index(idx) => (true, (-*idx).try_into().unwrap()),
182194
_ => panic!("must convert render types to indexes before serializing"),
183195
};
184-
id.serialize(serializer)
196+
// zig-zag encoding
197+
let value: u32 = (id << 1) | (if sign { 1 } else { 0 });
198+
// Self-terminating hex use capital letters for everything but the
199+
// least significant digit, which is lowercase. For example, decimal 17
200+
// would be `` Aa `` if zig-zag encoding weren't used.
201+
//
202+
// Zig-zag encoding, however, stores the sign bit as the last bit.
203+
// This means, in the last hexit, 1 is actually `c`, -1 is `b`
204+
// (`a` is the imaginary -0), and, because all the bits are shifted
205+
// by one, `` A` `` is actually 8 and `` Aa `` is -8.
206+
//
207+
// https://rust-lang.github.io/rustc-dev-guide/rustdoc-internals/search.html
208+
// describes the encoding in more detail.
209+
let mut shift: u32 = 28;
210+
let mut mask: u32 = 0xF0_00_00_00;
211+
while shift < 32 {
212+
let hexit = (value & mask) >> shift;
213+
if hexit != 0 || shift == 0 {
214+
let hex =
215+
char::try_from(if shift == 0 { '`' } else { '@' } as u32 + hexit).unwrap();
216+
string.push(hex);
217+
}
218+
shift = shift.wrapping_sub(4);
219+
mask = mask >> 4;
220+
}
185221
}
186222
}
187223

188224
/// Full type of functions/methods in the search index.
189-
#[derive(Debug)]
225+
#[derive(Debug, Eq, PartialEq)]
190226
pub(crate) struct IndexItemFunctionType {
191227
inputs: Vec<RenderType>,
192228
output: Vec<RenderType>,
193229
where_clause: Vec<Vec<RenderType>>,
194230
}
195231

196-
impl Serialize for IndexItemFunctionType {
197-
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
198-
where
199-
S: Serializer,
200-
{
201-
// If we couldn't figure out a type, just write `0`.
232+
impl IndexItemFunctionType {
233+
pub fn write_to_string<'a>(
234+
&'a self,
235+
string: &mut String,
236+
backref_queue: &mut VecDeque<&'a IndexItemFunctionType>,
237+
) {
238+
assert!(backref_queue.len() <= 16);
239+
// If we couldn't figure out a type, just write 0,
240+
// which is encoded as `` ` `` (see RenderTypeId::write_to_string).
202241
let has_missing = self
203242
.inputs
204243
.iter()
205244
.chain(self.output.iter())
206245
.any(|i| i.id.is_none() && i.generics.is_none());
207246
if has_missing {
208-
0.serialize(serializer)
247+
string.push('`');
248+
} else if let Some(idx) = backref_queue.iter().position(|other| *other == self) {
249+
// The backref queue has 16 items, so backrefs use
250+
// a single hexit, disjoint from the ones used for numbers.
251+
string.push(
252+
char::try_from('0' as u32 + u32::try_from(idx).unwrap())
253+
.expect("last possible value is '?'"),
254+
);
209255
} else {
210-
let mut seq = serializer.serialize_seq(None)?;
256+
backref_queue.push_front(self);
257+
if backref_queue.len() > 16 {
258+
backref_queue.pop_back();
259+
}
260+
string.push('{');
211261
match &self.inputs[..] {
212262
[one] if one.generics.is_none() && one.bindings.is_none() => {
213-
seq.serialize_element(one)?
263+
one.write_to_string(string);
264+
}
265+
_ => {
266+
string.push('{');
267+
for item in &self.inputs[..] {
268+
item.write_to_string(string);
269+
}
270+
string.push('}');
214271
}
215-
_ => seq.serialize_element(&self.inputs)?,
216272
}
217273
match &self.output[..] {
218274
[] if self.where_clause.is_empty() => {}
219275
[one] if one.generics.is_none() && one.bindings.is_none() => {
220-
seq.serialize_element(one)?
276+
one.write_to_string(string);
277+
}
278+
_ => {
279+
string.push('{');
280+
for item in &self.output[..] {
281+
item.write_to_string(string);
282+
}
283+
string.push('}');
221284
}
222-
_ => seq.serialize_element(&self.output)?,
223285
}
224286
for constraint in &self.where_clause {
225287
if let [one] = &constraint[..]
226288
&& one.generics.is_none()
227289
&& one.bindings.is_none()
228290
{
229-
seq.serialize_element(one)?;
291+
one.write_to_string(string);
230292
} else {
231-
seq.serialize_element(constraint)?;
293+
string.push('{');
294+
for item in &constraint[..] {
295+
item.write_to_string(string);
296+
}
297+
string.push('}');
232298
}
233299
}
234-
seq.end()
300+
string.push('}');
235301
}
236302
}
237303
}

src/librustdoc/html/render/search_index.rs

+7-22
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
use std::collections::hash_map::Entry;
2-
use std::collections::BTreeMap;
2+
use std::collections::{BTreeMap, VecDeque};
33

44
use rustc_data_structures::fx::{FxHashMap, FxIndexMap};
55
use rustc_middle::ty::TyCtxt;
@@ -409,9 +409,11 @@ pub(crate) fn build_index<'tcx>(
409409
let mut full_paths = Vec::with_capacity(self.items.len());
410410
let mut descriptions = Vec::with_capacity(self.items.len());
411411
let mut parents = Vec::with_capacity(self.items.len());
412-
let mut functions = Vec::with_capacity(self.items.len());
412+
let mut functions = String::with_capacity(self.items.len());
413413
let mut deprecated = Vec::with_capacity(self.items.len());
414414

415+
let mut backref_queue = VecDeque::new();
416+
415417
for (index, item) in self.items.iter().enumerate() {
416418
let n = item.ty as u8;
417419
let c = char::try_from(n + b'A').expect("item types must fit in ASCII");
@@ -434,27 +436,10 @@ pub(crate) fn build_index<'tcx>(
434436
full_paths.push((index, &item.path));
435437
}
436438

437-
// Fake option to get `0` out as a sentinel instead of `null`.
438-
// We want to use `0` because it's three less bytes.
439-
enum FunctionOption<'a> {
440-
Function(&'a IndexItemFunctionType),
441-
None,
442-
}
443-
impl<'a> Serialize for FunctionOption<'a> {
444-
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
445-
where
446-
S: Serializer,
447-
{
448-
match self {
449-
FunctionOption::None => 0.serialize(serializer),
450-
FunctionOption::Function(ty) => ty.serialize(serializer),
451-
}
452-
}
439+
match &item.search_type {
440+
Some(ty) => ty.write_to_string(&mut functions, &mut backref_queue),
441+
None => functions.push('`'),
453442
}
454-
functions.push(match &item.search_type {
455-
Some(ty) => FunctionOption::Function(ty),
456-
None => FunctionOption::None,
457-
});
458443

459444
if item.deprecation.is_some() {
460445
deprecated.push(index);

src/librustdoc/html/static/js/externs.js

+56
Original file line numberDiff line numberDiff line change
@@ -200,3 +200,59 @@ let FunctionSearchType;
200200
* }}
201201
*/
202202
let FunctionType;
203+
204+
/**
205+
* The raw search data for a given crate. `n`, `t`, `d`, `i`, and `f`
206+
* are arrays with the same length. `q`, `a`, and `c` use a sparse
207+
* representation for compactness.
208+
*
209+
* `n[i]` contains the name of an item.
210+
*
211+
* `t[i]` contains the type of that item
212+
* (as a string of characters that represent an offset in `itemTypes`).
213+
*
214+
* `d[i]` contains the description of that item.
215+
*
216+
* `q` contains the full paths of the items. For compactness, it is a set of
217+
* (index, path) pairs used to create a map. If a given index `i` is
218+
* not present, this indicates "same as the last index present".
219+
*
220+
* `i[i]` contains an item's parent, usually a module. For compactness,
221+
* it is a set of indexes into the `p` array.
222+
*
223+
* `f` contains function signatures, or `0` if the item isn't a function.
224+
* More information on how they're encoded can be found in rustc-dev-guide
225+
*
226+
* Functions are themselves encoded as arrays. The first item is a list of
227+
* types representing the function's inputs, and the second list item is a list
228+
* of types representing the function's output. Tuples are flattened.
229+
* Types are also represented as arrays; the first item is an index into the `p`
230+
* array, while the second is a list of types representing any generic parameters.
231+
*
232+
* b[i] contains an item's impl disambiguator. This is only present if an item
233+
* is defined in an impl block and, the impl block's type has more than one associated
234+
* item with the same name.
235+
*
236+
* `a` defines aliases with an Array of pairs: [name, offset], where `offset`
237+
* points into the n/t/d/q/i/f arrays.
238+
*
239+
* `doc` contains the description of the crate.
240+
*
241+
* `p` is a list of path/type pairs. It is used for parents and function parameters.
242+
*
243+
* `c` is an array of item indices that are deprecated.
244+
* @typedef {{
245+
* doc: string,
246+
* a: Object,
247+
* n: Array<string>,
248+
* t: String,
249+
* d: Array<string>,
250+
* q: Array<[Number, string]>,
251+
* i: Array<Number>,
252+
* f: string,
253+
* p: Array<Object>,
254+
* b: Array<[Number, String]>,
255+
* c: Array<Number>
256+
* }}
257+
*/
258+
let RawSearchIndexCrate;

0 commit comments

Comments
 (0)