Skip to content

Make marshalled Custom_tag objects store their length.#1683

Merged
damiendoligez merged 2 commits intoocaml:trunkfrom
stedolan:marshal-custom-length
Sep 14, 2018
Merged

Make marshalled Custom_tag objects store their length.#1683
damiendoligez merged 2 commits intoocaml:trunkfrom
stedolan:marshal-custom-length

Conversation

@stedolan
Copy link
Contributor

The format used by the Marshal module does not currently store the length of Custom_tag objects, making correct unmarshalling essentially impossible on the multicore branch. This patch adds the length to the marshal format.

The change is backwards- but not forwards-compatible: old marshalled files continue to parse, but new marshalled files containing custom objects will raise an exception when unmarshalled with old OCaml.

Background

In the struct custom_operations, custom objects provide the following two callbacks:

  void (*serialize)(value v,
                    /*out*/ uintnat * bsize_32 /*size in bytes*/,
                    /*out*/ uintnat * bsize_64 /*size in bytes*/);
  uintnat (*deserialize)(void * dst);

To marshal a custom object, serialize writes some bytes using e.g. caml_serialize_int_4 (and the other functions from intext.h), and returns two values in the bsize_32 and bsize_64 pointers: the number of bytes of memory that will be needed to store the unmarshalled value on 32- and 64-bit platforms respectively.

To unmarshal a custom object, deserialize takes a pointer to a piece of memory that is large enough to contain the unmarshalled object (that is, at least bsize_32 or bsize_64 bytes long, depending on word size), and fills it with the unmarshalled value using e.g. caml_deserialize_int_4 and friends. deserialize returns the number of bytes of dst that it has filled in, which should be equal to the value of bsize_32 or bsize_64 (again, depending on word size) that was returned by serialize.

Current marshal format

However, the marshal format does not store bsize_32 and bsize_64, so how does the unmarshaller know how large a buffer to pass to deserialize?

The marshal format's header stores the total number of bytes necessary to unmarshal the entire file. The unmarshaller allocates space for the entire data structure in a single contiguous block, and fills it in sequentially. So, the buffer passed to deserialize is the entire rest of this block, large enough not only for the custom object but for all subsequent values as well. The unmarshaller has no way of knowing in advance how many bytes of this block deserialize will consume, which is why deserialize must return this count so that the unmarshaller can advance its pointer.

This is heavily tied to the current heap layout and memory allocator: it relies on being able to store arbitrarily many objects in a single contiguous block. Currently, the multicore branch does not use this object layout, and needs to know how many bytes to allocate for each custom block encountered during unmarshalling, making this trick unworkable.

Change to the format

This patch adds a new code CODE_CUSTOM_LEN to the marshal format, which differs from the existing CODE_CUSTOM by including the 32-bit and 64-bit size fields. The marshaller (extern.c) now only produces CODE_CUSTOM_LEN, but the unmarshaller still understands CODE_CUSTOM. This means that old marshalled files continue to work, but new ones will not work with old programs.

This makes marshalled custom objects 12 bytes longer (4 byte bsize_32 + 8 byte bsize_64). More complicated and compact schemes are possible, but this is simplest.

This means that the unmarshaller now knows how many bytes of memory a custom deserialiser requires before calling the deserialiser, which is a requirement for the multicore branch. That branch will not support unmarshalling CODE_CUSTOM, so I'd like to get patch merged soon to make transition easier.

Currently, the only use that the unmarshaller makes of the new length fields is error-checking: with this patch, length inconsistencies between custom serialize and deserialize functions will be detected. There is now a test in tests/lib-marshal testing this.

@stedolan
Copy link
Contributor Author

I think the build failure on MSVC64 is an old, minor bug in nativeint_deserialize being uncovered by the new error checking.

I've created #1684 to fix that bug, which I think should make this build pass. (I don't have a Windows machine handy to test on, so I'll just wait for the CI to run again).

@stedolan stedolan force-pushed the marshal-custom-length branch from 4d25bec to 9e2e626 Compare April 3, 2018 08:17
@stedolan
Copy link
Contributor Author

stedolan commented Apr 3, 2018

The build was indeed fixed by #1684, so I think this is ready to merge.

@gasche
Copy link
Member

gasche commented Apr 3, 2018

cc @xavierleroy: this looks like something you would be interested in looking at

@xavierleroy
Copy link
Contributor

I haven't looked at the code yet. I understand the need but the increase in size (of marshaled data) makes me a bit nervous. Perhaps we could use a variable-length encoding (think UTF-8) for the two sizes.

Take bignums and bigarrays as examples: these are variable-length custom blocks, but the info needed to determine the size come first in the marshaled byte stream, followed by the actual data. So, this gets me thinking: what about a different demarshaling API with one function that determines the size (reading as much of the input data as needed to do so), so that the custom block can be allocated, then another function fills it (reading the rest of the input data).

Symmetrically, if the size of the custom block is always written, it could be passed to the demarshaler function, and may not need to be stored again in the encoding of the custom block. This could work for bignums and one-dimension bigarrays, but not multiple-dimension bigarrays.

@xavierleroy
Copy link
Contributor

PS. Variable-length encodings need not be complicated. Here is an encoding that is often used in smart card protocols and is even simpler than UTF-8:

1xxxxxxx 1yyyyyyy ... 0zzzzzzz

i.e. 7 bits per byte, top bit 1 means "there is more to come", top bit 0 means "end of number".

@stedolan
Copy link
Contributor Author

stedolan commented Apr 3, 2018

I did consider using variable-length sizes. Coincidentally, I was even planning to use the format you suggest! (which I know under the name "varint") The major annoyance is the serialise callback, which begins writing bytes and only tells the marshaller the length afterwards. This means that if a variable-length format is used, the serialised data must be shifted into the correct place afterwards, since the offset at which the data should be written is not known until the length is determined. I don't think this extra copy would be unworkably slow, but it was sufficiently annoying that I went for a fixed-size format instead.

On the other hand, I don't think the size change matters much. Here are some size numbers for boot/ocamlc, in the new and old marshal formats, with and without compression:

boot/ocamlc raw gzip lzma
old 2231.2 K 475.9 K 310.1 K
new 2231.8 K 476.0 K 310.2 K

The most pathological case I could think of was a long int32 array, so I also tested with Array.init 1000 (fun x -> Int32.of_int x):

int32 array raw gzip lzma
old 7.8 K 1.85 K 0.66 K
new 19.6 K 2.34 K 0.72 K

So, I don't think the size difference will usually be significant, and in any case using a standard compressor makes far more difference than fiddling with the marshal format.

I agree that the problems could be solved by changing the custom marshalling API instead of changing the marshal format, but I think changing the API would cause more pain than changing the format. Apart from the bootstrap itself, I haven't come across much long-lived data using the Marshal format, but there are quite a few custom serialisers / deserialisers hanging around.

@dra27 dra27 added this to the consider-for-4.07 milestone Apr 5, 2018
@mshinwell mshinwell removed this from the consider-for-4.07 milestone Apr 9, 2018
@stedolan
Copy link
Contributor Author

I'm afraid I've lost track of who's waiting for who. Is this ready-to-merge, or waiting for me to implement variable-length-sizes + shift data afterwards? (I'm happy with either option)

@xavierleroy
Copy link
Contributor

I'm still pondering the options and I don't have a strong opinion. Other opinions most welcome. At any rate this won't make it int 4.07 so we have some more time to converge on a design.

The boot/ocamlc measurements are not really conclusive because few custom blocks (e.g. int32, int64, bigarrays) are in the static data of this application.

@stedolan
Copy link
Contributor Author

OK. I also don't have a strong opinion. I picked one option for the sake of making a PR, but I'm not particularly attached to it. I don't think any of the options are very much work to implement.

I agree that boot/ocamlc is not a good example of using custom blocks. The point I was trying to make with the measurements is that (a) standard compressors are very good at compressing Marshal files, so anyone size-sensitive should be using them, and (b) after applying a standard compressor, this patch makes the pathological case of int32 array only 10% bigger (25% for gzip).

Adds a new marshal code CODE_CUSTOM_LEN, which stores the lengths
(for both 32- and 64-bit) of each marshalled custom block. The
marshaller only uses CODE_CUSTOM_LEN, but the unmarshaller still
understands CODE_CUSTOM (So, old files still unmarshal, but new ones
don't work on old versions).

This allows the unmarshaller to know exactly how much memory is
needed before calling a custom `deserialize` function.
@stedolan stedolan force-pushed the marshal-custom-length branch from 9e2e626 to 0d3c193 Compare July 3, 2018 11:58
@stedolan
Copy link
Contributor Author

stedolan commented Jul 3, 2018

Following @damiendoligez's suggestion, I added CODE_CUSTOM_FIXED as an optimisation for fixed-length custom blocks. Now int32, int64, and nativeint consume the same number of bytes as before, but lengths are still available to the deserialiser.

I added some error checking in serialisation: I chose to trigger a fatal error during serialization if fixed lengths are specified but serialize produces a different length, and to raise an exception during deserialization if fixed lengths are expected but unavailable. (The idea is that bad input can be handled with an exception, but finding bad C code in the current process is not something that should be recovered from).

I also fixed a typo (wsize -> bsize for a size in bytes) in the manual while adding docs for the new field.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me overall. A few comments below, but nothing that requires a change.

uintnat (*deserialize)(void * dst);
int (*compare_ext)(value v1, value v2);
const struct custom_fixed_length* fixed_length;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't expecting a pointer to struct custom_fixed_length, just a flat sub-structure.

Copy link
Contributor Author

@stedolan stedolan Jul 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was the original plan, but it got lost in a maze of incompatible compilers.

The issue is supporting custom_fixed_length_default, since the custom_foo_default macros are used both in initialiser lists and as rvalues in assignments. As a flat sub-structure, custom_fixed_length_default could be:

  • {0, 0} - an initialiser list, not valid as an rvalue
  • (struct custom_fixed_length){0, 0} - a structure literal, valid C99 but rejected by MSVC
  • empty_fixed_length (some predefined static const object) - when used as an initialiser, deemed "non-constant" and rejected by C99 compilers (although valid C++, so MSVC might be OK with this)

Making it be a pointer and defining custom_fixed_length_default as NULL seemed the easiest way out. If you can think of a portable alternative, I'm happy to change this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there is no nice solution in C90. A not very nice solution:

# define custom_fixed_length_default {0, 0}

for use in initializers (the most common use). For assigning to a lvalue e, you could do

  { struct custom_fixed_length cfl = custom_fixed_length_default; e = cfl; }

Is it the case that caml_finalize is the only place where we need an assignment to an expression of type struct custom_fixed_length? If yes there might be other ways to write this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think caml_finalize is indeed the only occurrence of custom_fixed_length_default in an assignment, and changing it isn't hard. I was more worried about external users of the API, who might be used to using custom_foo_default in both assignments and initialisers.

size_header = extern_ptr;
extern_ptr += 12;
Custom_ops_val(v)->serialize(v, &sz_32, &sz_64);
/* Store length before serialized block */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the serialize variable rather than fetch from Custom_ops_val(v) again. CSE in the C compiler might not be clever enough to reuse it for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

write(CODE_CUSTOM_FIXED);
writeblock(ident, strlen(ident) + 1);
Custom_ops_val(v)->serialize(v, &sz_32, &sz_64);
if (sz_32 != fixed_length->bsize_32 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment re: reuse of serialize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

runtime/extern.c Outdated
Custom_ops_val(v)->serialize(v, &sz_32, &sz_64);
if (sz_32 != fixed_length->bsize_32 ||
sz_64 != fixed_length->bsize_64)
caml_fatal_error("output_value: incorrect fixed sizes specified by %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to put the check in a CAMLassert, so that it is performed by the debug runtime but not by the normal runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It uses caml_fatal_error so that it can specify the name of the offending serialiser (an assertion failure is harder to debug, and looks like a bug in the runtime rather than the user's custom serialiser)

if (size != expected_size) {
intern_cleanup();
caml_failwith("input_value: incorrect length of serialized custom block");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here I agree an exception is better than an assertion.

@stedolan stedolan force-pushed the marshal-custom-length branch from 0d3c193 to 1f5c010 Compare July 4, 2018 13:41
Copy link
Member

@damiendoligez damiendoligez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I don't have a strong opinion on the "substructure vs pointer" debate.

runtime/extern.c Outdated
write(CODE_CUSTOM);
writeblock(ident, strlen(ident) + 1);
Custom_ops_val(v)->serialize(v, &sz_32, &sz_64);
if (!fixed_length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write if (fixed_length == NULL), if only to be consistent with the previous code (2 lines up).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@stedolan stedolan force-pushed the marshal-custom-length branch from 1f5c010 to 614bb14 Compare July 6, 2018 14:40
@stedolan
Copy link
Contributor Author

Fixed line length to keep check-typo happy.

@stedolan stedolan force-pushed the marshal-custom-length branch from 614bb14 to 0888b63 Compare July 10, 2018 11:52
@damiendoligez damiendoligez self-assigned this Jul 11, 2018
@stedolan stedolan force-pushed the marshal-custom-length branch from 0888b63 to ff0b18f Compare August 21, 2018 14:43
@stedolan
Copy link
Contributor Author

I didn't notice that a bootstrap had made this unmergeable! Rebased now.

As a space optimisation, custom serializers that always consume
the same number of bytes need not send the lengths explicitly.
@stedolan stedolan force-pushed the marshal-custom-length branch from ff0b18f to e9b04bb Compare August 22, 2018 09:33
@stedolan
Copy link
Contributor Author

I accidentally committed a bad bootstrap in yesterday's rebase. Should be fixed now.

@stedolan
Copy link
Contributor Author

I think this is ready to merge!

@damiendoligez
Copy link
Member

Appveyor's failure is an unrelated timeout. Merging.

@damiendoligez damiendoligez merged commit 2fd0186 into ocaml:trunk Sep 14, 2018
@stedolan
Copy link
Contributor Author

Thanks!

gasche added a commit to gasche/ocaml that referenced this pull request Aug 11, 2020
rajgodse added a commit to rajgodse/ocaml that referenced this pull request Aug 18, 2023
Multiway match typechecking and translation for pattern guards (ocaml#2)

* multiway typechecking and translation

* update jane test output

* self-review: format and style in translcore

* more translcore/typedtree cleanup

* expose value `is_guarded_rhs`

* fix typedtree printer

* make discussed changes to ocamldoc

* format: remove unnecessary parens in pattern

* improve parmatch variable naming

* explain [exp_attributes] and [exp_extra] weirdness

* improve translcore [event_function*] naming

* inlined transl_body in transl_rhs

* rename pats_exp... to use "rhs" naming

* added test for guarded value/exception or-patterns

* address ocamldoc CRs

---------

Co-authored-by: Nick Roberts <[email protected]>
EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Co-authored-by: Etienne Millon <[email protected]>
Co-authored-by: Christine Rose <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants