Skip to content

Conversation

@kimgr
Copy link
Contributor

@kimgr kimgr commented Mar 25, 2025

Avro describes a very simple Single Object Encoding in
https://avro.apache.org/docs/1.12.0/specification/#single-object-encoding-specification

Encoders put a schema fingerprint in the header, which decoders can use to look
up a schema definition for deserialization.

There are two codecs:

  • soe.Codec: follows avro.Marshal/Unmarshal but adds/parses SOE header
  • soe.TypedCodec: a type-safe generic version for avrogen-generated types

Both of them can encode/decode SOE, using a single schema.

Also add soe.DynamicDecoder, which has a more liberal view of types and schemas:
for each record, try to look up the associated schema from a configured
resolver, and decode anything for which a schema is found. This can be used for
full dynamic encoding of streams of records from different encoders, as long as
their schemas are available for decoding. This is similar to how the Avro Java
implementation does it.

Include a minimal in-memory schema store.

@kimgr
Copy link
Contributor Author

kimgr commented Mar 25, 2025

The current state is mostly wrappers around API Marshal/Unmarshal. I was hoping to be able to decorate/wrap avro.Encoder and avro.Decoder too, but I don't think that's easily done since they're composed and cached by API. Let me know if you have ideas.

@kimgr
Copy link
Contributor Author

kimgr commented Mar 27, 2025

I thought of another thing I wanted some feedback on: the basic and generic encoder/decoder pairs could benefit from being one entity, i.e Codec and GenericCodec that each can do bother encoding and decoding. Any preference on bundling there?

I've noticed when writing tests for application code only doing decoding it's very handy to have an encoder for the same schema available, and vice versa.

@nrwiersma nrwiersma requested a review from Copilot April 1, 2025 05:52
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces native support for Avro Single Object Encoding (SOE) by implementing core encoding/decoding functionality, generic type-safe wrappers, and a dynamic decoder leveraging a schema resolver.

  • Implements core SOE header creation, parsing, and fingerprint computation in soe/soe.go and soe/basic.go.
  • Introduces generic and dynamic encoder/decoder types in soe/generic.go and soe/dynamic.go.
  • Adds tests in soe/basic_test.go covering header format, error cases for short headers, bad magic, and fingerprint mismatches.

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
soe/soe.go Defines SOE header construction, parsing, schema fingerprinting, and associated utilities.
soe/resolvers.go Introduces an interface for schema resolution used by the dynamic decoder.
soe/generic.go Implements generic encoder and decoder wrappers for avrogen-generated types.
soe/dynamic.go Provides a dynamic decoder that resolves schemas using a provided resolver.
soe/basic.go Offers the basic encoder and decoder implementation for SOE framing.
soe/basic_test.go Contains tests validating header format, magic checks, and fingerprint handling.
Comments suppressed due to low confidence (2)

soe/soe.go:32

  • [nitpick] Consider defining a constant for the header length (10) to improve code readability and maintainability.
if len(data) < 10 {

soe/soe.go:42

  • [nitpick] Consider defining a constant for the expected fingerprint length (8) to enhance clarity and avoid magic numbers.
if len(fingerprint) != 8 {

@nrwiersma
Copy link
Member

I thought of another thing I wanted some feedback on: the basic and generic encoder/decoder pairs could benefit from being one entity, i.e Codec and GenericCodec that each can do bother encoding and decoding. Any preference on bundling there?

I've noticed when writing tests for application code only doing decoding it's very handy to have an encoder for the same schema available, and vice versa.

Bundling them seems like a good idea.

@kimgr
Copy link
Contributor Author

kimgr commented Apr 1, 2025

Would you prefer I finish up everything in this PR, or post piecemeal PRs split into;

  • basic codec + soe.go utilities
  • generic codec
  • dynamic decoder with resolver, etc.

@kimgr
Copy link
Contributor Author

kimgr commented Apr 1, 2025

Actually it's probably easier to do everything together. Just worried it will turn into a megapatch, but let's worry about that if it becomes a problem.

@kimgr
Copy link
Contributor Author

kimgr commented Apr 1, 2025

Bundling them seems like a good idea.

Cool, done.

Also added tests for generic codec, mostly repeated from basic_test.go.

Still haven't finished up the DynamicDecoder + Resolver.

@kimgr
Copy link
Contributor Author

kimgr commented Apr 5, 2025

I think this has everything I wanted to do:

  • Rebased
  • Rename GenericCodec -> TypedCodec
  • Add MemorySchemaRegistry to use as a testing SchemaResolver
  • Finish up DynamicDecoder API
  • Add full test suite for DynamicDecoder

Might be worthwhile to add some tests for the minor utility functions in soe.go too, but let's see what you think about this first.

@kimgr
Copy link
Contributor Author

kimgr commented Apr 14, 2025

Gentle Easter week ping!

@nrwiersma
Copy link
Member

Gentle Easter week ping!

Sorry, I have not forgotten. Will get to it as soon as I have time.

@nrwiersma
Copy link
Member

I left a couple comments, but in general I think this is looking good. Will properly review once it is out of Draft.

@kimgr kimgr changed the title draft proposal: SOE coders feat: add SOE codecs Apr 17, 2025
@kimgr
Copy link
Contributor Author

kimgr commented Apr 17, 2025

Alright, squashed the history and wrote up a more proper commit message. When you're happy with this, I'd like to try it out with some of our internal code so let's not merge until I've had a chance to run it for real.

@kimgr kimgr marked this pull request as ready for review April 17, 2025 16:00
Copy link
Member

@nrwiersma nrwiersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Will wait for your go ahead before merging.

@kimgr
Copy link
Contributor Author

kimgr commented Oct 11, 2025

Sorry for going AWOL on this PR, I ran into 6 months of unrelated chaos and still haven't had a chance to integrate this on our end. That said, it would be easier to do so if it was merged and released, so I think we can move ahead if you're comfortable with it. I'll rebase the branch and push again.

Avro describes a very simple Single Object Encoding in
https://avro.apache.org/docs/1.12.0/specification/#single-object-encoding-specification

Encoders put a schema fingerprint in the header, which decoders can use to look
up a schema definition for deserialization.

There are two codecs:

* soe.Codec: follows avro.Marshal/Unmarshal but adds/parses SOE header
* soe.TypedCodec: a type-safe generic version for avrogen-generated types

Both of them can encode/decode SOE, using a single schema.

Also add soe.DynamicDecoder, which has a more liberal view of types and schemas:
for each record, try to look up the associated schema from a configured
resolver, and decode anything for which a schema is found. This can be used for
full dynamic encoding of streams of records from different encoders, as long as
their schemas are available for decoding. This is similar to how the Avro Java
implementation does it.

Include a minimal in-memory schema store.
@nrwiersma nrwiersma merged commit 07e49cb into hamba:main Oct 13, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants