Skip to content

feat: pull Python types into TypeScript#126

Merged
sergioramos merged 17 commits into
mainfrom
api-types
Jun 9, 2023
Merged

feat: pull Python types into TypeScript#126
sergioramos merged 17 commits into
mainfrom
api-types

Conversation

@danvk

@danvk danvk commented Jun 8, 2023

Copy link
Copy Markdown
Collaborator

This goes a long way towards #19

The Python API is defined using dataclasses in typing.py. The top-level entry point is CliCommands, which defines the return type for each subcommand. These get serialized to JSON Schema using dataclasses-jsonschema. This JSON Schema file gets converted to TypeScript using json-schema-to-typescript.

So Python is the source of truth. To update the TypeScript, run yarn codegen. There's a check for diffs on CI so this should be hard to get wrong (here's what it looks like when this fails). There's also a check that properties in CliCommands and the subcommands stay in sync.

On the TypeScript side, we have a new callSidecar helper that gets us types based on the subcommand:

image

This should work well for now. A few warts:

  • I made no attempt to encode the arguments to each subcommand. We'll probably to tackle this as part of a move towards a long-running Python service, whatever form that takes (JSON on stdin, websockets, etc.)
  • Default properties on a dataclass are a bit awkward to model. For example authors is optional when you create a Reference but because it has a default value, that property will always be present when you receive it as a response. But it still comes out of codegen as optional.
  • dataclasses-jsonschema is marked as being "in maintenance mode" as of last fall.

Comment thread python/cli.schema.json

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this to validate responses or generate OpenAPI / Swagger docs if we like.

Comment thread python/generate_schema.py
commands = [*parser._get_positional_actions()[0].choices.keys()]
cli_schema = CliCommands.json_schema()

# There needs to be a 1-1 match between subcommands and CliCommands properties.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the check that the subcommands and the CliCommands dataclass stay in sync.

Comment thread python/main.py
Comment thread python/sidecar/ingest.py
"""
prepared_references = []
for ref in references:
# no need to include reference abstract, contents, or chunks in response

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON serialization can be done using the dataclass's .to_json() or .to_dict() method, no need to roll it by hand. If we want to drop these fields, we can define a new dataclass that doesn't have them and put it in the API schema.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danvk it seems this doesn't work for nested dataclasses. for example:

    references = [
        Reference(
            source_filename="some_file.pdf",
            filename_md5="abc123",
            title="Some title",
            authors=[
                Author(full_name="John Doe"),
                Author(full_name="Jane Doe"),
            ],
            chunks=[
                Chunk(text="This is some text."),
                Chunk(text="This is some more text."),
            ]
        ),
        Reference(
            source_filename="another_file.pdf",
            filename_md5="abcdefg123",
            title="Another title",
            authors=[
                Author(full_name="John Doe"),
                Author(full_name="Jane Doe"),
            ],
            chunks=[
                Chunk(text="Some text about something interesting."),
                Chunk(text="More text about something."),
            ]
        )
    ]
In [26]: ref = references[0]

In [27]: ref.to_json()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 ref.to_json()

File ~/Library/Caches/pypoetry/virtualenvs/refstudio-iTeERiwC-py3.10/lib/python3.10/site-packages/dataclasses_jsonschema/__init__.py:1017, in JsonSchemaMixin.to_json(self, omit_none, validate, **json_kwargs)
   1016 def to_json(self, omit_none: bool = True, validate: bool = False, **json_kwargs) -> str:
-> 1017     return json.dumps(self.to_dict(omit_none, validate), **json_kwargs)

File ~/.pyenv/versions/3.10.11/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/__init__.py:231, in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    226 # cached encoder
    227 if (not skipkeys and ensure_ascii and
    228     check_circular and allow_nan and
    229     cls is None and indent is None and separators is None and
    230     default is None and not sort_keys and not kw):
--> 231     return _default_encoder.encode(obj)
    232 if cls is None:
    233     cls = JSONEncoder

File ~/.pyenv/versions/3.10.11/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py:199, in JSONEncoder.encode(self, o)
    195         return encode_basestring(o)
    196 # This doesn't pass the iterator directly to ''.join() because the
    197 # exceptions aren't as detailed.  The list call should be roughly
    198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
    200 if not isinstance(chunks, (list, tuple)):
    201     chunks = list(chunks)

File ~/.pyenv/versions/3.10.11/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py:257, in JSONEncoder.iterencode(self, o, _one_shot)
    252 else:
    253     _iterencode = _make_iterencode(
    254         markers, self.default, _encoder, self.indent, floatstr,
    255         self.key_separator, self.item_separator, self.sort_keys,
    256         self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)

File ~/.pyenv/versions/3.10.11/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/json/encoder.py:179, in JSONEncoder.default(self, o)
    160 def default(self, o):
    161     """Implement this method in a subclass such that it returns
    162     a serializable object for ``o``, or calls the base implementation
    163     (to raise a ``TypeError``).
   (...)
    177 
    178     """
--> 179     raise TypeError(f'Object of type {o.__class__.__name__} '
    180                     f'is not JSON serializable')

TypeError: Object of type Author is not JSON serializable

Calling to_dict() will run, but Author will not be serialized and will remain as a dataclass. maybe i'm doing something wrong?

i know that asdict will serialize everything.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this was ultimately related to using str | None vs Optional[str]

chat = Chat(prompt, n_options)
response = chat.get_response()
sys.stdout.write(json.dumps(response))
sys.stdout.write(json.dumps([r.to_dict() for r in response]))

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't call response.to_dict() or response.to_json() here since it's a list. (Generally it's better to have the top-level response type of APIs be a dict so that you can more easily add new fields.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should try to return dict at top-level to be able to evolve the response payload. We should keep it for how, but create a follow-up ticket.

@danvk danvk marked this pull request as ready for review June 8, 2023 22:01
@danvk danvk assigned gjreda and unassigned gjreda Jun 8, 2023
@danvk danvk requested review from gjreda and sehyod June 8, 2023 22:01
@cguedes cguedes self-requested a review June 9, 2023 09:42

@sehyod sehyod left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's nice to have these automatically generated types! 🎉
I don't believe the optional fields are that much of an issue, because we are converting the received type to our internal type anyway (in parsePdfIngestionResponse).
Maybe this function could throw an error (or ignore the entry and log the problem somewhere) if authors is empty. This way we don't have to specify the default value both in python and in typescript

@gjreda gjreda left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for this one @danvk!

@sergioramos sergioramos changed the title Pull Python types into TypeScript feat: pull Python types into TypeScript Jun 9, 2023
@sergioramos sergioramos merged commit 6c02266 into main Jun 9, 2023
@sergioramos sergioramos deleted the api-types branch June 9, 2023 16:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants