-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3713][SQL] Uses JSON to serialize DataType objects #2563
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
Tests timed out after a configured wait of |
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is in the wrong place. We should probably note that this parser is deprecated and is only here for backwards compatibility. We might even print a warning when it is used so we can get rid of it eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this comment is a mistake. Instead of print a warning, I made fromCaseClassString() private. It's only referenced by CaseClassStringParser, which has already been marked as deprecated.
|
Minor comment otherwise this LGTM. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can have default implementation as:
self.__class__.__name__.[:-4].lower()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, saved lots of boilerplate code! Removed all simpleString() method in subclasses.
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just put _get_simple_string here? (it's not needed as separated functions, it will harder to understand without this context)
In order to make it available to class, it could be classmethod:
@classmethod
def simpleString(cls):
return cls.__name__[:-4].lower()
|
@davis Thanks for all the suggestions, really makes things a lot cleaner! |
|
QA tests have started for PR 2563 at commit
|
54c46ce to
785b683
Compare
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
|
Tests timed out after a configured wait of |
python/pyspark/sql.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you like to use single string for Primitive types, it's still doable, only use one layer dict for others.
Either one is good to me.
|
This looks good to me, you just forget to rollback the changes in run-tests after debugging. |
|
@davies Sorry for my carelessness... And thanks again for all the great advices! |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test PASSed. |
|
LGTM now, thanks! |
|
Could you rebase this to master? |
de18dea to
fc92eb3
Compare
|
Finished rebasing. |
|
QA tests have started for PR 2563 at commit
|
|
QA tests have started for PR 2563 at commit
|
|
QA tests have finished for PR 2563 at commit
|
|
Test FAILed. |
|
QA tests have finished for PR 2563 at commit
|
|
@marmbrus I think this is ready to go. |
|
Thanks! I've merged this. |
This PR uses JSON instead of
toStringto serializeDataTypes. The latter is not only hard to parse but also flaky in many cases.Since we already write schema information to Parquet metadata in the old style, we have to reserve the old
DataTypeparser and ensure downward compatibility. The old parser is now renamed toCaseClassStringParserand moved intoobject DataType.@JoshRosen @davies Please help review PySpark related changes, thanks!