Skip to content

Conversation

@eregon
Copy link
Member

@eregon eregon commented Nov 5, 2025

…s and not a Hash

* rb_struct_initialize() does not accept a Hash, and it's very brittle
  to pass `[{...}]` and to rely on that C function using rb_keyword_given_p().
  It basically worked accidentally, by having **members in the caller of the caller.
  Such logic when Struct#initialize is defined in Ruby (as in TruffleRuby) is basically impossible to implement,
  because it's incorrectly treating positional arguments as keyword arguments.
* rb_struct_initialize() is used in CRuby to set members of Data instances in marshal.c (there is no rb_data_initialize() yet).
  There, the code passes an Array of members values for Data (and for Struct which are not `keyword_init: true`):
  https://github.com/ruby/ruby/blob/48c7f349f68846e10d60ae77ad299a38ee014479/marshal.c#L2150-L2176
  So we should do the same in psych.
* Rename to init_data since it's only used for Data.
* See ruby#692 (comment).
@eregon eregon requested a review from tenderlove November 5, 2025 10:31
@eregon eregon mentioned this pull request Nov 5, 2025
@eregon eregon changed the title Fix usage of rb_struct_initialize() to pass an Array of members values and not a Hash Fix usage of rb_struct_initialize() to pass an Array of members values and not a Hash and add TruffleRuby in CI Nov 5, 2025
@eregon
Copy link
Member Author

eregon commented Nov 5, 2025

I added TruffleRuby in CI in this PR since it just needs a tiny extra fix to be fully green.
This will help to prevent issues like truffleruby/truffleruby#3997 (or make them much faster to notice & fix)

end
data ||= allocate_anon_data(o, members)
init_struct(data, **members)
values = data.members.map { |m| members[m] }
Copy link
Member Author

@eregon eregon Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also use members.values but then we should check if the data_class.members == members.keys, and raise an exception if not. If so, which exception class should I use then, do we have one for "incompatible change in loaded classes"?
Otherwise if e.g. one changes the order of fields from Foo = Data.define(:a, :b) to Data.define(:b, :a) then it would load the instance incorrectly in a rather surprising manner.

If there are extra members in Data.define, the current approach sets them to nil, like before:

irb(main):002> Psych::VERSION
=> "5.2.6"
irb(main):011> Foo = Data.define(:a,:b)
=> Foo
irb(main):012> d=Psych.dump Foo.new(1,2)
=> "--- !ruby/data:Foo\na: 1\nb: 2\n"
irb(main):013> puts d
--- !ruby/data:Foo
a: 1
b: 2
=> nil
irb(main):014> Foo = Data.define(:a,:b, :c)
(irb):14: warning: already initialized constant Foo
(irb):11: warning: previous definition of Foo was here
=> Foo
irb(main):015> Psych.unsafe_load d
=> #<data Foo a=1, b=2, c=nil>

This is rb_struct_initialize()-specific behavior BTW, Data#initialize doesn't allow that:

> Data.define(:a,:b).new(a: 1)
(irb):3:in 'Data#initialize': missing keyword: :b (ArgumentError)

Checking members would raise, unless we get fancy and only compare the first N members where N is members.size.

Basically this seems a good approach and it's compatible with what was done before.

@tenderlove tenderlove merged commit d1211a2 into ruby:master Dec 2, 2025
111 of 113 checks passed
@nevans
Copy link
Contributor

nevans commented Dec 10, 2025

While I agree that init_data makes more sense than init_struct and CRuby ought to have a public API specifically designed for this, IMO we shouldn't convert the keyword args into an array: we should use keyword args (or a hash) for Data objects. If the hash keys (or keyword args) are not an exact match for the member names, it should use the default Data#initialize behavior: don't substitute nils, raise an exception.

So, as I understand it, this PR deviates from the current behavior of CRuby with psych 5.4.6. I think it should be updated to have the same behavior. Probably some tests need to be added for the following failure cases (I've included Marshal for comparison).

With CRuby 3.4.5 and psych 5.2.6:

$ irb -ryaml
irb(main):001> D = Data.define(:foo, :bar)
=> D
irb(main):002> d = D["foo", "bar"]
=> #<data D foo="foo", bar="bar">
irb(main):003> d.to_yaml
=> "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
irb(main):004> Marshal.dump d
=> "\x04\bS:\x06D\a:\bfooI\"\bfoo\x06:\x06ET:\bbarI\"\bbar\x06;\aT"
irb(main):005> 

$ irb -ryaml
irb(main):001> D = Data.define(:a, :b)
=> D
irb(main):002> Marshal.load "\x04\bS:\x06D\a:\bfooI\"\bfoo\x06:\x06ET:\bbarI\"\bbar\x06;\aT"
<internal:marshal>:34:in 'Marshal.load': struct D not compatible (:foo for :a) (TypeError)
	from (irb):2:in '<main>'
	from <internal:kernel>:168:in 'Kernel#loop'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/irb-1.15.3/exe/irb:9:in '<top (required)>'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in 'Kernel#load'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in '<main>'
irb(main):003> YAML.unsafe_load "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
/home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/to_ruby.rb:222:in 'Psych::Visitors::ToRuby#init_struct': unknown keywords: foo, bar (ArgumentError)

          init_struct(data, **members)
                      ^^^^^^^^^^^^^^^
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/to_ruby.rb:222:in 'Psych::Visitors::ToRuby#visit_Psych_Nodes_Mapping'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/visitor.rb:30:in 'Psych::Visitors::Visitor#visit'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/visitor.rb:6:in 'Psych::Visitors::Visitor#accept'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/to_ruby.rb:35:in 'Psych::Visitors::ToRuby#accept'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/to_ruby.rb:345:in 'Psych::Visitors::ToRuby#visit_Psych_Nodes_Document'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/visitor.rb:30:in 'Psych::Visitors::Visitor#visit'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/visitor.rb:6:in 'Psych::Visitors::Visitor#accept'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/visitors/to_ruby.rb:35:in 'Psych::Visitors::ToRuby#accept'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych/nodes/node.rb:49:in 'Psych::Nodes::Node#to_ruby'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/psych-5.2.6/lib/psych.rb:275:in 'Psych.unsafe_load'
	from (irb):3:in '<main>'
	from <internal:kernel>:168:in 'Kernel#loop'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/irb-1.15.3/exe/irb:9:in '<top (required)>'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in 'Kernel#load'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in '<main>'
irb(main):004>

And with TruffleRuby 33.0.0-dev-d6dd6557 and psych 5.2.6:

$ rbenv shell truffleruby-dev
$ irb -ryaml
irb(main):001> D = Data.define(:foo, :bar)
=> D
irb(main):002> d = D["foo", "bar"]
=> #<data D foo="foo", bar="bar">
irb(main):003> d.to_yaml
=> "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
irb(main):004> Marshal.dump d
=> "\x04\bS:\x06D\a:\bfooI\"\bfoo\x06:\x06ET:\bbarI\"\bbar\x06;\aT"
irb(main):005> 

$ irb -ryaml
irb(main):001> D = Data.define(:a, :b)
=> D
irb(main):002> Marshal.load "\x04\bS:\x06D\a:\bfooI\"\bfoo\x06:\x06ET:\bbarI\"\bbar\x06;\aT"
<internal:core> core/marshal.rb:1013:in `block in construct_struct': struct D is not compatible (:foo for :a) (TypeError)
	from <internal:core> core/integer.rb:157:in `times'
	from <internal:core> core/marshal.rb:1010:in `construct_struct'
	from <internal:core> core/marshal.rb:719:in `construct'
	from <internal:core> core/marshal.rb:1532:in `load'
	from (irb):2:in `<main>'
	from <internal:core> core/kernel.rb:417:in `loop'
	from <internal:core> core/throw_catch.rb:36:in `catch'
	from /home/nick/.local/share/rubies/truffleruby-dev/lib/gems/gems/irb-1.15.3/exe/irb:9:in `<top (required)>'
	from <internal:core> core/kernel.rb:386:in `load'
	from /home/nick/.local/share/rbenv/versions/truffleruby-dev/bin/irb:25:in `<main>'
irb(main):003> YAML.unsafe_load "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
=> #<data D a={:foo=>"foo", :bar=>"bar"}, b=nil>
irb(main):004> 

$ ruby --version
truffleruby 33.0.0-dev-d6dd6557, like ruby 3.3.7, Oracle GraalVM Native [x86_64-linux]

As a user, if my serialized data doesn't match the current definition, I'd much rather see a crash than have it continue with no exception but corrupted values.

@eregon You did bring up the scenario where the parameter order is changed. Currently, Marshal crashes but Psych allows it:

$ ruby --version
ruby 3.4.5 (2025-07-16 revision 20cda200d3) +PRISM [x86_64-linux]
$ irb -ryaml
irb(main):001> D = Data.define(:bar, :foo)
=> D
irb(main):002> Marshal.load "\x04\bS:\x06D\a:\bfooI\"\bfoo\x06:\x06ET:\bbarI\"\bbar\x06;\aT"
<internal:marshal>:34:in 'Marshal.load': struct D not compatible (:foo for :bar) (TypeError)
	from (irb):2:in '<main>'
	from <internal:kernel>:168:in 'Kernel#loop'
	from /home/nick/.local/share/rubies/3.4.5/lib/ruby/gems/3.4.0/gems/irb-1.15.3/exe/irb:9:in '<top (required)>'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in 'Kernel#load'
	from /home/nick/.local/share/rbenv/versions/3.4.5/bin/irb:25:in '<main>'
irb(main):003> YAML.unsafe_load "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
=> #<data D bar="bar", foo="foo">
irb(main):004> 

Although Data.define does provide an explicit ordering, which can be used with ::new or ::[], Data#initialize always uses keyword args, and they never care about order. I'm personally inclined to say that Psych 5.2.6 handles this correctly. Perhaps Marshal should be updated to not care about Data member order.

@nevans
Copy link
Contributor

nevans commented Dec 10, 2025

And sorry to nitpick, without celebrating first!

On the plus side, with this PR merged and psych 5.3.0 released, CRuby and TruffleRuby both behave the same now!

$ irb -ryaml
irb(main):001> D = Data.define(:foo, :bar)
=> D
irb(main):002> YAML.unsafe_load "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
=> #<data D foo="foo", bar="bar">
irb(main):003> 

$ irb -ryaml
irb(main):001> D = Data.define(:a, :b)
=> D
irb(main):002> YAML.unsafe_load "--- !ruby/data:D\nfoo: foo\nbar: bar\n"
=> #<data D a=nil, b=nil>
irb(main):003> 

I believe that second scenario is a bug (I'll open an issue for it), but at least it's the same bug in both implementations! 🎉 😉

@eregon
Copy link
Member Author

eregon commented Dec 10, 2025

If the hash keys (or keyword args) are not an exact match for the member names, it should use the default Data#initialize behavior: don't substitute nils, raise an exception.

I agree this is safer, maybe you can make a PR for that? (EDIT: I made #761)

So, as I understand it, this PR deviates from the current behavior of CRuby with psych 5.4.6.
I'm personally inclined to say that Psych 5.2.6 handles this correctly.

Have you seen #749 (comment)?
It shows the previous behavior (the one you implemented) does allow nil for missing members.
Your output shows there are more edge cases though, which I agree are worth addressing.

I made that comment explicitly to discuss what should happen if members don't match, but it didn't receive any reply in over a month, so I assumed that everyone is OK with the current solution.
Why do you say the previous code is correct when I have shown it isn't? Maybe you misunderstood that comment? (it shows behavior before my change, i.e., behavior from your PR)
It makes me feel unhappy because from my POV this PR is fixing 1 out of 2 hacks added by #692, and your comment feels like a partially-unfair complaint to me.

I think I hadn't considered completely different members as in your example, I just assumed it would behave like in my comment with the extra member, but it seems it doesn't. That's a good point and I think we should validate members strictly, potentially allowing a different order.

While I agree that init_data makes more sense than init_struct and CRuby ought to have a public API specifically designed for this

Please file a ticket at https://bugs.ruby-lang.org/ for this, as I mentioned in #692 (comment).
Since you started using this API outside of Ruby core I think it would make a lot of sense for you to file it.
Using rb_struct_initialize to init a Data is a hack, which might be acceptable within Marshal which is core but I think is not outside of core. Also it has now clearly proven to be a poor fit, at least validation-wise.

IMO we shouldn't convert the keyword args into an array: we should use keyword args (or a hash) for Data objects.

I think that is orthogonal, regardless of how we pass the data we can/need to validate in Psych (relying on rb_struct_initialize() to validate is brittle and doesn't work, as shown in my comment above).

If there a C API to init a Data with a Hash or keyword arguments, that's fine, but there isn't one currently (though performance-wise a Hash is slower than Array). It's also relatively awkward to pass keyword arguments to a C function (needs to be wrapped in another C function, passed to rb_define_method, and depend on internals for checking "has this method (caller function) received kwargs") so I think the C API function should either take an Array, or a Hash but not kwargs.

@eregon
Copy link
Member Author

eregon commented Dec 10, 2025

maybe you can make a PR for that?

Actually I'll make one: #761

@headius
Copy link
Contributor

headius commented Dec 11, 2025

This PR introduced a new method only for the C extension, breaking Psych on JRuby. The JRuby jobs appear to not be reporting the error properly in most cases, but did show up as a failure in the most recent PR merge.

@headius
Copy link
Contributor

headius commented Dec 11, 2025

This PR introduced a new method only for the C extension, breaking Psych on JRuby. The JRuby jobs appear to not be reporting the error properly in most cases, but did show up as a failure in the most recent PR merge.

Nevermind, the new methods were originally added in #692 but never added for JRuby. That PR was merged in May 1 and the JRuby side has been broken ever since, but the CI jobs were not properly reflecting those failures.

@nevans
Copy link
Contributor

nevans commented Dec 13, 2025

@eregon Earlier, I skipped past part of your earlier comment, but I wanted to quickly circle back and address a couple of points:

I made that comment explicitly to discuss what should happen if members don't match, but it didn't receive any reply in over a month, so I assumed that everyone is OK with the current solution. Why do you say the previous code is correct when I have shown it isn't? Maybe you misunderstood that comment? (it shows behavior before my change, i.e., behavior from your PR) It makes me feel unhappy because from my POV this PR is fixing 1 out of 2 hacks added by #692, and your comment feels like a partially-unfair complaint to me.

Sorry about that, sincerely. I do owe you an apology.

When I first saw the ticket, it seemed good enough to me and I was busy with other unrelated things, so I was fine with just a thumbs up. Only after some time did I feel my objection strongly enough to comment. When I returned to it later, 1) I'd forgotten some of the details of your earlier comments, and 2) I was speaking with disproportionate urgency because I realized I'd waited too long and it "might" get merged as-is... Only while writing the comment did I discover it had already been merged and released!

I wrote my follow-up comment to try and convey that, despite my objections, I do appreciate the work you've put into this! And honestly, releasing it as-is was better than delaying any longer for my objections: the 5.3.0 release was hugely useful for the very issue that brought me to this in the first place!

But again, I'm sorry for the unfairness in both timing and tone of my objections.

@nevans
Copy link
Contributor

nevans commented Dec 13, 2025

Please file a ticket at https://bugs.ruby-lang.org/ for this, as I mentioned in #692 (comment). Since you started using this API outside of Ruby core I think it would make a lot of sense for you to file it. Using rb_struct_initialize to init a Data is a hack, which might be acceptable within Marshal which is core but I think is not outside of core. Also it has now clearly proven to be a poor fit, at least validation-wise.

...

If there a C API to init a Data with a Hash or keyword arguments, that's fine, but there isn't one currently (though performance-wise a Hash is slower than Array). It's also relatively awkward to pass keyword arguments to a C function (needs to be wrapped in another C function, passed to rb_define_method, and depend on internals for checking "has this method (caller function) received kwargs") so I think the C API function should either take an Array, or a Hash but not kwargs.

After #765, I think we should file a slightly different ticket:

  • Two ruby methods:
    • Data#initialize_members(**members)
      This would do exactly the same thing as calling super(**members) inside a defined Data subclass's #initialize method: I would implement it as an alias for #initialize. In other words, this would be a perfect replacement for Data.instance_method(:initialize).bind_call(data, **members).
    • Data#initialize_members_values(member_names, member_values)
      This would do the exact same thing, but avoid allocating the hash. The member_names array would be required, for verification. While this could simply be an overload of the other proposed method, the primary rationale for a second method is performance. So it's better to give it a separate name.
  • While I don't personally need this anymore, for symmetry there should be a matching C API:
    • rb_data_initialize_members(VALUE self, VALUE keyword_args)
    • rb_data_initialize_members_values(VALUE self, VALUE member_names, VALUE member_values)
  • Marshal should be updated to use the C API for Data#initialize_members_values.

And, of course, TruffleRuby could add identical code for these on both Data and the specialized instance methods module.

What do you think? Ideas for better method names?

It'll be a few days (or weeks) before I get around to filing such a ticket. So if you feel the urgency more keenly, I don't mind if you file this proposal (or your own) instead.

@eregon
Copy link
Member Author

eregon commented Dec 13, 2025

Sorry about that, sincerely. I do owe you an apology.

Thank you, I appreciate it.
I saw your second comment which helped me guessed you missed/forgot my earlier comment and just wanted to make it safe & reliable, which I 100% agree with.
I was also worried that you did not seem so much concerned about using rb_struct_initialize(), as that feels like a rather big hack to me (both the way to pass kwargs to it and the fact it's meant for Struct and uses Struct#initialize logic and not Data#initialize at all), but it seems clear you understand that concern too and you are fixing it with #765, great (I'm not sure why I didn't think of Data#initialize, maybe because TruffleRuby didn't have it).
I'll also take this occasion to thank you for your work on ruby/net-imap#528 and related, I didn't expect you would do so much there and it's amazing. I'm also sorry for the incompatibilities you have seen with TruffleRuby, at least I think I/others have addressed (most of) them pretty quickly :)

@eregon
Copy link
Member Author

eregon commented Dec 13, 2025

After #765, I think we should file a slightly ticket:

  • Data#initialize_members(**members)

I think this a great idea, because then it calls the subclass-specific copy of initialize which e.g. can unroll the loop to read kwargs, assign data members, have proper inline caches for the fields, etc.

  • Data#initialize_members_values(member_names, member_values)

I suspect this would be less likely to be accepted, or rather that only one of them would be accepted.
Because initialize_members can be properly optimized, there is less concern about passing a Hash of keyword arguments, and conceptually it matches closer to normal initialize.
I would suggest to only propose initialize_members as I think that has more chances to be accepted and it's simpler.

While I don't personally need this anymore, for symmetry there should be a matching C API:

It'd be nice to clean up the usage in Marshal for sure, there are probably subtle bugs lurking there as we have seen here. I wouldn't make it public API though, Marshal is in core so doesn't need it to be public.
In general I'm of the opinion of if it's enough to rb_funcall then just let it be rb_funcall because every function in the public C API has a cost (need to keep/maintain it, need to test it more thoroughly, might need to be implemented on other Ruby implementations or have some kind of alternative, etc. BTW in practice the tests for the public C API functions are usually pretty lacking and safety is not great either, i.e. more reasons to just rb_funcall).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants