Add explicit RUN_ENDED signal for terminal data handling in DataRecorder#3424
Add explicit RUN_ENDED signal for terminal data handling in DataRecorder#3424
RUN_ENDED signal for terminal data handling in DataRecorder#3424Conversation
|
Hi @quaquel Do you prefer this approach or should we wait for |
|
Performance benchmarks:
|
quaquel
left a comment
There was a problem hiding this comment.
This looks fine to me for now. We might want to revisit this later when we revisit the run_while/stopping condition issue and the run_config idea.
|
Can you check the code coverage and if needed at tests? I'll merge later today or more likely tomorrow. |
|
While writing tests, I found an interesting edge case in parquet backend, In old logic where we only pop from the in-memory buffer: if is_overwrite:
while buffer and buffer[-1]["time"] == time:
buffer.pop()If a file flush happens right before the overwrite (e.g., calling It might be propagated or reflected in user's own custom backend as well. I am convinced that it is a delicate approach. Do you think these type of edge cases can be handled by run_configs easily? |
That's an interesting edge case. In my current conception of On the edge case for the current approach, what about never flushing the last item in the buffer? |
Agree. However the trade-off for this approach, is that users would be strictly required to fetch their data through get_table_dataframe() to see the terminal state. If they bypass the recorder API and load the raw .parquet file directly from the disk, that final timestep will be missing |
Fair point. I am unfortunately not familiar with parquet, but it seems the real fix will require some check of the file and see if something needs to be overwritten. The other option is to at least document this for parquet. |
|
I have already implemented following approach: if the in-memory buffer is empty, it falls back to opening the physical .parquet file, filters out the old timestep (df = df[df["time"] != time]), and rewrites the file before appending the new data. |
|
As indicated, I have little knowledge regarding parquet, so I'll trust your judgment. |
This PR fixes the data collection/recorder architecture by ensuring the final state of a model is always captured at the end of a run, and introduces an "overwrite" mechanism to prevent duplicate data rows when the model is paused, modified, and resumed at the same timestep.
Previously, users had to manually call
finalise()orcollect()to get the last timestep if it didn't perfectly align with the interval.Key Changes:
model.py,signal_types.py):ModelSignals.RUN_ENDEDsignal.run_forandrun_untilto automatically emit this signal when they finish executing.basedatarecorder.py):RUN_ENDEDsignal to automatically capture the final snapshot (_on_run_ended), removing the need for the manualfinalise()method._last_collectiontracking to detect if the current timestep is identical to the previous one, passing anis_overwriteflag downstream.