Skip to content

[action] [PR:15298] fix: handle EOFError for cache reading#15318

Merged
mssonicbld merged 1 commit intosonic-net:202405from
mssonicbld:cherry/202405/15298
Nov 1, 2024
Merged

[action] [PR:15298] fix: handle EOFError for cache reading#15318
mssonicbld merged 1 commit intosonic-net:202405from
mssonicbld:cherry/202405/15298

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

When parallel run is enabled, multiple processes may try to read/write the same cache file, so there will be a small chance that the file is being written by process 1 while process 2 is reading it, which will cause EOFError in process 2. In this case, we will retry reading the file in process 2. If we still get EOFError after some retry attempts, we will return NOTEXIST to overwrite the file.

In the meantime, we should also optimize how we initialize the DUT hosts when parallel run is enabled to reduce the chance of having such cache read issue.

Summary:
Fixes # (issue) Microsoft ADO 30031372

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405

Approach

What is the motivation for this PR?

To prevent EOFError when reading cache file when parallel run is enabled.

How did you do it?

Add retry mechanism and optimize how DUT hosts are initialized when parallel is enabled.

How did you verify/test it?

I ran the updated code and can confirm parallel run is still working as expected.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Description of PR
When parallel run is enabled, multiple processes may try to read/write the same cache file, so there will be a small chance that the file is being written by process 1 while process 2 is reading it, which will cause EOFError in process 2. In this case, we will retry reading the file in process 2. If we still get EOFError after some retry attempts, we will return NOTEXIST to overwrite the file.

In the meantime, we should also optimize how we initialize the DUT hosts when parallel run is enabled to reduce the chance of having such cache read issue.

Summary:
Fixes # (issue) Microsoft ADO 30031372

Approach
What is the motivation for this PR?
To prevent EOFError when reading cache file when parallel run is enabled.

How did you do it?
Add retry mechanism and optimize how DUT hosts are initialized when parallel is enabled.

How did you verify/test it?
I ran the updated code and can confirm parallel run is still working as expected.

co-authorized by: [email protected]
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #15298

@mssonicbld mssonicbld merged commit b1fbcb9 into sonic-net:202405 Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants