Skip to content

Conversation

@gromit1811
Copy link
Contributor

@gromit1811 gromit1811 commented Mar 7, 2025

Finally, a first attempt to fix #16197

The main fixes are from the following 2 commits (other commits are mostly cleanup/optimization):

commit 9eb0492

ospf6d: Fix old summary route lookup

When orignating summaries, we perform a lookup to find possible previous
summaries. Don't just compare the prefix in that case, but also route and
path properties: We might have several summary entries for the same
destination (e.g. with a intra- and inter-area path) at the same time (at
least temporarily) and we need to deal with the right one.

commit 6f8df0c

ospf6d: Fix summary LSA removal by using ospf6_abr_task()

Previously, we checked each individual route to see whether a summary LSA
needs to be originated or purged. The problem is that this decision can't be
made reliably by looking at a single route, because there might be several
routes for a destination (e.g. one with a intra- and one with an inter-area
path) and one might cause origination of a summary LSA while the other
causes it to be purged. This causes unstable and/or missing LSAs as
described in https://github.com/FRRouting/frr/issues/16197

Instead, ospf6_abr_originate_summary_to_area() now only originates new LSAs
or approves existing ones. Removal is handled in ospf6_abr_task() after the
whole routing table has been iterated. LSAs which have not been approved by
any call to ospf6_abr_originate_summary_to_area() are then removed. This
matches the behaviour of OSPFv2.

Note that this isn't finished/complete yet and is intended mainly for feedback or to stop me in case I'm moving in a completely wrong direction 😉

Open issues:

  • Check topotest failure: ospf6_gr_topo1
  • Check topotest failure: ospf6_topo2
  • Routing table hooks no longer make sense the way they're implemented (directly calling ospf6_abr_originate_summary_to_area()), just schedule ospf6_abr_task() instead?
  • Right now we actually check routes twice, once via hooks and once via ospf6_abr_task(). Doesn't seem to hurt, but isn't really efficient.
  • Check whether there are other cases where we might have to schedule ospf6_abr_task() to immediately remove a summary LSA.
  • ospf6_abr_task() is hidden in ospf6_nssa.c even though it is not at all related to NSSA. Guess it was put there when adding NSSA support and that needed a small part of the ospf6_abr_task() functionality. Move non-NSSA functions from ospf6_nssa.c to ospf6_abr.c.
  • Rebase to master

@frrbot frrbot bot added the ospfv3 label Mar 7, 2025
@donaldsharp
Copy link
Member

Hey can we submit the PR against master not 10.2? All code must go into master first.

@gromit1811
Copy link
Contributor Author

Sure. I'm currently testing on 10.2.1 to have a stable base, but of course I'll rebase to master when the PR is complete and before removing the draft status

@gromit1811 gromit1811 force-pushed the bugfix/inter_area_lsa_16197 branch from 0b13876 to 721409e Compare March 26, 2025 09:46
@gromit1811
Copy link
Contributor Author

Latest update fixes the memory corruption triggered by ospf6_gr_topo1. Still a few TODOs left.

@gromit1811 gromit1811 force-pushed the bugfix/inter_area_lsa_16197 branch 2 times, most recently from f0bb512 to 5569bc9 Compare March 26, 2025 14:40
@gromit1811
Copy link
Contributor Author

Yay, all ospf6 topotests work now

@Jafaral
Copy link
Member

Jafaral commented Mar 26, 2025

Thank you @gromit1811

@gromit1811 gromit1811 marked this pull request as ready for review April 15, 2025 13:15
No point in doing this for route types we're not interested in at all.

Signed-off-by: Martin Buck <[email protected]>
When originating summaries, we perform a lookup to find possible previous
summaries. Don't just compare the prefix in that case, but also route and
path properties: We might have several summary entries for the same
destination (e.g. with a intra- and inter-area path) at the same time (at
least temporarily) and we need to deal with the right one.

Signed-off-by: Martin Buck <[email protected]>
Previously, we checked each individual route to see whether a summary LSA
needs to be originated or purged. The problem is that this decision can't be
made reliably by looking at a single route, because there might be several
routes for a destination (e.g. one with a intra- and one with an inter-area
path) and one might cause origination of a summary LSA while the other
causes it to be purged. This causes unstable and/or missing LSAs as
described in FRRouting#16197

Instead, ospf6_abr_originate_summary_to_area() now only originates new LSAs
or approves existing ones. Removal is handled in ospf6_abr_task() after the
whole routing table has been iterated. LSAs which have not been approved by
any call to ospf6_abr_originate_summary_to_area() are then removed. This
matches the behaviour of OSPFv2.

Signed-off-by: Martin Buck <[email protected]>
We slightly abuse summary routes to store information about LSAs that have
been originated for them. Most other route parameters are not used for
anything. So don't update them to avoid giving the impression that we've
got real routes here and to reduce confusion.

Signed-off-by: Martin Buck <[email protected]>
Calling ospf6_abr_task()'s callees directly no longer works as expected
because they don't know about the "LSA unapprove/approve/remove unapproved"
game implemented by ospf6_abr_task(). So just schedule ospf6_abr_task()
instead of calling its callees directly.

Signed-off-by: Martin Buck <[email protected]>
When reconfiguring an area between stub and normal, explicitly schedule the
ABR task to originate/expire summaries. So far, this happend implicitly as
a side effect of (re)originating the router LSA, but let's make this more
obvious by explictly scheduling the ABR task. This also matches what we do
when reconfiguring between NSSA and normal and also what ospfd does.

Signed-off-by: Martin Buck <[email protected]>
Move general ABR infrastructure code not specific to NSSA support
(ospf6_abr_task() and friends) out of ospf6_nssa.c. It proably got copied in
there (from ospfd) when adding NSSA support since it was a prerequisite for
that. Now that general intera-area summary orgination uses the ABR task as
well, it no longer makes sense to hide it in ospf6_nssa.c, so move it to
ospf6_abr.c.

While we're at it, also clean up some other unused cruft from
ospf6_nssa.{c,h} and move things which should not be private (like LSA
flags) to their proper place.

Signed-off-by: Martin Buck <[email protected]>
@gromit1811 gromit1811 force-pushed the bugfix/inter_area_lsa_16197 branch from 69ffa58 to 5c3f487 Compare April 17, 2025 11:23
@github-actions github-actions bot added size/XL and removed size/L labels Apr 17, 2025
@gromit1811
Copy link
Contributor Author

Rebased again to current master. Now CI looks much better and my local topotests were also successful (except for some BGP/SRv6/NHRP tests I really don't feel responsible for 😉)
Shall I fix the frrbot and "Verify Source" style issues? Except for one change that's code I've not modified at all but only moved around. And I don't fully agree with the "WARNING: Unnecessary ftrace-like logging - prefer using ftrace" - ftrace might be nice, but it's much more effort to set up than a simple "debug ospf6 ...".

@gromit1811
Copy link
Contributor Author

Fixed style issues found by frrbot. I did this in a separate commit so that the "move code around" commit doesn't move and modify code at the same time which would be more difficult to review.

I intentionally left the ospf6_abr_task() entry/exit log messages in which "[CI] Verify Source" complained about - ABR task may run asynchronously and it's important to understand when that happens when reading logs. And no, ftrace is not a suitable replacement as it's important to have those log messages together the other log messages and in sequence.

I removed some other entry/exit messages from functions only called by ospf6_abr_task(), though - they're only called from there and ABR task already logs each individual step.

@gromit1811
Copy link
Contributor Author

As expected, "Verify Source" still fails, but now with only one warning left triggered by the ospf6_abr_task() entry/exit logs I left in intentionally. Is there some kind of magic comment to override the warning or is it OK to just ignore it?

@riw777 Anything else I can do simplify review? I know this is a big PR (but not as big as the labels suggest 😉), but I tried to split it up into reasonable commits which should be easier to review individually than the whole diff. And the largest commit only moves code, so that should be relatively easy as well...

Copy link
Member

@riw777 riw777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@riw777
Copy link
Member

riw777 commented May 20, 2025

Not certain if we should merge past this or not:

===============================================
< WARNING: Unnecessary ftrace-like logging - prefer using ftrace
< #1499: FILE: /tmp/f1-2062112/ospf6_abr.c:1499:
Details at [https://ci1.netdef.org/browse/FRR-PULLREQ3-CHECKOUT-8861](https://ci1.netdef.org/browse/FRR-PULLREQ3-CHECKOUT-8861)

@gromit1811
Copy link
Contributor Author

Not certain if we should merge past this or not:

===============================================
< WARNING: Unnecessary ftrace-like logging - prefer using ftrace
< #1499: FILE: /tmp/f1-2062112/ospf6_abr.c:1499:
Details at [https://ci1.netdef.org/browse/FRR-PULLREQ3-CHECKOUT-8861](https://ci1.netdef.org/browse/FRR-PULLREQ3-CHECKOUT-8861)

Me neither. I fixed the other issues checkpatch was complaining about, but this debug message here was pretty important for me when analyzing logs, so I'd strongly prefer to leave it in. Of course, I can try to reformulate it to not trigger the warning. But IIRC, @donaldsharp also commented somewhere (Slack?) and questioned the usefulness of this warning as the ftrace recommendation doesn't really apply in this context.

@gromit1811 gromit1811 force-pushed the bugfix/inter_area_lsa_16197 branch from 04d69da to 3e23f74 Compare May 20, 2025 20:27
Fix style issues found by frrbot in unmodified but moved around code. Also
remove some excessive log messages found by "[CI] Verify Source".

Signed-off-by: Martin Buck <[email protected]>
@gromit1811 gromit1811 force-pushed the bugfix/inter_area_lsa_16197 branch from 3e23f74 to 3643e6d Compare May 20, 2025 20:32
@gromit1811
Copy link
Contributor Author

Rephrased debug message to make checkpatch happy. Hopefully the final fix that allows to merge this 🤞

@gromit1811
Copy link
Contributor Author

ci:rerun

1 similar comment
@gromit1811
Copy link
Contributor Author

ci:rerun

@riw777
Copy link
Member

riw777 commented Jun 3, 2025

ci has been running for two weeks? hmmm...

ci:rerun

@gromit1811
Copy link
Contributor Author

ci has been running for two weeks? hmmm...

ci:rerun

@riw777 Is it just this PR with repeated CI failures totally unrelated to the actual changes in the PR, or is CI generally unstable ATM?

@gromit1811
Copy link
Contributor Author

The remaining 2 "Basic Tests" failures look unrelated, BGP/ISIS SRv6 problems. Shall I rebase to get a possible fix from current master or just ignore them?

@riw777
Copy link
Member

riw777 commented Jun 10, 2025

The remaining 2 "Basic Tests" failures look unrelated, BGP/ISIS SRv6 problems. Shall I rebase to get a possible fix from current master or just ignore them?

I'm going to try to rerun just the failed tests to see what we can do

@riw777 riw777 merged commit 8b1620a into FRRouting:master Jun 17, 2025
19 of 21 checks passed
@gromit1811
Copy link
Contributor Author

@riw777 Thanks for merging, just after I celebrated the 1st birthday of the original bug report 😄 This set at least 3 records: The most complicated bug I analyzed, the most complicated bug fix I did and the longest time to get the fix through CI/CD. Now we just need to get #18549 in as well and then I can finally get this off my radar...

@gromit1811 gromit1811 deleted the bugfix/inter_area_lsa_16197 branch August 20, 2025 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing ECMP nexthops for OSPFv3 inter-area routes

4 participants