Skip to content

Address spurious LB+RB log flood on APC BXnnnnMI devices#2565

Merged
jimklimov merged 18 commits intonetworkupstools:masterfrom
jimklimov:issue-2347
Aug 12, 2024
Merged

Address spurious LB+RB log flood on APC BXnnnnMI devices#2565
jimklimov merged 18 commits intonetworkupstools:masterfrom
jimklimov:issue-2347

Conversation

@jimklimov
Copy link
Member

@jimklimov jimklimov commented Jul 29, 2024

Closes: #2347
Also note for #2533 question

It also adds some visibility around calibration status setting, extends "dstate" API with a status_get() method, and this helps avoid setting duplicate states (roughly like "OB LB OB") seen in some drivers earlier.

I hope this toggle allows to fix the problem in the field by optionally delaying spurious status propagation from the driver by lbrb_log_delay_sec at most, and if the device is otherwise "online" and is calibrating (unless lbrb_log_delay_without_calibrating flag was also set).

The fix goes to some lengths to try detecting the device model during init to default the setting to 3 sec for this line-up, otherwise defaults to 0 (immediate status propagation).

@desertwitch @grifferz @ShiroDN @PilaScat @bitmario @marcgarciamarti @KillianMelsen @gerben838665 @mauro-dasilva @tsopokis @statte @s7uben @Sanderluc5 @ivanjx @gabrieleancora @JoshNansoz1 @rioachim @owenperkins111 : Better late than never: would you be able to try a custom build of NUT following https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests to see if it handles the devices better?

For the git checkout, use this PR's source branch:

:; git clone https://github.com/jimklimov/nut -b issue-2347 nut
:; cd nut
...

If you run the built driver with debug verbosity of 2 or greater, it should log that it saw these calibration-like, LB and RB states, and chose to suppress them for a while according to settings. Checking that the numbers from CLI/ups.conf settings are propagated and considered correctly would also be helpful :)

Maybe these messages should be sunk to a less visible debug verbosity, eventually.

Also of interest is if the impacted devices report frequent calibration messages by default (without debug) and if that should be addressed additionally or if onlinedischarge_calibration and/or onlinedischarge_log_throttle_sec and related existing settings address it and the logs can be made peaceful and quiet already.

@jimklimov jimklimov added bug need testing Code looks reasonable, but the feature would better be tested against hardware or OSes APC USB Incorrect or missing readings On some devices driver-reported values are systemically off (e.g. x10, x0.1, const+Value, etc.) Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time labels Jul 29, 2024
@jimklimov jimklimov added this to the 2.8.3 milestone Jul 29, 2024
@jimklimov jimklimov requested review from aquette and clepple July 29, 2024 09:40
@jimklimov jimklimov marked this pull request as draft July 30, 2024 07:23
@jimklimov
Copy link
Member Author

Converting to draft while this is being tested, so NUT CI won't rebuild it in vain against newer target branch as it evolves.

@jimklimov
Copy link
Member Author

Gentle bump. So many people complained about the issue, is anyone still interested in testing a prospective fix? :)

@jimklimov jimklimov marked this pull request as ready for review August 5, 2024 08:17
@ivanjx
Copy link

ivanjx commented Aug 5, 2024

im new to this so please bear with me

so to test just need to clone this branch, compile, and sudo make install on the drivers folder?

the way i currently install nut is installing via apt first then overwrite it with manual compile and sudo make install

@desertwitch
Copy link
Contributor

Gentle bump. So many people complained about the issue, is anyone still interested in testing a prospective fix? :)

Hi Jim, thanks a lot for the effort put into making this work for everyone.
Unfortunately not able to test due to lack of affected hardware, but will do a code review / sanity-check tonight.

@jimklimov
Copy link
Member Author

im new to this so please bear with me

so to test just need to clone this branch, compile, and sudo make install on the drivers folder?

the way i currently install nut is installing via apt first then overwrite it with manual compile and sudo make install

Generally, yes. A finer approach is presented at https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests which refers to the list of dependencies per platform, configure the new build similarly to what your packages (or older custom builds) delivered, and describes how to test a new driver from the build workspace before installing it over your older build for "production" use (or not, if the test is unsuccessful). Surely it is not the only way to skin a cat, but one best streamlined to exploratory custom builds.

@jimklimov jimklimov marked this pull request as draft August 5, 2024 17:34
@desertwitch
Copy link
Contributor

Code looking good as usual, Jim, one thing we might want to consider is if we should default to lbrb_log_delay_without_calibrating = 1 for the affected APC series as well. I'm thinking this, as these spurious and seemingly random events might not always be preceded by an assumed or actual calibration. 3 seconds delay to registering an actual LB status will probably not make a difference in real life, as opposed to the very real annoyance of false statuses being reported and users then having to go search for the lbrb_log_delay_without_calibrating toggle in the manuals. Thanks for your efforts!

@jimklimov
Copy link
Member Author

Well, given lbrb_log_delay_without_calibrating is a flag, users of those models would not be able to disable it. Can at least suggest it in autodetection message, though.

@jimklimov
Copy link
Member Author

The CI faults are due to a change with an agent after an upgrade (lacked 32-bit libs for some dependencies now).

@jimklimov
Copy link
Member Author

Tested the monster message printer, works well but relies on math a bit (that the testvar findings are exactly 0 or 1) so will poke that a bit later.

Not sure why CI builds that code path where it failed due to missing libs, by config it should not have.

…actly 0/1 [networkupstools#2347]

When building a complex text expression, we rely on maths in some spots.

Signed-off-by: Jim Klimov <[email protected]>
…ctually build a graphical program

Namely, that further third-party libs are available
for the chosen architecture, not only the headers.
Had a problem with 32/64-bit build agent that only
had a binary lib*.so set for 64-bit after an update.

Signed-off-by: Jim Klimov <[email protected]>
@desertwitch
Copy link
Contributor

desertwitch commented Aug 12, 2024

Tested the monster message printer, works well but relies on math a bit (that the testvar findings are exactly 0 or 1) so will poke that a bit later.

Not sure why CI builds that code path where it failed due to missing libs, by config it should not have.

Thanks for the additions, is there a process for testing drivers and such hardware-specific code without actually having the affected hardware (which I assume you don't either)? Would love to help out with testing such code, but not sure how to go about that with the drivers (dummy-ups wouldn't work for a specific driver, right?). That message printer, as an example.

@jimklimov
Copy link
Member Author

jimklimov commented Aug 12, 2024 via email

@jimklimov jimklimov merged commit 443ba6a into networkupstools:master Aug 12, 2024
@jimklimov jimklimov deleted the issue-2347 branch August 12, 2024 21:21
jimklimov added a commit to jimklimov/nut that referenced this pull request Feb 11, 2025
jimklimov added a commit to jimklimov/nut that referenced this pull request Feb 11, 2025
@jimklimov jimklimov restored the issue-2347 branch April 15, 2025 09:55
@jimklimov jimklimov deleted the issue-2347 branch April 15, 2025 10:41
@invario
Copy link
Contributor

invario commented May 6, 2025

Hi, just wanted to note that the LB/RB problems occurs for my new APC BVK750M2 also. Going to try this fix and see how it goes. Thanks for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

APC bug Connection stability issues Issues about driver<->device and/or networked connections (upsd<->upsmon...) going AWOL over time Incorrect or missing readings On some devices driver-reported values are systemically off (e.g. x10, x0.1, const+Value, etc.) need testing Code looks reasonable, but the feature would better be tested against hardware or OSes USB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

APC Back-UPS BX1600MI spurious LOWBATT/REPLACEBATT events

6 participants