Addition of weighted selection to custom providers#1414
Addition of weighted selection to custom providers#1414kingthorin merged 33 commits intodatafaker-net:mainfrom
Conversation
PR Summary
|
kingthorin
left a comment
There was a problem hiding this comment.
What happens in the case of an empty or non-double weight?
(I haven't pulled the code and tried).
I'd also suggest creating a test YAML file and testing as you see users actually using it. (Which can likely be re-used for documentation later)
|
@kingthorin , great catch on the edge cases for weights! I’ve implemented a solution that should address the necessary scenarios. Additionally, in this file, I demonstrated the weighted selection with a custom hardcoded provider. Currently, I’m working on implementing weighted selection using a custom provider that loads data from a YAML file. I’ve created the file, but I’m facing challenges loading the data to apply the weighted selection. Following the logic from the ant() method, it looks like the key changes might be needed in AbstractProvider and FakeValueService classes, but I'm unsure how to proceed. Could you please provide any recommendations? |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1414 +/- ##
============================================
- Coverage 92.22% 92.20% -0.03%
- Complexity 3165 3182 +17
============================================
Files 320 320
Lines 6173 6207 +34
Branches 592 596 +4
============================================
+ Hits 5693 5723 +30
- Misses 335 336 +1
- Partials 145 148 +3 ☔ View full report in Codecov by Sentry. |
| private double calculateTotalWeight(List<Map<String, Object>> items) { | ||
| return items.stream() | ||
| .mapToDouble(item -> (Double) item.get(WEIGHT_KEY)) | ||
| .sum(); | ||
| } |
There was a problem hiding this comment.
How will it work if
- there are N weights each of them is Double.MAX_VALUE - 1 or something like that, are we protected against overflow?
- what if there are custom fields with same name however with different meaning in same provider and even number type? Will there at least any warning appear or how should user be warned about this (without hours of debugging)
There was a problem hiding this comment.
- I appreciate your insight regarding potential overflow. While weights are typically not expected to be extremely large, I've implemented an iterative approach to ensure that adding each weight is checked against Double.MAX_VALUE. This helps prevent overflow in cases where unusually large weights might occur.
- I believe that values should be unique when performing weighted selection. Therefore, while weights can be the same, duplicates in the value parameter trigger an exception that informs the user of the specific duplicate value. This approach seems appropriate to ensure the integrity of the selection process.
There was a problem hiding this comment.
While weights are typically not expected to be extremely large,
if we are talking about project in open source which is used then any input should be expected if it passes all existing validation
There was a problem hiding this comment.
Sure, I got your point. I provided the guard for overflow.
| for (Map<String, Object> item : items) { | ||
| currentWeight += (Double) item.get(WEIGHT_KEY); | ||
| if (randomValue < currentWeight) { | ||
| return (T) item.get(VALUE_KEY); | ||
| } | ||
| } |
There was a problem hiding this comment.
is there a reason to do this on each iteration?
IIUC data in file is persistent then we could sort them once including all unboxing things (not sure if need them...) only once and have something like array of doubles
then just use something like adopted binary search
or did I miss anything here?
There was a problem hiding this comment.
Thank you for the suggestion. I implemented the proposed optimization of sorting and preprocessing the cumulative weights once, which improved the efficiency.
| private double[] cumulativeWeights; | ||
| private Object[] values; | ||
|
|
There was a problem hiding this comment.
Definitively it should be outside of this class
Otherwise it will slow down all other not weight related cases
There should be a weight provider dedicated interface/class instead this very generic
i guess same for methods otherwise we are in situation that we have functionality in generic class which doesn't support 100% of existing providers
at the same time for each instance we add more memory footprint
There was a problem hiding this comment.
I've moved all the logic for the weight selector to the WeightedRandomSelector class. Currently, I’ve placed it in the service package, as it seems to fit best there, though I believe you initially suggested putting it in the providers package?
Additionally, based on the issue description, WeightedRandomSelector appears intended mainly for custom providers. At this point, no provider has data in the format required for this functionality. Should I register it only to be used with custom providers, or do you have a different solution in mind?
|
|
||
| private void assertUniqueValues(Map<String, Object> item, Set<Object> values) { | ||
| Object value = item.get(VALUE_KEY); | ||
| if (!values.add(value)) { |
There was a problem hiding this comment.
Another problem Set for objects
this is a very error prone to assume that for very object there are suitable equals/hashcode
There was a problem hiding this comment.
Do you have some suggestions on how to refactor it?
| if (item == null || item.isEmpty()) { | ||
| throw new IllegalArgumentException("Item cannot be null or empty"); | ||
| } | ||
| if (!item.containsKey(WEIGHT_KEY) || !item.containsKey(VALUE_KEY)) { |
There was a problem hiding this comment.
same here about equals/hashcode
There was a problem hiding this comment.
Sorry, but I might not get this. I am checking if each item contains weight and value keys defined as strings. If for some reason they do not, I am directly throwing an exception.
|
@kingthorin @snuyanzin Additionally, I have updated the documentation and examples to emphasize that weighted random selection is still in the Proof of Concept (POC) stage. I believe these changes address your concerns, and the PR is now ready for merging. Any future suggestions for improvements will be considered and addressed in subsequent Issues, ... |
|
@snuyanzin , would you be so kind and have a look at the PR please. I am awaiting your approve so it can be merged. |
| List<Map<String, Object>> insects = List.of( | ||
| Map.of("value", "Driver ant", "weight", 6.0), | ||
| Map.of("value", "Fire ant", "weight", 3.0), | ||
| Map.of("value", "Harvester ant", "weight", 1.0) | ||
| ); |
There was a problem hiding this comment.
It would be better to follow same way as existing example and place all the example data in constant like above rather than generating it for each method invocation
| Map.of("value", "Harvester ant", "weight", 1.0) | ||
| ); | ||
|
|
||
| WeightedRandomSelector selector = new WeightedRandomSelector(new Random()); |
There was a problem hiding this comment.
Also I guess we don't need to create a new selector per getter invocation and can reuse it
| if (!(weightObj instanceof Double weight)) { | ||
| throw new IllegalArgumentException("Weight must be a non-null Double"); | ||
| } | ||
| if (weight <= 0 || Double.isNaN(weight) || Double.isInfinite(weight)) { |
There was a problem hiding this comment.
out of curiosity: why do we consider weight = 0 as an error?
IMHO it might be a valid way to turn off this value while the total must be positive
There was a problem hiding this comment.
My thought was that a weight of 0 implies that the corresponding item cannot be selected. Since WeightedRandomSelector is designed to assign probabilities proportional to weights, including a weight of 0 would lead to unnecessary processing for an element that cannot be selected. But I refactored the code to enable also 0 weight, allowing weights of 0 to be a legitimate way to "turn off" certain elements while still keeping them in the configuration
| @ParameterizedTest | ||
| @MethodSource("weightedRandomSelectorProvider") | ||
| void testWeightedArrayElement_withZeroWeight(WeightedRandomSelector weightedRandomSelector) { | ||
| List<Map<String, Object>> items = List.of( | ||
| Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, ZERO_WEIGHT), | ||
| Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT) | ||
| ); | ||
|
|
||
| assertThatThrownBy(() -> weightedRandomSelector.select(items)) | ||
| .isInstanceOf(IllegalArgumentException.class) | ||
| .hasMessage("Weight must be a positive number and cannot be NaN or infinite"); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @MethodSource("weightedRandomSelectorProvider") | ||
| void testWeightedArrayElement_withNegativeWeight(WeightedRandomSelector weightedRandomSelector) { | ||
| List<Map<String, Object>> items = List.of( | ||
| Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, NEGATIVE_WEIGHT), | ||
| Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT) | ||
| ); | ||
|
|
||
| assertThatThrownBy(() -> weightedRandomSelector.select(items)) | ||
| .isInstanceOf(IllegalArgumentException.class) | ||
| .hasMessage("Weight must be a positive number and cannot be NaN or infinite"); | ||
| } | ||
|
|
||
| @ParameterizedTest | ||
| @MethodSource("weightedRandomSelectorProvider") | ||
| void testWeightedArrayElement_withInfiniteWeight(WeightedRandomSelector weightedRandomSelector) { | ||
| List<Map<String, Object>> items = List.of( | ||
| Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, Double.POSITIVE_INFINITY), | ||
| Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT) | ||
| ); | ||
|
|
||
| assertThatThrownBy(() -> weightedRandomSelector.select(items)) | ||
| .isInstanceOf(IllegalArgumentException.class) | ||
| .hasMessage("Weight must be a positive number and cannot be NaN or infinite"); | ||
| } |
There was a problem hiding this comment.
It looks like all these tests (my be some more) could be replaced with one parameterized test
since we have input of key/value sequence, exception and error message
There was a problem hiding this comment.
Good point, I refactored a whole test class to use a parameterized tests.
|
I think this version is much closer to be merged, I left some comments |
|
Thanks for the comments @snuyanzin , I have already addressed them. |
| */ | ||
| @ParameterizedTest | ||
| @MethodSource("exceptionCasesProvider") | ||
| void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) { |
There was a problem hiding this comment.
Can you rearrange a bit methods within this class
in a way: methods with different test annotations first
all other methods after that
There was a problem hiding this comment.
tests were rearranged
| */ | ||
| @ParameterizedTest | ||
| @MethodSource("exceptionCasesProvider") | ||
| void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) { |
There was a problem hiding this comment.
| void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) { | |
| void testWeightedArrayElementExceptions(List<Map<String, Object>> items, String expectedMessage) { |
or another way
| void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) { | |
| void testWeightedArrayElementFailureCase(List<Map<String, Object>> items, String expectedMessage) { |
lets's do not mix different notations
| * - any values in the list are not unique or null, | ||
| * - the sum of weights exceeds Double.MAX_VALUE. | ||
| */ | ||
| public <T> T select(List<Map<String, Object>> items) { |
There was a problem hiding this comment.
| public <T> T select(List<Map<String, Object>> items) { | |
| public static <T> T select(List<Map<String, Object>> items) { |
this seems not addressed yet
not sure if we need it public
There was a problem hiding this comment.
By the current design, the method public T select(List<Map<String, Object>> items) cannot be static because it relies on the random field of the WeightedRandomSelector record, which is an instance field.
If the goal is to make the method static, I will need to explicitly pass the Random object to the method.
From my point of view, the current implementation is correct.
As for the Issue, I have implemented the capability for weighted selection with custom providers.
#1314