Skip to content

Addition of weighted selection to custom providers#1414

Merged
kingthorin merged 33 commits intodatafaker-net:mainfrom
SamuelNeceda:weighted_selection
Nov 24, 2024
Merged

Addition of weighted selection to custom providers#1414
kingthorin merged 33 commits intodatafaker-net:mainfrom
SamuelNeceda:weighted_selection

Conversation

@SamuelNeceda
Copy link
Copy Markdown
Contributor

@SamuelNeceda SamuelNeceda commented Oct 30, 2024

As for the Issue, I have implemented the capability for weighted selection with custom providers.

#1314

@what-the-diff
Copy link
Copy Markdown

what-the-diff bot commented Oct 30, 2024

PR Summary

  • Implementation of a New Function
    The code now includes a new method called weightedArrayElement that is placed inside RandomService.java. This added function allows for selecting an item from a list of maps, but the odds of any specific item being selected can be uneven or "weighted".

  • Introduction of Input Checks
    The newly implemented weightedArrayElement function has also been hardened against faults by adding validation that checks the input data. If the input is not provided, is left empty, or has weights that are zero or less, it will now throw exceptions and prevent further operations.

  • Addition of Test Cases
    To ensure that the new weightedArrayElement function performs correctly, the RandomServiceTest.java class has been updated with multiple tests. These tests confirm the function's performance with different scenarios, including the use of a single or multiple elements, inputs that are null or empty, and situations with zero or negative weights.

  • Introduction of New Utility Class for Testing
    Developers now have a TestConstants.java utility class at their disposal, facilitating easier testing. This class houses various constants, which not only enhance the readability of the tests but also makes their maintenance much easier.

  • Defining Constants for Testing
    The newly created TestConstants.java contains a set of newly defined constants including WEIGHT_KEY, VALUE_KEY, and various element weights. These constants are to ensure consistency across tests and improve the ease-of-use and readability of the testing process.

Copy link
Copy Markdown
Collaborator

@kingthorin kingthorin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the case of an empty or non-double weight?

(I haven't pulled the code and tried).

I'd also suggest creating a test YAML file and testing as you see users actually using it. (Which can likely be re-used for documentation later)

@SamuelNeceda
Copy link
Copy Markdown
Contributor Author

@kingthorin , great catch on the edge cases for weights! I’ve implemented a solution that should address the necessary scenarios.

Additionally, in this file, I demonstrated the weighted selection with a custom hardcoded provider.

Currently, I’m working on implementing weighted selection using a custom provider that loads data from a YAML file. I’ve created the file, but I’m facing challenges loading the data to apply the weighted selection. Following the logic from the ant() method, it looks like the key changes might be needed in AbstractProvider and FakeValueService classes, but I'm unsure how to proceed. Could you please provide any recommendations?

kingthorin
kingthorin previously approved these changes Oct 31, 2024
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 82.35294% with 6 lines in your changes missing coverage. Please review.

Project coverage is 92.20%. Comparing base (0f50bbe) to head (8d0de4a).

Files with missing lines Patch % Lines
...main/java/net/datafaker/service/RandomService.java 82.35% 3 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1414      +/-   ##
============================================
- Coverage     92.22%   92.20%   -0.03%     
- Complexity     3165     3182      +17     
============================================
  Files           320      320              
  Lines          6173     6207      +34     
  Branches        592      596       +4     
============================================
+ Hits           5693     5723      +30     
- Misses          335      336       +1     
- Partials        145      148       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kingthorin kingthorin dismissed their stale review October 31, 2024 22:31

Oops still draft

Comment on lines +206 to +210
private double calculateTotalWeight(List<Map<String, Object>> items) {
return items.stream()
.mapToDouble(item -> (Double) item.get(WEIGHT_KEY))
.sum();
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How will it work if

  1. there are N weights each of them is Double.MAX_VALUE - 1 or something like that, are we protected against overflow?
  2. what if there are custom fields with same name however with different meaning in same provider and even number type? Will there at least any warning appear or how should user be warned about this (without hours of debugging)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I appreciate your insight regarding potential overflow. While weights are typically not expected to be extremely large, I've implemented an iterative approach to ensure that adding each weight is checked against Double.MAX_VALUE. This helps prevent overflow in cases where unusually large weights might occur.
  2. I believe that values should be unique when performing weighted selection. Therefore, while weights can be the same, duplicates in the value parameter trigger an exception that informs the user of the specific duplicate value. This approach seems appropriate to ensure the integrity of the selection process.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While weights are typically not expected to be extremely large,

if we are talking about project in open source which is used then any input should be expected if it passes all existing validation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I got your point. I provided the guard for overflow.

Comment on lines +215 to +220
for (Map<String, Object> item : items) {
currentWeight += (Double) item.get(WEIGHT_KEY);
if (randomValue < currentWeight) {
return (T) item.get(VALUE_KEY);
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason to do this on each iteration?
IIUC data in file is persistent then we could sort them once including all unboxing things (not sure if need them...) only once and have something like array of doubles
then just use something like adopted binary search

or did I miss anything here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion. I implemented the proposed optimization of sorting and preprocessing the cumulative weights once, which improved the efficiency.

Comment on lines +19 to +21
private double[] cumulativeWeights;
private Object[] values;

Copy link
Copy Markdown
Collaborator

@snuyanzin snuyanzin Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitively it should be outside of this class
Otherwise it will slow down all other not weight related cases

There should be a weight provider dedicated interface/class instead this very generic

i guess same for methods otherwise we are in situation that we have functionality in generic class which doesn't support 100% of existing providers
at the same time for each instance we add more memory footprint

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved all the logic for the weight selector to the WeightedRandomSelector class. Currently, I’ve placed it in the service package, as it seems to fit best there, though I believe you initially suggested putting it in the providers package?

Additionally, based on the issue description, WeightedRandomSelector appears intended mainly for custom providers. At this point, no provider has data in the format required for this functionality. Should I register it only to be used with custom providers, or do you have a different solution in mind?


private void assertUniqueValues(Map<String, Object> item, Set<Object> values) {
Object value = item.get(VALUE_KEY);
if (!values.add(value)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another problem Set for objects
this is a very error prone to assume that for very object there are suitable equals/hashcode

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some suggestions on how to refactor it?

if (item == null || item.isEmpty()) {
throw new IllegalArgumentException("Item cannot be null or empty");
}
if (!item.containsKey(WEIGHT_KEY) || !item.containsKey(VALUE_KEY)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here about equals/hashcode

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I might not get this. I am checking if each item contains weight and value keys defined as strings. If for some reason they do not, I am directly throwing an exception.

@SamuelNeceda SamuelNeceda changed the title [Draft] Addition of weighted selection to custom providers. Addition of weighted selection to custom providers Nov 3, 2024
@SamuelNeceda
Copy link
Copy Markdown
Contributor Author

@kingthorin @snuyanzin
I have removed all changes from the mentioned classes, ensuring that the WeightedRandomSelector can now be instantiated independently.

Additionally, I have updated the documentation and examples to emphasize that weighted random selection is still in the Proof of Concept (POC) stage.

I believe these changes address your concerns, and the PR is now ready for merging. Any future suggestions for improvements will be considered and addressed in subsequent Issues, ...

@asolntsev asolntsev added this to the 2.4.3 milestone Nov 19, 2024
@asolntsev asolntsev linked an issue Nov 19, 2024 that may be closed by this pull request
@asolntsev asolntsev added the enhancement New feature or request label Nov 19, 2024
@SamuelNeceda
Copy link
Copy Markdown
Contributor Author

SamuelNeceda commented Nov 19, 2024

@snuyanzin , would you be so kind and have a look at the PR please. I am awaiting your approve so it can be merged.

Comment on lines +36 to +40
List<Map<String, Object>> insects = List.of(
Map.of("value", "Driver ant", "weight", 6.0),
Map.of("value", "Fire ant", "weight", 3.0),
Map.of("value", "Harvester ant", "weight", 1.0)
);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to follow same way as existing example and place all the example data in constant like above rather than generating it for each method invocation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Map.of("value", "Harvester ant", "weight", 1.0)
);

WeightedRandomSelector selector = new WeightedRandomSelector(new Random());
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess we don't need to create a new selector per getter invocation and can reuse it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

if (!(weightObj instanceof Double weight)) {
throw new IllegalArgumentException("Weight must be a non-null Double");
}
if (weight <= 0 || Double.isNaN(weight) || Double.isInfinite(weight)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out of curiosity: why do we consider weight = 0 as an error?
IMHO it might be a valid way to turn off this value while the total must be positive

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought was that a weight of 0 implies that the corresponding item cannot be selected. Since WeightedRandomSelector is designed to assign probabilities proportional to weights, including a weight of 0 would lead to unnecessary processing for an element that cannot be selected. But I refactored the code to enable also 0 weight, allowing weights of 0 to be a legitimate way to "turn off" certain elements while still keeping them in the configuration

Comment on lines +131 to +168
@ParameterizedTest
@MethodSource("weightedRandomSelectorProvider")
void testWeightedArrayElement_withZeroWeight(WeightedRandomSelector weightedRandomSelector) {
List<Map<String, Object>> items = List.of(
Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, ZERO_WEIGHT),
Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT)
);

assertThatThrownBy(() -> weightedRandomSelector.select(items))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Weight must be a positive number and cannot be NaN or infinite");
}

@ParameterizedTest
@MethodSource("weightedRandomSelectorProvider")
void testWeightedArrayElement_withNegativeWeight(WeightedRandomSelector weightedRandomSelector) {
List<Map<String, Object>> items = List.of(
Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, NEGATIVE_WEIGHT),
Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT)
);

assertThatThrownBy(() -> weightedRandomSelector.select(items))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Weight must be a positive number and cannot be NaN or infinite");
}

@ParameterizedTest
@MethodSource("weightedRandomSelectorProvider")
void testWeightedArrayElement_withInfiniteWeight(WeightedRandomSelector weightedRandomSelector) {
List<Map<String, Object>> items = List.of(
Map.of(VALUE_KEY, ELEMENT_1, WEIGHT_KEY, Double.POSITIVE_INFINITY),
Map.of(VALUE_KEY, ELEMENT_2, WEIGHT_KEY, ELEMENT_2_WEIGHT)
);

assertThatThrownBy(() -> weightedRandomSelector.select(items))
.isInstanceOf(IllegalArgumentException.class)
.hasMessage("Weight must be a positive number and cannot be NaN or infinite");
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like all these tests (my be some more) could be replaced with one parameterized test
since we have input of key/value sequence, exception and error message

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I refactored a whole test class to use a parameterized tests.

@snuyanzin
Copy link
Copy Markdown
Collaborator

I think this version is much closer to be merged, I left some comments

@SamuelNeceda
Copy link
Copy Markdown
Contributor Author

Thanks for the comments @snuyanzin , I have already addressed them.

*/
@ParameterizedTest
@MethodSource("exceptionCasesProvider")
void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rearrange a bit methods within this class
in a way: methods with different test annotations first
all other methods after that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests were rearranged

*/
@ParameterizedTest
@MethodSource("exceptionCasesProvider")
void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) {
Copy link
Copy Markdown
Collaborator

@snuyanzin snuyanzin Nov 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) {
void testWeightedArrayElementExceptions(List<Map<String, Object>> items, String expectedMessage) {

or another way

Suggested change
void testWeightedArrayElement_exceptions(List<Map<String, Object>> items, String expectedMessage) {
void testWeightedArrayElementFailureCase(List<Map<String, Object>> items, String expectedMessage) {

lets's do not mix different notations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

* - any values in the list are not unique or null,
* - the sum of weights exceeds Double.MAX_VALUE.
*/
public <T> T select(List<Map<String, Object>> items) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public <T> T select(List<Map<String, Object>> items) {
public static <T> T select(List<Map<String, Object>> items) {

this seems not addressed yet
not sure if we need it public

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the current design, the method public T select(List<Map<String, Object>> items) cannot be static because it relies on the random field of the WeightedRandomSelector record, which is an instance field.
If the goal is to make the method static, I will need to explicitly pass the Random object to the method.
From my point of view, the current implementation is correct.

@kingthorin kingthorin merged commit 59eebb9 into datafaker-net:main Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add weighted selection

5 participants