[BEAM-3446] Fixes RedisIO non-prefix read operations by vvarma · Pull Request #5841 · apache/beam

vvarma · 2018-06-30T07:40:06Z

Rebase of #4656

BaseReadFn to abstract general jedis operations.
Moved key fetch given prefix to ReadKeywsWithPattern DoFn.
ReadFn is pure fetch from redis given key.
URL: https://issues.apache.org/jira/browse/BEAM-3446
@iemejia @jbonofre

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---		---	---	---	---

iemejia · 2018-07-02T07:36:21Z

Run Java Precommit

iemejia · 2018-07-02T13:12:14Z

+    public void processElement(ProcessContext processContext) throws Exception {
+      String key = processContext.element();
+
+      String value = jedis.get(key);


As mentioned in the previous PR I am a bit concerned about losing the multiple data request capability, any chance you can work on this with the approach you mentioned based on MGET for ReadFn.
The simplest approach probably is to do like other IOs and have a default size that can be parametrized via a withBatchSize method. WDYT ?

Sure, I think it should be straightforward, will update the pr over the week.

@iemejia right now there are two operators exposed for read operations

ReadKeysWithPattern - which is 2 stepped - a ) for each pattern prefix in the PCollection fetch all matching keys b) fetch values for each key

ReadFn - which gets values for each key in the PCollection.

I can make changes to ReadKeysWithPattern to use a parameterized batch to fetch the data in the b step, but for ReadFn I am not sure of how to use the batch parameter.

Hi, sorry I have missed your message. The idea is that we should add the DoFn startBundle and finishBundle methods and create a method in the Read to define the size of the maximum amount of elements that we will request, then you will build the collection of the keys that are going to be requested in the processElement, but you won't do the request in the processElement but in the finishBundle method by doing a MGET request with the defined number of elements of the batch, we should choose a default min size e.g. 1000. It is similar to what other IOs do in the Write (see withBatchSize in ElasticsearchIO or SolrIO, for ref.

beam/sdks/java/io/solr/src/main/java/org/apache/beam/sdk/io/solr/SolrIO.java

Lines 805 to 829 in c14c975

@StartBundle

public void startBundle(StartBundleContext context) {

batch = new ArrayList<>();

}

@ProcessElement

public void processElement(ProcessContext context) throws Exception {

SolrInputDocument document = context.element();

batch.add(document);

if (batch.size() >= spec.getMaxBatchSize()) {

flushBatch();

}

}

@FinishBundle

public void finishBundle(FinishBundleContext context) throws Exception {

flushBatch();

}

// Flushes the batch, implementing the retry mechanism as configured in the spec.

private void flushBatch() throws IOException, InterruptedException {

if (batch.isEmpty()) {

return;

}

try {

vvarma · 2018-07-14T17:55:14Z

+    public void finishBundle(FinishBundleContext context) throws Exception {
+      List<KV<String, String>> kvs = fetchAndFlush();
+      for (KV<String, String> kv : kvs) {
+        context.output(kv, lastMsg, window);


@iemejia Not sure about this since I am using the Instant and window from the last processed message to produce output in the finish bundle method.

Oh so silly of me I have misread the motivation on keeping the window, you are right, it makes total sense, in that case probably it is a better idea to store the elements in a Map with the window as key and the list of elements and use the window.maxTimeStamp (you don't need the lastMsg) and flush when enough elements, Similar to what is done here (but with the count logic):

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java

Lines 825 to 844 in 70b6531

private static class GatherBundlesPerWindowFn<T> extends DoFn<T, List<T>> {

@Nullable private transient Multimap<BoundedWindow, T> bundles = null;

@StartBundle

public void startBundle() {

bundles = ArrayListMultimap.create();

}

@ProcessElement

public void process(ProcessContext c, BoundedWindow w) {

bundles.put(w, c.element());

}

@FinishBundle

public void finishBundle(FinishBundleContext c) throws Exception {

for (BoundedWindow w : bundles.keySet()) {

c.output(Lists.newArrayList(bundles.get(w)), w.maxTimestamp(), w);

}

}

}

Though here, messages with the same window are being bundled together, stored and finally processed in the finishBundle.
In the case of ReadFn, the idea was to buffer requests till the batch size and process them at that time. Hence the output is pushed in both processElement method as well as finishBundle.
Not sure if I understand how to use a map with window as the key.

@iemejia could you please advice on the above ?

iemejia

Almost ready, please excuse me for being so slow with this review, many things at the same time taking my bandwith. Please take a look at my comment I think if you could address this it is almost GTM. Also can you please squash your commits + rename the commit properly (as the PR title that I just changed). Thanks.

iemejia · 2018-07-19T21:19:26Z

+    public void finishBundle(FinishBundleContext context) throws Exception {
+      List<KV<String, String>> kvs = fetchAndFlush();
+      for (KV<String, String> kv : kvs) {
+        context.output(kv, lastMsg, window);


Oh so silly of me I have misread the motivation on keeping the window, you are right, it makes total sense, in that case probably it is a better idea to store the elements in a Map with the window as key and the list of elements and use the window.maxTimeStamp (you don't need the lastMsg) and flush when enough elements, Similar to what is done here (but with the count logic):

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java

Lines 825 to 844 in 70b6531

private static class GatherBundlesPerWindowFn<T> extends DoFn<T, List<T>> {

@Nullable private transient Multimap<BoundedWindow, T> bundles = null;

@StartBundle

public void startBundle() {

bundles = ArrayListMultimap.create();

}

@ProcessElement

public void process(ProcessContext c, BoundedWindow w) {

bundles.put(w, c.element());

}

@FinishBundle

public void finishBundle(FinishBundleContext c) throws Exception {

for (BoundedWindow w : bundles.keySet()) {

c.output(Lists.newArrayList(bundles.get(w)), w.maxTimestamp(), w);

}

}

}

vvarma · 2018-07-26T09:13:20Z

Sure will make the change. Sorry about the delay.

iemejia · 2018-07-26T13:57:45Z

No problem, thanks a lot for taking care of thism we are really close.
I just wanted to bring awareness of another PR on Redis #6045 conceptually there seems not to be a conflict but good to think on any impact.

jbonofre · 2018-07-26T16:03:01Z

I gonna do the review as well.

iemejia · 2018-08-20T12:18:58Z

Hi @vvarma it seems the changes on the other PR produced a conflict. Can you please rebase so we can merge this one (+ add the minor fixes of the review). Thanks!

iemejia · 2018-09-03T09:26:39Z

Just pinging about the status on this one @vvarma we are quite close, so hopefully you can fix the last bits so we can merge it. Sorry if this has taken too long.

BaseReadFn to abstract general jedis operations. Separated key fetch using prefix and get by key into serparate DoFn.

…d ops

…leTestJava

…d sequentially pushing to output collector

vvarma · 2018-09-03T12:58:31Z

Hi @iemejia , I have made the change requested. I used the window as you suggested. Please let me know if they are as expected. Apologies for the delay.

huygaa11 · 2018-09-13T17:26:31Z

Friendly ping for review!

jbonofre · 2018-09-13T18:36:40Z

@huygaa11 sorry, I forgot. Resuming my review.

iemejia

Hi, i took a quicklook, looks almost done, thanks, just two questions from a quicklook (nothing major, just things that I didn't understand immediately).

iemejia · 2018-09-14T14:55:34Z

+      String key = processContext.element();
+      bundles.put(window, key);
+      if (batchCount.incrementAndGet() > getBatchSize()) {
+        Multimap<BoundedWindow, KV<String, String>> kvs = fetchAndFlush();


why you need to deal with windows here ? (note I quickly looked but didn't get the intuition), if we can avoid this probably is better, no?

the window stored in the key here is used in FinishBundle to output the keys, since the the context in FinishBundle takes window as a parameter. context.output(kv, w.maxTimestamp(), w);

iemejia · 2018-09-14T14:56:22Z

+
+    @FinishBundle
+    public void finishBundle(FinishBundleContext context) throws Exception {
+      Multimap<BoundedWindow, KV<String, String>> kvs = fetchAndFlush();


Is this extra flush needed?, without an equivalent startBundle I don't see why this could be needed.

the reason for this extra flush is because we have a batch size. Once the number of messages reaches this value, we invoke flush.
at the end of the window when finishbundle is invoked, there may be few messages left in the buffer (less than batch size). So we invoke flush from here as well. And for this reason we need to store the window of the message as well.

Thanks for answering, I have somehow misread the startBundle as a setup only method. I see how everything fits now.

Thanks @iemejia . Do suggest if there any other changes needed.

iemejia

LGTM, sorry for taking so long. I still think we can simplify a bit the iteration + maybe doing a method for the repeated flush part but we can address this in the future (not a blocker for merge). Thanks a lot @vvarma and sorry for the delay.

vvarma · 2018-09-19T15:42:56Z

@iemejia Thank you!

vvarma mentioned this pull request Jun 30, 2018

[BEAM-3446] Fixes RedisIO non-prefix read operations #4656

Closed

iemejia reviewed Jul 2, 2018

View reviewed changes

iemejia self-requested a review July 2, 2018 13:12

vvarma requested a review from jbonofre as a code owner July 14, 2018 17:44

vvarma commented Jul 14, 2018

View reviewed changes

iemejia changed the title ~~Fixes https://issues.apache.org/jira/browse/BEAM-3446.~~ [BEAM-3446] Fixes RedisIO non-prefix read operations Jul 19, 2018

iemejia requested changes Jul 19, 2018

View reviewed changes

vvarma and others added 8 commits September 3, 2018 20:52

Endpoint host port configuration reset issue

6863ccb

Endpoint host port configuration reset issue

5dfe848

Fixes https://issues.apache.org/jira/browse/BEAM-3446.

be2daa9

BaseReadFn to abstract general jedis operations. Separated key fetch using prefix and get by key into serparate DoFn.

package private and typo fix

88cc752

Using mget with configurable batch size to increase efficiency of rea…

a498446

…d ops

fixing equality of batch size

4249bea

fixing order of test arguments, to fix :beam-sdks-java-io-redis:compi…

728776f

…leTestJava

Aggregating on per window, applyiong redis get on per window batch an…

eb1e084

…d sequentially pushing to output collector

vvarma force-pushed the i3446rebase branch from 6acf2f7 to eb1e084 Compare September 3, 2018 12:55

iemejia requested changes Sep 14, 2018

View reviewed changes

iemejia approved these changes Sep 19, 2018

View reviewed changes

iemejia merged commit b7c2975 into apache:master Sep 19, 2018

vvarma deleted the i3446rebase branch September 19, 2018 15:43

egalpin mentioned this pull request Sep 1, 2021

[BEAM-10990] Elasticsearch response filtering and [BEAM-5172] Tries to reduce ES UTest flakiness #15381

Merged

4 tasks

	@StartBundle
	public void startBundle(StartBundleContext context) {
	batch = new ArrayList<>();
	}

	@ProcessElement
	public void processElement(ProcessContext context) throws Exception {
	SolrInputDocument document = context.element();
	batch.add(document);
	if (batch.size() >= spec.getMaxBatchSize()) {
	flushBatch();
	}
	}

	@FinishBundle
	public void finishBundle(FinishBundleContext context) throws Exception {
	flushBatch();
	}

	// Flushes the batch, implementing the retry mechanism as configured in the spec.
	private void flushBatch() throws IOException, InterruptedException {
	if (batch.isEmpty()) {
	return;
	}
	try {

	private static class GatherBundlesPerWindowFn<T> extends DoFn<T, List<T>> {
	@Nullable private transient Multimap<BoundedWindow, T> bundles = null;

	@StartBundle
	public void startBundle() {
	bundles = ArrayListMultimap.create();
	}

	@ProcessElement
	public void process(ProcessContext c, BoundedWindow w) {
	bundles.put(w, c.element());
	}

	@FinishBundle
	public void finishBundle(FinishBundleContext c) throws Exception {
	for (BoundedWindow w : bundles.keySet()) {
	c.output(Lists.newArrayList(bundles.get(w)), w.maxTimestamp(), w);
	}
	}
	}

Conversation

vvarma commented Jun 30, 2018

Post-Commit Tests Status (on master branch)

Uh oh!

iemejia commented Jul 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iemejia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vvarma commented Jul 26, 2018

Uh oh!

iemejia commented Jul 26, 2018

Uh oh!

jbonofre commented Jul 26, 2018

Uh oh!

iemejia commented Aug 20, 2018

Uh oh!

iemejia commented Sep 3, 2018

Uh oh!

vvarma commented Sep 3, 2018

Uh oh!

huygaa11 commented Sep 13, 2018

Uh oh!

jbonofre commented Sep 13, 2018

Uh oh!

iemejia left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iemejia left a comment

Choose a reason for hiding this comment

Uh oh!

vvarma commented Sep 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants