Approximate Continuous Query
Approximate Continuous Query
Archive
University of Zurich
University Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2015
Dehghanzadeh, Soheila ; Dell’Aglio, Daniele ; Gao, Shen ; Della Valle, Emanuele ; Mileo, Alessandra ; Bernstein,
Abraham
DOI: https://doi.org/10.1007/978-3-319-19890-3_20
1 Introduction
RDF Stream Processing (RSP) applications are becoming increasingly popular.
For example, real-time city monitoring applications process public transporta-
tion and weather streams [13]; recommendation applications exploit micro-post
streams and user profiles from social networks [7]; supply chain applications use
commercial RFID data streams and product master data. RSP techniques, at
the basis of those applications, proved to be valid solutions to cope with the
high variety and velocity that characterize those data. However, more complex
analyses can be performed by combining data streams with static or quasi-static
background data. In a Semantic Web setting, background data is usually stored
remotely and exposed through SPARQL endpoints [1].
Current RSP languages, like C-SPARQL [4], SPARQLstream [6], and CQELS-
QL [12] support queries involving streaming and background data. Those lan-
guages are built as extensions of SPARQL 1.1 and consequently support the fed-
erated SPARQL extension [1] and the SERVICE clause that enables the remote
2 S. Dehghanzadeh et al.
worth noting that a relevant number of queries in the Stream Processing context
are in this class [5,14]. Our solution is a query-driven maintenance process based
on the following consideration: the accuracy of the current response is not af-
fected by refreshing elements that are fresh or not involved in the current query.
Thus, an efficient maintenance process should refresh local view entries that are
both stale and involved in current query evaluation. We investigate the research
question through two hypotheses.
The first hypothesis claims that the accuracy of the answer can increase
by maintaining part of the local view involved in the current query evaluation
(HP1). Having materialized the intermediate results in a local view, the contin-
uous queries join the local view with the stream. In fact, local view elements
that are involved in current evaluation depend on the content of the stream in
current evaluation which varies over different evaluations. We propose Window
Service Join (WSJ), a join method to filter out local view elements that are
not involved in current evaluation (i.e., their maintenance does not affect the
response accuracy). In this way, the maintenance focuses on the elements that
affect the accuracy of the current response.
The second hypothesis claims that the accuracy of the answer increases by
refreshing the (possibly) stale local view entries that would remain fresh in a
higher number of evaluations (HP2). We propose Window Based Maintenance
(WBM), a policy that assigns a score to the local view elements based on the
estimated best before time, i.e., the time on which a fresh element estimated
to become stale, and the number of next evaluations that the item is going
to be involved. The former is possible by exploiting the change frequency of
elements, while the second exploits the streaming part of the query and the
window operator to (partially) foresee part of the future answers.
The paper is structured as follows. Section 2 introduces the main concepts at
the basis of this work; Section 3 analyzes the problem, by providing a motivat-
ing example, the problem formalization and by identifying the requirements to
design solutions. Section 4 presents the query-driven maintenance process, and
an experimental evaluation is provided in Section 5. Finally, the paper closes
with a brief review of relevant existing works in Section 6, and conclusions and
future works in Section 7.
2 Background
An RDF stream S is a potentially unbounded sequence of time stamped in-
formative units ordered by the temporal dimension:
Two mappings are compatible if they assign the same values to the common
variables3 , i.e., ∀v ∈ dom(µ1 ) ∩ dom(µ2 ), µ1 (v) = µ2 (v). We name joining vari-
ables the variables in dom(µ1 ) ∩ dom(µ2 ).
SERVICE is the clause at the basis of the federated extension [1], introduced
in SPARQL 1.1. This clause indicates that a graph pattern expression has to be
forwarded to and evaluated by a remote SPARQL endpoint. In this way, it is
possible to retrieve only the relevant part of information to compute the query
answer, instead of pulling the whole remote data and processing it locally.
Among operators in RSP languages, the sliding window is one of the most
important ones. Due to the fact that streams are infinite, the query accesses the
streaming data through windows, views over the stream that include subsets
of the stream elements. The content of the window is a set of RDF statements
and it can be processed through SPARQL expressions. In this work, we focus
on time-based windows, that are defined through a time interval representing
the portion of the stream they capture: a window (o, c] contains all the elements
(d, t) ∈ S s.t. o < t ≤ c. Time-based windows are generated by a time-based
sliding window operator W, defined through two parameters ω and β. The first,
named width, defines the size of the windows (every window has an opening
time o and a closing time c such that c−o = ω); the second, named slide, defines
the time step between windows (given two consecutive windows generated by W
with opening time instants o1 and o2 , o2 − o1 = β). When the width and the
slide values are the same, the sliding window is named tumbling window: in
this case, each element of the stream is in one and only one window, i.e., the
stream is partitioned.
To close this section, we introduce the notion of accuracy and latency, that
are used for the assessment of Quality of Service constraints. An RSP engine E is
a system that evaluates continuous query over streams. Given a query q, Ans(q)
3
In this work, we do not treat the empty mappings.
Approximate Continuous Query Answering 5
– the expected answer, and AnsE (q) – the answer provided by an engine E, the
accuracy acc(E, q) is the ratio between the number of elements of AnsE (q) that
are also in Ans(q) and the total number of the elements in AnsE (q) (without
repetitions) [15]. In a database system, the query latency is the time required
to process the query answer. This definition has to be adapted for RSP engines,
where queries are evaluated multiple times. In this case, the query latency is a
set of values (one for each evaluation), and with lat(E, q) we indicate the latency
of the current evaluation of q in E.
In this section, we discuss the problem of maintaining local views in a Web stream
processing. In Section 3.1, we discuss an example to highlight the critical aspects
of the problem. In Section 3.2, we formalize the problem. Finally, in Section 3.3,
we elicit the requirements of a solution for this problem. Those requirements take
into account not only the aspects already studied in the database literature, but
also new ones introduced by the Web setting and the presence of data streams.
where it is not possible to achieve the goal: when it happens, the maintenance
process should raise an alert to the query processor (R4).
Moreover, it is possible to gather requirements while processing queries. First,
the join between stream and quasi-static data exposes important information to
improve the maintenance process: it may consider the join selectivity of map-
ping in the local view to identify those that have greater effect on accuracy (R5).
Second, the streaming part of the query can be exploited by the maintenance
process. In particular, the sliding window can enable the optimization the main-
tenance process: at each evaluation, the window slides, and part of the data is
not removed from the window. Given the window definition, it is possible to
compute how long a data item will remain in the system and use it in the main-
tenance process (R6), e.g., if two mappings have the same changing rate, we can
update the one for which the compatible mapping from the stream has longer
lifetime, as it has higher probability to save more future updates.
4 Solution
Our proposed solution is a query-driven maintenance process for the local view R,
in the context of the evaluation of continuous query q under QoS constraints on
responsiveness and accuracy. The maintenance process is query-driven in the
sense that it refreshes the mappings involved in the current query evaluation.
Join method (WSJ) and the Window Based Maintenance policy (WBM). The
former, presented in Section 4.1, performs the join and starts the maintenance
process (as proposer); the latter, presented in Section 4.2, completes the mainte-
nance process by ranking the candidate set and maintaining the local view. The
intuition behind WBM is to prioritize the refresh of the mappings that are going
to be used in the upcoming evaluations and that allows saving future refreshes.
As explained in Section 1, in the following we study the class of queries where
there is a unique join between the WINDOW and the SERVICE graph pattern
expressions. Moreover, to be compliant with SPARQL 1.0 endpoints, we assume
that the queries sent to refresh the local view cannot make use of the VALUE
clause. In other words, every query refreshes one replicated mapping.
WSJ performs the join and starts the maintenance process (as proposer). As
explained above, the query answering process should take into account the QoS
constraints (requirement R3) including latency and accuracy as defined in Equa-
tion 1. While the former can be tracked – the RSP engine can measure the query
latency –, the latter can only be estimated, – the engine cannot determine if a
mapping is fresh or stale, and consequently cannot compute the accuracy. This
consideration leads the design of WSJ: it fixes the latency based on the respon-
siveness constraint ρ and maximize the accuracy of the answer accordingly.
To cope with the responsiveness requirement, we introduce the notion of
refresh budget γ as the number of elements in R that can be maintained at each
evaluation. As explained in Equation 1 the latency value should be lower or equal
to the response time constraint ρ. Given the time rq to evaluate P the query 5 ,
γ
and the time to perform the maintenance process of γ elements ( i=1 ri ), the
latency of the engine E to execute the query q is:
X
γ
lat(E,q) = rq + ri ≤ ρ (2)
i=1
Algorithm 1 shows WSJ. First invocation of the next() method retrieves the
results of Ωjoin (i.e., the block in Lines 1–12 is executed). That is, the WINDOW
expression is evaluated and the bag of solution mappings Ωwindow is retrieved
from the WinOp operator (Lines 2–4). WSJ computes the candidate set C as
the set of mappings in R compatible with the ones in Ωwindow (Line 5). In fact,
the mappings in R that are not compatible with the ones in Ωwindow do not
affect the accuracy of the current query evaluation, so they are discarded. C and
the refresh budget γ are the inputs of the maintenance policy M (Line 6), that
refreshes the local view. Then, an iterator is initiated over Ωwindow (Line 7).
Finally, the join is performed (Lines 9–13) between each mapping in Ωwindow
and the compatible mapping from R and returned at each next() invocation.
5
rq includes the time to transform the query plan, optimize and evaluate it, and
appending the output to the answer stream.
10 S. Dehghanzadeh et al.
Figure 2 shows the running example of this section. The join is performed at
time 8. The local view R contains the result of the SERVICE clause evaluation(on
the right): µR R R
a , µb , . . . , µf . As described in Algorithm 1, WSJ first computes
Ωwindow (on the left): at time 8, it contains µW W W W
a , µb , µc and µd . Next, WSJ
starts the maintenance process. First, it filters R in order to build the candidate
set C with the compatible mappings of the ones in Ωwindow . C contains µR R
a , µb ,
R R R R
µc and µd . The other two mappings in R, µe and µf , are not compatible with
the mappings in Ωwindow , so they are not considered for the refresh.
The Window Based Maintenance (WBM) policy elects the mappings to be re-
freshed and maintains the local view accordingly. Its goal is to maximize the
accuracy of the query answer, given that it can refresh at most γ mappings at
each evaluation. WBM aims at identifying the stale mappings in the candidate
set C and choose them for maintenance.
To determine if a mapping in C is fresh or stale, an access to the remote
SPARQL endpoint BKG is required, and it is not possible (as explained above).
Approximate Continuous Query Answering 11
To overcome this limitation, WBM computes the best before time of the map-
pings in C: as the name suggests, it is an estimation of the time on which a fresh
mapping becomes stale. Being only estimation, it is not certain that after the
best before time the mapping becomes stale, but only possibly stale.
{(µR R R
1 , τ1 ), (µ2 , τ2 ), . . . , (µn , τn )}
Where µR i is the solution mapping in R, and τi represents the current best before
time. In Figure 2, the best before time values are shown on the right side of the
picture (the black and white mappings in the local view), e.g., the best before
time of µR R R
a is 7, the one of µb is 9 and the one of µc is 6.
The set of possibly stale mappings PS is a subset of mappings in C such that
their best before time is lower or equal to the current evaluation time. Continuing
the example, given the candidate set C = {µR R R R
a , µb , µc , µd }, WBM selects the
possibly stale mappings by comparing their best before time values with the the
current time (8). The possibly stale mappings (the black mappings in the local
12 S. Dehghanzadeh et al.
ti + ω − t
Li (t) = ⌈ ⌉ (3)
β
Where ti is the time that the compatible mapping µW i enters the window.
Continuing the example in Figure 2, the remaining life time of µR c at the
current time instant is Lc (8) = 3: the compatible mapping µW c is in W1 , W2 and
W3 , so µR R
c is involved in three successive evaluations. Similarly, the values of µa
and µRd are 1 and 3 respectively.
Computation of the renewed best before time (Line 4). The second
scoring value of WBM identifies the number of successive evaluations on which
the element will remain (possibly) fresh, if refreshed now. In other words, first,
WBS computes the renewed best before time τinext of the mapping. The renewed
best before time of the mapping µRi at time t is computed as:
5 Experiments
In this section, we experimentally study the performance of WSJ and WBM
to verify the validity of the hypotheses presented in Section 1. We set up two
experiments: first (Section 5.1) investigates if WSJ improves the accuracy of the
answer (HP1); second (Section 5.2) studies if WBM contributes to improve the
accuracy of the answers (HP2). In the following, we describe the experimental
setting to perform the experiments, inspired by the example in Section 3.1.
Data set preparation. An experimental data set is composed by streaming
and background data. We built two data sets: one with real streaming data and
synthetic background data; and one with real streaming and background data.
The real streaming data has been collected from Twitter. We identified four
hundred Twitter verified users as a user set, and we collected three hours of
tweets related to them. In the meanwhile, we also built the real background
data, as the number of followers of the user set elements. We collected snapshots
of the users’ follower count every minute to keep track of the changes and to
replay the evolution of the background data6 . Additionally, we built the synthetic
6
It is worth to note that in this way we do not hit the Twitter API limits, see
https://dev.twitter.com/rest/public/rate-limiting
14 S. Dehghanzadeh et al.
background data assigning a different change rate at each user (that is stable over
time), and changing the follower count accordingly.
Query preparation. The test query performs the join in Listing 1.1 between
collected data. The query uses a window that slides every 60 seconds. Slides
should be greater than or equal to intervals among consecutive snapshots to
make sure that the current snapshot is different than the previous one.
5.1 Experiment 1
The first experiment aims at investigating the hypothesis HP1: the accuracy
of the answer can increase by maintaining part of the local view involved in
the current query evaluation. To verify this hypothesis, we follow a comparative
approach: we evaluate the join using WSJ as join method, and we compare it with
a set of baselines. As lower bound proposer, we consider the worst maintenance
process (WST), that does no refresh local view throughout evaluations, i.e., it
represents a proposer with an empty candidate set. As upper bound, we use
BST (best): its candidate set consists of γ certainly stale elements (where γ
is the refresh budget). This proposer cannot be applied in reality (as it is not
possible to know if a local view element is stale or fresh), and we use it as
upper bound. Finally, we use the proposer GNR: it uses the whole local view as
candidate set, i.e., it maintains the local view without considering the query. To
complete the maintenance process, a policy is required. We use two maintenance
policies inspired by the random (RND) and Least-Recently Used (LRU) page
replacement algorithms. RND picks γ mappings from the candidate set, while
LRU chooses the γ least recently refreshed mappings in the candidate set.
Figure 3 shows the results of the experiment; the charts show the cumulative
error over the multiple evaluations (the lower, the better). WST and BST are
the lower and upper bounds so all the other results are between those two lines.
It is possible to observe that GNR performs slightly better than the lower bound
WST. Comparing GNR and WSJ, WSJ performs significantly better than GNR
with both maintenance policies.
To study if the result generalizes, we repeated the experiment with different
refresh budgets. To set the refresh budget, we first computed the average dimen-
¯ = 33, and we set the refresh budget as 8%, 15% and
sion of the candidate sets |C|
Approximate Continuous Query Answering 15
Synthetic Real
γ WST GNR WSJ BST WST GNR WSJ BST
RND LRU RND LRU RND LRU RND LRU
8% 0.23 0.26 0.27 0.40 0.38 0.49 0.30 0.34 0.33 0.46 0.47 0.56
15% 0.23 0.26 0.28 0.48 0.51 0.66 0.30 0.36 0.35 0.57 0.58 0.74
30% 0.23 0.32 0.33 0.64 0.76 0.94 0.30 0.41 0.41 0.68 0.80 0.98
30% of |C|¯ (respectively 3, 5 and 10). Table 1 reports on the average accuracy
for both the synthetic and the real data set. It is worth noting that WSJ shows
better improvements than GNR when the refresh budget increases: moving γ
from 8% to 30%, in the synthetic (real) data set GNR improves from 0.26 (0.27)
to 0.32 (0.33), while WSJ improves the accuracy from 0.40 (0.38) to 0.64 (0.76).
It happens because WSJ chooses the mappings from the ones currently involved
in the evaluation, while GNR chooses from the whole local view. A similar trend
is visible also when the real data set is considered.
5.2 Experiment 2
The second experiment aims at investigating the hypothesis HP2: the accuracy
of the answer increases by refreshing local view entries that estimated to be stale
and would remain fresh in a higher number of evaluations. This requires studying
the performance of WBM. Like in the first experiment, we follow a comparative
approach, and we compare WBM with other maintenance policies. As lower
bound, we use WST (in this case represents a policy that does not refresh any
mapping); as upper bound we use WBM*, i.e., the WBM policy that can access
the real change time instants of the mappings from the remote service. Like BST,
WBM* cannot be used in reality, due to the fact that change time instants are not
available ahead of time. Finally, we use RND and LRU (presented in the previous
section) as policies to make the comparison. Due to the good performance of
WSJ, we used it as proposer for all policies.
Results of the experiments are shown in Figure 4. In both the synthetic and
real data set cases, the WBM maintenance policy outperforms RND and LRU by
16 S. Dehghanzadeh et al.
having a lower cumulative error. This difference is more visible in the synthetic
data set due to the fixed change rate assumption. Similarly, WST and WSJ-
WBM* are lower and upper bounds respectively. Figure 4a and 4b shows that
WSJ-WBM clearly outperforms baselines (WSJ-RND, WSJ-LRU).
We repeated the experiment with different time constraints (i.e., refresh bud-
gets), in order to study the behavior of the policies under different situations.
Results are shown in Table 2. In general WBM shows better performance than
the two baseline policies we considered. However, WBM is more efficient on lower
refresh budgets. Comparing WBM and WBM*, it is possible to notice that the
accuracy difference increases as the refresh budget increases: WBM* accuracy
move from 0.52 (0.59) for the synthetic (real) data set to 0.94 (0.98), while
WBM moves from 0.46 (0.52) to 0.81 (0.80). In the experiment with the real
data set, the estimation error is higher when the refresh budget is high; there
WBM performance is equal to the LRU one.
Synthetic Real
γ WST WSJ WSJ WSJ WSJ WST WSJ WSJ WSJ WSJ
RND LRU WBM WBM* RND LRU WBM WBM*
8% 0.23 0.39 0.38 0.46 0.52 0.30 0.45 0.47 0.52 0.59
15% 0.23 0.49 0.50 0.60 0.71 0.30 0.57 0.58 0.61 0.77
30% 0.23 0.64 0.76 0.81 0.94 0.30 0.68 0.80 0.80 0.98
Table 2: Accuracy comparison of LRU, RND & WBM in synthetic/real data sets
6 Related Work
Local views, such as replicas and caches, materialize the content of remote
sources in the query processor to improve availability, scalability and perfor-
mance [9]. Any materialization methodology will lead to a trade-off among
space/time. More materialization requires more space but will decrease the re-
sponse time and vice versa. However, maintenance processes have to be intro-
duced in order to update the view and reduce inconsistencies. View maintenance
has been studied extensively in database community [11,3,19,9]. Any mainte-
nance methodology will lead to a trade-off among response quality and time.
That is, the shorter maintenance intervals will lead to a higher response quality
but will increase the response time due to the consumption of computational
resources and vice versa. A common assumption among all maintenance meth-
ods is the existence of update streams, i.e, streams carrying the changes of the
relations. An adaptive materialization strategy with an eager view maintenance
(i.e., all the updates are processed on arrival) is proposed in [3]. It manages the
trade-off among space and query response time and adaptively refines data for
materialization by monitoring their cost/benefit ratio under different circum-
stances. In [11] a lazy maintenance (i.e., update processing can be postponed)
solution is proposed. It works in cases where the cost of updating the views
is high. In this work, authors propose a query-driven maintenance approach to
apply a subset of update-stream so that user-defined constraints on the quality
of the answer is not impaired. Providing approximate results according to the
Approximate Continuous Query Answering 17
quality constraints is a well known problem [9]. In a similar attempt, [10] pro-
pose a technique to optimize the view maintenance process in order to target
the trade-off between time and quality of the response. However, in a seman-
tic web setting, update streams are not available because most of the SPARQL
endpoints are not providing the update stream of their underlying data.
In [18] the time/quality trade-off has been addressed in a Semantic Web
scenario: each query is split between the local query processor and a live query
processor to achieve faster response than a live query processor and more fresh re-
sponse than the local query processor. However, parameters to adjust the trade-
off among freshness and fastness are fixed and therefore it is not possible to
adjust them based on user-defined trade-off on a query basis.
under grant No.603095 and the IBM Ph.D. Fellowship Award 2014 granted to
D. Dell’Aglio.
References
1. C. B. Aranda, M. Arenas, Ó. Corcho, and A. Polleres. Federating queries in
SPARQL 1.1: Syntax, semantics and evaluation. J. Web Sem., 18(1):1–17, 2013.
2. C. B. Aranda, A. Polleres, and J. Umbrich. Strategies for executing federated
queries in SPARQL1.1. In ISWC 2014, Proceedings Part II, pages 390–405, 2014.
3. S. Babu, K. Munagala, J. Widom, and R. Motwani. Adaptive caching for contin-
uous queries. In ICDE 2005, pages 118–129. IEEE, 2005.
4. D. F. Barbieri, D. Braga, S. Ceri, E. Della Valle, and M. Grossniklaus. Querying
RDF streams with C-SPARQL. SIGMOD Record, 39(1):20–26, 2010.
5. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A
comparison of join algorithms for log processing in MapReduce. In SIGMOD 2010,
pages 975–986, 2010.
6. J. Calbimonte, H. Jeung, Ó. Corcho, and K. Aberer. Enabling query technologies
for the semantic sensor web. Int. J. Semantic Web Inf. Syst., 8(1):43–63, 2012.
7. I. Celino, D. Dell’Aglio, E. Della Valle, Y. Huang, T. K. Lee, S. Kim, and V. Tresp.
Towards BOTTARI: using stream reasoning to make sense of location-based micro-
posts. In ESWC 2011 Workshops, Revised Selected Papers, pages 80–87, 2011.
8. F. Goasdoué, K. Karanasos, J. Leblay, and I. Manolescu. View selection in semantic
web databases. Proceedings of the VLDB Endowment, 5(2):97–108, 2011.
9. H. Guo, P.-Å. Larson, and R. Ramakrishnan. Caching with good enough currency,
consistency, and completeness. In VLDB, pages 457–468. VLDB Endowment, 2005.
10. H. Guo, P.-Å. Larson, R. Ramakrishnan, and J. Goldstein. Relaxed Currency and
Consistency: How to Say Good Enough in SQL. In SIGMOD, pages 815–826, 2004.
11. A. Labrinidis and N. Roussopoulos. Exploring the tradeoff between performance
and data freshness in database-driven web servers. VLDB J., 13(3):240–255, 2004.
12. D. Le Phuoc, M. Dao-Tran, J. X. Parreira, and M. Hauswirth. A native and
adaptive approach for unified processing of linked streams and linked data. In
ISWC 2011, Proceedings, Part I, pages 370–388, 2011.
13. F. Lécué, S. Tallevi-Diotallevi, J. Hayes, R. Tucker, V. Bicer, M. L. Sbodio, and
P. Tommasi. Smart traffic analytics in the semantic web with STAR-CITY: sce-
narios, system and lessons learned in dublin city. J. Web Sem., 27:26–33, 2014.
14. A. Natsev, Y.-C. Chang, J. R. Smith, C.-S. Li, and J. S. Vitter. Supporting
incremental join queries on ranked inputs. In VLDB, pages 281–290, 2001.
15. A. Parssian, S. Sarkar, and V. S. Jacob. Assessing information quality for the
composite relational operation join. In IQ, pages 225–237, 2002.
16. M. Schmidt, M. Meier, and G. Lausen. Foundations of sparql query optimization.
In ICDT, pages 4–33. ACM, 2010.
17. X. Sean and Z. Xiaoquan. Impact of wikipedia on market information environ-
ment: Evidence on management disclosure and investor reaction. MIS Quarterly,
37(4):1043–1068, 2013.
18. J. Umbrich, M. Karnstedt, A. Hogan, and J. X. Parreira. Freshening up while
staying fast: Towards hybrid sparql queries. In EKAW, pages 164–174. 2012.
19. S. D. Viglas, J. F. Naughton, and J. Burger. Maximizing the output rate of multi-
way join queries over streaming information sources. In VLDB, pages 285–296,
2003.