0% found this document useful (0 votes)

23 views16 pages

Science DB

Uploaded by

nkulafind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views16 pages

Science DB

Uploaded by

nkulafind

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Review for Paper: 33-Requirements for

Science Data Bases and SciDB

Review 1

This paper reports scientists' requirements for databases, and sketch some of the SciDB design.

The requirements discussed include:

1. data model: array data model is wanted by a majority of scientific users, and functions like
shape is needed
2. operations: structural operators based on input structure and content-dependent operators
based on data stored in the input array are needed
3. extendibility: array operations should be user-extendable
4. language binding: no consensus on a single programming language
5. No overwrite: retain old values for provenance purposes when updating values
6. Open source: easier for people to use
7. Grid Orientation: allow data partitioning to change over time
8. Storage withing a Node: decompose a partition into disk blocks
9. in-situ data: avoid overhead of loading data
10. Cooking: decide when to convert sensor information into standard data types
11. Named Versions: versioning of data, easier to track what went wrong
12. Provenance: repeatability of data derivation is required
13. Uncertainty: support for uncertain data elements is requested

Actually, I don't like this paper due to two reasons:

1. The paper introduces some of the user requests/requirements in different sections an provide
solutions, which is too sparse and hard to follow. I prefer listing requirements and reasons for
requirements in the beginning of the paper, and then introduces solutions to these requirements.
2. Some titles are not easy to understand. For example, the "Integration of the Cooking Process".
After careful reading, I understand what cooking means in the content, but it's not intuitive at all
by just looking at this title. I think this is a bad practice of writing the title.

Besides the reasons making me dislike the paper, the paper provides some ideas about how
SciDB designs are generated based on actual requirements, and it's interesting to see how the
group design data structures and operations from scratch, from some of the user requirements.

Review 2

Current commercial DBMS cannot fulfil the demand of some scientific database users.
Scientific database users in different fields including astronomy, particle physics, fusion, remote
sensing,
oceanography, and biology proposed several requirements they need. This paper aims at
reporting the requirements they collected and identified and present the design of SciDB, which
is a database designed to meet a common set of requirements from scientific users.

Some of the contributions and strengths of this paper are:

1. SciDB supports customized data models for scientific users that are not satisfied with
traditional table or array data models.
2. SciDB allows users to add their own data type, which gives the possibility for scientific users
to perform sophisticated computations and analytics.
3. SciDB support multiple languages including C++, python, Ruby and so on by mapping the
language-specific representation to common parse-tree representation.

Some of the drawbacks of this paper are:

1. The paper is not derived in well-organized sections and some of the requirements are isolated
and not related to each other.
2. Most of the requirements remain in concept stage without explaining the detailed
implementation.
3. The paper lacks experimental results and performance comparison with other existing popular
DBMS.

Review 3

A collection of scientific data base users keep complaining about the inadequacy of current
commercial DBMS offerings, and though many researchers worked on science databases for
years, there was still no common set of requirements across several science disciplines.
Therefore, this paper defined these requirements and presented a detailed design exercise and
sketched some of the SciDB design.

First, Data model in most scientific scenarios are an array data model, primarily because it
makes a considerable subset of the community happy and is easier to build than a mesh model.
Specifically, this model is a multi-dimensional, nested array model with array cells containing
records, which in turn can contain components that are multi-dimensional arrays. The operators
that accompany the proposed data model consist of Structural Operators and Content-Dependent
Operators. Structural operators create new arrays based purely on the structure of the inputs. In
other words, these operators are data-agnostic. Content-dependent operators are those whose
result depends on the data that is stored in the input array. To support these disparate languages,
SciDB will have a parse-tree representation for commands. Then, there will be multiple
language bindings. These will map from the language-specific representation to this parse tree
format.

The main contribution of this paper is it is the first paper which defined the basic requirements
for scientific databases across several science discipline, and thus can serve as a guidance for
researchers. Besides, the authors of this paper also decided to build SciDB and recruited an
initial programming team, and have started a non-profit foundation to manage the project, which
can definitely help scientific data users as it's not possible for commercial vendors to pay
attention to these non-profit part of specific databases.

The main advantages of the proposed model are as follows:

1. It suits the large scale dataset.
2. It performs best when the workload mostly contains write once (or accumulate slowly) and
read a lot.
3. It offers the ability to store large volumes of multidimensional data and perform computation
on the data in the form of database queries.
4. It also provides support for data versioning and provenance which is a oft-cited feature
requirement of data management systems from scientists.
5. It may be possible to extend SciDB’s array-based data model to how images are digitally
represented, and apply SciDB in image and video processing tasks.

One thing to notice is that this paper is specified for scientific database, so we cannot think from
perspective of the most common databases, especially these requirements cannot meet the OLTP
demands. In other words, this proposed database does not support transactions, it only supports
analysis, which is like OLAP workloads, which can be thought as a “drawback” of the model.

Review 4

Scientific applications often have to use databases and other data management technologies to
handle the immense amounts of data that they produce in their research. Often, however, they
are forced to make do with systems that are developed with industry concerns in mind, rather
than the requirements specific to scientists, due to various historical reasons. For example,
scientific data management is a “zero billion dollar” industry, which makes it nearly impossible
for large commercial vendors to justify the time and expense to make a scientific research-
focused system. This paper attempts to bridge the gap by specifying the various requirements for
scientific databases, to make it easier for vendors and other organizations to develop products
specifically targeting the scientific community. In particular, the product that the authors hope to
develop out of this is termed SciDB. Potential users of this SciDB system span various scientific
disciplines from particle physics to biology and remote sensing to astronomy to oceanography,
and are mostly represented by universities and large research institutions.

The first major difference between scientific users and industry is that most of their data does
not fit particularly well into the table format that is the cornerstone of relational DBMSs. In fact,
arrays tend to be the natural data model for a lot of these disciplines, rather than tables. Other
domains like genomics would be happy with neither, and instead prefer graphs and sequence
representations instead. This means that a “one size fits all” approach does not work, and
instead, DBMSs will have to be specialized to individual needs. This paper focuses on the array
data model, due to practical considerations, where each array can have an arbitrary number of
dimensions and each combination of dimension values defines a cell. Data values are in the
form of scalar values, and each cell has the same data type across all of its associated values.
Arrays can be defined, and then multiple instances can be created. SciDB will also have user
defined functions (in C++). One concept that SciDB introduces is the idea of enhancing arrays,
i.e. performing various transformations such as transposition, scaling, translation, etc, which are
done by user defined functions. SciDB includes a wide range of operators, which fall under two
general categories: structural and context-dependent operators. Structural operators create new
arrays purely based on the structure of the input, i.e. they are data-agnostic. This allows for a
greater degree of optimization since the values of the data do not have to be taken into account.
Examples of structural operators include subsample and reshape. On the other hand, context-
dependent operators depend on the data stored in the array. Examples of this class of operators
include filter and aggregate. As before, the set of operators and data types is extendable, and
users can add in ones more suited to their specific disciplines. As for language, SciDB has a
parse-tree representation for commands, followed by multiple language bindings. The reason for
this is to support the disparate languages, from C++ to Python to MATLAB that different
scientists prefer. Other important features of SciDB include a no-overwrite design, where old
data is usually not deleted (for lineage purposes), the open source nature of the project, grid
orientation, storage within a node, and “in situ” data where the overhead of loading data is
minimized.

The main strength of this paper is that is probably the first one of its type to attempt to gather
together scientific professionals in order to figure out their specific requirements for a database
management system. The insights gathered are very interesting and shed light on how
differently scientific research is from other more conventional business-oriented transaction or
analytical processing is done. By taking this initial step, the hope is that SciDB or other follow-
on systems will be developed to better serve the scientific community. Overall, the paper is very
comprehensive, but since it is well written, it is straightforward to comprehend.

The biggest weakness, in my opinion, is that the scientific community is so fragmented in terms
of requirements and preferred programming languages, to name a few. As previously
mentioned, science as a whole is a “zero billion dollar” industry, and further splitting that up
decreases the business incentive to develop systems specific to them. Also, SciDB has to make a
lot of compromises in order to make it flexible enough to satisfy enough users, as was shown in
how it was designed to handle multiple types of programming languages. While that is the
practical reality, it is questionable whether the overall product is able to perform as well as
initially envisioned, given that is being pulled in so many directions at once.

Review 5

This paper is a requirements paper for Science Databases as well as their proposed Science
Database architecture, called SciDB. The author starts with a history where Michael Stonebraker
and his colleague at a conference defended the current DBMS systems which were under attack
by the advocates of scientific databases. The author also draws attention to the fact that the
space of scientific databases does not have a lot of market potential, so it does not get the
support it needs from the corporate community.

The author then gets into the data model of a science database. This section introduces all of the
possible models: an array data model, a table data model, and a mesh data model. These data
models each serve different subset of the scientific community, but array data models serve the
largest amount of the community and it is easier to develop than a mesh model. The paper
describes how tables can be defined once and then later instances can be created multiple times,
as in SQL. The paper also describes the Postgres-style UDFs that allow you enhance your array
by adding pseudo-coordinates to your array. The author mentions how SciDB will also come
with a few shape functions implemented that will allow the user to digitize some shapes like
circles using arrays.

Next the paper looks into structural operators which create arrays based on the structure of the
inputs. The first one we look into is Subsample, which takes an array and a predicate over the
array and returns a subset of the elements that satisfy the predicate. The next example we look at
is Reshape, which converts the input array to a new array with a different shape and can modify
the dimensions of the array, but not the number of cells, accordingly. Next the paper looks into
content-dependent operators, which are operators that return objects of the same dimension as
the input. The author then proposes that SciDB will be user-extendible like Postgres. The next
requirement is that a science database should not overwrite old data but should instead keep a
lineage of the data, so scientists can see what the value was at different points in time. The paper
also suggests that the DBMS should be open source so that the community can work together to
manage and support the codebase. They also require named versions so that scientists can access
data from a large set at exactly the time or range that is useful to them. We also want
provenance so we can recover data based on the lineage of steps used to get the data to its state.
They also require that the DBMS can be used in non-scientific applications and that a
benchmarks is developed, which they say is under progress for SciDB.

I thought that in the earlier stages, the paper told an interesting story to lead the reader to the
main content of the paper and provide context. However I felt that when the technical
information was presented, it was a bit hard to distiguish the history from the author’s
contribution, at least in the earlier sections of the Requirements part of the paper. This was
understandable since the paper was more high level as a requirements paper.

Review 6

Requirement for sequence data bases and SciDB

This paper presents several requirements from scientific researchers and database users for
scientific area. And provides some designing ideas of science database. Overall speaking, the
requirements from scientific users are quite different from other people(daily database users),
and those requirements are mixed which means a typical design of SciDB will not satisfy all.
Therefore, the design of SciDB is need to consider all the requirement and needs.

In this paper, more than ten designing ideas are listed. For the data model. Scientific users want
different type of storage of data and significant ability of manipulating arrays. So SciDB designs
array of cell mechanism and providing lots of user define functions so that users can design their
own datatype. Also, SciDB provides multiple data/arrays manipulation and its user define
functions method so that SciDB is extendable. For the language bindings, different users have
different view on it. Some prefers C++ but other prefers Python, so SciDB has a parse-tree
representation for commands, and then there will be multiple language bindings.

For the scientific users, they do not want to overwrite datas in case of losing history. To deal
with this, SciDB adopts another dimension(history) to record the data in previous times. Like
MySQL and Postgres, SciDB is open-source for the community. Also, some may use very large
dataset such as LSST, and their dataset is change during the time span, so SciDB not only deal
this with grid, but also support dynamic partition(change over time). With regard to storage
within a node, there are some optimizations designed specifically for that.

Another thing regarding the complaints from scientific users is that they feel loading data is
slow and SciDB can handle "in Situ" data. For the cooking process, there are typically two
methods of doing this. You can process it before the data comes in the database and you can also
first record the data in database and then processes it. The second method is better because the
"cooking" process is an information loss process, you want to have the most accurate
information regarding your data. At the same time, another problem arises, different people, at
the different time, may use different algorithms dealing with same chunk of data, in this case,
SciDB adopts named version to handle different usage of the same data. Next important thing is,
if something goes wrong, users want to know what procedure lead to the wrong answer and
what are the datas that got influenced. Furthermore, SciDB can not only used in scientific area,
but also non-scientific area like eBay.

The contribution in this paper is the group of people who designed and develop SciDB, keeping
in mind that this is a non-commercial project, and what they did would benefit for all scientific
users and others which will move the science field faster.

The advantage of SciDB is: 1: it resolves the concerns and requirements of scientists 2 SciDB is
non-profit and open source which would benefit lots of people. 3 the design is excellent which
make the computing faster.

One of the drawback maybe, as an introductory paper, there are lack of detail of some
implementations like how is the algorithm for dynamic partitioning, and how to deal with the "in
situ" data in detail.

Review 7

This paper is written to specify a common set of requirements for a new science databse system
and breifly sketch the design of SciDB.

The requirements come from particle phsics, remote sensing and astronomy, oceanography.
While most scientific users can use relational tables, tables are not a natural data model that
closely matches their data. The Sequoia 2000 project realized in the mid 1990s that their users
wanted an array data model, and that simulating arrays on top of tables was difficult and resulted
in poor performance. THus, people will be not happy with neither a table nor an array data
model. This project is exploring an array data model, primarily because it makes a considerable
subset of the community happy and is easier to build than a mesh model. It will support a multi-
dimensional, nested array model with array cells containing records.

Then the paper introduced some of the operators that are used in the data model. There are
mainly two kinds of operators structural operators and content-dependent operators. Structural
operators category creates new arrays based purely on the structure of the inputs. Content-
Dependent operators involves operators whose result depends on the data that is stored in the
input array.
Then the paper point out several essential requirements of the SciDB, it should be extendible,
such as user can add its own data type in SciDB. Secondly, the SciDB should able to not binding
a specific language. It can be completely acoided by language embedding approach. What's
more, it should be open source and frid orientated. To ease the burden of loading large-scale
data into a database, SciDB defines its own data format and writes adapters for commonly used
external number formats. As long as there is an adapter corresponding to the user data, you can
use SciDB directly without loading the data. SciDB loads raw data, using custom functions
(UDFs) and data manipulation processes. The user makes specific changes to a portion of the
array while leaving the rest unchanged. Data in the scientific field is generally inaccurate, and
SciDB supports data and its errors.

The main contribution fo this paper is that it introduced the SciDB and describe the user
requirement of the SciDB. One week point is that this paper did not introduce SciDB in a
systematic way and did not say anything about the performance of SciDB.

Review 8

This paper describes the requirements for a database designed for users in the scientific
community. There was a discussion about this at XLDB-1 in October 2007, and Stonebraker and
DeWitt agreed to build it if a common set of requirements could be defined - these are those
requirements. Before this database was proposed, most scientific use cases had built one-off
databases for individual products.

There are many requirements that differ from traditional databases, but due to the length of this
review I will focus on a select few. One that is quite interesting is the data model. The agreed
upon data model is arrays - these look somewhat like a numpy array, which can have multiple
dimensions. Additionally, UDFs can be defined on arrays. Because of the data model and other
requirements of the project, the traditional SQL language is not desired. Therefore, new
operations are described, which include two different subsets: operations that manipulate the
structure of the data and operations that manipulate the data itself. This was an interesting
separation that I had not thought of before in a database context. Some possible operations are
reorganizing the dimensions of the data or performing a join. Additional important requirements
include but are not limited to flexible language bindings, open source, and running on a shared-
nothing cloud.

I think that the authors did a fairly good job laying out all of the requirements. Additionally, I
noted that they mention that scientific DBMSs are a “‘zero billion dollar’ industry.” However,
they draw parallels with use cases at eBay near the end of the paper, for complex business
analytics at web-scale. Therefore, in my mind, since the requirements meet the requirements for
use cases at a place like eBay, this certainly *isn’t* a zero billion dollar problem. I’m curious
about what this space looks like in 2018, with more NoSQL options and specific time series
databases.

I felt as though there were some areas that were too closely explored in a paper for
requirements, and some that should have been included when their absence was noted. On page
2, there is a lot of description of syntax, which feels unnecessary for a paper on high-level
requirements. Meanwhile, on page 6, issues like which compression schemes to use are pushed
to a later point. In my opinion, this decision about detail should have been flipped. I also noticed
that the data model chosen basically ignores the desires of biology, genomics, and chemistry
(who want a graphs and sequences as their data model), which to me seem to represent many
scientific disciplines.. Although they say that this is to satisfy a large subset of the use cases, I
wonder how these disciplines use cases will be addressed in the future, because it seems
unrealistic to call this a database for all of science without them.

Review 9

The contribution of this paper is a set of requirements that is crucial when designing a new
science database system or SciDB in short. Users doing scientific computing, which typically
have extreme database requirements complained about the inadequacy of current commercial
DBMS offering. Building custom software for each new project clearly will not work in the
future since the software stack will become too big and complex. To meet these users needs, we
need to know their requirements for the new database system.

The first and the most important requirement is a more flexible data model. Traditional DBMS
only provides tables with a primary key, which is merely a one-dimensional array. But biology
and genomics users want graphs and sequences. To satisfy their requirement, the author
proposed a new data model called array which can have any number of (named) dimensions.
Each combination of dimension values defines a cell, and a cell can store one or more scalar
values or arrays. Enhanced arrays which allow basic arrays to be scaled, translated, have
irregular boundaries and non-integer dimensions can be created by user-defined functions.
Updates on the values of an array should not overwrite previous versions, so a history dimension
must be added to every updatable array. Users should also be allowed to create named versions
of their data as a user may want to do only a few modifications to a dataset and get a new
version.
Operators that can be applied on arrays can be divided into two categories: structural operators
which based purely on the structure of the inputs and content-dependent operators whose result
depends on the data that is stored in the input array. One important property of array operations
is that they must be user-extendable, as users typically want to perform sophisticated
computations and analytics.

Other requirements include open source, multiple language bindings, a simple model of
uncertainty, a history that remembers how an array is derived, etc.

This paper does a good job of presenting these requirements for a SciDB. However, I think it
will be better to also provide background on the existing solution to this problem and show how
these methods failed to meet the customers’ needs.

Review 10

In the paper "Requirements for Science Data Bases and SciDB", Michael Stonebraker and Co.
identify a common set of requirements between disciplines in the field of "Big Science" to
develop a new science-oriented database system. Recently (in 2007), there were many users
with extreme database requirements that complained about the inadequacy of current
commercial DBMS. These current solutions did not support the workloads and grand scheme of
"Big Science". Thus, Stonebraker and his companions decided to tackle this "zero-billion dollar"
industry since getting the attention of large commercial vendors didn't seem likely. Since the
field of academia is responsible for the "progress" of humans, designing a system that can assist
them is not only interesting, but an important problem.

These requirements are split in several sections:

1) Data Model: Even though most scientific users can use relational tables, it doesn't actually
suit their workload (which includes SQL). An array/graph/mesh data model is the more
preferred model (for different fields). In order to satisfy the largest subset of users in the
scientific community, the array model is used - specifically, a multi-dimensional nested array
model with array cells containing records. This array model can be used to model complex
behavior in systems.
2) Operations: There are structural operators and content dependent operators. Structural
operators create new arrays based on the structure of inputs. A content dependent operator's
result depends on the data that is stored in the input array.
3) Extendability: Much like Postgres, users like defining their own operations through user
defined functions in order to do complex aggregates/analysis.
4) Language Bindings: There is a desire for persistence of large array like in c++. However,
some want python while others want MATLAB. In order to support all these interfaces, there
needs to be tree parse representation for commands (A bridge that communicates to different
languages).
5) No Overwrite: Scientists want to keep data and fix it in the case that it is wrong. Thus, there
are no-overwrites and arrays are marked as "updatable" instead.
6) Open Source: Unless the project is open-source, no one will contribute and there will be no
traction in the community.
7) Grid Orientation: We should use a shared nothing architecture to store large volumes of data.
This makes load balancing a bit harder, but still doable.
8) Storage Within a Node: Data is streamed into main memory and written to disk when
memory is full for performance gains.
9) "In Situ" Data: There must be a way to deal with loading large amounts of data. This is one of
the major pain point for scientists - the downtime between actual analysis. We should allow
other operations to be possible during the load times.
10) Integration of the Cooking Process: "Cooking data" within the DBMS vs doing it externally.
This is a big area for debate.
11) Named Versions: Different versions to complete different cooking processes for data -> this
supports different workloads.
12) Provenance: We should keep track of the derivation of an array via a log in order to extract
any lessons learned from them.
13) Uncertainty: All scientific data is plagued with impreciseness. Thus, there is a lot of support
for data distribution of data elements (error bars).

Much like other Stonebraker papers, this one had some drawbacks. One criticism I have is the
approach Stonebraker takes when organizing the structure of his paper. He tries really hard to
generalize a problem, when in reality, it is usually a case-by-case basis. I believe that most of his
views are valid, but some people may value one attribute over another - attribute a may not be
the most compelling factor. Another drawback is the lack of an actual
configuration/development of a system with these requirements. I would have liked to see the
performance benefits that this type of system would have in comparison to mainstream database
vendors. This would have strengthened the earlier claims made by users as well as give rise to
support and contribution for scientific databases.

Review 11

This paper details the requirements and creation of SciDB, a DBMS built specifically for
scientific usage. Standard DBMSs using SQL don’t work well for many scientific applications,
but they still need a standardized system to work with.

Different scientists need different data models, but an array data model worked well for most of
them. In this data model, the equivalent of a table is an array of arbitrary dimensions, each of
arbitrary size. An index into all of the dimensions of an array yields an array cell, which can be
an object with arbitrary attributes. As such, a 1-dimensional array is equivalent to an ordinary
table, where the array index is the primary key, and the cell attributes are the other columns.

Because of the many different uses for arrays, SciDB supports user-defined functions, which can
also change how users index into an array. A user can always access an array with the standard
array indexes, but they can use UDFs to define pseudo coordinates, which can be of any type,
not just integers. Then, the user can perform standard array operations such as aggregation,
sampling, and filtering.

One of the benefits of SciDB is that it can run in multiple programming languages. The various
scientists don’t all use the same language, so SciDB produces a common parse tree
representation that any language can use. Another benefit is that users can recreate the set of
steps that built any data item, so they can debug data items that were created incorrectly. As
well, users can store uncertain data with error bars. All of these allow SciDB to be widely used.

One of the largest downsides of the paper is the lack of experimental results. SciDB is a system
to be used in practice by many people, and it should justify itself with its performance. Without
this, it’s much harder to verify how effective SciDB really is.

Review 12

The paper presents a common set of requirements for a new science data base system, SciDB.
The result was a meeting at Asilomar in March 2008 between a collection of science users and a
collection of DBMS researchers to define requirements, followed by a more detailed design
exercise over the summer. Additional use cases were solicited, and parallel fund raising was
carried out. These requirements come from particle, biology and remote sensing applications,
remote, astronomy, oceanography, and eBay. Requirements involves data models, SciDB
operators, extendibility, language binding, open source, grid orientation, storage within a node,
“In Situ” Data, Integration of the Cooking Process, Named Versions, Provenance, Uncertainty,
Un-science usage and science benchmark.
.

Review 13

This paper presents requirements for science data base and SciDB. These requirements are
assembled from a collection of scientific data base users from astronomy, particle physics,
fusion, remote sensing, oceanography, and biology. The science community realizes that the
software stack is getting too big, too hard to build and too hard to maintain. Hence, the
community seems willing to get behind a single project in the DBMS area.
The paper first presents the data model in scientific databases. The requirements here are arrays
can have any number of dimensions, which may be named; an array can be defined and multiple
instances can be created; an array can be created by specifying high water marks in each
dimension; unbounded array can grow without restriction in dimensions; enhanced arrays should
allow basic arrays to be scaled. SciDB should also support user-defined functions, which should
be coded in C++. UDFs can be defined by specifying the function name, its input and output and
the code to execute the function. Additionally, a shape function must be able to return low-water
and high-water marks when one dimension is left unspecified. There should also be a exits
function to find out whether or not a given cell is present in an array.
Following the data model, the paper presents requirement of operations. The first operator
category creates new arrays based purely on the structure of the inputs. These operators are
data-agnostic and do not have to read the data values to produce results. Therefore, there is
opportunity to optimize these operators. Example operators are subsample, reshape, and
Structured-Join. Structured-Join restricts its join predicate to be over dimension values only. The
second type of operator category is content-dependent operators. This type of operator's result
depends on the data that is stored in the input array. Example operators are filter, aggregate,
sum, and content-based join. Content-based join restricts its join predicate to be over data values
only. The fundamental arrays operations in SciDB should be user-extendable.
Most scientists are adamant about not discarding any data. Therefore, there should not be
overwrite, which may cause data loss. If a data item is shown to be wrong, they want to add the
replacement value and the time of the replacement, retaining the old value for provenance
purposes. Furthermore, DBMS should be open source to get traction in the science community.
For the storage aspect, the storage manager must decompose a partition into disk blocks. Most
data will come into SciDB through a streaming bulk loader. Additionally, a universal
requirement from scientists was repeatability of data derivation. Hence, they wish to be able to
recreate any array A, by remembering how it was derived. The basic requirements are: for a
given data element D, find the collection of processing steps that created it from input data; for a
given data element D, find all the downstream data elements whose value is impacted by the
value of D.
The advantage of this paper is that it covers a lot of requirements of a DBMS for science
community and explain why the requirements are needed. However, the whole paper is bery
disorganized. Each section should be named more clearly. Also, I feel like there are a lot of
requirements mentioned. It would be better if the requirements can be summarized and
categorized.

Review 14

“Requirements for Science Data Bases and SciDB” by Stonebraker et al. discusses database use
cases and requirements of scientists, a new database system SciDB to support these
requirements, and a history of the SciDB designers’ interactions with users to uncover these
requirements. The requirements include:
1) a data model and corresponding language that better fits scientific data use cases, namely an
array data model
2) structural operators (based on structure of input) or content-dependent operators (based on
content of input)
3) user-extendable operations on arrays
4) bindings for a variety of programming languages, since different scientists use different
languages
5) no overwrite of data
6) open source DBMS
7) support changing data partitionings over time
8) how data is stored within a node
9) operate on “in situ” data (don’t require a full load of data)
10) provide support for “cooking” or pre-processing of raw data within the DBMS
11) support named versions of data
12) provide provenance of how particular data was computed and what other computations
down the line it affects
13) support indication of uncertainty (i.e., normal distribution of data)

Review 15

This paper talks about the requirements for science databases and SciDB. The paper first
introduces that the demand for science database is large. With different people work not together
on the topic, the software becomes too large and too hard to maintain. So there is a demand for a
single project in the DBMS area for scientific usage.

The main contribution of the paper is the discussing of many requirements for the science
database and SciDB. The first one is the array data model, which is easier to build than a mesh
model. And SciDB also needs to support user-defined functions. The second requirements are
for the operations. The paper proposed several operations such as subsample, reshape a
structured-join as data-agnostic operators. Also, several context-dependant operators are
introduced, such as filter, aggregate, and content-based join. The third and fourth requirements
are extendibility and language bindings. I think they are both requirements for user usage.
Besides, the paper considers some special requirements, including no overwriting, open source,
storage within a node, "in Situ" data and provenance.

The strong part of the paper is that it conders the design requirements from a scientific
researcher's perspective. The data and situation for researchers are different from traditional
DBMS users. And the paper also gives the non-science usage of the proposed requirements,
which means the design od science area can be applied to other applications.

The drawback of the paper is that I didn't see the support of treating some of the opinions of the
paper as a considerable subset of researchers' opinions. They didn't conduct a survey and gave
data to support their ideas. I think this kind of assertation can only be done by doing a well-
designed survey. So I think it's the drawback of the paper.
Review 16

In this paper, the authors from several different organization do a research on exploring DBMS
especially for scientific researches called SciDB. Design and implement DBMS for scientific
researches is definitely a meaningful task because, before the introduction of SciDB, there are
no other commercial DBMS systems. Even for existing products like Sequoia and MonetDB, the
common set of requirements across several science disciplines are not defined. However, the
demand for DBMS with scientific usage is great, as they mentioned in their paper, lots of
researches like astronomy, particle physics, fusion, remote sensing and etc. The goal of this
paper is to specify a common set of requirements for a new science DBMS called SciDB, these
requirements also fit the requirement of very complex business analytics. Next, I will summarize
the crux of this paper with my understanding.

In this paper, they discussed several requirements for building SciDB. First of all, they need an
appropriate data model. The relational data model of tables is not suitable for scientists because
of its rigidity. They don't appear natural to the scientific needs and the SQL language is not a
good method to retrieve data for scientists. A good model is an array like data model, which
may contain multiple dimensions and each combination of the dimension number is a single
array cell that holds the data. Secondly, a set of well-defined operators is required. They defined
structural operators that are data agnostic and performs only structural manipulations on arrays,
possibly with various dimensions. The other type of operators is the content-dependent
operators, which takes a logical predicate on data values to decide the operations they are to
perform. Third, for analytical and computational needs, the SciDB should be extendible by user-
defined contents. Users could define their own operation methods and data types in POSTGRES
style like format. Fourth, The SciDB should be able to support multiple language bindings, as
there is no unique language that is supposed to be used behind the scene. Fifth, in order to
guarantee that data are trackable and not discarded, the SciDB should prevent overwriting the
old data by the users. This is done by adding history which can track the data changes when
preserving the original value. Next, to make sure the scalability, a long-time scientific project is
maintainable and recompile at will. SciDB should be open-source, to accompany it with a
comparable strength as the commercial DBMS's, a non-profit foundation is established to
support the user community.

Partitioning of data storage should fit the users' demand for large datasets. In order to finish this
efficiently, the SciDB has a default fixed partitioning mechanism, while a user-defined dynamic
partitioning scheme is available. Also, The SciDB needs to break data into disk blocks. An R
tree is used for tracking and background threads are used for optimization. Further optimization
is expected to take place as well. The SciDB is supposed to have the ability to support ad-hoc
analysis without much time spent on loading the huge amount of data. Next, it should focus on a
powerful data manipulation ability internally, the SciDB should integrate the cooking process
within the DBMS without external processing when the users need it. SciDB should support
different scientific data manipulation on the same set of data bundles. To achieve this the
authors proposed a tree-like structure using named versions. Data derivation should be clear
from the outset. SciDB should support backward tracking of data changes when necessary.
Besides, uncertainty is common in scientific fields and SciDB should capture this. It will no
longer store just single data values, but also normal distributions of them.
I think this is an interesting paper which is pretty different compared to the previous paper we
read, it looks like a blueprint for designing a scientific DMBS and provide several useful
advices. Although it is a survey like paper, I think it still makes a good technical contribution,
SciDB is the first project that considers several factors when designing a DBMS especially for
science, I think they make a good effort to this project and the result is promising. There are
several advantages of this paper, first of all, they proposed a well-defined database management
system with an array data model and accompanying operations and features that capture the
requirements of the scientist communities. I think they make a contribution by providing much
good guidance, which can help developers to avoid potential pitfalls when building such a
system. Actually, I think numpy make a good job later for an array representation that can be
widely used for scientific computation.

There are some drawbacks to this paper. First of all, there is no existing implementation of
SciDB and they just talk about their plan for building such a DBMS. Also, due to this reason,
there is also no any experiments done with SciDB, so whether the decision of SciDB is correct
is unknown. There is no solid result which can prove that SciDB is efficient for scientific
workloads and its robustness and flexibility.

Review 17

This paper is a collection of “requirements” for a good scientific DBMS, proposed by Michael
Stonebraker & others after collecting information from a group of users. These requirements
were gathered for the purpose of developing a new scientific database system called SciDB.
Here is a summary of the proposed rules/requirements:

1. Data models are specific to use case; for example, biology users like graphs & sequence but
other users like array models better. SciDB decided to pursue a multidimensional-array-based
model but it is impossible to make everyone happy with this. Several operators are defined for
this model, including subsample, reshape, dimension adding/removing, concatenate, cross
product, filter, aggregate, join and project.

2. Language binding is desired by users and persistence (such as that offered in the object-
oriented era models) is a good thing for scientific DB users. Because of this, SciDB uses a
parse-tree that lets it use language embedding for this purpose.

3. Don’t allow overwriting of data — instead just append. This makes sense as analysis over old
data is important in scientific settings.

4. Make it open source — this is fairly intuitive and makes sense in the scientific community.

5. The “cooking” process is important to build around — this is when input (raw) data is
converted into a better standard (calibration, correction, etc). SciDB makes the choice to load
raw data into the DMBS and then cook it INSIDE the database.

6. Determinism when deriving data is important, as it allows scientists the ability to re-create
any situation (array in SciDB). This is good for fixing errors.

7. Support for fuzzy/uncertain data. This isn’t as important with business databases—uncertain
data means that SciDB (or any good scientific database) should support uncertainty factors
related to data, which can be related to how the data was measured, etc.

The contributions of the paper are mainly these observations, which were taken from real
industry users of scientific databases / people who want better scientific databases. The paper
also presents a high-level picture of what SciDB would look like The main weakness is that no
results were given because SciDB is still in the idea phase at the time of this paper, which seems
strange to me—why not just implement the project then write the paper afterwards, rather than
write it before any actual development has taken place?

Overview of Scidb: Large Scale Array Storage, Processing and Analysis
No ratings yet
Overview of Scidb: Large Scale Array Storage, Processing and Analysis
6 pages
Scidb Overview: The Need
No ratings yet
Scidb Overview: The Need
2 pages
CBS Databases&SQL MinorDegree EVEN SEM
No ratings yet
CBS Databases&SQL MinorDegree EVEN SEM
5 pages
Paper6 - Bionoinformatics Nosql
No ratings yet
Paper6 - Bionoinformatics Nosql
5 pages
Geodatabase Creation and Analysis Guide
No ratings yet
Geodatabase Creation and Analysis Guide
17 pages
Semantic Web for Scientific Data
No ratings yet
Semantic Web for Scientific Data
6 pages
Information 14 00563
No ratings yet
Information 14 00563
24 pages
Datacubes A Technological Survey
No ratings yet
Datacubes A Technological Survey
4 pages
Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or A Relational DBMS Help?
No ratings yet
Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or A Relational DBMS Help?
10 pages
Big Data Storage for Astronomy
No ratings yet
Big Data Storage for Astronomy
10 pages
Group. 1 Final
No ratings yet
Group. 1 Final
11 pages
Database Management Systems Overview
No ratings yet
Database Management Systems Overview
31 pages
Distributed Database Management Systems and The Data Grid
No ratings yet
Distributed Database Management Systems and The Data Grid
12 pages
Graph Databases: Design and Implementation
No ratings yet
Graph Databases: Design and Implementation
14 pages
Computers 13 00077 v2
No ratings yet
Computers 13 00077 v2
3 pages
Model Driven Extraction of NoSQL Databases Schema Case of MongoDB
No ratings yet
Model Driven Extraction of NoSQL Databases Schema Case of MongoDB
10 pages
Nosql
No ratings yet
Nosql
26 pages
Intro to Biological Databases
No ratings yet
Intro to Biological Databases
14 pages
Introduction To Biological Databases
No ratings yet
Introduction To Biological Databases
5 pages
Research Issues in Spatial Databases
No ratings yet
Research Issues in Spatial Databases
8 pages
Smart Proxy: Generic Cloud Middleware Framework For Improving Database Performance
No ratings yet
Smart Proxy: Generic Cloud Middleware Framework For Improving Database Performance
5 pages
Logical Design of Multi-Model Data Warehouses
No ratings yet
Logical Design of Multi-Model Data Warehouses
38 pages
02-A-Introduction To Biological Databases
No ratings yet
02-A-Introduction To Biological Databases
52 pages
Intro to Object-Oriented DBMS
No ratings yet
Intro to Object-Oriented DBMS
10 pages
Survey of Graph Database
No ratings yet
Survey of Graph Database
39 pages
Cheat Sheet v4
No ratings yet
Cheat Sheet v4
3 pages
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
No ratings yet
Ajol File Journals - 314 - Articles - 242956 - Submission - Proof - 242956 3745 584187 1 10 20230306
17 pages
Time Series Analysis of Geospatial Big Data
No ratings yet
Time Series Analysis of Geospatial Big Data
5 pages
The Design of POSTGRES
No ratings yet
The Design of POSTGRES
28 pages
DBS 6202 - Advanced Database Systems Individual Assignment Iii
No ratings yet
DBS 6202 - Advanced Database Systems Individual Assignment Iii
16 pages
CAD Data Structures & DBMS Review
No ratings yet
CAD Data Structures & DBMS Review
9 pages
56453-An Introduction To Spatial Database PDF
No ratings yet
56453-An Introduction To Spatial Database PDF
43 pages
Introduction to Spatial Database Systems
No ratings yet
Introduction to Spatial Database Systems
43 pages
Relational Databases and Beyond
No ratings yet
Relational Databases and Beyond
12 pages
An Approach To On-Demand Extension of Multidimensional Cubes in Multi-Model Settings Application To IoT-based Agro-Ecology
No ratings yet
An Approach To On-Demand Extension of Multidimensional Cubes in Multi-Model Settings Application To IoT-based Agro-Ecology
31 pages
Data Modeling For Big Data Zhu Wang
No ratings yet
Data Modeling For Big Data Zhu Wang
7 pages
Lowell Database Research Insights
No ratings yet
Lowell Database Research Insights
9 pages
Application Data 2
No ratings yet
Application Data 2
105 pages
Storage
No ratings yet
Storage
6 pages
Dmsmicroprojectm
No ratings yet
Dmsmicroprojectm
28 pages
Bioinformatics Database Guide
No ratings yet
Bioinformatics Database Guide
16 pages
87jun CD
No ratings yet
87jun CD
68 pages
A Searchable and Verifiable Data Protection Scheme For Scholarly Big Data
No ratings yet
A Searchable and Verifiable Data Protection Scheme For Scholarly Big Data
57 pages
Role of Database Management Systems in Structural Engineering
No ratings yet
Role of Database Management Systems in Structural Engineering
18 pages
Big Data Solutions for Engineers
No ratings yet
Big Data Solutions for Engineers
5 pages
Bioinformatics Day1
No ratings yet
Bioinformatics Day1
5 pages
Semistructured Data Storage Methods
No ratings yet
Semistructured Data Storage Methods
16 pages
Semistructured Data Storage Methods
No ratings yet
Semistructured Data Storage Methods
16 pages
Nosql Over Rdbms in Image Storing Using Mongodb Deepashree K Kanchan B Prof Mrunali M
No ratings yet
Nosql Over Rdbms in Image Storing Using Mongodb Deepashree K Kanchan B Prof Mrunali M
10 pages
Bioinformatics Database Guide
No ratings yet
Bioinformatics Database Guide
7 pages
Spatial Databases for Professionals
No ratings yet
Spatial Databases for Professionals
28 pages
Database Design Essentials
No ratings yet
Database Design Essentials
24 pages
Paper 5 - The Rise of NoSQL Systems Research and Pedagogy
No ratings yet
Paper 5 - The Rise of NoSQL Systems Research and Pedagogy
17 pages
DataIntensive Computer
No ratings yet
DataIntensive Computer
10 pages
Güting1994 Article AnIntroductionToSpatialDatabas
No ratings yet
Güting1994 Article AnIntroductionToSpatialDatabas
43 pages
MSc Computer Science Project List
No ratings yet
MSc Computer Science Project List
26 pages
Student Management System For San Agustin Elementary School
100% (1)
Student Management System For San Agustin Elementary School
9 pages
C - Header Files
100% (1)
C - Header Files
3 pages
Symantec Antivirus Proxy Setup Guide
No ratings yet
Symantec Antivirus Proxy Setup Guide
2 pages
Ferns N Petals
No ratings yet
Ferns N Petals
12 pages
Online Food Ordering System Overview
No ratings yet
Online Food Ordering System Overview
28 pages
Matrices Study Guide
No ratings yet
Matrices Study Guide
26 pages
Problem Statement AQVH2025.370d30ce2d720db96759
No ratings yet
Problem Statement AQVH2025.370d30ce2d720db96759
1 page
Manual Modulo HL950 EN
No ratings yet
Manual Modulo HL950 EN
8 pages
Point Cloud to Mesh Workflow Guide
No ratings yet
Point Cloud to Mesh Workflow Guide
4 pages
Intel Processor Tech Specs
No ratings yet
Intel Processor Tech Specs
34 pages
Business Message Revision Guide
No ratings yet
Business Message Revision Guide
3 pages
Chain of Agents Large La
No ratings yet
Chain of Agents Large La
30 pages
REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs
No ratings yet
REF-6-DeepLab Semantic Image Segmentation With Deep Convolutional Nets Atrous Convolution and Fully Connected CRFs
15 pages
Oj1436 Manual 1.3 en
No ratings yet
Oj1436 Manual 1.3 en
92 pages
Combo Resume PDF
No ratings yet
Combo Resume PDF
1 page
Exam
No ratings yet
Exam
9 pages
Using Dmee
100% (3)
Using Dmee
24 pages
Y.1564 Test Report
No ratings yet
Y.1564 Test Report
6 pages
Book - Computer Architecture (Ripped From Amazon Kindle Ebooks by Sai Seena)
50% (2)
Book - Computer Architecture (Ripped From Amazon Kindle Ebooks by Sai Seena)
400 pages
UNITED STATES OF AMERICA Et Al v. MICROSOFT CORPORATION - Document No. 850
No ratings yet
UNITED STATES OF AMERICA Et Al v. MICROSOFT CORPORATION - Document No. 850
9 pages
Use "Rufus" To Create Bootable Window 10 Usb Legacy Biosl - Uefi
No ratings yet
Use "Rufus" To Create Bootable Window 10 Usb Legacy Biosl - Uefi
10 pages
Unit 3 1
No ratings yet
Unit 3 1
20 pages
LGL Mod Menu 3.2 Overview and Issues
No ratings yet
LGL Mod Menu 3.2 Overview and Issues
3 pages
Package Import Import Import Import Import Public Class Extends
No ratings yet
Package Import Import Import Import Import Public Class Extends
7 pages
Wiseplay Movie Lists and Links
No ratings yet
Wiseplay Movie Lists and Links
3,875 pages
Amazfit Bip U User Guide: Content
No ratings yet
Amazfit Bip U User Guide: Content
48 pages
Future HCI Trends for Tech Innovators
No ratings yet
Future HCI Trends for Tech Innovators
9 pages
21EC642
No ratings yet
21EC642
1 page
Software Engineer Resume of Namrata Bajpai
No ratings yet
Software Engineer Resume of Namrata Bajpai
3 pages
Computers & Industrial Engineering: Kenneth R. Baker, Brian Keller
No ratings yet
Computers & Industrial Engineering: Kenneth R. Baker, Brian Keller
6 pages

Science DB

Uploaded by

Science DB

Uploaded by

Review for Paper: 33-Requirements for

Science Data Bases and SciDB

The requirements discussed include:

Actually, I don't like this paper due to two reasons:

Some of the contributions and strengths of this paper are:

Some of the drawbacks of this paper are:

The main advantages of the proposed model are as follows:

Requirement for sequence data bases and SciDB

These requirements are split in several sections:

You might also like