Science DB
Science DB
This paper reports scientists' requirements for databases, and sketch some of the SciDB design.
Besides the reasons making me dislike the paper, the paper provides some ideas about how
SciDB designs are generated based on actual requirements, and it's interesting to see how the
group design data structures and operations from scratch, from some of the user requirements.
Review 2
Current commercial DBMS cannot fulfil the demand of some scientific database users.
Scientific database users in different fields including astronomy, particle physics, fusion, remote
sensing,
oceanography, and biology proposed several requirements they need. This paper aims at
reporting the requirements they collected and identified and present the design of SciDB, which
is a database designed to meet a common set of requirements from scientific users.
Review 3
A collection of scientific data base users keep complaining about the inadequacy of current
commercial DBMS offerings, and though many researchers worked on science databases for
years, there was still no common set of requirements across several science disciplines.
Therefore, this paper defined these requirements and presented a detailed design exercise and
sketched some of the SciDB design.
First, Data model in most scientific scenarios are an array data model, primarily because it
makes a considerable subset of the community happy and is easier to build than a mesh model.
Specifically, this model is a multi-dimensional, nested array model with array cells containing
records, which in turn can contain components that are multi-dimensional arrays. The operators
that accompany the proposed data model consist of Structural Operators and Content-Dependent
Operators. Structural operators create new arrays based purely on the structure of the inputs. In
other words, these operators are data-agnostic. Content-dependent operators are those whose
result depends on the data that is stored in the input array. To support these disparate languages,
SciDB will have a parse-tree representation for commands. Then, there will be multiple
language bindings. These will map from the language-specific representation to this parse tree
format.
The main contribution of this paper is it is the first paper which defined the basic requirements
for scientific databases across several science discipline, and thus can serve as a guidance for
researchers. Besides, the authors of this paper also decided to build SciDB and recruited an
initial programming team, and have started a non-profit foundation to manage the project, which
can definitely help scientific data users as it's not possible for commercial vendors to pay
attention to these non-profit part of specific databases.
One thing to notice is that this paper is specified for scientific database, so we cannot think from
perspective of the most common databases, especially these requirements cannot meet the OLTP
demands. In other words, this proposed database does not support transactions, it only supports
analysis, which is like OLAP workloads, which can be thought as a “drawback” of the model.
Review 4
Scientific applications often have to use databases and other data management technologies to
handle the immense amounts of data that they produce in their research. Often, however, they
are forced to make do with systems that are developed with industry concerns in mind, rather
than the requirements specific to scientists, due to various historical reasons. For example,
scientific data management is a “zero billion dollar” industry, which makes it nearly impossible
for large commercial vendors to justify the time and expense to make a scientific research-
focused system. This paper attempts to bridge the gap by specifying the various requirements for
scientific databases, to make it easier for vendors and other organizations to develop products
specifically targeting the scientific community. In particular, the product that the authors hope to
develop out of this is termed SciDB. Potential users of this SciDB system span various scientific
disciplines from particle physics to biology and remote sensing to astronomy to oceanography,
and are mostly represented by universities and large research institutions.
The first major difference between scientific users and industry is that most of their data does
not fit particularly well into the table format that is the cornerstone of relational DBMSs. In fact,
arrays tend to be the natural data model for a lot of these disciplines, rather than tables. Other
domains like genomics would be happy with neither, and instead prefer graphs and sequence
representations instead. This means that a “one size fits all” approach does not work, and
instead, DBMSs will have to be specialized to individual needs. This paper focuses on the array
data model, due to practical considerations, where each array can have an arbitrary number of
dimensions and each combination of dimension values defines a cell. Data values are in the
form of scalar values, and each cell has the same data type across all of its associated values.
Arrays can be defined, and then multiple instances can be created. SciDB will also have user
defined functions (in C++). One concept that SciDB introduces is the idea of enhancing arrays,
i.e. performing various transformations such as transposition, scaling, translation, etc, which are
done by user defined functions. SciDB includes a wide range of operators, which fall under two
general categories: structural and context-dependent operators. Structural operators create new
arrays purely based on the structure of the input, i.e. they are data-agnostic. This allows for a
greater degree of optimization since the values of the data do not have to be taken into account.
Examples of structural operators include subsample and reshape. On the other hand, context-
dependent operators depend on the data stored in the array. Examples of this class of operators
include filter and aggregate. As before, the set of operators and data types is extendable, and
users can add in ones more suited to their specific disciplines. As for language, SciDB has a
parse-tree representation for commands, followed by multiple language bindings. The reason for
this is to support the disparate languages, from C++ to Python to MATLAB that different
scientists prefer. Other important features of SciDB include a no-overwrite design, where old
data is usually not deleted (for lineage purposes), the open source nature of the project, grid
orientation, storage within a node, and “in situ” data where the overhead of loading data is
minimized.
The main strength of this paper is that is probably the first one of its type to attempt to gather
together scientific professionals in order to figure out their specific requirements for a database
management system. The insights gathered are very interesting and shed light on how
differently scientific research is from other more conventional business-oriented transaction or
analytical processing is done. By taking this initial step, the hope is that SciDB or other follow-
on systems will be developed to better serve the scientific community. Overall, the paper is very
comprehensive, but since it is well written, it is straightforward to comprehend.
The biggest weakness, in my opinion, is that the scientific community is so fragmented in terms
of requirements and preferred programming languages, to name a few. As previously
mentioned, science as a whole is a “zero billion dollar” industry, and further splitting that up
decreases the business incentive to develop systems specific to them. Also, SciDB has to make a
lot of compromises in order to make it flexible enough to satisfy enough users, as was shown in
how it was designed to handle multiple types of programming languages. While that is the
practical reality, it is questionable whether the overall product is able to perform as well as
initially envisioned, given that is being pulled in so many directions at once.
Review 5
This paper is a requirements paper for Science Databases as well as their proposed Science
Database architecture, called SciDB. The author starts with a history where Michael Stonebraker
and his colleague at a conference defended the current DBMS systems which were under attack
by the advocates of scientific databases. The author also draws attention to the fact that the
space of scientific databases does not have a lot of market potential, so it does not get the
support it needs from the corporate community.
The author then gets into the data model of a science database. This section introduces all of the
possible models: an array data model, a table data model, and a mesh data model. These data
models each serve different subset of the scientific community, but array data models serve the
largest amount of the community and it is easier to develop than a mesh model. The paper
describes how tables can be defined once and then later instances can be created multiple times,
as in SQL. The paper also describes the Postgres-style UDFs that allow you enhance your array
by adding pseudo-coordinates to your array. The author mentions how SciDB will also come
with a few shape functions implemented that will allow the user to digitize some shapes like
circles using arrays.
Next the paper looks into structural operators which create arrays based on the structure of the
inputs. The first one we look into is Subsample, which takes an array and a predicate over the
array and returns a subset of the elements that satisfy the predicate. The next example we look at
is Reshape, which converts the input array to a new array with a different shape and can modify
the dimensions of the array, but not the number of cells, accordingly. Next the paper looks into
content-dependent operators, which are operators that return objects of the same dimension as
the input. The author then proposes that SciDB will be user-extendible like Postgres. The next
requirement is that a science database should not overwrite old data but should instead keep a
lineage of the data, so scientists can see what the value was at different points in time. The paper
also suggests that the DBMS should be open source so that the community can work together to
manage and support the codebase. They also require named versions so that scientists can access
data from a large set at exactly the time or range that is useful to them. We also want
provenance so we can recover data based on the lineage of steps used to get the data to its state.
They also require that the DBMS can be used in non-scientific applications and that a
benchmarks is developed, which they say is under progress for SciDB.
I thought that in the earlier stages, the paper told an interesting story to lead the reader to the
main content of the paper and provide context. However I felt that when the technical
information was presented, it was a bit hard to distiguish the history from the author’s
contribution, at least in the earlier sections of the Requirements part of the paper. This was
understandable since the paper was more high level as a requirements paper.
Review 6
This paper presents several requirements from scientific researchers and database users for
scientific area. And provides some designing ideas of science database. Overall speaking, the
requirements from scientific users are quite different from other people(daily database users),
and those requirements are mixed which means a typical design of SciDB will not satisfy all.
Therefore, the design of SciDB is need to consider all the requirement and needs.
In this paper, more than ten designing ideas are listed. For the data model. Scientific users want
different type of storage of data and significant ability of manipulating arrays. So SciDB designs
array of cell mechanism and providing lots of user define functions so that users can design their
own datatype. Also, SciDB provides multiple data/arrays manipulation and its user define
functions method so that SciDB is extendable. For the language bindings, different users have
different view on it. Some prefers C++ but other prefers Python, so SciDB has a parse-tree
representation for commands, and then there will be multiple language bindings.
For the scientific users, they do not want to overwrite datas in case of losing history. To deal
with this, SciDB adopts another dimension(history) to record the data in previous times. Like
MySQL and Postgres, SciDB is open-source for the community. Also, some may use very large
dataset such as LSST, and their dataset is change during the time span, so SciDB not only deal
this with grid, but also support dynamic partition(change over time). With regard to storage
within a node, there are some optimizations designed specifically for that.
Another thing regarding the complaints from scientific users is that they feel loading data is
slow and SciDB can handle "in Situ" data. For the cooking process, there are typically two
methods of doing this. You can process it before the data comes in the database and you can also
first record the data in database and then processes it. The second method is better because the
"cooking" process is an information loss process, you want to have the most accurate
information regarding your data. At the same time, another problem arises, different people, at
the different time, may use different algorithms dealing with same chunk of data, in this case,
SciDB adopts named version to handle different usage of the same data. Next important thing is,
if something goes wrong, users want to know what procedure lead to the wrong answer and
what are the datas that got influenced. Furthermore, SciDB can not only used in scientific area,
but also non-scientific area like eBay.
The contribution in this paper is the group of people who designed and develop SciDB, keeping
in mind that this is a non-commercial project, and what they did would benefit for all scientific
users and others which will move the science field faster.
The advantage of SciDB is: 1: it resolves the concerns and requirements of scientists 2 SciDB is
non-profit and open source which would benefit lots of people. 3 the design is excellent which
make the computing faster.
One of the drawback maybe, as an introductory paper, there are lack of detail of some
implementations like how is the algorithm for dynamic partitioning, and how to deal with the "in
situ" data in detail.
Review 7
This paper is written to specify a common set of requirements for a new science databse system
and breifly sketch the design of SciDB.
The requirements come from particle phsics, remote sensing and astronomy, oceanography.
While most scientific users can use relational tables, tables are not a natural data model that
closely matches their data. The Sequoia 2000 project realized in the mid 1990s that their users
wanted an array data model, and that simulating arrays on top of tables was difficult and resulted
in poor performance. THus, people will be not happy with neither a table nor an array data
model. This project is exploring an array data model, primarily because it makes a considerable
subset of the community happy and is easier to build than a mesh model. It will support a multi-
dimensional, nested array model with array cells containing records.
Then the paper introduced some of the operators that are used in the data model. There are
mainly two kinds of operators structural operators and content-dependent operators. Structural
operators category creates new arrays based purely on the structure of the inputs. Content-
Dependent operators involves operators whose result depends on the data that is stored in the
input array.
Then the paper point out several essential requirements of the SciDB, it should be extendible,
such as user can add its own data type in SciDB. Secondly, the SciDB should able to not binding
a specific language. It can be completely acoided by language embedding approach. What's
more, it should be open source and frid orientated. To ease the burden of loading large-scale
data into a database, SciDB defines its own data format and writes adapters for commonly used
external number formats. As long as there is an adapter corresponding to the user data, you can
use SciDB directly without loading the data. SciDB loads raw data, using custom functions
(UDFs) and data manipulation processes. The user makes specific changes to a portion of the
array while leaving the rest unchanged. Data in the scientific field is generally inaccurate, and
SciDB supports data and its errors.
The main contribution fo this paper is that it introduced the SciDB and describe the user
requirement of the SciDB. One week point is that this paper did not introduce SciDB in a
systematic way and did not say anything about the performance of SciDB.
Review 8
This paper describes the requirements for a database designed for users in the scientific
community. There was a discussion about this at XLDB-1 in October 2007, and Stonebraker and
DeWitt agreed to build it if a common set of requirements could be defined - these are those
requirements. Before this database was proposed, most scientific use cases had built one-off
databases for individual products.
There are many requirements that differ from traditional databases, but due to the length of this
review I will focus on a select few. One that is quite interesting is the data model. The agreed
upon data model is arrays - these look somewhat like a numpy array, which can have multiple
dimensions. Additionally, UDFs can be defined on arrays. Because of the data model and other
requirements of the project, the traditional SQL language is not desired. Therefore, new
operations are described, which include two different subsets: operations that manipulate the
structure of the data and operations that manipulate the data itself. This was an interesting
separation that I had not thought of before in a database context. Some possible operations are
reorganizing the dimensions of the data or performing a join. Additional important requirements
include but are not limited to flexible language bindings, open source, and running on a shared-
nothing cloud.
I think that the authors did a fairly good job laying out all of the requirements. Additionally, I
noted that they mention that scientific DBMSs are a “‘zero billion dollar’ industry.” However,
they draw parallels with use cases at eBay near the end of the paper, for complex business
analytics at web-scale. Therefore, in my mind, since the requirements meet the requirements for
use cases at a place like eBay, this certainly *isn’t* a zero billion dollar problem. I’m curious
about what this space looks like in 2018, with more NoSQL options and specific time series
databases.
I felt as though there were some areas that were too closely explored in a paper for
requirements, and some that should have been included when their absence was noted. On page
2, there is a lot of description of syntax, which feels unnecessary for a paper on high-level
requirements. Meanwhile, on page 6, issues like which compression schemes to use are pushed
to a later point. In my opinion, this decision about detail should have been flipped. I also noticed
that the data model chosen basically ignores the desires of biology, genomics, and chemistry
(who want a graphs and sequences as their data model), which to me seem to represent many
scientific disciplines.. Although they say that this is to satisfy a large subset of the use cases, I
wonder how these disciplines use cases will be addressed in the future, because it seems
unrealistic to call this a database for all of science without them.
Review 9
The contribution of this paper is a set of requirements that is crucial when designing a new
science database system or SciDB in short. Users doing scientific computing, which typically
have extreme database requirements complained about the inadequacy of current commercial
DBMS offering. Building custom software for each new project clearly will not work in the
future since the software stack will become too big and complex. To meet these users needs, we
need to know their requirements for the new database system.
The first and the most important requirement is a more flexible data model. Traditional DBMS
only provides tables with a primary key, which is merely a one-dimensional array. But biology
and genomics users want graphs and sequences. To satisfy their requirement, the author
proposed a new data model called array which can have any number of (named) dimensions.
Each combination of dimension values defines a cell, and a cell can store one or more scalar
values or arrays. Enhanced arrays which allow basic arrays to be scaled, translated, have
irregular boundaries and non-integer dimensions can be created by user-defined functions.
Updates on the values of an array should not overwrite previous versions, so a history dimension
must be added to every updatable array. Users should also be allowed to create named versions
of their data as a user may want to do only a few modifications to a dataset and get a new
version.
Operators that can be applied on arrays can be divided into two categories: structural operators
which based purely on the structure of the inputs and content-dependent operators whose result
depends on the data that is stored in the input array. One important property of array operations
is that they must be user-extendable, as users typically want to perform sophisticated
computations and analytics.
Other requirements include open source, multiple language bindings, a simple model of
uncertainty, a history that remembers how an array is derived, etc.
This paper does a good job of presenting these requirements for a SciDB. However, I think it
will be better to also provide background on the existing solution to this problem and show how
these methods failed to meet the customers’ needs.
Review 10
In the paper "Requirements for Science Data Bases and SciDB", Michael Stonebraker and Co.
identify a common set of requirements between disciplines in the field of "Big Science" to
develop a new science-oriented database system. Recently (in 2007), there were many users
with extreme database requirements that complained about the inadequacy of current
commercial DBMS. These current solutions did not support the workloads and grand scheme of
"Big Science". Thus, Stonebraker and his companions decided to tackle this "zero-billion dollar"
industry since getting the attention of large commercial vendors didn't seem likely. Since the
field of academia is responsible for the "progress" of humans, designing a system that can assist
them is not only interesting, but an important problem.
Much like other Stonebraker papers, this one had some drawbacks. One criticism I have is the
approach Stonebraker takes when organizing the structure of his paper. He tries really hard to
generalize a problem, when in reality, it is usually a case-by-case basis. I believe that most of his
views are valid, but some people may value one attribute over another - attribute a may not be
the most compelling factor. Another drawback is the lack of an actual
configuration/development of a system with these requirements. I would have liked to see the
performance benefits that this type of system would have in comparison to mainstream database
vendors. This would have strengthened the earlier claims made by users as well as give rise to
support and contribution for scientific databases.
Review 11
This paper details the requirements and creation of SciDB, a DBMS built specifically for
scientific usage. Standard DBMSs using SQL don’t work well for many scientific applications,
but they still need a standardized system to work with.
Different scientists need different data models, but an array data model worked well for most of
them. In this data model, the equivalent of a table is an array of arbitrary dimensions, each of
arbitrary size. An index into all of the dimensions of an array yields an array cell, which can be
an object with arbitrary attributes. As such, a 1-dimensional array is equivalent to an ordinary
table, where the array index is the primary key, and the cell attributes are the other columns.
Because of the many different uses for arrays, SciDB supports user-defined functions, which can
also change how users index into an array. A user can always access an array with the standard
array indexes, but they can use UDFs to define pseudo coordinates, which can be of any type,
not just integers. Then, the user can perform standard array operations such as aggregation,
sampling, and filtering.
One of the benefits of SciDB is that it can run in multiple programming languages. The various
scientists don’t all use the same language, so SciDB produces a common parse tree
representation that any language can use. Another benefit is that users can recreate the set of
steps that built any data item, so they can debug data items that were created incorrectly. As
well, users can store uncertain data with error bars. All of these allow SciDB to be widely used.
One of the largest downsides of the paper is the lack of experimental results. SciDB is a system
to be used in practice by many people, and it should justify itself with its performance. Without
this, it’s much harder to verify how effective SciDB really is.
Review 12
The paper presents a common set of requirements for a new science data base system, SciDB.
The result was a meeting at Asilomar in March 2008 between a collection of science users and a
collection of DBMS researchers to define requirements, followed by a more detailed design
exercise over the summer. Additional use cases were solicited, and parallel fund raising was
carried out. These requirements come from particle, biology and remote sensing applications,
remote, astronomy, oceanography, and eBay. Requirements involves data models, SciDB
operators, extendibility, language binding, open source, grid orientation, storage within a node,
“In Situ” Data, Integration of the Cooking Process, Named Versions, Provenance, Uncertainty,
Un-science usage and science benchmark.
.
Review 13
This paper presents requirements for science data base and SciDB. These requirements are
assembled from a collection of scientific data base users from astronomy, particle physics,
fusion, remote sensing, oceanography, and biology. The science community realizes that the
software stack is getting too big, too hard to build and too hard to maintain. Hence, the
community seems willing to get behind a single project in the DBMS area.
The paper first presents the data model in scientific databases. The requirements here are arrays
can have any number of dimensions, which may be named; an array can be defined and multiple
instances can be created; an array can be created by specifying high water marks in each
dimension; unbounded array can grow without restriction in dimensions; enhanced arrays should
allow basic arrays to be scaled. SciDB should also support user-defined functions, which should
be coded in C++. UDFs can be defined by specifying the function name, its input and output and
the code to execute the function. Additionally, a shape function must be able to return low-water
and high-water marks when one dimension is left unspecified. There should also be a exits
function to find out whether or not a given cell is present in an array.
Following the data model, the paper presents requirement of operations. The first operator
category creates new arrays based purely on the structure of the inputs. These operators are
data-agnostic and do not have to read the data values to produce results. Therefore, there is
opportunity to optimize these operators. Example operators are subsample, reshape, and
Structured-Join. Structured-Join restricts its join predicate to be over dimension values only. The
second type of operator category is content-dependent operators. This type of operator's result
depends on the data that is stored in the input array. Example operators are filter, aggregate,
sum, and content-based join. Content-based join restricts its join predicate to be over data values
only. The fundamental arrays operations in SciDB should be user-extendable.
Most scientists are adamant about not discarding any data. Therefore, there should not be
overwrite, which may cause data loss. If a data item is shown to be wrong, they want to add the
replacement value and the time of the replacement, retaining the old value for provenance
purposes. Furthermore, DBMS should be open source to get traction in the science community.
For the storage aspect, the storage manager must decompose a partition into disk blocks. Most
data will come into SciDB through a streaming bulk loader. Additionally, a universal
requirement from scientists was repeatability of data derivation. Hence, they wish to be able to
recreate any array A, by remembering how it was derived. The basic requirements are: for a
given data element D, find the collection of processing steps that created it from input data; for a
given data element D, find all the downstream data elements whose value is impacted by the
value of D.
The advantage of this paper is that it covers a lot of requirements of a DBMS for science
community and explain why the requirements are needed. However, the whole paper is bery
disorganized. Each section should be named more clearly. Also, I feel like there are a lot of
requirements mentioned. It would be better if the requirements can be summarized and
categorized.
Review 14
“Requirements for Science Data Bases and SciDB” by Stonebraker et al. discusses database use
cases and requirements of scientists, a new database system SciDB to support these
requirements, and a history of the SciDB designers’ interactions with users to uncover these
requirements. The requirements include:
1) a data model and corresponding language that better fits scientific data use cases, namely an
array data model
2) structural operators (based on structure of input) or content-dependent operators (based on
content of input)
3) user-extendable operations on arrays
4) bindings for a variety of programming languages, since different scientists use different
languages
5) no overwrite of data
6) open source DBMS
7) support changing data partitionings over time
8) how data is stored within a node
9) operate on “in situ” data (don’t require a full load of data)
10) provide support for “cooking” or pre-processing of raw data within the DBMS
11) support named versions of data
12) provide provenance of how particular data was computed and what other computations
down the line it affects
13) support indication of uncertainty (i.e., normal distribution of data)
Review 15
This paper talks about the requirements for science databases and SciDB. The paper first
introduces that the demand for science database is large. With different people work not together
on the topic, the software becomes too large and too hard to maintain. So there is a demand for a
single project in the DBMS area for scientific usage.
The main contribution of the paper is the discussing of many requirements for the science
database and SciDB. The first one is the array data model, which is easier to build than a mesh
model. And SciDB also needs to support user-defined functions. The second requirements are
for the operations. The paper proposed several operations such as subsample, reshape a
structured-join as data-agnostic operators. Also, several context-dependant operators are
introduced, such as filter, aggregate, and content-based join. The third and fourth requirements
are extendibility and language bindings. I think they are both requirements for user usage.
Besides, the paper considers some special requirements, including no overwriting, open source,
storage within a node, "in Situ" data and provenance.
The strong part of the paper is that it conders the design requirements from a scientific
researcher's perspective. The data and situation for researchers are different from traditional
DBMS users. And the paper also gives the non-science usage of the proposed requirements,
which means the design od science area can be applied to other applications.
The drawback of the paper is that I didn't see the support of treating some of the opinions of the
paper as a considerable subset of researchers' opinions. They didn't conduct a survey and gave
data to support their ideas. I think this kind of assertation can only be done by doing a well-
designed survey. So I think it's the drawback of the paper.
Review 16
In this paper, the authors from several different organization do a research on exploring DBMS
especially for scientific researches called SciDB. Design and implement DBMS for scientific
researches is definitely a meaningful task because, before the introduction of SciDB, there are
no other commercial DBMS systems. Even for existing products like Sequoia and MonetDB, the
common set of requirements across several science disciplines are not defined. However, the
demand for DBMS with scientific usage is great, as they mentioned in their paper, lots of
researches like astronomy, particle physics, fusion, remote sensing and etc. The goal of this
paper is to specify a common set of requirements for a new science DBMS called SciDB, these
requirements also fit the requirement of very complex business analytics. Next, I will summarize
the crux of this paper with my understanding.
In this paper, they discussed several requirements for building SciDB. First of all, they need an
appropriate data model. The relational data model of tables is not suitable for scientists because
of its rigidity. They don't appear natural to the scientific needs and the SQL language is not a
good method to retrieve data for scientists. A good model is an array like data model, which
may contain multiple dimensions and each combination of the dimension number is a single
array cell that holds the data. Secondly, a set of well-defined operators is required. They defined
structural operators that are data agnostic and performs only structural manipulations on arrays,
possibly with various dimensions. The other type of operators is the content-dependent
operators, which takes a logical predicate on data values to decide the operations they are to
perform. Third, for analytical and computational needs, the SciDB should be extendible by user-
defined contents. Users could define their own operation methods and data types in POSTGRES
style like format. Fourth, The SciDB should be able to support multiple language bindings, as
there is no unique language that is supposed to be used behind the scene. Fifth, in order to
guarantee that data are trackable and not discarded, the SciDB should prevent overwriting the
old data by the users. This is done by adding history which can track the data changes when
preserving the original value. Next, to make sure the scalability, a long-time scientific project is
maintainable and recompile at will. SciDB should be open-source, to accompany it with a
comparable strength as the commercial DBMS's, a non-profit foundation is established to
support the user community.
Partitioning of data storage should fit the users' demand for large datasets. In order to finish this
efficiently, the SciDB has a default fixed partitioning mechanism, while a user-defined dynamic
partitioning scheme is available. Also, The SciDB needs to break data into disk blocks. An R
tree is used for tracking and background threads are used for optimization. Further optimization
is expected to take place as well. The SciDB is supposed to have the ability to support ad-hoc
analysis without much time spent on loading the huge amount of data. Next, it should focus on a
powerful data manipulation ability internally, the SciDB should integrate the cooking process
within the DBMS without external processing when the users need it. SciDB should support
different scientific data manipulation on the same set of data bundles. To achieve this the
authors proposed a tree-like structure using named versions. Data derivation should be clear
from the outset. SciDB should support backward tracking of data changes when necessary.
Besides, uncertainty is common in scientific fields and SciDB should capture this. It will no
longer store just single data values, but also normal distributions of them.
I think this is an interesting paper which is pretty different compared to the previous paper we
read, it looks like a blueprint for designing a scientific DMBS and provide several useful
advices. Although it is a survey like paper, I think it still makes a good technical contribution,
SciDB is the first project that considers several factors when designing a DBMS especially for
science, I think they make a good effort to this project and the result is promising. There are
several advantages of this paper, first of all, they proposed a well-defined database management
system with an array data model and accompanying operations and features that capture the
requirements of the scientist communities. I think they make a contribution by providing much
good guidance, which can help developers to avoid potential pitfalls when building such a
system. Actually, I think numpy make a good job later for an array representation that can be
widely used for scientific computation.
There are some drawbacks to this paper. First of all, there is no existing implementation of
SciDB and they just talk about their plan for building such a DBMS. Also, due to this reason,
there is also no any experiments done with SciDB, so whether the decision of SciDB is correct
is unknown. There is no solid result which can prove that SciDB is efficient for scientific
workloads and its robustness and flexibility.
Review 17
This paper is a collection of “requirements” for a good scientific DBMS, proposed by Michael
Stonebraker & others after collecting information from a group of users. These requirements
were gathered for the purpose of developing a new scientific database system called SciDB.
Here is a summary of the proposed rules/requirements:
1. Data models are specific to use case; for example, biology users like graphs & sequence but
other users like array models better. SciDB decided to pursue a multidimensional-array-based
model but it is impossible to make everyone happy with this. Several operators are defined for
this model, including subsample, reshape, dimension adding/removing, concatenate, cross
product, filter, aggregate, join and project.
2. Language binding is desired by users and persistence (such as that offered in the object-
oriented era models) is a good thing for scientific DB users. Because of this, SciDB uses a
parse-tree that lets it use language embedding for this purpose.
3. Don’t allow overwriting of data — instead just append. This makes sense as analysis over old
data is important in scientific settings.
4. Make it open source — this is fairly intuitive and makes sense in the scientific community.
5. The “cooking” process is important to build around — this is when input (raw) data is
converted into a better standard (calibration, correction, etc). SciDB makes the choice to load
raw data into the DMBS and then cook it INSIDE the database.
6. Determinism when deriving data is important, as it allows scientists the ability to re-create
any situation (array in SciDB). This is good for fixing errors.
7. Support for fuzzy/uncertain data. This isn’t as important with business databases—uncertain
data means that SciDB (or any good scientific database) should support uncertainty factors
related to data, which can be related to how the data was measured, etc.
The contributions of the paper are mainly these observations, which were taken from real
industry users of scientific databases / people who want better scientific databases. The paper
also presents a high-level picture of what SciDB would look like The main weakness is that no
results were given because SciDB is still in the idea phase at the time of this paper, which seems
strange to me—why not just implement the project then write the paper afterwards, rather than
write it before any actual development has taken place?