Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2021, Proceedings of the 2021 International Conference on Management of Data
…
5 pages
1 file
In-situ processing has received a great deal of attention in recent years. In in-situ scenarios, big raw data files which do not fit in main memory, must be efficiently handled on-the-fly using commodity hardware, without the overhead of a preprocessing phase or the loading of data into a database system. This paper presents RawVis, an open source data visualization system for in-situ visual exploration and analytics over big raw data. RawVis implements novel indexing schemes and adaptive processing techniques allowing users to perform efficient visual and analytics operations directly over the data files. RawVis provides real-time interaction, reporting low response time, over large data files, using commodity hardware.
In-situ Visual Exploration over Big Raw Data, Information Systems, 2020, 2020
Data exploration and visual analytics systems are of great importance in Open Science scenarios, where less tech-savvy researchers wish to access and visually explore big raw data files (e.g., json, csv) generated by scientific experiments using commodity hardware and without being overwhelmed in the tedious processes of data loading, indexing and query optimization. In this paper, we present our work for enabling efficient query processing on large raw data files for interactive visual exploration scenarios and analytics. We introduce a framework, named RawVis, built on top of a lightweight in-memory tile-based index, VALINOR, that is constructed on-the-fly given the first user query over a raw file and progressively adapted based on the user interaction. We evaluate the performance of a prototype implementation compared to three other alternatives and show that our method outperforms in terms of response time, disk accesses and memory consumption. Particularly during an exploration scenario, the proposed method in most cases is about 5-10× faster compared to existing solutions, and requires significantly less memory resources. Keywords: Visual Analytics, Progressive & Adaptive Indexes, User-driven Incremental Processing, Interactive Indexing, RawVis, In-situ Query Processing, Big Data Visualization
Data exploration and visual analytics systems are of great importance in Open Science scenarios, where less tech-savvy researchers wish to access and visually explore big raw data files (e.g., json, csv) generated by scientific experiments using commodity hardware and without being overwhelmed in the tedious processes of data loading, indexing and query optimization. In this work, we present our work for enabling efficient in site query processing on big raw data files for interactive visual exploration scenarios. We introduce a framework, named RawVis, built on top of a lightweight in-memory tile-based index, VALINOR, that is constructed on-the-fly given the first user query over a raw file and adapted incrementally based on the user interaction. We evaluate the performance of prototype implementation compared to three other alternatives and show that our method outperforms in terms of response time, disk accesses and memory consumption.
Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP 2021), 2021
In-situ processing has received a great deal of attention in recent years. In in-situ scenarios, big raw data files which do not fit in main memory, must be efficiently handled using commodity hardware, without the overhead of a preprocessing phase or the loading of data into a database. In this work, we present an adap-tive indexing scheme that enables efficient visual exploration and analytics over big raw data files. Beyond visual exploration and statistics, the scheme enables categorical-based analytics using group-by and filter operations. The proposed scheme combines a tile-based structure that offers efficient exploratory operations over the 2D space, with a tree-based structure that organizes a tile's objects based on their categorical values, enabling efficient visual analytics and the support of advanced visualization methods. The index resides in main memory and is built progressively as the user explores parts of the raw file, whereas its structure and level of granularity are adjusted to the user's exploration areas and type of analysis. We conduct experiments using real and synthetic datasets, and demonstrate that the proposed approach, is in most cases more than 40× faster compared to the existing solutions, and performs around 3 orders of magnitude less I/O operations.
VLDB Journal, 2022
In in-situ data management scenarios, large data files which do not fit in main memory, must be efficiently handled using commodity hardware, without the overhead of a preprocessing phase or the loading of data into a database. In this work, we study the challenges posed by the visual analysis tasks in in-situ scenarios in the presence of memory constraints. We present an indexing scheme and adaptive query evaluation techniques, which enable efficient categorical based group-by and filter operations, combined with 2D visual interactions, such as exploration of data points on maps or scatter plots. The indexing scheme combines a tile-based structure, which offers efficient visual exploration over the 2D plane, with a tree-based structure that organizes a tile's objects based on its categorical values. The index is constructed on-the-fly, resides in main memory and is built progressively as the user explores parts of the raw file, whereas its structure and level of granularity are adjusted to the user's exploration areas and type of analysis. To handle the cases where limited resources are available, we introduce a resource-aware index initialization mechanism and we formulate it as an NP-hard optimization problem; we propose two efficient approximation algorithms to solve it. We conduct extensive experiments using real and synthetic datasets, and demonstrate that our approach reports interactive query response times (less than 0.04sec); and in most cases is more than 100× faster and performs up to 2 orders of magnitude less I/O operations compared to existing solutions. The proposed methods are implemented as part of an open-source system for in-situ visual exploration and analytics.
2011 IEEE Symposium on Large Data Analysis and Visualization, 2011
As computing power increases exponentially, vast amount of data is created by many scientific research activities. However, the bandwidth for storing the data to disks and reading the data from disks has been improving at a much slower pace. These two trends produce an ever-widening data access gap. Our work brings together two distinct technologies to address this data access issue: indexing and in situ processing. From decades of database research literature, we know that indexing is an effective way to address the data access issue, particularly for accessing relatively small fraction of data records. As data sets increase in sizes, more and more analysts need to use selective data access, which makes indexing an even more important for improving data access. The challenge is that most implementations of indexing technology are embedded in large database management systems (DBMS), but most scientific datasets are not managed by any DBMS. In this work, we choose to include indexes with the scientific data instead of requiring the data to be loaded into a DBMS. We use compressed bitmap indexes from the FastBit software which are known to be highly effective for query-intensive workloads common to scientific data analysis. To use the indexes, we need to build them first. The index building procedure needs to access the whole data set and may also require a significant amount of compute time. In this work, we adapt the in situ processing technology to generate the indexes, thus removing the need of reading data from disks and to build indexes in parallel. The in situ data processing system used is ADIOS, a middleware for high-performance I/O. Our experimental results show that the indexes can improve the data access time up to 200 times depending on the fraction of data selected, and using in situ data processing system can effectively reduce the time needed to create the indexes, up to 10 times with our in situ technique when using identical parallel settings.
Communications of the ACM, 1999
VisIt is a popular open source tool for visualizing and analyzing data. It owes its success to its foci of increasing data understanding, large data support, and providing a robust and usable product, as well as its underlying design that fits today's supercomputing landscape. In this short paper, we describe the VisIt project and its accomplishments.
IEEE Access, 2018
When exploring big amounts of data without a clear target, providing an interactive experience becomes really difficult, since this tentative inspection usually defeats any early decision on data structures or indexing strategies. This is also true in the physics domain, specifically in high-energy physics, where the huge volume of data generated by the detectors are normally explored via C++ code using batch processing, which introduces a considerable latency. An interactive tool, when integrated into the existing data management systems, can add a great value to the usability of these platforms. Here, we intend to review the current state-of-the-art of interactive data exploration, aiming at satisfying three requirements: access to raw data files, stored in a distributed environment, and with a reasonably low latency. This paper follows the guidelines for systematic mapping studies, which is well suited for gathering and classifying available studies. We summarize the results after classifying the 242 papers that passed our inclusion criteria. While there are many proposed solutions that tackle the problem in different manners, there is little evidence available about their implementation in practice. Almost all of the solutions found by this paper cover a subset of our requirements, with only one partially satisfying the three. The solutions for data exploration abound. It is an active research area and, considering the continuous growth of data volume and variety, is only to become harder. There is a niche for research on a solution that covers our requirements, and the required building blocks are there. INDEX TERMS Big data applications, data analysis, data engineering, data exploration, database systems, interactive systems, systematic mapping study. APPENDIX RESULTS OF THE MAPPING STUDY See Tables.
ArXiv, 2021
As the rate of data collection continues to grow rapidly, developing visualization tools that scale to immense data sets is a serious and ever-increasing challenge. Existing approaches generally seek to decouple storage and visualization systems, performing just-in-time data reduction to transparently avoid overloading the visualizer. We present a new architecture in which the visualizer and data store are tightly coupled. Unlike systems that read raw data from storage, the performance of our system scales linearly with the size of the final visualization, essentially independent of the size of the data. Thus, it scales to massive data sets while supporting interactive performance (sub-100 ms query latency). This enables a new class of visualization clients that automatically manage data, quickly and transparently requesting data from the underlying database without requiring the user to explicitly initiate queries. It lays a groundwork for supporting truly interactive exploration o...
2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis, 2008
One of the central challenges in modern science is the need to quickly derive knowledge and understanding from large, complex collections of data. We present a new approach that deals with this challenge by combining and extending techniques from high performance visual data analysis and scientific data management. This approach is demonstrated within the context of gaining insight from complex, time-varying datasets produced by a laser wakefield accelerator simulation. Our approach leverages histogram-based parallel coordinates for both visual information display as well as a vehicle for guiding a data mining operation. Data extraction and subsetting are implemented with state-of-the-art index/query technology. This approach, while applied here to accelerator science, is generally applicable to a broad set of science applications, and is implemented in a production-quality visual data analysis infrastructure. We conduct a detailed performance analysis and demonstrate good scalability on a distributed memory Cray XT4 system.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
International Workshop on Big Data Visual Exploration and Analytics (BigVis 2022), 2022
2013 IEEE International Conference on Cluster Computing (CLUSTER), 2013
Proceedings. 20th International Conference on Data Engineering, 2004
2018
Lecture Notes in Computer Science, 1996
IEEE Transactions on Visualization and Computer Graphics, 2019
IEEE Computer Graphics and Applications, 2001
Concurrency and Computation: Practice and Experience, 2009
High Performance Computing Symposium (HPC 2018), 2017
VIS 05. IEEE Visualization, 2005.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques
15th International Conference on Scientific and Statistical Database Management, 2003., 2003