Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, 2015 IEEE 8th International Conference on Cloud Computing
Euclidean embedding algorithms transform data defined in an arbitrary metric space to the Euclidean space, which is critical to many visualization techniques. At big-data scale, these algorithms need to be scalable to massive dataparallel infrastructures. Designing such scalable algorithms and understanding the factors affecting the algorithms are important research problems for visually analyzing big data. We propose a framework that extends the existing Euclidean embedding algorithms to scalable ones. Specifically, it decomposes an existing algorithm into naturally parallel components and non-parallelizable components. Then, data parallel implementations such as MapReduce and data reduction techniques are applied to the two categories of components, respectively. We show that this can be possibly done for a collection of embedding algorithms. Extensive experiments are conducted to understand the important factors in these scalable algorithms: scalability, time cost, and the effect of data reduction to result quality. The result on sample algorithms: FastMap-MR and LMDS-MR shows that with the proposed approach the derived algorithms can preserve result quality well, while achieving desirable scalability.
2019
In the advent of big data era, interactive visualization of large data sets consisting of M*10^5+ high-dimensional feature vectors of length N (N ~ 10^3+), is an indispensable tool for data exploratory analysis. The state-of-the-art data embedding (DE) methods of N-D data into 2-D (3-D) visually perceptible space (e.g., based on t-SNE concept) are too demanding computationally to be efficiently employed for interactive data analytics of large and high-dimensional datasets. Herein we present a simple method, ivhd (interactive visualization of high-dimensional data tool), which radically outperforms the modern data-embedding algorithms in both computational and memory loads, while retaining high quality of N-D data embedding in 2-D (3-D). We show that DE problem is equivalent to the nearest neighbor nn-graph visualization, where only indices of a few nearest neighbors of each data sample has to be known, and binary distance between data samples -- 0 to the nearest and 1 to the other s...
International Journal of Decision Sciences & Applications (2528-956X)
Dimension reduction strives to represent higher dimensional data by a lower-dimensional structure. A famous approach by Carroll called Parametric Mapping or PARAMAP (Shepard & Carroll, 1966) works by iterative minimization of a loss function measuring the smoothness or continuity of the mapping from the lower dimensional representation to the original data. The algorithm was revitalized with essential modifications (Akkucuk & Carroll, 2006). Even though the algorithm was modified, it still needed to make a large number of randomly generated starts. In this paper we discuss the use of a variant of the Isomap method (Tenenbaum et al., 2000) to obtain a starting framework to replace the random starts. The core set of landmark points are selected by a special procedure akin to selection of seeds for the k-means algorithm. These core set of landmark points are used to create a rational start for running the PARAMAP algorithm only once but effectively reach a global minimum. Since Isomap ...
2011
Large-scale visualization systems are typically designed to efficiently “push” datasets through the graphics hardware. However, exploratory visualization systems are increasingly expected to support scalable data manipulation, restructuring, and querying capabilities in addition to core visualization algorithms. We posit that new emerging abstractions for parallel data processing, in particular computing clouds, can be leveraged to support large-scale data exploration through visualization. In this paper, we take a first step in evaluating the suitability of the MapReduce framework to implement large-scale visualization techniques. MapReduce is a lightweight, scalable, general-purpose parallel data processing framework increasingly popular in the context of cloud computing. Specifically, we implement and evaluate a representative suite of visualization tasks (mesh rendering, isosurface extraction, and mesh simplification) as MapReduce programs, and report quantitative performance results applying these algorithms to realistic datasets. For example, we perform isosurface extraction of up to l6 isovalues for volumes composed of 27 billion voxels, simplification of meshes with 30GBs of data and subsequent rendering with image resolutions up to 800002 pixels. Our results indicate that the parallel scalability, ease of use, ease of access to computing resources, and fault-tolerance of MapReduce offer a promising foundation for a combined data manipulation and data visualization system deployed in a public cloud or a local commodity cluster.
Brazilian Conference on Intelligent Systems, 2019
Big Data has attracted extensive attention from industry , academia and governments around the world, employing various approaches from many fields such as machine learning, pattern recognition and data visualization. Data visualization is quite useful in the perception of relevant information by a human for gaining understanding and insight from data with high dimensionality. This paper presents a novel approach for dimensionality reduction called Polygonal Coordinate System (PCS), which is able to represent multi-dimensional data into a two-dimensional data. For this purpose, data are represented across a regular polygon or interface between the high dimensions and the 2D plane. PCS can deal with massive data sets by adopting an incremental and efficient dimensionality reduction. Statistical comparison using Spearman's rho correlation highlights the utility of PCS, outperforming the state-of-the-art t-Distributed Stochastic Neighbor Embedding (t-SNE) technique.
Lecture Notes in Computer Science, 2020
Interactive visual exploration of large and multidimensional data still needs more efficient ND → 2D data embedding (DE) algorithms. We claim that the visualization of very high-dimensional data is equivalent to the problem of 2D embedding of undirected kNN-graphs. We demonstrate that high quality embeddings can be produced with minimal time&memory complexity. A very efficient GPU version of IVHD (interactive visualization of high-dimensional data) algorithm is presented, and we compare it to the state-of-the-art GPU-implemented DE methods: BH-SNE-CUDA and AtSNE-CUDA. We show that memory and time requirements for IVHD-CUDA are radically lower than those for the baseline codes. For example, IVHD-CUDA is almost 30 times faster in embedding (without the procedure of kNN graph generation, which is the same for all the methods) of the largest (M = 1.4 • 10 6) YAHOO dataset than AtSNE-CUDA. We conclude that in the expense of minor deterioration of embedding quality, compared to the baseline algorithms, IVHD well preserves the main structural properties of ND data in 2D for radically lower computational budget. Thus, our method can be a good candidate for a truly big data (M = 10 8+) interactive visualization.
International Journal of Science and Business, 2021
One of the main characteristics of scaling data is complexity. Heterogeneous data contributes to data integration and the process of big data problems. Both of them are essential and difficult to visualize and interpret large-scale databases since they require considerable data processing and storage capacity. The data age, where data grows exponentially, is a significant struggle to extract data in a manner that the human mind can grasp. This paper reviews and provides data visualization and the Heterogeneous Distributed Storage description and their challenges using different methods through some previous researches. Besides, the results of reviewed research works are compared, and the fundamental shift in the world of large data visualization of virtual reality is discussed.
Clustering is a process of grouping objects that are similar among themselves but dissimilar to objects in others. Clustering large dataset is a challenging resource data intensive task. The key to scalability and performance benefits it to use parallel algorithms. Moreover the use of Big Data has become very crucial in almost all sectors nowadays. However analyzing Big data is a very challenging task. Google's Mapreduce has attracted a lot of attention for such applications that motivate us to convert sequential algorithm to Mapreduce algorithm. This paper presents the p-PIC with Mapreduce, one of the newly developed clustering algorithms. P-PIC originated from PIC though scalable and effective finds it difficult to fit, and works well only for low end commodity computers. It performs clustering by embedding data points in a low dimensional data derived from the similarity matrix. The experimental results show that p-PIC can perform well in MR framework for handling big data. It is very fast and scalable. The results show that the accuracy in producing the clusters is almost the same in using Mapreduce framework. Hence the results produced by p-PIC in mapreduce are fast, scalable and accurate.
Concurrency and Computation: Practice and Experience, 2014
Visual and interactive data exploration requires fast and reliable tools for embedding of an original data space in 3(2)-D Euclidean space. Multidimensional scaling (MDS) is a good candidate. However, owing to at least O(M 2 ) memory and time complexity, MDS is computationally demanding for interactive visualization of data sets consisting of order of 10 4 objects on computer systems, ranging from PC with multi-core CPU processor, GPU board to midrange multiprocessor MPI clusters. To explore interactively data sets of that size we have developed novel efficient parallel algorithms for MDS mapping based on virtual particle dynamics. We demonstrate that the performance of our MDS algorithms implemented in CUDA environment on a PC equipped with a modern GPU board (Tesla M2090, GeForce GTX 480) is considerably faster then its OpenMP/MPI parallel implementation on the modern midrange professional cluster (10 nodes, each equipped with 2x Intel Xeon X5670 CPUs). We also show that the hybridized two-level MPI/CUDA implementation, run on a cluster of GPU nodes, can additionally provide a linear speed-up.
Lecture Notes in Computer Science, 2015
With the rise of Big Data, the challenge for modern multidimensional data analysis and visualization is how it grows very quickly in size and complexity. In this paper, we first present a classification method called the 5Ws Dimensions which classifies multidimensional data into the 5Ws definitions. The 5Ws Dimensions can be applied to multiple datasets such as text datasets, audio datasets and video datasets. Second, we establish a Pair-Density model to analyze the data patterns to compare the multidimensional data on the 5Ws patterns. Third, we created two additional parallel axes by using pair-density for visualization. The attributes has been shrunk to reduce data overcrowding in pair-density parallel coordinates. This has achieved more than 80% clutter reduction without the loss of information. The experiment shows that our model can be efficiently used for Big Data analysis and visualization.
This paper proposes a modification on the Sammon map algorithm for data visualisation. The modification, known as the Sparse Approximated Sammon Stress(SASS), allows mappings to be produced for very large data sets of the order of 10 6 points. While the technique may be useful in a variety of applications, the results presented here will demonstrate its usefulness for visualising patient deterioration in vital sign data collected from step-down unit hospital patients. A final result demonstrates an application of the SASS visualisation for drug safety analysis.
Procedia Computer Science, 2015
The embedding of high-dimensional data into 2D (or 3D) space is the most popular way of data visualization. Despite recent advances in developing of very accurate dimensionality reduction algorithms, such as BH-SNE, Q-SNE and LoCH, their relatively high computational complexity still remains the obstacle for interactive visualization of truly large sets of high-dimensional data. We show that a new clone of the multidimensional scaling method (MDS)nr-MDScan be up to two orders of magnitude faster than the modern dimensionality reduction algorithms. We postulate its linear O(M) computational and memory complexity. Simultaneously, our method preserves in 2D and 3D target spaces high separability of data, similar to that obtained by the state-of-the-art dimensionality reduction algorithms. We present the effects of nr-MDS application in visualization of data repositories such as 20 Newsgroups (M=18000), MNIST (M=70000) and REUTERS (M=267000).
Proceedings of the 2014 SIAM International Conference on Data Mining, 2014
The kernel k-means is an effective method for data clustering which extends the commonly-used k-means algorithm to work on a similarity matrix over complex data structures. The kernel k-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel k-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel k-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel k-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.
Data visualization is very useful method of data processing and data mining. This paper deals with an approach of visualization multivariate large data sets using parallel coordinates. In the following text there is detailed description of the process of creating parallel coordinates graph which can be used for exploring large multivariate data sets.
Big Data in Bioeconomy
In this chapter, we introduce the topic of big data visualization with a focus on the challenges related to geospatial data. We present several efficient techniques to address these challenges. We then provide examples from the DataBio project of visualisation solutions. These examples show that there are many technologies and software components available for big data visualisation, but they also point to limitations and the need for further research and development.
Abstract—Technical advancements produces a huge amount of scientific data which are usually in high dimensional formats, and it is getting more important to analyze those large-scale high-dimensional data. Dimension reduction is a well-known approach for high-dimensional data visualization, but can be very time and memory demanding for large problems.
2002
SUMMARY. The analysis of high-dimensional data offers a great challenge to the analyst because the human intuition about geometry of high dimensions fails. We have found that a combination of three basic techniques proves to be extraordinarily effective for visualizing large, high-dimensional data sets. Two important methods for visualizing high-dimensional data involve the parallel coordinate system and the grand tour.
Distributed and Parallel Databases, 2015
In this paper, we study how to visualize large amounts of multidimensional data with a radial visualization. For such a visualization, we study a multithreaded implementation on the CPU and the GPU. We start by reviewing the approaches that have visualized the largest multidimensional datasets and we focus on the approaches that have used CPU or GPU parallelization. We consider the radial visualizations and we describe our approach (called POIViz) that uses points of interest to determine a layout of a large dataset. We detail its parallelization on the CPU and the GPU. We study the efficiency of this approach with different configurations and for large datasets. We show that it can visualize, in less than one second, millions of data with tens of dimensions, and that it can support "real-time" interactions even for large datasets. We conclude on the advantages and limits of the proposed visualization.
The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 2015
Processing massive datasets which are not fitting in the main memory of computer is challenging. This is especially true in the case of map generalization, where the relationships between (nearby) features in the map must be considered. In our case, an automated map generalization process runs offline to produce a dataset suitable for visualizing at arbitrary map scale (vario-scale) and efficiently enabling smooth zoom user interactions over the web. Our solution to be able to generalize such large vector datasets is based on the idea of subdividing the workload according to the Fieldtree organization: a multi-level structure of space. It subdivides space regularly into fields (grid cells), at every level with shifted origin. Only features completely fitting within a field are processed. Due to the Fieldtree organization, features on the boundary at a given level will be contained completely in one of the fields of the higher levels. Every field that resides at the same level in the Fieldtree can be processed in parallel, which is advantageous for processing on multicore computer systems. We have tested our method with datasets with upto 880 thousand objects on a machine with 16 cores, resulting in a decrease of runtime with a factor 27 compared to a single sequential process run. This more than linear speed-up indicates also an interesting algorithmic side-effect of our approach.
2014
This thesis studies the scalability of the similarity search problem in large-scale multidimensional data. Similarity search, translating into the neighbour search problem, finds many applications for information retrieval, visualization, machine learning and data mining. The current exponential growth of data motivates the need for approximate and scalable algorithms. In most of existing algorithms and data-structures, there is a trade-off to be found between efficiency, complexity, scalability and memory efficiency. To address these issues, we explore recent techniques for similarity search. One remarkable recent technique is the Permutation-Based Indexing. Data objects are represented by a list of pivots, ordered with respect to their distances from the object. The similarity between two objects is then estimated based on these lists, following the idea that neighbouring objects have the same neighbourhood. In this thesis, we introduce a formal representation of the permutation-based indexing model. In particular, we introduce different strategies for selecting the pivots, pursuing the goal of high efficiency and low query response time. We propose a scalable, fast and memory-efficient data structure for nearest neighbour search. We provide models for permutation-based indexing on shared memory architecture, including CPU and GPU. In addition, we propose several distributed models for permutation-based indexing using MPI and MapReduce. In doing so, we provide an enhanced programming model using MPI to speedup further the MapReduce strategy. We analyse the proposed techniques and algorithms using standard public large-scale datasets containing millions of objects, and give comparisons to the most recent data structures and algorithms for similarity search.
2019
Exploring large and complex data sets is a crucial factor in a digital library framework. To find a specific data set within a large repository, visualisation can help to validate the content apart from the textual description. However, even with the existing visual tools, the difficulty of large-scale data concerning their size and heterogeneity impedes building visualisation as part of the digital library framework, thus hindering the effectiveness of large-scale data exploration. The scope of this research focuses on managing Big Data and eventually visualising the core information of the data itself. Specifically, I study three large-scale experiments that feature two Big Data challenges: large data size (Volume) and heterogeneous data (Variety), and provide the final visualisation through the web browser in which the size of the input data has to be reduced while preserving the vital information. Despite the intimidating size, i.e., approximately 30 GB, and the complexity of th...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.