Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2019
We propose the algorithms for performing multiway joins using a new type of coarse grain reconfigurable hardware accelerator~-- ``Plasticine''~-- that, compared with other accelerators, emphasizes high compute capability and high on-chip communication bandwidth. Joining three or more relations in a single step, i.e. multiway join, is efficient when the join of any two relations yields too large an intermediate relation. We show at least 200X speedup for a sequence of binary hash joins execution on Plasticine over CPU. We further show that in some realistic cases, a Plasticine-like accelerator can make 3-way joins more efficient than a cascade of binary hash joins on the same hardware, by a factor of up to 45X.
Conference on Innovative Data Systems Research, 2015
Large relational databases often rely on fast join implementations for good performance. Recent paradigm shifts in processor architectures has reinvigorated research into how the join operation can be implemented. The FPGA community has also been developing new architectures with the potential to push performance even further. Hashing is a common method used to implement joins, but its poor spatial locality can hinder performance on processor architectures. Multithreaded architectures can better cope with poor spatial locality by masking memory/cache latencies with many outstanding requests. In this paper we present the first end-to-end in-memory FPGA hash join implementation. The FGPA uses massive multithreading during the build and probe phases to mask long memory delays, while it concurrently manages hundreds of thread states locally. Throughput results show a speedup between 2x and 3.4x over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets; however, this advantage diminishes for extremely skewed datasets. Results using the FPGA's full 76.8 GB/s of bandwidth show throughput up to 1.6 billion tuples per second for uniform and skewed datasets.
ABSTRACT The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the parameters of the architecture (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance.
2020
Benefiting from the fine-grained parallelism and energy efficiency, heterogeneous computing platforms featuring FPGAs are becoming more and more common in data centers. The hash join is one of the most costly operators in database systems and accelerating the hash join as a whole task on discrete FPGA platforms has been explored for a long time. Recently, the emerging coupled CPU-FPGA architectures enable flexibility for efficient task placement between the CPU and the FPGA by omitting the high synchronization overhead introduced by CPU to device data copy and high latency of on-board PCIe bus. However, the opportunities it brings to hash joins are still under-explored. In this paper, we explore the hash join acceleration on such a platform with the OpenCL high-level synthesis design methodology. We quantitatively analyze the performance of different workload placements between CPU and FPGA with a roofline model and propose the best design on current hardware. We also point out that...
2022
The relational equality-join is one of the most performancecritical operators in database management systems. For this reason, there have been many attempts to implement this operator on FPGAs in various sort-merge and hash join variants. However, most achieve suboptimal performance because they ineffectively use the limited memory bandwidths of current FPGA platforms. In this paper, we present an FPGA-based implementation of the partitioned hash join (PHJ), where both PHJ phases are executed on the FPGA. Contrary to prior work, we consider a commonly used PCIe-attached FPGA card with dedicated on-board memory. We discuss how to utilize this on-board memory effectively and propose a solution that uses this memory to store partitioned tuples, minimizing data transfers to system memory and thus optimally using the available bandwidth. In our experimental evaluation we demonstrate up to 2x faster end-to-end join times than state-of-the-art 32-threaded hash join implementations on the CPU.
Proceedings of the …, 2009
Join is an important database operation. As computer architectures evolve, the best join algorithm may change hand. This paper reexamines two popular join algorithms -hash join and sort-merge join -to determine if the latest computer architecture trends shift the tide that has favored hash join for many years. For a fair comparison, we implemented the most optimized parallel version of both algorithms on the latest Intel Core i7 platform. Both implementations scale well with the number of cores in the system and take advantages of latest processor features for performance. Our hash-based implementation achieves more than 100M tuples per second which is 17X faster than the best reported performance on CPUs and 8X faster than that reported for GPUs. Moreover, the performance of our hash join implementation is consistent over a wide range of input data sizes from 64K to 128M tuples and is not affected by data skew. We compare this implementation to our highly optimized sort-based implementation that achieves 47M to 80M tuples per second. We developed analytical models to study how both algorithms would scale with upcoming processor architecture trends. Our analysis projects that current architectural trends of wider SIMD, more cores, and smaller memory bandwidth per core imply better scalability potential for sort-merge join. Consequently, sort-merge join is likely to outperform hash join on upcoming chip multiprocessors. In summary, we offer multicore implementations of hash join and sort-merge join which consistently outperform all previously reported results. We further conclude that the tide that favors the hash join algorithm has not changed yet, but the change is just around the corner.
IEEE Transactions on Knowledge and Data Engineering, 2002
AbstractÐIn the past decade, the exponential growth in commodity CPU's speed has far outpaced advances in memory latency. A second trend is that CPU performance advances are not only brought by increased clock rate, but also by increasing parallelism inside the CPU. Current database systems have not yet adapted to these trends and show poor utilization of both CPU and memory resources on current hardware. In this paper, we show how these resources can be optimized for large joins and translate these insights into guidelines for future database architectures, encompassing data structures, algorithms, cost modeling, and implementation. In particular, we discuss how vertically fragmented data structures optimize cache performance on sequential data access. On the algorithmic side, we refine the partitioned hash-join with a new partitioning algorithm called radix-cluster, which is specifically designed to optimize memory access. The performance of this algorithm is quantified using a detailed analytical model that incorporates memory access costs in terms of a limited number of parameters, such as cache sizes and miss penalties. We also present a calibration tool that extracts such parameters automatically from any computer hardware. The accuracy of our models is proven by exhaustive experiments conducted with the Monet database system on three different hardware platforms. Finally, we investigate the effect of implementation techniques that optimize CPU resource usage. Our experiments show that large joins can be accelerated almost an order of magnitude on modern RISC hardware when both memory and CPU resources are optimized.
Sigmod Record, 1993
In this paper, we develop algorithms to achieve optimal processor allocation for pipelined hash joins in a multiprocessorbased database system. A pipeline of hash joins is composed of several stages, each of which is associated with one join op-
Proceedings. 20th International Conference on Data Engineering, 2004
This paper introduces the hash-merge join algorithm (HMJ, for short); a new non-blocking join algorithm that deals with data items from remote sources via unpredictable, slow, or bursty network traffic. The HMJ algorithm is designed with two goals in mind: (1) Minimize the time to produce the first few results, and (2) Produce join results even if the two sources of the join operator occasionally get blocked. The HMJ algorithm has two phases: The hashing phase and the merging phase. The hashing phase employs an in-memory hash-based join algorithm that produces join results as quickly as data arrives. The merging phase is responsible for producing join results if the two sources are blocked. Both phases of the HMJ algorithm are connected via a flushing policy that flushes in-memory parts into disk storage once the memory is exhausted. Experimental results show that HMJ combines the advantages of two state-of-the-art non-blocking join algorithms (XJoin and Progressive Merge Join) while avoiding their shortcomings.
Hash table is a fundamental data structure that provides efficient data store and access. It is a key component in AI applications which rely on building a model of the environment using observations and performing lookups on the model for newer observations. In this work, we develop FASTHash, a "truly" high throughput parallel hash table implementation using FPGA on-chip SRAM. Contrary to stateof-the-art hash table implementations on CPU, GPU, and FPGA, the parallelism in our design is data independent, allowing us to support p parallel queries (p > 1) per clock cycle via p processing engines (PEs) in the worst case. Our novel data organization and query flow techniques allow full utilization of abundant low latency on-chip SRAM and enable conflict free concurrent insertions. Our hash table ensures relaxed eventual consistency -inserts from a PE are visible to all PEs with some latency. We provide theoretical worst case bound on the number of erroneous queries (true negative search, duplicate inserts) due to relaxed eventual consistency. We customize our design to implement both static and dynamic hash tables on state-of-the-art FPGA devices. Our implementations are scalable to 16 PEs and support throughput as high as 5360 million operations per second with PEs running at 335 MHz for static hashing and 4480 million operations per second with PEs running at 280 MHz for dynamic hashing. They outperform state-of-the-art implementations by 5.7x and 8.7x respectively.
2011
Indexed Foreign-Key Joins expose a very asymmetric access pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimension table, and thus limit the random memory access to inside the CPU's cache. However, this is very hard to optimize and the performance impact on recent architectures is limited because the partitioning costs consume most of the achievable join improvement .
Proceedings of the 2009 ACM symposium on Applied Computing, 2009
Hash joins combine massive relations in data warehouses, decision support systems, and scientific data stores. Faster hash join performance significantly improves query throughput, response time, and overall system performance. In this work, we demonstrate how using join cardinality improves hash join performance. The key contribution is the development of an algorithm to determine join cardinality in an arbitrary query plan. We implemented early hash join and the join cardinality algorithm in PostgreSQL. Experimental results demonstrate that early hash join has an immediate response time that is an order of magnitude faster than the existing hybrid hash join implementation. One-to-one joins execute up to 50% faster and perform significantly fewer I/Os, and one-to-many joins have similar or better performance over all memory sizes.
Proceedings of the 14th International Workshop on Data Management on New Hardware
With High-Bandwidth Memory (HBM), an additional opportunity on hardware side for performance benefits is given. The large amount of available bandwidth compared to regular DRAM allows the execution of high numbers of threads in parallel masking penalties of concurrent memory accesses. This is especially interesting considering database join algorithms optimized for multicore CPUs, even more when running on a manycore processor like a Xeon Phi Knights Landing (KNL). The drawback of HBM, however, is its small size and given penalties in random memory access patterns. In this paper, we analyze the impact of HBM on join processing exemplarily on the KNL manycore architecture. We run certain main memory hash join and sort-merge join algorithms of relational DBMS as well as data stream joins, comparing execution time in different HBM configurations. In addition, we consider data skew and output materialization for our measurements. Our results finally show performance gains up to 3x for joins when HBM is used. However, there is still a lot of room for improvements to fully utilize this kind of memory. Therefore, we give additional advices regarding HBM at the end of this paper.
The Vldb Journal, 2021
The join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2× and 3.4× over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3× with a best case of 9.4× in terms of throughput over CPU implementations across five types of data distributions.
Proceedings of the Twelfth International Conference on Data Engineering
The widening performance gap between CPU and disk is significant for hash join performance. Most current hash join methods try t o reduce the volume of data transferred between memory and disk. In this paper, we try to reduce hash-join times by reducing random I/O. We study how current algorithms incur random I/O, and propose a new hash join method, Seq+, that converts much of the random 1/0 to sequential I/O. Seq+ uses a new organization for hash buckets on disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write-and readgroups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method, and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our performance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost functions underestimate the cost of current algorithms and overestimate the cost of Seq+, the actual performance gain of Seq+ is likely to be even greater.
1996
The widening performance gap between CPU and disk is significant for hash join performance. Most current hash join methods try t o reduce the volume of data transferred between memory and disk. In this paper, we try to reduce hash-join times by reducing random I/O. We study how current algorithms incur random I/O, and propose a new hash join method, Seq+, that converts much of the random 1/0 to sequential I/O. Seq+ uses a new organization for hash buckets on disk, and larger input and output buffer sizes. We introduce the technique of batch writes to reduce the bucket-write cost, and the concepts of write-and readgroups of hash buckets to reduce the bucket-read cost. We derive a cost model for our method, and present formulas for choosing various algorithm parameters, including input and output buffer sizes. Our performance study shows that the new hash join method performs many times better than current algorithms under various environments. Since our cost functions underestimate the cost of current algorithms and overestimate the cost of Seq+, the actual performance gain of Seq+ is likely to be even greater.
[1991] Proceedings. Seventh International Conference on Data Engineering, 1991
Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded.
IEEE Micro, 2000
comparisons of hashing buckets having the same index. 16 Therefore, the desirable number of pro cessors is limited to the number of tuples in the hashing bucket of the smaller relation, multiplied by the number of processing elements used for each tuple (usually one PE per tuple).
Lecture Notes in Computer Science, 2008
Hash functions play an important role in various cryptographic applications. Modern cryptography relies on a few but supposedly well analyzed hash functions which are mostly members of the so-called MD4-family. This work shows whether it is possible, using special-purpose hardware, to significantly speedup collision search for MD4-family hash functions. A thorough analysis of the computational requirements for MD4-family hash functions and corresponding collision attacks reveals that a microprocessor based architecture is best suited for the implementation of collision search algorithms. Consequently, we designed and implemented a (concerning MD4-family hash-functions) general-purpose microprocessor with minimal area requirements and, based on this, a full collision search unit. Comparing the performance characteristics of both ASICs with standard PC processors and clusters, it turns out that our design, massively parallelized, is nearly four times more cost-efficient than parallelized standard PCs. With further optimizations, we believe that this factor can even be improved.
ACM Transactions on Reconfigurable Technology and Systems, 2015
FPGA-based data processing is becoming increasingly relevant in data centers, as the transformation of existing applications into dataflow architectures can bring significant throughput and power benefits. Furthermore, a tighter integration of computing and network is appealing, as it overcomes traditional bottlenecks between CPUs and network interfaces, and dramatically reduces latency. In this article, we present the design of a novel hash table, a fundamental building block used in many applications, to enable data processing on FPGAs close to the network. We present a fully pipelined design capable of sustaining consistent 10Gbps line-rate processing by deploying a concurrent mechanism to handle hash collisions. We address additional design challenges such as support for a broad range of key sizes without stalling the pipeline through careful matching of lookup time with packet reception time. Finally, the design is based on a scalable architecture that can be easily parameterized to work with different memory types operating at different access speeds and latencies. We have tested the proposed hash table in an FPGA-based memcached appliance implementing a mainmemory key-value store in hardware. The hash table is used to index 2 million entries in 24GB of external DDR3 DRAM while sustaining 13 million requests per second, the maximum packet rate that can be achieved with UDP packets on a 10Gbps link for this application.
2002
We investigate the effect that caches, particularly caches for remote accesses, have on the performance of hash join algorithms. The join is a computationally intensive operation of relational databases and is used in many important applications. Thus, there are a considerable number of studies on the parallel hash join. However, most of the previous research does not show how cache affects the performance of these algorithms. In this paper, we show the impact and benefits of remote caches (Netcaches) on the overall performance of parallel hash join algorithms running on SMP clusters. Furthermore, we show the effects of these caches on speedup and scalability of these algorithms. Our simulation results leads us to conclude that the execution time of hash join algorithms on modern's multiprocessors, with large local and remote caches, could be reduced up to 70%. Finally, we show results for verifying the big effects of netcaches on scalability of these algorithms.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.