Multicore machines are becoming common. There are many languages, language extensions and librari... more Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of programming multicores from two different perspectives, task parallelism and data parallelism. The Intel Threading Building Blocks (TBB) library separates logical task patterns, which are easy to understand, from physical threads, and delegates the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature following a data-parallel paradigm. Our comparison considers both ease of programming and the performance obtained using both approaches. In our experience, HTA programs tend to be smaller or as long as TBB programs, while performance of both approaches is very similar.
ACM Transactions on Architecture and Code Optimization, 2007
The performance of memory hierarchies, in which caches play an essential role, is critical in now... more The performance of memory hierarchies, in which caches play an essential role, is critical in nowadays general-purpose and embedded computing systems because of the growing memory bottleneck problem. Unfortunately, cache behavior is very unstable and difficult to predict. This is particularly true in the presence of irregular access patterns, which exhibit little locality. Such patterns are very common, for example, in applications in which pointers or compressed sparse matrices give place to indirections. Nevertheless, cache behavior in the presence of irregular access patterns has not been widely studied. In this paper we present an extension of a systematic analytical modeling technique based on PMEs (probabilistic miss equations), previously developed by the authors, that allows the automated analysis of the cache behavior for codes with irregular access patterns resulting from indirections. The model generates very accurate predictions despite the irregularities and has very low computing requirements, being the first model that gathers these desirable characteristics that can automatically analyze this kind of codes. These properties enable this model to help drive compiler optimizations, as we show with an example.
Several analytical models that predict the memory hierarchy behavior of codes with regular access... more Several analytical models that predict the memory hierarchy behavior of codes with regular access patterns have been developed. These models help understand this behavior and they can be used successfully to guide compilers in the application of locality-related optimizations requiring small computing times. Still, these models suffer from many limitations. The most important of them is their restricted scope of applicability, since real codes exhibit many access patterns they cannot model. The most common source of such kind of accesses is the presence of irregular access patterns because of the presence of either data-dependent conditionals or indirections in the code. This paper extends the Probabilistic Miss Equations (PME) model to be able to cope with codes that include data-dependent conditional structures too. This approach is systematic enough to enable the automatic implementation of the extended model in a compiler framework. Validations show a good degree of accuracy in the predictions despite the irregularity of the access patterns. This opens the possibility of using our model to guide compiler optimizations for this kind of codes.
Real-time systems are subject to temporal constraints and require a schedulability analysis to en... more Real-time systems are subject to temporal constraints and require a schedulability analysis to ensure that task execution finishes within lower and upper specified bounds. Worst-case memory performance (WCMP) plays a key role in the calculation of the upper bound of the execution time. Data caches complicate the calculation of the WCMP, since their behavior is highly dependent on the sequence of memory addresses accessed, which is often not available. For example, the address of a data structure may not be available at compile-time, and it may change between different executions of the program. We present an analytical model that provides fast, safe and tight estimations of the WCMP component of the worst-case execution time, using no information about the data base addresses. The address-independent absolute WCMP for codes with references that follow the same access pattern can be very high with respect to the average behavior because those references may be aligned with respect to the cache, thus generating systematic interferences among them. Our model can also provide a tighter and safe estimation for the WCMP for these codes when the user avoids these alignments.
The increasing gap between the speed of the processor and the memory makes the role played by the... more The increasing gap between the speed of the processor and the memory makes the role played by the memory hierarchy essential in the system performance. There are several methods for studying this behavior. Trace-driven simulation has been the most widely used by now. Nevertheless, analytical modeling requires shorter computing times and provides more information. In the last years a series of fast and reliable strategies for the modeling of set-associative caches with LRU replacement policy has been presented. However, none of them has considered the modeling of codes with data-dependent conditionals. In this article we present the extension of one of them in this sense.
Understanding and improving the memory hierarchy behavior is one of the most important challenges... more Understanding and improving the memory hierarchy behavior is one of the most important challenges in current architectures. Analytical models are a good approach for this, but they have been traditionally limited by either their restricted scope of application or their lack of accuracy. Most models can only predict the cache behavior of codes that generate regular access patterns. The Probabilistic Miss Equation(PME) model is able nevertheless to model accurately the cache behavior for codes with irregular access patterns due to data-dependent conditionals or indirections. Its main limitation is that it only considers irregular access patterns that exhibit an uniform distribution of the accesses. In this work, we extend the PME model to enable to analyze more realistic and complex irregular accesses. Namely, we consider indirections due to the compressed storage of most real banded matrices.
As the memory bottleneck problem continues to grow, so does the relevance of the techniques that ... more As the memory bottleneck problem continues to grow, so does the relevance of the techniques that help improve program locality. A well-known technique in this category is tiling, which decomposes data sets to be used several times in a computation into a series of tiles that are reused before proceeding to process the next tile. This way, capacity misses are avoided. Finding the optimal tile size is a complex task. In this paper we present and compare a series of strategies to search the optimal tile size guided by an analytical model of the whole memory hierarchy and the CPU behavior. Our experiments show that our strategies find better tile sizes than traditional heuristic approaches proposed in the literature while requiring a small compile-time overhead. Iterative compilation can yield better results, but at the expense of very large overheads.
Concurrency and Computation: Practice and Experience, 2007
The memory hierarchy plays an essential role in the performance of current computers, thus good a... more The memory hierarchy plays an essential role in the performance of current computers, thus good analysis tools that help predict and understand its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application are overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty to analyze them. Nevertheless, many important applications exhibit this kind of patterns, and their lack of locality make them more cache-demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The paper defines the information requirements of the PME model and describes its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes. We show how to exploit the powerful information-gathering capabilities provided by this compiler to allow automated modeling of loop-oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented. ; 00:1-15 Prepared using cpeauth.cls AUTOMATED CACHE ANALYSIS FOR IRREGULAR CODES † The CRS (Compressed Row Storage) format stores sparse matrices by rows in a compressed way using three vectors. One vector stores the nonzeros of the sparse matrix ordered by rows, another vector stores the column indices of the corresponding nonzeros, and finally another vector stores the position in the other two vectors where the nonzeros of each row begin. In the example code these vectors are called A, C and R, respectively.
Analytical modeling is one of the most interesting approaches to evaluate the memory hierarchy be... more Analytical modeling is one of the most interesting approaches to evaluate the memory hierarchy behavior. Unfortunately, models have many limitations regarding the structure of the code they can be applied to, particularly when the path of execution depends on conditions calculated at run-time that depend on the input or intermediate data. In this paper we extend in this direction a modular analytical modeling technique that provides very accurate estimations of the number of misses produced by codes with regular access patterns and structures while having a low computing cost. Namely, we have extended this model in order to be able to analyze codes with data-dependent conditionals. In a previous work we studied how to analyze codes with a single and simple conditional sentence. In this work we introduce and validate a general and completely systematic strategy that enables the analysis of codes with any number of conditionals, possibly nested in any arbitrary way, while allowing the conditionals to depend on any number of items and atomic conditions.
While caches are essential to reduce execution time and power consumption, they complicate the es... more While caches are essential to reduce execution time and power consumption, they complicate the estimation of the Worst-Case Execution Time (WCET), crucial for many Real-Time Systems (RTS). Most research on static worst-case cache behavior prediction has focused on hard RTS, which need complete information on the access patterns and addresses of the data to guarantee the predicted WCET is a safe upper bound of any execution time. Access patterns are available in those codes that have a steady state of access patterns after the first iteration of a loop (in the following regular codes), however, the addresses of the data are not always known at compile time for many reasons: stack variables, dynamically allocated memory, modules compiled separately, etc. Even when available, their usefulness to predict cache behavior in systems with virtual memory decreases in the presence of physically-indexed caches. In this paper we present a model that predicts a reasonable bound of the worst-case behavior of data caches during the execution of regular codes without information on the base address of the data structures. In 99.7% of our tests the number of misses performed below the boundary predicted by the model. This turns the model into a valuable tool, particularly for non-RTS and soft RTS, which tolerate a percentage of the runs exceeding their deadlines.
Multicore systems are becoming common, while programmers cannot rely on growing clock rate to spe... more Multicore systems are becoming common, while programmers cannot rely on growing clock rate to speed up their application. Thus, software developers are increasingly exposed to the complexity associated with programming parallel shared memory environments. Intel Threading Building Blocks (TBBs) is a library which facilitates the programming of this kind of system. The key notion is to separate logical task patterns, which are easy to understand, from physical threads, and delegate the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a single-threaded view of the execution. The HTA implementation in C++ has been recently extended to support multicore machines. In this work we implement several algorithms using both libraries in order to compare the ease of programming and the relative performance of both approaches.
Multicore machines are becoming common. There are many languages, language extensions and librari... more Multicore machines are becoming common. There are many languages, language extensions and libraries devoted to improve the programmability and performance of these machines. In this paper we compare two libraries, that face the problem of programming multicores from two different perspectives, task parallelism and data parallelism. The Intel Threading Building Blocks (TBB) library separates logical task patterns, which are easy to understand, from physical threads, and delegates the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature following a data-parallel paradigm. Our comparison considers both ease of programming and the performance obtained using both approaches. In our experience, HTA programs tend to be smaller or as long as TBB programs, while performance of both approaches is very similar.
ACM Transactions on Architecture and Code Optimization, 2007
The performance of memory hierarchies, in which caches play an essential role, is critical in now... more The performance of memory hierarchies, in which caches play an essential role, is critical in nowadays general-purpose and embedded computing systems because of the growing memory bottleneck problem. Unfortunately, cache behavior is very unstable and difficult to predict. This is particularly true in the presence of irregular access patterns, which exhibit little locality. Such patterns are very common, for example, in applications in which pointers or compressed sparse matrices give place to indirections. Nevertheless, cache behavior in the presence of irregular access patterns has not been widely studied. In this paper we present an extension of a systematic analytical modeling technique based on PMEs (probabilistic miss equations), previously developed by the authors, that allows the automated analysis of the cache behavior for codes with irregular access patterns resulting from indirections. The model generates very accurate predictions despite the irregularities and has very low computing requirements, being the first model that gathers these desirable characteristics that can automatically analyze this kind of codes. These properties enable this model to help drive compiler optimizations, as we show with an example.
Several analytical models that predict the memory hierarchy behavior of codes with regular access... more Several analytical models that predict the memory hierarchy behavior of codes with regular access patterns have been developed. These models help understand this behavior and they can be used successfully to guide compilers in the application of locality-related optimizations requiring small computing times. Still, these models suffer from many limitations. The most important of them is their restricted scope of applicability, since real codes exhibit many access patterns they cannot model. The most common source of such kind of accesses is the presence of irregular access patterns because of the presence of either data-dependent conditionals or indirections in the code. This paper extends the Probabilistic Miss Equations (PME) model to be able to cope with codes that include data-dependent conditional structures too. This approach is systematic enough to enable the automatic implementation of the extended model in a compiler framework. Validations show a good degree of accuracy in the predictions despite the irregularity of the access patterns. This opens the possibility of using our model to guide compiler optimizations for this kind of codes.
Real-time systems are subject to temporal constraints and require a schedulability analysis to en... more Real-time systems are subject to temporal constraints and require a schedulability analysis to ensure that task execution finishes within lower and upper specified bounds. Worst-case memory performance (WCMP) plays a key role in the calculation of the upper bound of the execution time. Data caches complicate the calculation of the WCMP, since their behavior is highly dependent on the sequence of memory addresses accessed, which is often not available. For example, the address of a data structure may not be available at compile-time, and it may change between different executions of the program. We present an analytical model that provides fast, safe and tight estimations of the WCMP component of the worst-case execution time, using no information about the data base addresses. The address-independent absolute WCMP for codes with references that follow the same access pattern can be very high with respect to the average behavior because those references may be aligned with respect to the cache, thus generating systematic interferences among them. Our model can also provide a tighter and safe estimation for the WCMP for these codes when the user avoids these alignments.
The increasing gap between the speed of the processor and the memory makes the role played by the... more The increasing gap between the speed of the processor and the memory makes the role played by the memory hierarchy essential in the system performance. There are several methods for studying this behavior. Trace-driven simulation has been the most widely used by now. Nevertheless, analytical modeling requires shorter computing times and provides more information. In the last years a series of fast and reliable strategies for the modeling of set-associative caches with LRU replacement policy has been presented. However, none of them has considered the modeling of codes with data-dependent conditionals. In this article we present the extension of one of them in this sense.
Understanding and improving the memory hierarchy behavior is one of the most important challenges... more Understanding and improving the memory hierarchy behavior is one of the most important challenges in current architectures. Analytical models are a good approach for this, but they have been traditionally limited by either their restricted scope of application or their lack of accuracy. Most models can only predict the cache behavior of codes that generate regular access patterns. The Probabilistic Miss Equation(PME) model is able nevertheless to model accurately the cache behavior for codes with irregular access patterns due to data-dependent conditionals or indirections. Its main limitation is that it only considers irregular access patterns that exhibit an uniform distribution of the accesses. In this work, we extend the PME model to enable to analyze more realistic and complex irregular accesses. Namely, we consider indirections due to the compressed storage of most real banded matrices.
As the memory bottleneck problem continues to grow, so does the relevance of the techniques that ... more As the memory bottleneck problem continues to grow, so does the relevance of the techniques that help improve program locality. A well-known technique in this category is tiling, which decomposes data sets to be used several times in a computation into a series of tiles that are reused before proceeding to process the next tile. This way, capacity misses are avoided. Finding the optimal tile size is a complex task. In this paper we present and compare a series of strategies to search the optimal tile size guided by an analytical model of the whole memory hierarchy and the CPU behavior. Our experiments show that our strategies find better tile sizes than traditional heuristic approaches proposed in the literature while requiring a small compile-time overhead. Iterative compilation can yield better results, but at the expense of very large overheads.
Concurrency and Computation: Practice and Experience, 2007
The memory hierarchy plays an essential role in the performance of current computers, thus good a... more The memory hierarchy plays an essential role in the performance of current computers, thus good analysis tools that help predict and understand its behavior are required. Analytical modeling is the ideal base for such tools if its traditional limitations in accuracy and scope of application are overcome. While there has been extensive research on the modeling of codes with regular access patterns, less attention has been paid to codes with irregular patterns due to the increased difficulty to analyze them. Nevertheless, many important applications exhibit this kind of patterns, and their lack of locality make them more cache-demanding, which makes their study more relevant. The focus of this paper is the automation of the Probabilistic Miss Equations (PME) model, an analytical model of the cache behavior that provides fast and accurate predictions for codes with irregular access patterns. The paper defines the information requirements of the PME model and describes its integration in the XARK compiler, a research compiler oriented to automatic kernel recognition in scientific codes. We show how to exploit the powerful information-gathering capabilities provided by this compiler to allow automated modeling of loop-oriented scientific codes. Experimental results that validate the correctness of the automated PME model are also presented. ; 00:1-15 Prepared using cpeauth.cls AUTOMATED CACHE ANALYSIS FOR IRREGULAR CODES † The CRS (Compressed Row Storage) format stores sparse matrices by rows in a compressed way using three vectors. One vector stores the nonzeros of the sparse matrix ordered by rows, another vector stores the column indices of the corresponding nonzeros, and finally another vector stores the position in the other two vectors where the nonzeros of each row begin. In the example code these vectors are called A, C and R, respectively.
Analytical modeling is one of the most interesting approaches to evaluate the memory hierarchy be... more Analytical modeling is one of the most interesting approaches to evaluate the memory hierarchy behavior. Unfortunately, models have many limitations regarding the structure of the code they can be applied to, particularly when the path of execution depends on conditions calculated at run-time that depend on the input or intermediate data. In this paper we extend in this direction a modular analytical modeling technique that provides very accurate estimations of the number of misses produced by codes with regular access patterns and structures while having a low computing cost. Namely, we have extended this model in order to be able to analyze codes with data-dependent conditionals. In a previous work we studied how to analyze codes with a single and simple conditional sentence. In this work we introduce and validate a general and completely systematic strategy that enables the analysis of codes with any number of conditionals, possibly nested in any arbitrary way, while allowing the conditionals to depend on any number of items and atomic conditions.
While caches are essential to reduce execution time and power consumption, they complicate the es... more While caches are essential to reduce execution time and power consumption, they complicate the estimation of the Worst-Case Execution Time (WCET), crucial for many Real-Time Systems (RTS). Most research on static worst-case cache behavior prediction has focused on hard RTS, which need complete information on the access patterns and addresses of the data to guarantee the predicted WCET is a safe upper bound of any execution time. Access patterns are available in those codes that have a steady state of access patterns after the first iteration of a loop (in the following regular codes), however, the addresses of the data are not always known at compile time for many reasons: stack variables, dynamically allocated memory, modules compiled separately, etc. Even when available, their usefulness to predict cache behavior in systems with virtual memory decreases in the presence of physically-indexed caches. In this paper we present a model that predicts a reasonable bound of the worst-case behavior of data caches during the execution of regular codes without information on the base address of the data structures. In 99.7% of our tests the number of misses performed below the boundary predicted by the model. This turns the model into a valuable tool, particularly for non-RTS and soft RTS, which tolerate a percentage of the runs exceeding their deadlines.
Multicore systems are becoming common, while programmers cannot rely on growing clock rate to spe... more Multicore systems are becoming common, while programmers cannot rely on growing clock rate to speed up their application. Thus, software developers are increasingly exposed to the complexity associated with programming parallel shared memory environments. Intel Threading Building Blocks (TBBs) is a library which facilitates the programming of this kind of system. The key notion is to separate logical task patterns, which are easy to understand, from physical threads, and delegate the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a single-threaded view of the execution. The HTA implementation in C++ has been recently extended to support multicore machines. In this work we implement several algorithms using both libraries in order to compare the ease of programming and the relative performance of both approaches.
Uploads
Papers by Diego Andrade