Field programmable gate arrays (FPGA) have transformed digital design by enabling versatile and customizable solutions that balance performance and power efficiency, yielding them essential for today’s diverse computing challenges. Research in the Netherlands in both academia and industry plays a major role in developing new innovative FPGA solutions. This survey presents the current landscape of FPGA innovation research in the Netherlands by delving into ongoing projects, advancements, and breakthroughs in the field. Focusing on recent research outcome (within the past 5 years), we have identified five key research areas: (a) FPGA architecture, (b) FPGA robustness, (c) data center infrastructure and high-performance computing, (d) programming models and tools, and (e) applications. This survey provides in-depth insights beyond a mere snapshot of the current innovation research landscape by highlighting future research directions within each key area; these insights can serve as a foundational resource to inform potential national-level investments in FPGA technology.
@article{10.3389/fhpcp.2025.1572844,author={Alachiotis, Nikolaos and van den Belt, Sjoerd and van der Vlugt, Steven and van der Walle, Reinier and Safari, Mohsen and Endres Forlin, Bruno and De Matteis, Tiziano and Al-Ars, Zaid and Jordans, Roel and Sousa de Almeida, António J. and Corradi, Federico and Baaij, Christiaan and Varbanescu, Ana-Lucia},title={FPGA innovation research in the Netherlands: present landscape and future outlook},journal={Frontiers in High Performance Computing},volume={Volume 3 - 2025},year={2025},url={https://www.frontiersin.org/journals/high-performance-computing/articles/10.3389/fhpcp.2025.1572844},doi={10.3389/fhpcp.2025.1572844},issn={2813-7337},}
ICSA25
How Does Microservice Granularity Impact Energy Consumption and Performance? A Controlled Experiment
Yiming
Zhao, Tiziano
De Matteis, and Justus
Bogner
In 2025 IEEE 22nd International Conference on Software Architecture (ICSA) , 2025
Context: Microservice architectures are a widely used software deployment approach, with benefits regarding flexibility and scalability. However, their impact on energy consumption is poorly understood, and often overlooked in favor of performance and other quality attributes (QAs). One understudied concept in this area is microservice granularity, i.e., over how many services the system functionality is distributed.Objective: We therefore aim to analyze the relationship between microservice granularity and two critical QAs in microservice-based systems: energy consumption and performance.Method: We conducted a controlled experiment using two open-source microservice-based systems of different scales: the small Pet Clinic system and the large Train Ticket system. For each system, we created three levels of granularity by merging or splitting services (coarse, medium, and fine) and then exposed them to five levels of request frequency.Results: Our findings revealed that: i) granularity significantly affected both energy consumption and response time, e.g., in the large system, fine granularity consumed on average 461 J more energy (13%) and added 5.2 ms to response time (14%) compared to coarse granularity; ii) higher request loads significantly increased both energy consumption and response times, with moving from 40 to 400 requests / s resulting in 651 J higher energy consumption (23%) and 41.2 ms longer response times (98%); iii) there is a complex relationship between granularity, system scale, energy consumption, and performance that warrants careful consideration in microservice design. We derive generalizable takeaways from our results.Conclusion: Microservices practitioners should take our findings into account when making granularity-related decisions, especially for large-scale systems.
@inproceedings{10978921,author={Zhao, Yiming and De Matteis, Tiziano and Bogner, Justus},booktitle={ 2025 IEEE 22nd International Conference on Software Architecture (ICSA) },title={{ How Does Microservice Granularity Impact Energy Consumption and Performance? A Controlled Experiment }},year={2025},volume={},issn={},pages={84-95},keywords={Energy consumption;Time-frequency analysis;Software architecture;Scalability;Merging;Microservice architectures;Computer architecture;Software;Large-scale systems;Time factors},doi={10.1109/ICSA65012.2025.00018},url={https://doi-ieeecomputersociety-org.vu-nl.idm.oclc.org/10.1109/ICSA65012.2025.00018},publisher={IEEE Computer Society},address={Los Alamitos, CA, USA},}
@article{10960277,author={Ziogas, Alexandros Nikolaos and Schneider, Timo and Ben-Nun, Tal and Calotoiu, Alexandru and De Matteis, Tiziano and de Fine Licht, Johannes and Lavarini, Luca and Hoefler, Torsten},journal={IEEE Transactions on Parallel and Distributed Systems},title={Productivity, Portability, Performance, and Reproducibility: Data-Centric Python},year={2025},volume={36},number={5},pages={804-820},keywords={Productivity;Codes;High performance computing;Semantics;Computer architecture;Supercomputers;Software;Field programmable gate arrays;Optimization;Python;Computer languages;Python;high-performance computing;dataflow computing;parallel programming;distributed computing},doi={10.1109/TPDS.2025.3549310},}
CHEOPS25
An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD
Zebin
Ren, Krijn
Doekemeijer, Tiziano
De Matteis, and
3 more authors
In Proceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, Rotterdam, Netherlands, 2025
With the popularity of generative AI, LLM inference has become one of the most popular cloud workloads. Modern popular LLMs have hundreds of billions of parameters and support very large input/output prompt token sizes (100K-1M). As a result, their computational state during LLM inference can exceed the memory available on GPUs. One solution to this GPU memory problem is to offload the model weights and KV cache to the host memory. As the size of the models and prompts continue to increase, researchers have started to explore the use of secondary storage, such as SSDs, to store the model weights and KV cache. However, there is a lack of study on the I/O characteristics and performance requirements of these offloading operations. In order to have a better understanding of the performance characteristics of these offloading operations, in this work, we collect, study, and characterize the block layer I/O traces from two LLM inference frameworks, DeepSpeed and FlexGen, that support model and KV cache offloading to SSDs. Through our analysis of these I/O traces, we report that: (i) libaio-based tensor offloading delivers higher I/O bandwidth for both writing and reading tensors to/from the SSDs than POSIX; (ii) the I/O workload of model offloading is dominated by 128 KiB reads for both DeepSpeed and FlexGen in the block layer; (iii) model offloading does not saturate NVMe SSDs; and (iv) the I/O workload of KV cache offloading contains both read and write workloads dominated by 128 KiB requests, but the average bandwidth of read is much higher than write (2.0 GiB/s vs. 11.0 MiB/s). We open-source the scripts and the I/O traces of this work at https://github.com/stonet-research/cheops25-IO-characterization-of-LLM-model-kv-cache-offloading-nvme
@inproceedings{10.1145/3719330.3721230,author={Ren, Zebin and Doekemeijer, Krijn and De Matteis, Tiziano and Pinto, Christian and Stoica, Radu and Trivedi, Animesh},title={An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD},year={2025},isbn={9798400715297},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi-org.vu-nl.idm.oclc.org/10.1145/3719330.3721230},doi={10.1145/3719330.3721230},booktitle={Proceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems},pages={23–33},numpages={11},keywords={KV cache offloading, Large language model, Model offloading, SSDs},location={Rotterdam, Netherlands},series={CHEOPS '25},}
Datacenter service providers face engineering and operational challenges involving numerous risk aspects. Bad decisions can result in financial penalties, competitive disadvantage, and unsustainable environmental impact. Risk management is an integral aspect of the design and operation of modern datacenters, but frameworks that allow users to consider various risk trade-offs conveniently are missing. We propose RADiCe, an open-source framework that enables data-driven analysis of IT-related operational risks in sustainable datacenters. RADiCe uses monitoring and environmental data and, via discrete event simulation, assists datacenter experts through systematic evaluation of risk scenarios, visualization, and optimization of risks. Our analyses highlight the increasing risk datacenter operators face due to price surges in electricity and sustainability and demonstrate how RADiCe can evaluate and control such risks by optimizing the topology and operational settings of the datacenter. Eventually, RADiCe can evaluate risk scenarios by a factor 70x–330x faster than others, opening possibilities for interactive risk exploration.
@article{MASTENBROEK2025107702,title={RADiCe: A Risk Analysis Framework for Data Centers},journal={Future Generation Computer Systems},volume={166},pages={107702},year={2025},issn={0167-739X},doi={https://doi.org/10.1016/j.future.2024.107702},url={https://www.sciencedirect.com/science/article/pii/S0167739X24006666},author={Mastenbroek, Fabian and {De Matteis}, Tiziano and {van Beek}, Vincent and Iosup, Alexandru},keywords={Datacenter, Risk assessment, Sustainability, Simulation},}
@misc{vanbalen2025comparingparallelfunctionalarray,title={Comparing Parallel Functional Array Languages: Programming and Performance},author={van Balen, David and De Matteis, Tiziano and Glerks, Clemens and Henriksen, Troels and Hsu, Aaron W. and Keller, Gabriele K. and Koopman, Thomas and McDonell, Trevor L. and Oancea, Cosmin and Scholz, Sven-Bodo and Sinkarovs, Artjoms and Smeding, Tom and Trinder, Phil and de Wolff, Ivo Gabe and Ziogas, Alexandros Nikolaos},year={2025},eprint={2505.08906},archiveprefix={arXiv},primaryclass={cs.PL},url={https://arxiv.org/abs/2505.08906},note={(Authors are in alphabetical order)},}
Spatial (dataflow) computer architectures can mitigate the control and performance overhead of classical von Neumann architectures such as traditional CPUs. Driven by the popularity of Machine Learning (ML) workloads, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts. We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model.
@misc{aie-blas,title={Developing a BLAS library for the AMD AI Engine},author={Laan, Tristan and De Matteis, Tiziano},year={2024},eprint={2410.00825},archiveprefix={arXiv},primaryclass={cs.DC},url={https://arxiv.org/abs/2410.00825},}
SC24
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects
Daniele
De Sensi, Lorenzo
Pichetti, Flavio
Vella, and
11 more authors
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’24), Nov 2024
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers — Alps, Leonardo, and LUMI — each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4,096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
@inproceedings{gpugpuinterconnect,author={De Sensi, Daniele and Pichetti, Lorenzo and Vella, Flavio and De Matteis, Tiziano and Ren, Zebin and Fusco, Luigi and Turisini, Matteo and Cesarini, Daniele and Lust, Kurt and Trivedi, Animesh and Roweth, Duncan and Spiga, Filippo and Di Girolamo, Salvatore and Hoefler, Torsten},title={Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects},year={2024},month=nov,booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)},doi={10.1109/SC41406.2024.00039},}
HotCloudPerf24
FootPrinter: Quantifying Data Center Carbon Footprint
Dante
Niewenhuis, Sacheendra
Talluri, Alexandru
Iosup, and
1 more author
In Companion of the 15th ACM/SPEC International Conference on Performance Engineering, London, United Kingdom, Nov 2024
Data centers have become an increasingly significant contributor to the global carbon footprint. In 2021, the global data center industry was responsible for around 1% of the worldwide greenhouse gas emissions. With more resource-intensive workloads, such as Large Language Models, gaining popularity, this percentage is expected to increase further. Therefore, it is crucial for data center service providers to become aware of and accountable for the sustainability impact of their design and operational choices. However, reducing the carbon footprint of data centers has been a challenging process due to the lack of comprehensive metrics, carbon-aware design tools, and guidelines for carbon-aware optimization. In this work, we propose FootPrinter, a first-of-its-kind tool that supports data center designers and operators in assessing the environmental impact of their data center. FootPrinter uses coarse-grained operational data, grid energy mix information, and discrete event simulation to determine the data center’s operational carbon footprint and evaluate the impact of infrastructural or operational changes. FootPrinter can simulate days of operations of a regional data center on a commodity laptop in a few seconds, returning the estimated footprint with marginal error. By making this project open source, we hope to engage the community in the development of methodologies and tools for systematically assessing and exploring the sustainability of data centers.
@inproceedings{dniewenhuis_hotcloud_footprinter,author={Niewenhuis, Dante and Talluri, Sacheendra and Iosup, Alexandru and De Matteis, Tiziano},title={FootPrinter: Quantifying Data Center Carbon Footprint},year={2024},isbn={9798400704451},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3629527.3651419},doi={10.1145/3629527.3651419},booktitle={Companion of the 15th ACM/SPEC International Conference on Performance Engineering},pages={189–195},numpages={7},keywords={carbon emission, carbon footprint, data center, simulation},location={London, United Kingdom},series={ICPE '24 Companion},}
ICPE24
The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions
Aratz Manterola
Lasa, Sacheendra
Talluri, Tiziano
De Matteis, and
1 more author
In 15th ACM/SPEC International Conference on Performance Engineering (ICPE’24), Nov 2024
Schedulers are a crucial component in datacenter resource management. Each scheduler offers different capabilities, and users use them through their APIs. However, there is no clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand their differences and the performance costs imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and their related performance costs. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, understand the differences in their APIs, and identify the missing abstractions. Finally, we carry out exemplary experiments using trace-driven simulation demonstrating that an API extension, such as container migration, can improve total execution time per task by 81%, highlighting how schedulers sacrifice performance by implementing simpler programming abstractions. All the relevant software and data artifacts are publicly available at https://github.com/atlarge-research/quantifying-api-design.
@inproceedings{2024-icpe-datacenter-scheduler,author={Lasa, Aratz Manterola and Talluri, Sacheendra and De Matteis, Tiziano and Iosup, Alexandru},title={The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions},booktitle={15th ACM/SPEC International Conference on Performance Engineering (ICPE'24)},publisher={{ACM}},year={2024},url={https://atlarge-research.com/pdfs/2024-icpe-datacenter-scheduler.pdf},}
Serverless computing is increasingly used for data-processing applications in both science and business domains. At the core of serverless data-processing systems is the scheduler, which ensures dynamic decisions about task and data placement. Due to the variety of user, cluster, and workload properties, the design space for high-performance and cost-effective scheduling architectures and mechanisms is vast. The large design space is difficult to explore and characterize. To help the system designer disentangle this complexity, we present ExDe, a framework to systematically explore the design space of scheduling architectures and mechanisms. The framework includes a conceptual model and a simulator to assist in design space exploration. We use the framework, and real-world workloads, to characterize the performance of three scheduling architectures and two mechanisms. Our framework is open-source software available on Zenodo.
@article{stalluri_exde,title={ExDe: Design space exploration of scheduler architectures and mechanisms for serverless data-processing},journal={Future Generation Computer Systems},volume={153},pages={84-96},year={2024},issn={0167-739X},doi={https://doi.org/10.1016/j.future.2023.11.013},url={https://www.sciencedirect.com/science/article/pii/S0167739X23004211},author={Talluri, Sacheendra and Herbst, Nikolas and Abad, Cristina and De Matteis, Tiziano and Iosup, Alexandru},keywords={Serverless, Scheduler, Design, Mechanism, Architecture, Performance},}
2023
HPDC23
Streaming Task Graph Scheduling for Dataflow Architectures
Tiziano
De Matteis, Lukas
Gianinazzi, Johannes
Fine Licht, and
1 more author
In Proceedings of the 32th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’23), Orlando, FL, USA, Jun 2023
Dataflow devices represent an avenue towards saving the control and data movement overhead of Load-Store Architectures. Various dataflow accelerators have been proposed, but how to efficiently schedule applications on such devices remains an open problem. The programmer can explicitly implement both temporal and spatial parallelism, and pipelining across multiple processing elements can be crucial to take advantage of the fast on-chip interconnect, enabling the concurrent execution of different program components. This paper introduces canonical task graphs, a model that enables streaming scheduling of task graphs over dataflow architectures. We show how a task graph can be statically analyzed to understand its steady-state behavior, and we use this information to partition it into temporally multiplexed components of spatially executed tasks. Results on synthetic and realistic workloads show how streaming scheduling can increase speedup and device utilization over a traditional scheduling approach.
@inproceedings{streaming_scheduling,author={De Matteis, Tiziano and Gianinazzi, Lukas and de Fine Licht, Johannes and Hoefler, Torsten},title={{Streaming Task Graph Scheduling for Dataflow Architectures}},year={2023},month=jun,pages={225–237},numpages={13},booktitle={Proceedings of the 32th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'23)},location={Orlando, FL, USA},publisher={ACM},isbn={9798400701559},note={},doi={10.1145/3588195.3592999},url={https://dl.acm.org/doi/10.1145/3588195.3592999},}
2022
SIGMETRICS23
Noise in the Clouds: Influence of Network Performance Variability on Application Scalability
Daniele
De Sensi, Tiziano
De Matteis, Konstantin
Taranov, and
3 more authors
Proc. ACM Meas. Anal. Comput. Syst., New York, NY, USA, Dec 2022
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
@article{noise_cloud,author={De Sensi, Daniele and De Matteis, Tiziano and Taranov, Konstantin and Di Girolamo, Salvatore and Rahn, Tobias and Hoefler, Torsten},title={{Noise in the Clouds: Influence of Network Performance Variability on Application Scalability}},journal={Proc. ACM Meas. Anal. Comput. Syst.},year={2022},month=dec,volume={6},number={3},location={New York, NY, USA},publisher={Association for Computing Machinery},}
Although high-level synthesis (HLS) tools have significantly improved programmer productivity over hardware description languages, developing for FPGAs remains tedious and error prone. Programmers must learn and implement a large set of vendor-specific syntax, patterns, and tricks to optimize (or even successfully compile) their applications, while dealing with ever-changing toolflows from the FPGA vendors. We propose a new way to develop, optimize, and compile FPGA programs. The Data-Centric parallel programming (DaCe) framework allows applications to be defined by their dataflow and control flow through the Stateful DataFlow multiGraph (SDFG) representation, capturing the abstract program characteristics, and exposing a plethora of optimization opportunities. In this work, we show how extending SDFGs with multi-level Library Nodes incorporates both domain-specific and platform-specific optimizations into the design flow, enabling knowledge transfer across application domains and FPGA vendors. We present the HLS-based FPGA code generation backend of DaCe, and show how SDFGs are code generated for either FPGA vendor, emitting efficient HLS code that is structured and annotated to implement the desired architecture.
@misc{licht2022python,title={Python FPGA Programming with Data-Centric Multi-Level Design},author={de Fine Licht, Johannes and De Matteis, Tiziano and Ben-Nun, Tal and Kuster, Andreas and Rausch, Oliver and Burger, Manuel and Johnsenn, Carl-Johannes and Hoefler, Torsten},year={2022},eprint={2212.13768},archiveprefix={arXiv},primaryclass={cs.DC},}
ICCAD22
Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping
Carl-Johannes
Johnsen, Tiziano
De Matteis, Tal
Ben-Nun, and
2 more authors
In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, San Diego, California, Oct 2022
The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for fine-grained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.
@inproceedings{double_pumping,author={Johnsen, Carl-Johannes and De Matteis, Tiziano and Ben-Nun, Tal and de Fine Licht, Johannes and Hoefler, Torsten},title={Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping},month=oct,year={2022},booktitle={Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design},url={https://doi.org/10.1145/3508352.3549374},doi={10.1145/3508352.3549374},articleno={85},numpages={9},location={San Diego, California},series={ICCAD '22},}
SIGHPC Certificate of Appreciation for reproducible methods at the ACM/IEEE Supercomputing Conference (SC22) ACM student cluster competition
Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes.
@inproceedings{data_centric_python,author={Ziogas, Alexandros Nikolaos and Schneider, Timo and Ben-Nun, Tal and Calotoiu, Alexandru and De Matteis, Tiziano and de Fine Licht, Johannes and Lavarini, Luca and Hoefler, Torsten},title={Productivity, portability, performance: data-centric Python},year={2021},isbn={9781450384421},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3458817.3476176},doi={10.1145/3458817.3476176},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},articleno={95},numpages={13},keywords={NumPy, data-centric, high performance computing, python},location={St. Louis, Missouri},series={SC '21},}
CGO21
StencilFlow: mapping large stencil programs to distributed spatial computing systems
Johannes
Fine Licht, Andreas
Kuster, Tiziano
De Matteis, and
3 more authors
In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, Virtual Event, Republic of Korea, Oct 2021
Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.
@inproceedings{stencilflow,author={de Fine Licht, Johannes and Kuster, Andreas and De Matteis, Tiziano and Ben-Nun, Tal and Hofer, Dominic and Hoefler, Torsten},title={StencilFlow: mapping large stencil programs to distributed spatial computing systems},year={2021},isbn={9781728186139},publisher={IEEE Press},url={https://doi.org/10.1109/CGO51591.2021.9370315},doi={10.1109/CGO51591.2021.9370315},booktitle={Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization},pages={315–326},numpages={12},location={Virtual Event, Republic of Korea},series={CGO '21},}
2020
SC20
fBLAS: streaming linear algebra on FPGA
Tiziano
De Matteis, Johannes
Fine Licht, and Torsten
Hoefler
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Georgia, Oct 2020
Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present fBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. fBLAS’ implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce offchip communication. With fBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms.
@inproceedings{fblas,author={De Matteis, Tiziano and de Fine Licht, Johannes and Hoefler, Torsten},title={fBLAS: streaming linear algebra on FPGA},year={2020},isbn={9781728199986},publisher={IEEE Press},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},articleno={59},numpages={13},keywords={hardware library, high level synthesis, spatial architectures},location={Atlanta, Georgia},series={SC '20},}
2019
SC19
Streaming message interface: high-performance distributed memory programming on reconfigurable hardware
Tiziano
De Matteis, Johannes
Fine Licht, Jakub
Beránek, and
1 more author
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, Oct 2019
Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
@inproceedings{smi,author={De Matteis, Tiziano and de Fine Licht, Johannes and Ber\'{a}nek, Jakub and Hoefler, Torsten},title={Streaming message interface: high-performance distributed memory programming on reconfigurable hardware},year={2019},isbn={9781450362290},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3295500.3356201},doi={10.1145/3295500.3356201},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},articleno={82},numpages={33},keywords={reconfigurable computing, high-level synthesis tools, distributed memory programming},location={Denver, Colorado},series={SC '19},}
Today’s stream processing systems handle high-volume data streams in an efficient manner. To achieve this goal, they are designed to scale out on large clusters of commodity machines. However, despite the efficient use of distributed architectures, they lack support to co-processors like graphical processing units (GPUs) ready to accelerate data-parallel tasks. The main reason for this lack of integration is that GPU processing and the streaming paradigm have different processing models, with GPUs needing a bulk of data present at once while the streaming paradigm advocates a tuple-at-a-time processing model. This paper contributes to fill this gap by proposing Gasser, a system for offloading the execution of sliding-window operators on GPUs. The system focuses on completely general functions by targeting the parallel processing of non-incremental queries that are not supported by the few existing GPU-based streaming prototypes. Furthermore, Gasser provides an auto-tuning approach able to automatically find the optimal value of the configuration parameters (i.e., batch length and the degree of parallelism) needed to optimize throughput and latency with the given query and data stream. The experimental part assesses the performance efficiency of Gasser by comparing its peak throughput and latency against Apache Flink, a popular and scalable streaming system. Furthermore, we evaluate the penalty induced by supporting completely general queries against the performance achieved by the state-of-the-art solution specifically optimized for incremental queries. Finally, we show the speed and accuracy of the auto-tuning approach adopted by Gasser, which is able to self-configure the system by finding the right configuration parameters without manual tuning by the users.
@article{gasser,author={De Matteis, Tiziano and Mencagli, Gabriele and De Sensi, Daniele and Torquati, Massimo and Danelutto, Marco},journal={IEEE Access},title={GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs},volume={7},number={},pages={48753-48769},doi={10.1109/ACCESS.2019.2910312},issn={2169-3536},year={2019},month=apr,openaccess={https://ieeexplore.ieee.org/document/8688411},}
2018
KDD18
D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes
Alessio
Conte, Tiziano
De Matteis, Daniele
De Sensi, and
3 more authors
In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, United Kingdom, Apr 2018
This paper studies kplexes, a well known pseudo-clique model for network communities. In a kplex, each node can miss at most k-1 links. Our goal is to detect large communities in today’s real-world graphs which can have hundreds of millions of edges. While many have tried, this task has been elusive so far due to its computationally challenging nature: kplexes and other pseudo-cliques are harder to find and more numerous than cliques, a well known hard problem. We present D2K, which is the first algorithm able to find large kplexes of very large graphs in just a few minutes. The good performance of our algorithm follows from a combination of graph-theoretical concepts, careful algorithm engineering and a high-performance implementation. In particular, we exploit the low degeneracy of real-world graphs, and the fact that large enough kplexes have diameter 2. We validate a sequential and a parallel/distributed implementation of D2K on real graphs with up to half a billion edges.
@inproceedings{kdd:18,author={Conte, Alessio and De Matteis, Tiziano and De Sensi, Daniele and Grossi, Roberto and Marino, Andrea and Versari, Luca},title={D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes},booktitle={Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \&\#38; Data Mining},series={KDD '18},year={2018},isbn={978-1-4503-5552-0},location={London, United Kingdom},pages={1272--1281},numpages={10},url={http://doi.acm.org/10.1145/3219819.3220093},doi={10.1145/3219819.3220093},acmid={3220093},publisher={ACM},address={New York, NY, USA},keywords={community discovery, graph enumeration, k-plexes, parallel programming},openaccess={https://dl.acm.org/authorize?N666390},videopitch={https://www.youtube.com/watch?v=zF2Hz1wq9eM},}
Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources toallocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users’ activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose Nornir, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the {PARSEC} benchmark, we provide to strategy designers a wide set of applications already interfaced to Nornir. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that Nornir can also be used to easily analyze different algorithms and to provide useful insights on them.
@article{nornir:fgcs18,title={Simplifying self-adaptive and power-aware computing with Nornir},journal={Future Generation Computer Systems},volume={},number={},pages={ - },year={2018},note={},issn={0167-739X},doi={https://doi.org/10.1016/j.future.2018.05.012},url={https://www.sciencedirect.com/science/article/pii/S0167739X17326699},author={De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco},keywords={Self-adaptive, Power-aware, Quality of service, Data stream processing, Fog computing, Parallel computing},}
PDP18
Reducing Message Latency and CPU Utilization in the CAF Actor Framework
Massimo
Torquati, Tullio
Menga, Tiziano
De Matteis, and
2 more authors
In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018, Cambridge, United Kingdom, Apr 2018
In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applications using the actor programming model in C++. CAF has been optimized for high-throughput computing, whereas message latency between actors is greatly influenced by the message data rate: at low and moderate rates the latency is higher than at high data rates. To this end, we propose a modification of the polling strategies in the work-stealing CAF scheduler, which can reduce message latency at low and moderate data rates up to two orders of magnitude without compromising the overall throughput and message latency at maximum pressure. The technique proposed uses a lightweight event notification protocol that is general enough to be used used to optimize the runtime of other frameworks experiencing similar issues.
@inproceedings{cafpdp18,author={Torquati, Massimo and Menga, Tullio and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele},booktitle={Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2018},location={Cambridge, United Kingdom},title={Reducing Message Latency and CPU Utilization in the CAF Actor Framework},year={2018},toappear={},}
2017
SAC17
P^3ARSEC: Towards Parallel Patterns Benchmarking
Marco
Danelutto, Tiziano
De Matteis, Daniele
De Sensi, and
2 more authors
In Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Marrakesh, Morocco, Apr 2017
High-level parallel programming is a de-facto standard approach to develop parallel software with reduced time to development. High-level abstractions are provided by existing frameworks as pragma-based annotations in the source code, or through pre-built parallel patterns that recur frequently in parallel algorithms, and that can be easily instantiated by the programmer to add a structure to the development of parallel software. In this paper we focus on this second approach and we propose P3ARSEC, a benchmark suite for parallel pattern-based frameworks consisting of a representative subset of PARSEC applications. We analyse the programmability advantages and the potential performance penalty of using such high-level methodology with respect to hand-made parallelisations using low-level mechanisms. The results are obtained on the new Intel Knights Landing multicore, and show a significantly reduced code complexity with comparable performance.
@inproceedings{sac17,author={Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo},title={P$^{3}$ARSEC: Towards Parallel Patterns Benchmarking},booktitle={Proceedings of the 32nd Annual ACM Symposium on Applied Computing},year={2017},series={SAC '17},pages={1582--1589},address={New York, NY, USA},publisher={ACM},acmid={3019745},doi={10.1145/3019612.3019745},isbn={978-1-4503-4486-9},keywords={Parallel Patterns, PARSEC Benchmarks, Intel KNL},location={Marrakesh, Morocco},numpages={8},url={http://dl.acm.org/authorize?N34889},}
PDP17
Elastic Scaling for Distributed Latency-sensitive Data Stream Operators
Tiziano
De Matteis and Gabriele
Mencagli
In Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2017, St. Petersburg, Russia, Apr 2017
High-volume data streams are straining the limits of stream processing frameworks which need advanced parallel processing capabilities to withstand the actual incoming bandwidth.
Parallel processing must be synergically integrated with elastic features in order dynamically scale the amount of utilized resources by accomplishing the Quality of Service goals in a costeffective
manner. This paper proposes a control-theoretic strategy to drive the elastic behavior of latency-sensitive streaming operators in distributed environments. The strategy takes scaling
decisions in advance by relying on a predictive model-based approach. Our ideas have been experimentally evaluated on a cluster using a real-world streaming application fed by synthetic
and real datasets. The results show that our approach takes the strictly necessary reconfigurations while providing reduced resource consumption. Furthermore, it allows the operator to
meet desired average latency requirements with a significant reduction in the experienced latency jitter.
@inproceedings{dasp:pdp17,author={De Matteis, Tiziano and Mencagli, Gabriele},title={Elastic Scaling for Distributed Latency-sensitive Data Stream Operators},booktitle={Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2017},year={2017},location={St. Petersburg, Russia},}
Abstract Data stream processing applications have a long running nature (24 hr/7 d) with workload conditions that may exhibit wide variations at run-time. Elasticity is the term coined to describe the capability of applications to change dynamically their resource usage in response to workload fluctuations. This paper focuses on strategies for elastic data stream processing targeting multicore systems. The key idea is to exploit Model Predictive Control, a control-theoretic method that takes into account the system behavior over a future time horizon in order to decide the best reconfiguration to execute. We design a set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the {CPU} frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs. We evaluate our strategies in a high-frequency trading application fed by synthetic and real-world workload traces. We introduce specific properties to effectively compare different elastic approaches, and the results show that our strategies are able to achieve the best outcome.
@article{jss17,author={De Matteis, Tiziano and Mencagli, Gabriele},title={Proactive elasticity and energy awareness in data stream processing },journal={Journal of Systems and Software },year={2017},volume={127},pages={302 - 319},doi={http://dx.doi.org/10.1016/j.jss.2016.08.037},issn={0164-1212},keywords={Data stream processing},url={http://www.sciencedirect.com/science/article/pii/S0164121216301467},}
The topic of Data Stream Processing is a recent and highly active research area dealing with the in-memory, tuple-by-tuple analysis of streaming data. Continuous queries typically consume huge volumes of data received at a great velocity. Solutions that persistently store all the input tuples and then perform off-line computation are impractical. Rather, queries must be executed continuously as data cross the streams. The goal of this paper is to present parallel patterns for window-based stateful operators, which are the most representative class of stateful data stream operators. Parallel patterns are presented “à la” Algorithmic Skeleton, by explaining the rationale of each pattern, the preconditions to safely apply it, and the outcome in terms of throughput, latency and memory consumption. The patterns have been implemented in the }}\backslashmathtt {FastFlow}}} FastFlow framework targeting off-the-shelf multicores. To the best of our knowledge this is the first time that a similar effort to merge the Data Stream Processing domain and the field of Structured Parallelism has been made.
@article{ijpp17,author={De Matteis, Tiziano and Mencagli, Gabriele},title={Parallel Patterns for Window-Based Stateful Operators on Data Streams: An Algorithmic Skeleton Approach},journal={International Journal of Parallel Programming},year={2017},volume={45},number={2},pages={382--401},month=apr,day={01},doi={10.1007/s10766-016-0413-x},issn={1573-7640},url={https://doi.org/10.1007/s10766-016-0413-x},}
We discuss the extended parallel pattern set identified within the EU-funded project RePhrase as a candidate pattern set to support data intensive applications targeting heterogeneous architectures. The set has been designed to include three classes of pattern, namely (1) core patterns, modelling common, not necessarily data intensive parallelism exploitation patterns, usually to be used in composition; (2) high level patterns, modelling common, complex and complete parallelism exploitation patterns; and (3) building block patterns, modelling the single components of data intensive applications, suitable for use—in composition—to implement patterns not covered by the core and high level patterns. We discuss the expressive power of the RePhrase extended pattern set and results illustrating the performances that may be achieved with the FastFlow implementation of the high level patterns.
@article{rephrase:ijpp17,author={Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo and Aldinucci, Marco and Kilpatrick, Peter},title={The RePhrase Extended Pattern Set for Data Intensive Parallel Computing},journal={International Journal of Parallel Programming},year={2017},month=nov,day={28},issn={1573-7640},doi={10.1007/s10766-017-0540-z},openaccess={http://rdcu.be/zN6c},url={https://doi.org/10.1007/s10766-017-0540-z},}
High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution. Pattern-based parallel programming is based on a set of composable and customizable parallel patterns used as basic building blocks in parallel applications. In recent years, a considerable effort has been made in empowering this programming model with features able to overcome shortcomings of early approaches concerning flexibility and performance. In this article, we demonstrate that the approach is flexible and efficient enough by applying it on 12 out of 13 PARSEC applications. Our analysis, conducted on three different multicore architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e., Openmp and OmpSs) and native implementations (i.e., Pthreads). Regarding the programming effort, we also demonstrate a considerable reduction in lines of code and code churn compared to Pthreads and comparable results with respect to other existing implementations.
@article{p3arsec:taco17,author={De Sensi, Daniele and De Matteis, Tiziano and Torquati, Massimo and Mencagli, Gabriele and Danelutto, Marco},title={Bringing Parallel Patterns Out of the Corner: The P$^{3}$ARSEC Benchmark Suite},journal={ACM Trans. Archit. Code Optim.},issue_date={October 2017},volume={14},number={4},month=oct,year={2017},issn={1544-3566},pages={33:1--33:26},articleno={33},numpages={26},url={http://doi.acm.org/10.1145/3132710},openaccess={http://dl.acm.org/authorize?N49996},doi={10.1145/3132710},acmid={3132710},publisher={ACM},address={New York, NY, USA},keywords={Parallel patterns, algorithmic skeletons, benchmarking, multicore programming, parsec},}
AUTODASP17
Nornir: A Customisable Framework for Autonomic and Power-Aware Applications
Daniele
De Sensi, Tiziano
De Matteis, and Marco
Danelutto
In Euro-Par 2017 Workshops, Proc. of the Auto-DaSP Workshop, Oct 2017
A desirable characteristic of modern parallel applications
is the ability to dynamically select the amount of resources to be
used to meet requirements on performance or power consumption.
In many cases, providing explicit guarantees
on performance is of paramount importance.
In streaming applications, this is related with the
concept of elasticity, i.e. being able to allocate
the proper amount of resources to match the current demand
as closely as possible. Similarly, in other scenarios, it may be useful
to limit the maximum power consumption of an application to do
not exceed the available power budget.
In this paper we propose Nornir, a customizable
C++ framework for autonomic and power-aware parallel applications on shared memory multicore machines. Nornir can be used by autonomic strategy designers
to implement new algorithms and by application users to enforce
requirements on their applications.
@inproceedings{nornir:autodasp17,author={De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco},title={Nornir: A Customisable Framework for Autonomic and Power-Aware Applications},booktitle={Euro-Par 2017 Workshops, Proc. of the Auto-DaSP Workshop},year={2017},}
This work studies the issues related to dynamic memory management in Data Stream Processing, an emerging paradigm enabling the real-time processing of live data streams. In this paper, we consider two streaming parallel patterns and we discuss different implementation variants related to how dynamic memory is managed. The results show that the standard mechanisms provided by modern C++ are not entirely adequate for maximizing the performance. Instead, the combined use of an efficient general purpose memory allocator, a custom allocator optimized for the pattern considered and a custom variant of the C++ shared pointer mechanism, provides a performance improvement up to 16% on the best case.
@article{jsc17,author={Torquati, Massimo and Mencagli, Gabriele and Drocco, Maurizio and Aldinucci, Marco and De Matteis, Tiziano and Danelutto, Marco},title={On dynamic memory allocation in sliding-window parallel patterns for streaming analytics},journal={The Journal of Supercomputing},year={2017},month=sep,day={27},doi={10.1007/s11227-017-2152-1},issn={1573-0484},url={https://doi.org/10.1007/s11227-017-2152-1},}
Techniques to handle traffic bursts and out-of-order arrivals are of paramount importance to provide real-time sensor data analytics in domains like traffic surveillance, transportation management, healthcare and security applications. In these systems the amount of raw data coming from sensors must be analyzed by continuous queries that extract value-added information used to make informed decisions in real-time. To perform this task with timing constraints, parallelism must be exploited in the query execution in order to enable the real-time processing on parallel architectures. In this paper we focus on continuous preference queries, a representative class of continuous queries for decision making, and we propose a parallel query model targeting the efficient processing over out-of-order and bursty data streams. We study how to integrate punctuation mechanisms in order to enable out-of-order processing. Then, we present advanced scheduling strategies targeting scenarios with different burstiness levels, parameterized using the index of dispersion quantity. Extensive experiments have been performed using synthetic datasets and real-world data streams obtained from an existing real-time locating system. The experimental evaluation demonstrates the efficiency of our parallel solution and its effectiveness in handling the out-of-orderness degrees and burstiness levels of real-world applications.
@article{tpds17,author={Mencagli, Gabriele and Torquati, Massimo and Danelutto, Marco and De Matteis, Tiziano},title={Parallel Continuous Preference Queries over Out-of-Order and Bursty Data Streams},journal={IEEE Transactions on Parallel and Distributed Systems},year={2017},volume={28},number={9},pages={2608-2624},month=sep,doi={10.1109/TPDS.2017.2679197},issn={1045-9219},}
2016
SEPS16
A Divide-and-conquer Parallel Pattern Implementation for Multicores
Marco
Danelutto, Tiziano
De Matteis, Gabriele
Mencagli, and
1 more author
In Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, Amsterdam, Netherlands, Sep 2016
@inproceedings{seps16,author={Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo},title={A Divide-and-conquer Parallel Pattern Implementation for Multicores},booktitle={Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems},series={SEPS 2016},year={2016},isbn={978-1-4503-4641-2},location={Amsterdam, Netherlands},pages={10--19},numpages={10},url={http://doi.acm.org/10.1145/3002125.3002128},doi={10.1145/3002125.3002128},acmid={3002128},publisher={ACM},address={New York, NY, USA},keywords={Divide and Conquer, High-level parallel patterns},}
Time-to-solution is an important metric when parallelizing existing code. The REPARA approach provides a systematic way to instantiate stream and data parallel patterns by annotating the sequential source code with Cpp11 attributes. Annotations are automatically transformed in a target parallel code that uses existing libraries for parallel programming (e.g., FastFlow). In this paper, we apply this approach for the parallelization of a data stream processing application. The description shows the effectiveness of the approach in easily and quickly prototyping several parallel variants of the sequential code by obtaining good overall performance in terms of both throughput and latency.
@article{js2016,author={Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo},title={Data stream processing via code annotations},journal={The Journal of Supercomputing},year={2016},pages={1--15},issn={1573-0484},doi={10.1007/s11227-016-1793-9},url={http://dx.doi.org/10.1007/s11227-016-1793-9},}
The emergence of real-time decision-making applications in domains like high-frequency trading,
emergency management and service level analysis in communication networks, has led to the definition
of new classes of queries. Skyline queries are a notable example. Their results consist of all the tuples whose
attribute vector is not dominated (in the Pareto sense) by one of any other tuple. Because of their popularity,
skyline queries have been studied in terms of both sequential algorithms and parallel implementations for
multiprocessors and clusters. Within the Data Stream Processing paradigm, traditional database queries
on static relations have been revised in order to operate on continuous data streams. Most of the past
papers propose sequential algorithms for continuous skyline queries, whereas there exist very few works
targeting implementations on parallel machines. This paper contributes to fill this gap by proposing a parallel
implementation for multicore architectures. We propose: i) a parallelization of the eager algorithm based on
the notion of Skyline Influence Time, ii) optimizations of the reduce phase and load-balancing strategies to
achieve near-optimal speedup, iii) a set of experiments with both synthetic benchmarks and a real dataset in
order to show our implementation effectiveness
@article{ccpe2016,author={De Matteis, Tiziano and Di Girolamo, Salvatore and Mencagli, Gabriele},title={Continuous Skyline Queries on Multicore Architectures},journal={Concurrency and Computation: Practice and Experience},year={2016},volume={28},number={12},pages={3503--3522},doi={10.1002/cpe.3866},issn={1532-0634},url={http://dx.doi.org/10.1002/cpe.3866},}
PPoPP16
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing
Tiziano
De Matteis and Gabriele
Mencagli
In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Barcelona, Spain, Sep 2016
This paper addresses the problem of designing control strategies for elastic stream processing applications. Elasticity allows applications to rapidly change their configuration (e.g. the number of used resources) on-the-fly, in response to fluctuations of their workload. In this work we face this problem by adopting the Model Predictive Control technique, a control-theoretic method aimed at finding the optimal application configuration along a limited prediction horizon by solving an online optimization problem. Our control strategies are designed to address latency constraints, by using Queueing Theory models, and energy consumption by changing the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) function of modern multi-core CPUs. The proactive capabilities, in addition to the latency- and energy-awareness, represent the novel features of our approach. Experiments performed using a high-frequency trading application show the effectiveness compared with state-of-the-art techniques.
@inproceedings{ppopp2016,author={De Matteis, Tiziano and Mencagli, Gabriele},title={Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing},booktitle={Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)},year={2016},pages={13:1--13:12},articleno={13},doi={10.1145/2851141.2851148},isbn={978-1-4503-4092-2},location={Barcelona, Spain},numpages={12},url={http://doi.acm.org/10.1145/2851141.2851148},}
PhD Thesis
Parallel Patterns for Adaptive Data Stream Processing
@inproceedings{europar2015,author={De Matteis, Tiziano and Di Girolamo, Salvatore and Mencagli, Gabriele},booktitle={Proceedings of the 2015 International Conference on Parallel Processing (Euro-Par)},title={A Multicore Parallelization of Continuous Skyline Queries on Data Streams},year={2015},pages={402--413},doi={},address={Vienna, Austria},}
2014
PDP14
Optimizing Message-Passing on Multicore Architectures Using Hardware Multi-threading
Daniele
Buono, Tiziano
De Matteis, Gabriele
Mencagli, and
1 more author
In Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on, Sep 2014
@inproceedings{pdp2014,author={Buono, Daniele and De Matteis, Tiziano and Mencagli, Gabriele and Vanneschi, Marco},booktitle={Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on},title={Optimizing Message-Passing on Multicore Architectures Using Hardware Multi-threading},year={2014},address={Torino, Italy},pages={262-270},doi={10.1109/PDP.2014.63},issn={1066-6192},}
ISPA14
A High-Throughput and Low-Latency Parallelization of Window-based Stream Joins on Multicores
Daniele
Buono, Tiziano
De Matteis, and Gabriele
Mencagli
In 12th IEEE International Symposium on Parallel and Distributed Processing with Applications, Sep 2014
@inproceedings{ispa2014,author={Buono, Daniele and De Matteis, Tiziano and Mencagli, Gabriele},booktitle={12th IEEE International Symposium on Parallel and Distributed Processing with Applications},title={A High-Throughput and Low-Latency Parallelization of Window-based Stream Joins on Multicores},year={2014},isbn={978-1-4799-4293-0},pages={117--126},numpages={10},url={http://dx.doi.org/10.1109/ISPA.2014.24},doi={10.1109/ISPA.2014.24},acmid={2681942},publisher={IEEE Computer Society},address={Milano, Italy},}
HPCS14
Autonomic parallel Data Stream Processing
Tiziano
De Matteis
In High Performance Computing Simulation (HPCS), 2014 International Conference on, Jul 2014
The work proposes ffMDF, a lightweight dynamic run-time support able to achieve high performance in the execution of dense linear algebra kernels on shared-cache multi-core. ffMDF implements a dynamic macro-dataflow interpreter processing DAG graphs generated on-the-fly out of standard numeric kernel code. The experimental results demonstrate that the performance obtained using ffMDF on both fine-grain and coarse-grain problems is comparable with or even better than that achieved by de-facto standard solutions (notably PLASMA library), which use separate run-time supports specifically optimised for different computational grains on modern multi-core.
@inproceedings{pdcn2014,author={Buono, Daniele and Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo},title={A Lightweight Run-Time Support for Fast Dense Linear Algebra on Multi-Core},booktitle={Proceedings of 12th IASTED International Conference on Parallel and Distributed Computing and Networks},year={2014},publisher={Iasted},address={Innsbruck, Austria},}
2013
PDCN13
Evaluation of Architectural Supports for Fine-Grained Synchronization
Mechanisms
Tiziano
De Matteis, Fabio
Luporini, Gabriele
Mencagli, and
1 more author
In Proceedings of the 11th IASTED International Conference on Parallel
and Distributed Computing and Networks, Jul 2013
The advent of multi-/many-core architectures demands efficient run-time
supports to sustain parallel applications scalability. Synchronization
mechanisms should be optimized in order to account for different
scenarios, such as the interaction between threads executed on different
cores as well as intra-core synchronization, i.e. involving threads
executed on hardware contexts of the same core. In this perspective,
we describe the design issues of two notable mechanisms for shared-memory
parallel computations. We point out how specific architectural supports,
like hardware cache coherence and core-to-core interconnection networks,
make it possible to design optimized implementations of such mechanisms.
In this paper we discuss experimental results on three representative
architectures: a flagship Intel multi-core and two interesting network
processors. The final result helps to untangle the complex implementation
space of synchronization mechanisms.
@inproceedings{pdcn2013,author={{De Matteis}, Tiziano and Luporini, Fabio and Mencagli, Gabriele and Vanneschi, Marco},title={Evaluation of Architectural Supports for Fine-Grained Synchronization
Mechanisms},booktitle={Proceedings of the 11th IASTED International Conference on Parallel
and Distributed Computing and Networks},year={2013},address={Innsbruck, Austria},publisher={Iasted},isbn={978-088986943-1},}