Publications | Tiziano De Matteis

Legend: Conference/Workshop Journal arXiv/Other

2025

IISWC
Does Linux Provide Performance Isolation for NVMe SSDs? Configuring cgroups for I/O Control in the NVMe Era

Krijn Doekemeijer, Zebin Ren, Tiziano De Matteis, and 2 more authors

In 2025 IEEE International Symposium on Workload Characterization (IISWC), 2025

Best Paper Finalist Abs Bib

True

Retrieval-augmented generation (RAG) has emerged as an effective method for enhancing large language models by integrating external knowledge sources to reduce the model size, avoid hallucinations, and provide an easier way to update the knowledge than fine-tuning. This external knowledge is commonly managed by vector databases, where the external knowledge is embedded into vectors and retrieved with vector similarity search. As the size of these external knowledge bases grows, the memory requirements for storing vectors and their associated indexes exceed the practical limits of main memory, prompting a shift toward storage-based solutions. Despite the adoption of storage-based solutions in modern vector databases, there have been limited systematic evaluations of the performance characteristics and I/O behavior of state-of-the-practice vector databases with storage-based setups. In this paper, we systematically characterize the performance, scalability, and I/O characteristics of these vector databases on modern SSDs that can deliver millions of I/O operations/s with less than 100 µs latency. We report 22 observations and 3 key findings that indicate: (i) vector databases with storage-based setups do not necessarily indicate lower performance than memorybased setups, for example, the storage-based setup DiskANN outperforms the memory-based setup, IVF, with up to 3.2× search throughput in Milvus, (ii) state-of-the-practice vector databases with storage-based setups require optimizations on I/O traffic to fully utilize the performance with flash SSDs, the maximum bandwidth achieved in our experiments is 1.7 GiB/s and can not saturate our benchmarked SSD, and (iii) the indexes’ search-time parameters affect both performance and I/O characteristics of vector databases, for example, when the parameter search_list increases from 10 to 100, the throughput of vector similarity search decreases up to 60.9% and the read bandwidth increases up to 3.3×. We open-source the scripts and traces of this work at: https://zenodo.org/records/16916496.
@inproceedings{doekemeijer_cgroups, author = {Doekemeijer, Krijn and Ren, Zebin and De Matteis, Tiziano and Chandrasekaran, Balakrishnan and Trivedi, Animesh}, title = {Does Linux Provide Performance Isolation for NVMe SSDs? Configuring cgroups for I/O Control in the NVMe Era}, booktitle = {2025 IEEE International Symposium on Workload Characterization (IISWC)}, publisher = {{ACM}}, year = {2025}, }
FHPC
FPGA innovation research in the Netherlands: present landscape and future outlook

Nikolaos Alachiotis, Sjoerd Belt, Steven Vlugt, and 10 more authors

Frontiers in High Performance Computing, 2025

Abs DOI Bib

Field programmable gate arrays (FPGA) have transformed digital design by enabling versatile and customizable solutions that balance performance and power efficiency, yielding them essential for today’s diverse computing challenges. Research in the Netherlands in both academia and industry plays a major role in developing new innovative FPGA solutions. This survey presents the current landscape of FPGA innovation research in the Netherlands by delving into ongoing projects, advancements, and breakthroughs in the field. Focusing on recent research outcome (within the past 5 years), we have identified five key research areas: (a) FPGA architecture, (b) FPGA robustness, (c) data center infrastructure and high-performance computing, (d) programming models and tools, and (e) applications. This survey provides in-depth insights beyond a mere snapshot of the current innovation research landscape by highlighting future research directions within each key area; these insights can serve as a foundational resource to inform potential national-level investments in FPGA technology.
@article{10.3389/fhpcp.2025.1572844, author = {Alachiotis, Nikolaos and van den Belt, Sjoerd and van der Vlugt, Steven and van der Walle, Reinier and Safari, Mohsen and Endres Forlin, Bruno and De Matteis, Tiziano and Al-Ars, Zaid and Jordans, Roel and Sousa de Almeida, António J. and Corradi, Federico and Baaij, Christiaan and Varbanescu, Ana-Lucia}, title = {FPGA innovation research in the Netherlands: present landscape and future outlook}, journal = {Frontiers in High Performance Computing}, volume = {Volume 3 - 2025}, year = {2025}, url = {https://www.frontiersin.org/journals/high-performance-computing/articles/10.3389/fhpcp.2025.1572844}, doi = {10.3389/fhpcp.2025.1572844}, issn = {2813-7337}, }
ICSA25
How Does Microservice Granularity Impact Energy Consumption and Performance? A Controlled Experiment

Yiming Zhao, Tiziano De Matteis, and Justus Bogner

In 2025 IEEE 22nd International Conference on Software Architecture (ICSA) , 2025

Abs DOI arXiv Bib

Context: Microservice architectures are a widely used software deployment approach, with benefits regarding flexibility and scalability. However, their impact on energy consumption is poorly understood, and often overlooked in favor of performance and other quality attributes (QAs). One understudied concept in this area is microservice granularity, i.e., over how many services the system functionality is distributed.Objective: We therefore aim to analyze the relationship between microservice granularity and two critical QAs in microservice-based systems: energy consumption and performance.Method: We conducted a controlled experiment using two open-source microservice-based systems of different scales: the small Pet Clinic system and the large Train Ticket system. For each system, we created three levels of granularity by merging or splitting services (coarse, medium, and fine) and then exposed them to five levels of request frequency.Results: Our findings revealed that: i) granularity significantly affected both energy consumption and response time, e.g., in the large system, fine granularity consumed on average 461 J more energy (13%) and added 5.2 ms to response time (14%) compared to coarse granularity; ii) higher request loads significantly increased both energy consumption and response times, with moving from 40 to 400 requests / s resulting in 651 J higher energy consumption (23%) and 41.2 ms longer response times (98%); iii) there is a complex relationship between granularity, system scale, energy consumption, and performance that warrants careful consideration in microservice design. We derive generalizable takeaways from our results.Conclusion: Microservices practitioners should take our findings into account when making granularity-related decisions, especially for large-scale systems.
@inproceedings{10978921, author = {Zhao, Yiming and De Matteis, Tiziano and Bogner, Justus}, booktitle = { 2025 IEEE 22nd International Conference on Software Architecture (ICSA) }, title = {{ How Does Microservice Granularity Impact Energy Consumption and Performance? A Controlled Experiment }}, year = {2025}, volume = {}, issn = {}, pages = {84-95}, keywords = {Energy consumption;Time-frequency analysis;Software architecture;Scalability;Merging;Microservice architectures;Computer architecture;Software;Large-scale systems;Time factors}, doi = {10.1109/ICSA65012.2025.00018}, url = {https://doi-ieeecomputersociety-org.vu-nl.idm.oclc.org/10.1109/ICSA65012.2025.00018}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }

TPDS

Productivity, Portability, Performance, and Reproducibility: Data-Centric Python

Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, and 5 more authors

IEEE Transactions on Parallel and Distributed Systems, 2025

DOI Bib

@article{10960277,
  author = {Ziogas, Alexandros Nikolaos and Schneider, Timo and Ben-Nun, Tal and Calotoiu, Alexandru and De Matteis, Tiziano and de Fine Licht, Johannes and Lavarini, Luca and Hoefler, Torsten},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  title = {Productivity, Portability, Performance, and Reproducibility: Data-Centric Python},
  year = {2025},
  volume = {36},
  number = {5},
  pages = {804-820},
  keywords = {Productivity;Codes;High performance computing;Semantics;Computer architecture;Supercomputers;Software;Field programmable gate arrays;Optimization;Python;Computer languages;Python;high-performance computing;dataflow computing;parallel programming;distributed computing},
  doi = {10.1109/TPDS.2025.3549310},
}

CHEOPS25
An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD

Zebin Ren, Krijn Doekemeijer, Tiziano De Matteis, and 3 more authors

In Proceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems, Rotterdam, Netherlands, 2025

Abs DOI Bib PDF

With the popularity of generative AI, LLM inference has become one of the most popular cloud workloads. Modern popular LLMs have hundreds of billions of parameters and support very large input/output prompt token sizes (100K-1M). As a result, their computational state during LLM inference can exceed the memory available on GPUs. One solution to this GPU memory problem is to offload the model weights and KV cache to the host memory. As the size of the models and prompts continue to increase, researchers have started to explore the use of secondary storage, such as SSDs, to store the model weights and KV cache. However, there is a lack of study on the I/O characteristics and performance requirements of these offloading operations. In order to have a better understanding of the performance characteristics of these offloading operations, in this work, we collect, study, and characterize the block layer I/O traces from two LLM inference frameworks, DeepSpeed and FlexGen, that support model and KV cache offloading to SSDs. Through our analysis of these I/O traces, we report that: (i) libaio-based tensor offloading delivers higher I/O bandwidth for both writing and reading tensors to/from the SSDs than POSIX; (ii) the I/O workload of model offloading is dominated by 128 KiB reads for both DeepSpeed and FlexGen in the block layer; (iii) model offloading does not saturate NVMe SSDs; and (iv) the I/O workload of KV cache offloading contains both read and write workloads dominated by 128 KiB requests, but the average bandwidth of read is much higher than write (2.0 GiB/s vs. 11.0 MiB/s). We open-source the scripts and the I/O traces of this work at https://github.com/stonet-research/cheops25-IO-characterization-of-LLM-model-kv-cache-offloading-nvme
@inproceedings{10.1145/3719330.3721230, author = {Ren, Zebin and Doekemeijer, Krijn and De Matteis, Tiziano and Pinto, Christian and Stoica, Radu and Trivedi, Animesh}, title = {An I/O Characterizing Study of Offloading LLM Models and KV Caches to NVMe SSD}, year = {2025}, isbn = {9798400715297}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi-org.vu-nl.idm.oclc.org/10.1145/3719330.3721230}, doi = {10.1145/3719330.3721230}, booktitle = {Proceedings of the 5th Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems}, pages = {23–33}, numpages = {11}, keywords = {KV cache offloading, Large language model, Model offloading, SSDs}, location = {Rotterdam, Netherlands}, series = {CHEOPS '25}, }
FGCS
RADiCe: A Risk Analysis Framework for Data Centers

Fabian Mastenbroek, Tiziano De Matteis, Vincent van Beek, and 1 more author

Future Generation Computer Systems, 2025

Abs DOI Bib

Datacenter service providers face engineering and operational challenges involving numerous risk aspects. Bad decisions can result in financial penalties, competitive disadvantage, and unsustainable environmental impact. Risk management is an integral aspect of the design and operation of modern datacenters, but frameworks that allow users to consider various risk trade-offs conveniently are missing. We propose RADiCe, an open-source framework that enables data-driven analysis of IT-related operational risks in sustainable datacenters. RADiCe uses monitoring and environmental data and, via discrete event simulation, assists datacenter experts through systematic evaluation of risk scenarios, visualization, and optimization of risks. Our analyses highlight the increasing risk datacenter operators face due to price surges in electricity and sustainability and demonstrate how RADiCe can evaluate and control such risks by optimizing the topology and operational settings of the datacenter. Eventually, RADiCe can evaluate risk scenarios by a factor 70x–330x faster than others, opening possibilities for interactive risk exploration.
@article{MASTENBROEK2025107702, title = {RADiCe: A Risk Analysis Framework for Data Centers}, journal = {Future Generation Computer Systems}, volume = {166}, pages = {107702}, year = {2025}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2024.107702}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X24006666}, author = {Mastenbroek, Fabian and {De Matteis}, Tiziano and {van Beek}, Vincent and Iosup, Alexandru}, keywords = {Datacenter, Risk assessment, Sustainability, Simulation}, }

arXiv

Comparing Parallel Functional Array Languages: Programming and Performance

David Balen, Tiziano De Matteis, Clemens Glerks, and 12 more authors

2025

(Authors are in alphabetical order)

arXiv Bib

@misc{vanbalen2025comparingparallelfunctionalarray,
  title = {Comparing Parallel Functional Array Languages: Programming and Performance},
  author = {van Balen, David and De Matteis, Tiziano and Glerks, Clemens and Henriksen, Troels and Hsu, Aaron W. and Keller, Gabriele K. and Koopman, Thomas and McDonell, Trevor L. and Oancea, Cosmin and Scholz, Sven-Bodo and Sinkarovs, Artjoms and Smeding, Tom and Trinder, Phil and de Wolff, Ivo Gabe and Ziogas, Alexandros Nikolaos},
  year = {2025},
  eprint = {2505.08906},
  archiveprefix = {arXiv},
  primaryclass = {cs.PL},
  url = {https://arxiv.org/abs/2505.08906},
  note = {(Authors are in alphabetical order)},
}

2024

arXiv
Developing a BLAS library for the AMD AI Engine

Tristan Laan and Tiziano De Matteis

2024

Abs arXiv Bib Slides

Spatial (dataflow) computer architectures can mitigate the control and performance overhead of classical von Neumann architectures such as traditional CPUs. Driven by the popularity of Machine Learning (ML) workloads, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts. We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model.
@misc{aie-blas, title = {Developing a BLAS library for the AMD AI Engine}, author = {Laan, Tristan and De Matteis, Tiziano}, year = {2024}, eprint = {2410.00825}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, url = {https://arxiv.org/abs/2410.00825}, }
SC24
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects

Daniele De Sensi, Lorenzo Pichetti, Flavio Vella, and 11 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’24), Nov 2024

Abs DOI Bib

Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers — Alps, Leonardo, and LUMI — each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4,096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
@inproceedings{gpugpuinterconnect, author = {De Sensi, Daniele and Pichetti, Lorenzo and Vella, Flavio and De Matteis, Tiziano and Ren, Zebin and Fusco, Luigi and Turisini, Matteo and Cesarini, Daniele and Lust, Kurt and Trivedi, Animesh and Roweth, Duncan and Spiga, Filippo and Di Girolamo, Salvatore and Hoefler, Torsten}, title = {Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects}, year = {2024}, month = nov, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'24)}, doi = {10.1109/SC41406.2024.00039}, }
HotCloudPerf24
FootPrinter: Quantifying Data Center Carbon Footprint

Dante Niewenhuis, Sacheendra Talluri, Alexandru Iosup, and 1 more author

In Companion of the 15th ACM/SPEC International Conference on Performance Engineering, London, United Kingdom, Nov 2024

Abs DOI Bib PDF

Data centers have become an increasingly significant contributor to the global carbon footprint. In 2021, the global data center industry was responsible for around 1% of the worldwide greenhouse gas emissions. With more resource-intensive workloads, such as Large Language Models, gaining popularity, this percentage is expected to increase further. Therefore, it is crucial for data center service providers to become aware of and accountable for the sustainability impact of their design and operational choices. However, reducing the carbon footprint of data centers has been a challenging process due to the lack of comprehensive metrics, carbon-aware design tools, and guidelines for carbon-aware optimization. In this work, we propose FootPrinter, a first-of-its-kind tool that supports data center designers and operators in assessing the environmental impact of their data center. FootPrinter uses coarse-grained operational data, grid energy mix information, and discrete event simulation to determine the data center’s operational carbon footprint and evaluate the impact of infrastructural or operational changes. FootPrinter can simulate days of operations of a regional data center on a commodity laptop in a few seconds, returning the estimated footprint with marginal error. By making this project open source, we hope to engage the community in the development of methodologies and tools for systematically assessing and exploring the sustainability of data centers.
@inproceedings{dniewenhuis_hotcloud_footprinter, author = {Niewenhuis, Dante and Talluri, Sacheendra and Iosup, Alexandru and De Matteis, Tiziano}, title = {FootPrinter: Quantifying Data Center Carbon Footprint}, year = {2024}, isbn = {9798400704451}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3629527.3651419}, doi = {10.1145/3629527.3651419}, booktitle = {Companion of the 15th ACM/SPEC International Conference on Performance Engineering}, pages = {189–195}, numpages = {7}, keywords = {carbon emission, carbon footprint, data center, simulation}, location = {London, United Kingdom}, series = {ICPE '24 Companion}, }
ICPE24
The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions

Aratz Manterola Lasa, Sacheendra Talluri, Tiziano De Matteis, and 1 more author

In 15th ACM/SPEC International Conference on Performance Engineering (ICPE’24), Nov 2024

Abs Bib Code

Schedulers are a crucial component in datacenter resource management. Each scheduler offers different capabilities, and users use them through their APIs. However, there is no clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand their differences and the performance costs imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and their related performance costs. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, understand the differences in their APIs, and identify the missing abstractions. Finally, we carry out exemplary experiments using trace-driven simulation demonstrating that an API extension, such as container migration, can improve total execution time per task by 81%, highlighting how schedulers sacrifice performance by implementing simpler programming abstractions. All the relevant software and data artifacts are publicly available at https://github.com/atlarge-research/quantifying-api-design.
@inproceedings{2024-icpe-datacenter-scheduler, author = {Lasa, Aratz Manterola and Talluri, Sacheendra and De Matteis, Tiziano and Iosup, Alexandru}, title = {The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions}, booktitle = {15th ACM/SPEC International Conference on Performance Engineering (ICPE'24)}, publisher = {{ACM}}, year = {2024}, url = {https://atlarge-research.com/pdfs/2024-icpe-datacenter-scheduler.pdf}, }
FGCS
ExDe: Design space exploration of scheduler architectures and mechanisms for serverless data-processing

Sacheendra Talluri, Nikolas Herbst, Cristina Abad, and 2 more authors

Future Generation Computer Systems, Nov 2024

Abs DOI Bib PDF

Serverless computing is increasingly used for data-processing applications in both science and business domains. At the core of serverless data-processing systems is the scheduler, which ensures dynamic decisions about task and data placement. Due to the variety of user, cluster, and workload properties, the design space for high-performance and cost-effective scheduling architectures and mechanisms is vast. The large design space is difficult to explore and characterize. To help the system designer disentangle this complexity, we present ExDe, a framework to systematically explore the design space of scheduling architectures and mechanisms. The framework includes a conceptual model and a simulator to assist in design space exploration. We use the framework, and real-world workloads, to characterize the performance of three scheduling architectures and two mechanisms. Our framework is open-source software available on Zenodo.
@article{stalluri_exde, title = {ExDe: Design space exploration of scheduler architectures and mechanisms for serverless data-processing}, journal = {Future Generation Computer Systems}, volume = {153}, pages = {84-96}, year = {2024}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2023.11.013}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X23004211}, author = {Talluri, Sacheendra and Herbst, Nikolas and Abad, Cristina and De Matteis, Tiziano and Iosup, Alexandru}, keywords = {Serverless, Scheduler, Design, Mechanism, Architecture, Performance}, }

2023

HPDC23
Streaming Task Graph Scheduling for Dataflow Architectures

Tiziano De Matteis, Lukas Gianinazzi, Johannes Fine Licht, and 1 more author

In Proceedings of the 32th International Symposium on High-Performance Parallel and Distributed Computing (HPDC’23), Orlando, FL, USA, Jun 2023

Abs DOI arXiv Bib Video Code

Dataflow devices represent an avenue towards saving the control and data movement overhead of Load-Store Architectures. Various dataflow accelerators have been proposed, but how to efficiently schedule applications on such devices remains an open problem. The programmer can explicitly implement both temporal and spatial parallelism, and pipelining across multiple processing elements can be crucial to take advantage of the fast on-chip interconnect, enabling the concurrent execution of different program components. This paper introduces canonical task graphs, a model that enables streaming scheduling of task graphs over dataflow architectures. We show how a task graph can be statically analyzed to understand its steady-state behavior, and we use this information to partition it into temporally multiplexed components of spatially executed tasks. Results on synthetic and realistic workloads show how streaming scheduling can increase speedup and device utilization over a traditional scheduling approach.
@inproceedings{streaming_scheduling, author = {De Matteis, Tiziano and Gianinazzi, Lukas and de Fine Licht, Johannes and Hoefler, Torsten}, title = {{Streaming Task Graph Scheduling for Dataflow Architectures}}, year = {2023}, month = jun, pages = {225–237}, numpages = {13}, booktitle = {Proceedings of the 32th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'23)}, location = {Orlando, FL, USA}, publisher = {ACM}, isbn = {9798400701559}, note = {}, doi = {10.1145/3588195.3592999}, url = {https://dl.acm.org/doi/10.1145/3588195.3592999}, }

2022

SIGMETRICS23
Noise in the Clouds: Influence of Network Performance Variability on Application Scalability

Daniele De Sensi, Tiziano De Matteis, Konstantin Taranov, and 3 more authors

Proc. ACM Meas. Anal. Comput. Syst., New York, NY, USA, Dec 2022

Abs arXiv Bib Code

Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
@article{noise_cloud, author = {De Sensi, Daniele and De Matteis, Tiziano and Taranov, Konstantin and Di Girolamo, Salvatore and Rahn, Tobias and Hoefler, Torsten}, title = {{Noise in the Clouds: Influence of Network Performance Variability on Application Scalability}}, journal = {Proc. ACM Meas. Anal. Comput. Syst.}, year = {2022}, month = dec, volume = {6}, number = {3}, location = {New York, NY, USA}, publisher = {Association for Computing Machinery}, }
arXiv
Python FPGA Programming with Data-Centric Multi-Level Design

Johannes Fine Licht, Tiziano De Matteis, Tal Ben-Nun, and 5 more authors

Dec 2022

Abs arXiv Bib

Although high-level synthesis (HLS) tools have significantly improved programmer productivity over hardware description languages, developing for FPGAs remains tedious and error prone. Programmers must learn and implement a large set of vendor-specific syntax, patterns, and tricks to optimize (or even successfully compile) their applications, while dealing with ever-changing toolflows from the FPGA vendors. We propose a new way to develop, optimize, and compile FPGA programs. The Data-Centric parallel programming (DaCe) framework allows applications to be defined by their dataflow and control flow through the Stateful DataFlow multiGraph (SDFG) representation, capturing the abstract program characteristics, and exposing a plethora of optimization opportunities. In this work, we show how extending SDFGs with multi-level Library Nodes incorporates both domain-specific and platform-specific optimizations into the design flow, enabling knowledge transfer across application domains and FPGA vendors. We present the HLS-based FPGA code generation backend of DaCe, and show how SDFGs are code generated for either FPGA vendor, emitting efficient HLS code that is structured and annotated to implement the desired architecture.
@misc{licht2022python, title = {Python FPGA Programming with Data-Centric Multi-Level Design}, author = {de Fine Licht, Johannes and De Matteis, Tiziano and Ben-Nun, Tal and Kuster, Andreas and Rausch, Oliver and Burger, Manuel and Johnsenn, Carl-Johannes and Hoefler, Torsten}, year = {2022}, eprint = {2212.13768}, archiveprefix = {arXiv}, primaryclass = {cs.DC}, }
ICCAD22
Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping

Carl-Johannes Johnsen, Tiziano De Matteis, Tal Ben-Nun, and 2 more authors

In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, San Diego, California, Oct 2022

Abs DOI Bib

The multi-pumping resource sharing technique can overcome the limitations commonly found in single-clocked FPGA designs by allowing hardware components to operate at a higher clock frequency than the surrounding system. However, this optimization cannot be expressed in high levels of abstraction, such as HLS, requiring the use of hand-optimized RTL. In this paper we show how to leverage multiple clock domains for computational subdomains on reconfigurable devices through data movement analysis on high-level programs. We offer a novel view on multi-pumping as a compiler optimization — a superclass of traditional vectorization. As multiple data elements are fed and consumed, the computations are packed temporally rather than spatially. The optimization is applied automatically using an intermediate representation that maps high-level code to HLS. Internally, the optimization injects modules into the generated designs, incorporating RTL for fine-grained control over the clock domains. We obtain a reduction of resource consumption by up to 50% on critical components and 23% on average. For scalable designs, this can enable further parallelism, increasing overall performance.
@inproceedings{double_pumping, author = {Johnsen, Carl-Johannes and De Matteis, Tiziano and Ben-Nun, Tal and de Fine Licht, Johannes and Hoefler, Torsten}, title = {Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping}, month = oct, year = {2022}, booktitle = {Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design}, url = {https://doi.org/10.1145/3508352.3549374}, doi = {10.1145/3508352.3549374}, articleno = {85}, numpages = {9}, location = {San Diego, California}, series = {ICCAD '22}, }

2021

SC21
Productivity, portability, performance: data-centric Python

Alexandros Nikolaos Ziogas, Timo Schneider, Tal Ben-Nun, and 5 more authors

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, Oct 2021

SIGHPC Certificate of Appreciation for reproducible methods at the ACM/IEEE Supercomputing Conference (SC22) ACM student cluster competition Abs DOI arXiv Bib

SIGHPC Certificate of Appreciation for reproducible methods at the ACM/IEEE Supercomputing Conference (SC22) ACM student cluster competition

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High Performance Computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. In this work, we present a workflow that retains Python’s high productivity while achieving portable performance across different architectures. The workflow’s key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes.
@inproceedings{data_centric_python, author = {Ziogas, Alexandros Nikolaos and Schneider, Timo and Ben-Nun, Tal and Calotoiu, Alexandru and De Matteis, Tiziano and de Fine Licht, Johannes and Lavarini, Luca and Hoefler, Torsten}, title = {Productivity, portability, performance: data-centric Python}, year = {2021}, isbn = {9781450384421}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3458817.3476176}, doi = {10.1145/3458817.3476176}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {95}, numpages = {13}, keywords = {NumPy, data-centric, high performance computing, python}, location = {St. Louis, Missouri}, series = {SC '21}, }
CGO21
StencilFlow: mapping large stencil programs to distributed spatial computing systems

Johannes Fine Licht, Andreas Kuster, Tiziano De Matteis, and 3 more authors

In Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization, Virtual Event, Republic of Korea, Oct 2021

Abs DOI arXiv Bib

Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting, providing end-to-end analysis and mapping from a high-level program description to distributed hardware. We evaluate our generated architectures on a Stratix 10 FPGA testbed, yielding 1.31 TOp/s and 4.18 TOp/s on single-device and multi-device, respectively, demonstrating the highest performance recorded for stencil programs on FPGAs to date. We then leverage the framework to study a complex stencil program from a production weather simulation application. Our work enables productively targeting distributed spatial computing systems with large stencil programs, and offers insight into architecture characteristics required for their efficient execution in practice.
@inproceedings{stencilflow, author = {de Fine Licht, Johannes and Kuster, Andreas and De Matteis, Tiziano and Ben-Nun, Tal and Hofer, Dominic and Hoefler, Torsten}, title = {StencilFlow: mapping large stencil programs to distributed spatial computing systems}, year = {2021}, isbn = {9781728186139}, publisher = {IEEE Press}, url = {https://doi.org/10.1109/CGO51591.2021.9370315}, doi = {10.1109/CGO51591.2021.9370315}, booktitle = {Proceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization}, pages = {315–326}, numpages = {12}, location = {Virtual Event, Republic of Korea}, series = {CGO '21}, }

2020

SC20
fBLAS: streaming linear algebra on FPGA

Tiziano De Matteis, Johannes Fine Licht, and Torsten Hoefler

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Georgia, Oct 2020

Abs arXiv Bib Video Code Slides

Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present fBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. fBLAS’ implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce offchip communication. With fBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms.
@inproceedings{fblas, author = {De Matteis, Tiziano and de Fine Licht, Johannes and Hoefler, Torsten}, title = {fBLAS: streaming linear algebra on FPGA}, year = {2020}, isbn = {9781728199986}, publisher = {IEEE Press}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {59}, numpages = {13}, keywords = {hardware library, high level synthesis, spatial architectures}, location = {Atlanta, Georgia}, series = {SC '20}, }

2019

SC19
Streaming message interface: high-performance distributed memory programming on reconfigurable hardware

Tiziano De Matteis, Johannes Fine Licht, Jakub Beránek, and 1 more author

In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, Oct 2019

Abs DOI arXiv Bib Video Code Slides

Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
@inproceedings{smi, author = {De Matteis, Tiziano and de Fine Licht, Johannes and Ber\'{a}nek, Jakub and Hoefler, Torsten}, title = {Streaming message interface: high-performance distributed memory programming on reconfigurable hardware}, year = {2019}, isbn = {9781450362290}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3295500.3356201}, doi = {10.1145/3295500.3356201}, booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis}, articleno = {82}, numpages = {33}, keywords = {reconfigurable computing, high-level synthesis tools, distributed memory programming}, location = {Denver, Colorado}, series = {SC '19}, }
IEEE Access
GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs

Tiziano De Matteis, Gabriele Mencagli, Daniele De Sensi, and 2 more authors

IEEE Access, Apr 2019

Abs DOI Bib Code

Today’s stream processing systems handle high-volume data streams in an efficient manner. To achieve this goal, they are designed to scale out on large clusters of commodity machines. However, despite the efficient use of distributed architectures, they lack support to co-processors like graphical processing units (GPUs) ready to accelerate data-parallel tasks. The main reason for this lack of integration is that GPU processing and the streaming paradigm have different processing models, with GPUs needing a bulk of data present at once while the streaming paradigm advocates a tuple-at-a-time processing model. This paper contributes to fill this gap by proposing Gasser, a system for offloading the execution of sliding-window operators on GPUs. The system focuses on completely general functions by targeting the parallel processing of non-incremental queries that are not supported by the few existing GPU-based streaming prototypes. Furthermore, Gasser provides an auto-tuning approach able to automatically find the optimal value of the configuration parameters (i.e., batch length and the degree of parallelism) needed to optimize throughput and latency with the given query and data stream. The experimental part assesses the performance efficiency of Gasser by comparing its peak throughput and latency against Apache Flink, a popular and scalable streaming system. Furthermore, we evaluate the penalty induced by supporting completely general queries against the performance achieved by the state-of-the-art solution specifically optimized for incremental queries. Finally, we show the speed and accuracy of the auto-tuning approach adopted by Gasser, which is able to self-configure the system by finding the right configuration parameters without manual tuning by the users.
@article{gasser, author = {De Matteis, Tiziano and Mencagli, Gabriele and De Sensi, Daniele and Torquati, Massimo and Danelutto, Marco}, journal = {IEEE Access}, title = {GASSER: An Auto-Tunable System for General Sliding-Window Streaming Operators on GPUs}, volume = {7}, number = {}, pages = {48753-48769}, doi = {10.1109/ACCESS.2019.2910312}, issn = {2169-3536}, year = {2019}, month = apr, openaccess = {https://ieeexplore.ieee.org/document/8688411}, }

2018

KDD18
D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes

Alessio Conte, Tiziano De Matteis, Daniele De Sensi, and 3 more authors

In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, United Kingdom, Apr 2018

Abs DOI Bib

This paper studies kplexes, a well known pseudo-clique model for network communities. In a kplex, each node can miss at most k-1 links. Our goal is to detect large communities in today’s real-world graphs which can have hundreds of millions of edges. While many have tried, this task has been elusive so far due to its computationally challenging nature: kplexes and other pseudo-cliques are harder to find and more numerous than cliques, a well known hard problem. We present D2K, which is the first algorithm able to find large kplexes of very large graphs in just a few minutes. The good performance of our algorithm follows from a combination of graph-theoretical concepts, careful algorithm engineering and a high-performance implementation. In particular, we exploit the low degeneracy of real-world graphs, and the fact that large enough kplexes have diameter 2. We validate a sequential and a parallel/distributed implementation of D2K on real graphs with up to half a billion edges.
@inproceedings{kdd:18, author = {Conte, Alessio and De Matteis, Tiziano and De Sensi, Daniele and Grossi, Roberto and Marino, Andrea and Versari, Luca}, title = {D2K: Scalable Community Detection in Massive Networks via Small-Diameter k-Plexes}, booktitle = {Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery \&\#38; Data Mining}, series = {KDD '18}, year = {2018}, isbn = {978-1-4503-5552-0}, location = {London, United Kingdom}, pages = {1272--1281}, numpages = {10}, url = {http://doi.acm.org/10.1145/3219819.3220093}, doi = {10.1145/3219819.3220093}, acmid = {3220093}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {community discovery, graph enumeration, k-plexes, parallel programming}, openaccess = {https://dl.acm.org/authorize?N666390}, videopitch = {https://www.youtube.com/watch?v=zF2Hz1wq9eM}, }
FGCS
Simplifying self-adaptive and power-aware computing with Nornir

Daniele De Sensi, Tiziano De Matteis, and Marco Danelutto

Future Generation Computer Systems, Apr 2018

Abs DOI Bib

Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources toallocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users’ activity. By dynamically selecting the appropriate number of resources it is possible, for example, to use at each time step the minimum amount of resources needed to process the incoming data. Implementing such kind of algorithms may be a complex task, due to low-level interactions with the underlying hardware and to non-intrusive and low-overhead monitoring of the applications. For these reasons, in this paper we propose Nornir, a C++-based framework, which can be used to enforce performance and power consumption constraints on parallel applications running on shared memory multicores. The framework can be easily customized by algorithm designers to implement new self-adaptive policies. By instrumenting the applications in the {PARSEC} benchmark, we provide to strategy designers a wide set of applications already interfaced to Nornir. In addition to this, to prove its flexibility, we implemented and compared several state-of-the-art existing policies, showing that Nornir can also be used to easily analyze different algorithms and to provide useful insights on them.
@article{nornir:fgcs18, title = {Simplifying self-adaptive and power-aware computing with Nornir}, journal = {Future Generation Computer Systems}, volume = {}, number = {}, pages = { - }, year = {2018}, note = {}, issn = {0167-739X}, doi = {https://doi.org/10.1016/j.future.2018.05.012}, url = {https://www.sciencedirect.com/science/article/pii/S0167739X17326699}, author = {De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco}, keywords = {Self-adaptive, Power-aware, Quality of service, Data stream processing, Fog computing, Parallel computing}, }
PDP18
Reducing Message Latency and CPU Utilization in the CAF Actor Framework

Massimo Torquati, Tullio Menga, Tiziano De Matteis, and 2 more authors

In Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2018, Cambridge, United Kingdom, Apr 2018

Abs Bib

In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applications using the actor programming model in C++. CAF has been optimized for high-throughput computing, whereas message latency between actors is greatly influenced by the message data rate: at low and moderate rates the latency is higher than at high data rates. To this end, we propose a modification of the polling strategies in the work-stealing CAF scheduler, which can reduce message latency at low and moderate data rates up to two orders of magnitude without compromising the overall throughput and message latency at maximum pressure. The technique proposed uses a lightweight event notification protocol that is general enough to be used used to optimize the runtime of other frameworks experiencing similar issues.
@inproceedings{cafpdp18, author = {Torquati, Massimo and Menga, Tullio and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele}, booktitle = {Proceedings of the 26th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2018}, location = {Cambridge, United Kingdom}, title = {Reducing Message Latency and CPU Utilization in the CAF Actor Framework}, year = {2018}, toappear = {}, }

2017

SAC17
P^3ARSEC: Towards Parallel Patterns Benchmarking

Marco Danelutto, Tiziano De Matteis, Daniele De Sensi, and 2 more authors

In Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Marrakesh, Morocco, Apr 2017

Abs DOI Bib Slides

High-level parallel programming is a de-facto standard approach to develop parallel software with reduced time to development. High-level abstractions are provided by existing frameworks as pragma-based annotations in the source code, or through pre-built parallel patterns that recur frequently in parallel algorithms, and that can be easily instantiated by the programmer to add a structure to the development of parallel software. In this paper we focus on this second approach and we propose P3ARSEC, a benchmark suite for parallel pattern-based frameworks consisting of a representative subset of PARSEC applications. We analyse the programmability advantages and the potential performance penalty of using such high-level methodology with respect to hand-made parallelisations using low-level mechanisms. The results are obtained on the new Intel Knights Landing multicore, and show a significantly reduced code complexity with comparable performance.
@inproceedings{sac17, author = {Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo}, title = {P$^{3}$ARSEC: Towards Parallel Patterns Benchmarking}, booktitle = {Proceedings of the 32nd Annual ACM Symposium on Applied Computing}, year = {2017}, series = {SAC '17}, pages = {1582--1589}, address = {New York, NY, USA}, publisher = {ACM}, acmid = {3019745}, doi = {10.1145/3019612.3019745}, isbn = {978-1-4503-4486-9}, keywords = {Parallel Patterns, PARSEC Benchmarks, Intel KNL}, location = {Marrakesh, Morocco}, numpages = {8}, url = {http://dl.acm.org/authorize?N34889}, }
PDP17
Elastic Scaling for Distributed Latency-sensitive Data Stream Operators

Tiziano De Matteis and Gabriele Mencagli

In Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2017, St. Petersburg, Russia, Apr 2017

Abs Bib Slides

High-volume data streams are straining the limits of stream processing frameworks which need advanced parallel processing capabilities to withstand the actual incoming bandwidth. Parallel processing must be synergically integrated with elastic features in order dynamically scale the amount of utilized resources by accomplishing the Quality of Service goals in a costeffective manner. This paper proposes a control-theoretic strategy to drive the elastic behavior of latency-sensitive streaming operators in distributed environments. The strategy takes scaling decisions in advance by relying on a predictive model-based approach. Our ideas have been experimentally evaluated on a cluster using a real-world streaming application fed by synthetic and real datasets. The results show that our approach takes the strictly necessary reconfigurations while providing reduced resource consumption. Furthermore, it allows the operator to meet desired average latency requirements with a significant reduction in the experienced latency jitter.
@inproceedings{dasp:pdp17, author = {De Matteis, Tiziano and Mencagli, Gabriele}, title = {Elastic Scaling for Distributed Latency-sensitive Data Stream Operators}, booktitle = {Proceedings of the 25th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, {PDP} 2017}, year = {2017}, location = {St. Petersburg, Russia}, }
JSS
Proactive elasticity and energy awareness in data stream processing

Tiziano De Matteis and Gabriele Mencagli

Journal of Systems and Software , Apr 2017

Abs DOI Bib

Abstract Data stream processing applications have a long running nature (24 hr/7 d) with workload conditions that may exhibit wide variations at run-time. Elasticity is the term coined to describe the capability of applications to change dynamically their resource usage in response to workload fluctuations. This paper focuses on strategies for elastic data stream processing targeting multicore systems. The key idea is to exploit Model Predictive Control, a control-theoretic method that takes into account the system behavior over a future time horizon in order to decide the best reconfiguration to execute. We design a set of energy-aware proactive strategies, optimized for throughput and latency QoS requirements, which regulate the number of used cores and the {CPU} frequency through the Dynamic Voltage and Frequency Scaling (DVFS) support offered by modern multicore CPUs. We evaluate our strategies in a high-frequency trading application fed by synthetic and real-world workload traces. We introduce specific properties to effectively compare different elastic approaches, and the results show that our strategies are able to achieve the best outcome.
@article{jss17, author = {De Matteis, Tiziano and Mencagli, Gabriele}, title = {Proactive elasticity and energy awareness in data stream processing }, journal = {Journal of Systems and Software }, year = {2017}, volume = {127}, pages = {302 - 319}, doi = {http://dx.doi.org/10.1016/j.jss.2016.08.037}, issn = {0164-1212}, keywords = {Data stream processing}, url = {http://www.sciencedirect.com/science/article/pii/S0164121216301467}, }
IJPP
Parallel Patterns for Window-Based Stateful Operators on Data Streams: An Algorithmic Skeleton Approach

Tiziano De Matteis and Gabriele Mencagli

International Journal of Parallel Programming, Apr 2017

Abs DOI Bib Slides

The topic of Data Stream Processing is a recent and highly active research area dealing with the in-memory, tuple-by-tuple analysis of streaming data. Continuous queries typically consume huge volumes of data received at a great velocity. Solutions that persistently store all the input tuples and then perform off-line computation are impractical. Rather, queries must be executed continuously as data cross the streams. The goal of this paper is to present parallel patterns for window-based stateful operators, which are the most representative class of stateful data stream operators. Parallel patterns are presented “à la” Algorithmic Skeleton, by explaining the rationale of each pattern, the preconditions to safely apply it, and the outcome in terms of throughput, latency and memory consumption. The patterns have been implemented in the }}\backslashmathtt {FastFlow}}} FastFlow framework targeting off-the-shelf multicores. To the best of our knowledge this is the first time that a similar effort to merge the Data Stream Processing domain and the field of Structured Parallelism has been made.
@article{ijpp17, author = {De Matteis, Tiziano and Mencagli, Gabriele}, title = {Parallel Patterns for Window-Based Stateful Operators on Data Streams: An Algorithmic Skeleton Approach}, journal = {International Journal of Parallel Programming}, year = {2017}, volume = {45}, number = {2}, pages = {382--401}, month = apr, day = {01}, doi = {10.1007/s10766-016-0413-x}, issn = {1573-7640}, url = {https://doi.org/10.1007/s10766-016-0413-x}, }
IJPP
The RePhrase Extended Pattern Set for Data Intensive Parallel Computing

Marco Danelutto, Tiziano De Matteis, Daniele De Sensi, and 4 more authors

International Journal of Parallel Programming, Nov 2017

Abs DOI Bib

We discuss the extended parallel pattern set identified within the EU-funded project RePhrase as a candidate pattern set to support data intensive applications targeting heterogeneous architectures. The set has been designed to include three classes of pattern, namely (1) core patterns, modelling common, not necessarily data intensive parallelism exploitation patterns, usually to be used in composition; (2) high level patterns, modelling common, complex and complete parallelism exploitation patterns; and (3) building block patterns, modelling the single components of data intensive applications, suitable for use—in composition—to implement patterns not covered by the core and high level patterns. We discuss the expressive power of the RePhrase extended pattern set and results illustrating the performances that may be achieved with the FastFlow implementation of the high level patterns.
@article{rephrase:ijpp17, author = {Danelutto, Marco and De Matteis, Tiziano and De Sensi, Daniele and Mencagli, Gabriele and Torquati, Massimo and Aldinucci, Marco and Kilpatrick, Peter}, title = {The RePhrase Extended Pattern Set for Data Intensive Parallel Computing}, journal = {International Journal of Parallel Programming}, year = {2017}, month = nov, day = {28}, issn = {1573-7640}, doi = {10.1007/s10766-017-0540-z}, openaccess = {http://rdcu.be/zN6c}, url = {https://doi.org/10.1007/s10766-017-0540-z}, }
TACO
Bringing Parallel Patterns Out of the Corner: The P^3ARSEC Benchmark Suite

Daniele De Sensi, Tiziano De Matteis, Massimo Torquati, and 2 more authors

ACM Trans. Archit. Code Optim., Oct 2017

Abs DOI Bib

High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time to solution. Pattern-based parallel programming is based on a set of composable and customizable parallel patterns used as basic building blocks in parallel applications. In recent years, a considerable effort has been made in empowering this programming model with features able to overcome shortcomings of early approaches concerning flexibility and performance. In this article, we demonstrate that the approach is flexible and efficient enough by applying it on 12 out of 13 PARSEC applications. Our analysis, conducted on three different multicore architectures, demonstrates that pattern-based parallel programming has reached a good level of maturity, providing comparable results in terms of performance with respect to both other parallel programming methodologies based on pragma-based annotations (i.e., Openmp and OmpSs) and native implementations (i.e., Pthreads). Regarding the programming effort, we also demonstrate a considerable reduction in lines of code and code churn compared to Pthreads and comparable results with respect to other existing implementations.
@article{p3arsec:taco17, author = {De Sensi, Daniele and De Matteis, Tiziano and Torquati, Massimo and Mencagli, Gabriele and Danelutto, Marco}, title = {Bringing Parallel Patterns Out of the Corner: The P$^{3}$ARSEC Benchmark Suite}, journal = {ACM Trans. Archit. Code Optim.}, issue_date = {October 2017}, volume = {14}, number = {4}, month = oct, year = {2017}, issn = {1544-3566}, pages = {33:1--33:26}, articleno = {33}, numpages = {26}, url = {http://doi.acm.org/10.1145/3132710}, openaccess = {http://dl.acm.org/authorize?N49996}, doi = {10.1145/3132710}, acmid = {3132710}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Parallel patterns, algorithmic skeletons, benchmarking, multicore programming, parsec}, }
AUTODASP17
Nornir: A Customisable Framework for Autonomic and Power-Aware Applications

Daniele De Sensi, Tiziano De Matteis, and Marco Danelutto

In Euro-Par 2017 Workshops, Proc. of the Auto-DaSP Workshop, Oct 2017

Abs Bib Slides

A desirable characteristic of modern parallel applications is the ability to dynamically select the amount of resources to be used to meet requirements on performance or power consumption. In many cases, providing explicit guarantees on performance is of paramount importance. In streaming applications, this is related with the concept of elasticity, i.e. being able to allocate the proper amount of resources to match the current demand as closely as possible. Similarly, in other scenarios, it may be useful to limit the maximum power consumption of an application to do not exceed the available power budget. In this paper we propose Nornir, a customizable C++ framework for autonomic and power-aware parallel applications on shared memory multicore machines. Nornir can be used by autonomic strategy designers to implement new algorithms and by application users to enforce requirements on their applications.
@inproceedings{nornir:autodasp17, author = {De Sensi, Daniele and De Matteis, Tiziano and Danelutto, Marco}, title = {Nornir: A Customisable Framework for Autonomic and Power-Aware Applications}, booktitle = {Euro-Par 2017 Workshops, Proc. of the Auto-DaSP Workshop}, year = {2017}, }
JS
On dynamic memory allocation in sliding-window parallel patterns for streaming analytics

Massimo Torquati, Gabriele Mencagli, Maurizio Drocco, and 3 more authors

The Journal of Supercomputing, Sep 2017

Abs DOI Bib

This work studies the issues related to dynamic memory management in Data Stream Processing, an emerging paradigm enabling the real-time processing of live data streams. In this paper, we consider two streaming parallel patterns and we discuss different implementation variants related to how dynamic memory is managed. The results show that the standard mechanisms provided by modern C++ are not entirely adequate for maximizing the performance. Instead, the combined use of an efficient general purpose memory allocator, a custom allocator optimized for the pattern considered and a custom variant of the C++ shared pointer mechanism, provides a performance improvement up to 16% on the best case.
@article{jsc17, author = {Torquati, Massimo and Mencagli, Gabriele and Drocco, Maurizio and Aldinucci, Marco and De Matteis, Tiziano and Danelutto, Marco}, title = {On dynamic memory allocation in sliding-window parallel patterns for streaming analytics}, journal = {The Journal of Supercomputing}, year = {2017}, month = sep, day = {27}, doi = {10.1007/s11227-017-2152-1}, issn = {1573-0484}, url = {https://doi.org/10.1007/s11227-017-2152-1}, }
TPDS
Parallel Continuous Preference Queries over Out-of-Order and Bursty Data Streams

Gabriele Mencagli, Massimo Torquati, Marco Danelutto, and 1 more author

IEEE Transactions on Parallel and Distributed Systems, Sep 2017

Abs DOI Bib

Techniques to handle traffic bursts and out-of-order arrivals are of paramount importance to provide real-time sensor data analytics in domains like traffic surveillance, transportation management, healthcare and security applications. In these systems the amount of raw data coming from sensors must be analyzed by continuous queries that extract value-added information used to make informed decisions in real-time. To perform this task with timing constraints, parallelism must be exploited in the query execution in order to enable the real-time processing on parallel architectures. In this paper we focus on continuous preference queries, a representative class of continuous queries for decision making, and we propose a parallel query model targeting the efficient processing over out-of-order and bursty data streams. We study how to integrate punctuation mechanisms in order to enable out-of-order processing. Then, we present advanced scheduling strategies targeting scenarios with different burstiness levels, parameterized using the index of dispersion quantity. Extensive experiments have been performed using synthetic datasets and real-world data streams obtained from an existing real-time locating system. The experimental evaluation demonstrates the efficiency of our parallel solution and its effectiveness in handling the out-of-orderness degrees and burstiness levels of real-world applications.
@article{tpds17, author = {Mencagli, Gabriele and Torquati, Massimo and Danelutto, Marco and De Matteis, Tiziano}, title = {Parallel Continuous Preference Queries over Out-of-Order and Bursty Data Streams}, journal = {IEEE Transactions on Parallel and Distributed Systems}, year = {2017}, volume = {28}, number = {9}, pages = {2608-2624}, month = sep, doi = {10.1109/TPDS.2017.2679197}, issn = {1045-9219}, }

2016

SEPS16

A Divide-and-conquer Parallel Pattern Implementation for Multicores

Marco Danelutto, Tiziano De Matteis, Gabriele Mencagli, and 1 more author

In Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems, Amsterdam, Netherlands, Sep 2016

DOI Bib

@inproceedings{seps16,
  author = {Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo},
  title = {A Divide-and-conquer Parallel Pattern Implementation for Multicores},
  booktitle = {Proceedings of the 3rd International Workshop on Software Engineering for Parallel Systems},
  series = {SEPS 2016},
  year = {2016},
  isbn = {978-1-4503-4641-2},
  location = {Amsterdam, Netherlands},
  pages = {10--19},
  numpages = {10},
  url = {http://doi.acm.org/10.1145/3002125.3002128},
  doi = {10.1145/3002125.3002128},
  acmid = {3002128},
  publisher = {ACM},
  address = {New York, NY, USA},
  keywords = {Divide and Conquer, High-level parallel patterns},
}

JS
Data stream processing via code annotations

Marco Danelutto, Tiziano De Matteis, Gabriele Mencagli, and 1 more author

The Journal of Supercomputing, Sep 2016

Abs DOI Bib

Time-to-solution is an important metric when parallelizing existing code. The REPARA approach provides a systematic way to instantiate stream and data parallel patterns by annotating the sequential source code with Cpp11 attributes. Annotations are automatically transformed in a target parallel code that uses existing libraries for parallel programming (e.g., FastFlow). In this paper, we apply this approach for the parallelization of a data stream processing application. The description shows the effectiveness of the approach in easily and quickly prototyping several parallel variants of the sequential code by obtaining good overall performance in terms of both throughput and latency.
@article{js2016, author = {Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo}, title = {Data stream processing via code annotations}, journal = {The Journal of Supercomputing}, year = {2016}, pages = {1--15}, issn = {1573-0484}, doi = {10.1007/s11227-016-1793-9}, url = {http://dx.doi.org/10.1007/s11227-016-1793-9}, }
CCPE
Continuous Skyline Queries on Multicore Architectures

Tiziano De Matteis, Salvatore Di Girolamo, and Gabriele Mencagli

Concurrency and Computation: Practice and Experience, Sep 2016

Abs DOI Bib

The emergence of real-time decision-making applications in domains like high-frequency trading, emergency management and service level analysis in communication networks, has led to the definition of new classes of queries. Skyline queries are a notable example. Their results consist of all the tuples whose attribute vector is not dominated (in the Pareto sense) by one of any other tuple. Because of their popularity, skyline queries have been studied in terms of both sequential algorithms and parallel implementations for multiprocessors and clusters. Within the Data Stream Processing paradigm, traditional database queries on static relations have been revised in order to operate on continuous data streams. Most of the past papers propose sequential algorithms for continuous skyline queries, whereas there exist very few works targeting implementations on parallel machines. This paper contributes to fill this gap by proposing a parallel implementation for multicore architectures. We propose: i) a parallelization of the eager algorithm based on the notion of Skyline Influence Time, ii) optimizations of the reduce phase and load-balancing strategies to achieve near-optimal speedup, iii) a set of experiments with both synthetic benchmarks and a real dataset in order to show our implementation effectiveness
@article{ccpe2016, author = {De Matteis, Tiziano and Di Girolamo, Salvatore and Mencagli, Gabriele}, title = {Continuous Skyline Queries on Multicore Architectures}, journal = {Concurrency and Computation: Practice and Experience}, year = {2016}, volume = {28}, number = {12}, pages = {3503--3522}, doi = {10.1002/cpe.3866}, issn = {1532-0634}, url = {http://dx.doi.org/10.1002/cpe.3866}, }
PPoPP16
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing

Tiziano De Matteis and Gabriele Mencagli

In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Barcelona, Spain, Sep 2016

Abs DOI Bib Slides

This paper addresses the problem of designing control strategies for elastic stream processing applications. Elasticity allows applications to rapidly change their configuration (e.g. the number of used resources) on-the-fly, in response to fluctuations of their workload. In this work we face this problem by adopting the Model Predictive Control technique, a control-theoretic method aimed at finding the optimal application configuration along a limited prediction horizon by solving an online optimization problem. Our control strategies are designed to address latency constraints, by using Queueing Theory models, and energy consumption by changing the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) function of modern multi-core CPUs. The proactive capabilities, in addition to the latency- and energy-awareness, represent the novel features of our approach. Experiments performed using a high-frequency trading application show the effectiveness compared with state-of-the-art techniques.
@inproceedings{ppopp2016, author = {De Matteis, Tiziano and Mencagli, Gabriele}, title = {Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing}, booktitle = {Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)}, year = {2016}, pages = {13:1--13:12}, articleno = {13}, doi = {10.1145/2851141.2851148}, isbn = {978-1-4503-4092-2}, location = {Barcelona, Spain}, numpages = {12}, url = {http://doi.acm.org/10.1145/2851141.2851148}, }
PhD Thesis

Parallel Patterns for Adaptive Data Stream Processing

Tiziano De Matteis

University of Pisa, Sep 2016

PDF

2015

EUROPAR15

A Multicore Parallelization of Continuous Skyline Queries on Data Streams

Tiziano De Matteis, Salvatore Di Girolamo, and Gabriele Mencagli

In Proceedings of the 2015 International Conference on Parallel Processing (Euro-Par), Sep 2015

DOI Bib

@inproceedings{europar2015,
  author = {De Matteis, Tiziano and Di Girolamo, Salvatore and Mencagli, Gabriele},
  booktitle = {Proceedings of the 2015 International Conference on Parallel Processing (Euro-Par)},
  title = {A Multicore Parallelization of Continuous Skyline Queries on Data Streams},
  year = {2015},
  pages = {402--413},
  doi = {},
  address = {Vienna, Austria},
}

2014

PDP14

Optimizing Message-Passing on Multicore Architectures Using Hardware Multi-threading

Daniele Buono, Tiziano De Matteis, Gabriele Mencagli, and 1 more author

In Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on, Sep 2014

DOI Bib

@inproceedings{pdp2014,
  author = {Buono, Daniele and De Matteis, Tiziano and Mencagli, Gabriele and Vanneschi, Marco},
  booktitle = {Parallel, Distributed and Network-Based Processing (PDP), 2014 22nd Euromicro International Conference on},
  title = {Optimizing Message-Passing on Multicore Architectures Using Hardware Multi-threading},
  year = {2014},
  address = {Torino, Italy},
  pages = {262-270},
  doi = {10.1109/PDP.2014.63},
  issn = {1066-6192},
}

ISPA14

A High-Throughput and Low-Latency Parallelization of Window-based Stream Joins on Multicores

Daniele Buono, Tiziano De Matteis, and Gabriele Mencagli

In 12th IEEE International Symposium on Parallel and Distributed Processing with Applications, Sep 2014

DOI Bib

@inproceedings{ispa2014,
  author = {Buono, Daniele and De Matteis, Tiziano and Mencagli, Gabriele},
  booktitle = {12th IEEE International Symposium on Parallel and Distributed Processing with Applications},
  title = {A High-Throughput and Low-Latency Parallelization of Window-based Stream Joins on Multicores},
  year = {2014},
  isbn = {978-1-4799-4293-0},
  pages = {117--126},
  numpages = {10},
  url = {http://dx.doi.org/10.1109/ISPA.2014.24},
  doi = {10.1109/ISPA.2014.24},
  acmid = {2681942},
  publisher = {IEEE Computer Society},
  address = {Milano, Italy},
}

HPCS14

Autonomic parallel Data Stream Processing

Tiziano De Matteis

In High Performance Computing Simulation (HPCS), 2014 International Conference on, Jul 2014

DOI Bib

@inproceedings{hpcs2014,
  author = {De Matteis, Tiziano},
  booktitle = {High Performance Computing Simulation (HPCS), 2014 International Conference on},
  title = {Autonomic parallel Data Stream Processing},
  year = {2014},
  month = jul,
  pages = {995-998},
  address = {Bologna, Italy},
  doi = {10.1109/HPCSim.2014.6903797},
}

PDCN14
A Lightweight Run-Time Support for Fast Dense Linear Algebra on Multi-Core

Daniele Buono, Marco Danelutto, Tiziano De Matteis, and 2 more authors

In Proceedings of 12th IASTED International Conference on Parallel and Distributed Computing and Networks, Jul 2014

Abs Bib

The work proposes ffMDF, a lightweight dynamic run-time support able to achieve high performance in the execution of dense linear algebra kernels on shared-cache multi-core. ffMDF implements a dynamic macro-dataflow interpreter processing DAG graphs generated on-the-fly out of standard numeric kernel code. The experimental results demonstrate that the performance obtained using ffMDF on both fine-grain and coarse-grain problems is comparable with or even better than that achieved by de-facto standard solutions (notably PLASMA library), which use separate run-time supports specifically optimised for different computational grains on modern multi-core.
@inproceedings{pdcn2014, author = {Buono, Daniele and Danelutto, Marco and De Matteis, Tiziano and Mencagli, Gabriele and Torquati, Massimo}, title = {A Lightweight Run-Time Support for Fast Dense Linear Algebra on Multi-Core}, booktitle = {Proceedings of 12th IASTED International Conference on Parallel and Distributed Computing and Networks}, year = {2014}, publisher = {Iasted}, address = {Innsbruck, Austria}, }

2013

PDCN13
Evaluation of Architectural Supports for Fine-Grained Synchronization Mechanisms

Tiziano De Matteis, Fabio Luporini, Gabriele Mencagli, and 1 more author

In Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks, Jul 2013

Abs Bib

The advent of multi-/many-core architectures demands efficient run-time supports to sustain parallel applications scalability. Synchronization mechanisms should be optimized in order to account for different scenarios, such as the interaction between threads executed on different cores as well as intra-core synchronization, i.e. involving threads executed on hardware contexts of the same core. In this perspective, we describe the design issues of two notable mechanisms for shared-memory parallel computations. We point out how specific architectural supports, like hardware cache coherence and core-to-core interconnection networks, make it possible to design optimized implementations of such mechanisms. In this paper we discuss experimental results on three representative architectures: a flagship Intel multi-core and two interesting network processors. The final result helps to untangle the complex implementation space of synchronization mechanisms.
@inproceedings{pdcn2013, author = {{De Matteis}, Tiziano and Luporini, Fabio and Mencagli, Gabriele and Vanneschi, Marco}, title = {Evaluation of Architectural Supports for Fine-Grained Synchronization Mechanisms}, booktitle = {Proceedings of the 11th IASTED International Conference on Parallel and Distributed Computing and Networks}, year = {2013}, address = {Innsbruck, Austria}, publisher = {Iasted}, isbn = {978-088986943-1}, }