I am an Assistant Professor in the @Large Research group at the Vrije Universiteit Amsterdam.
Before, I was a PostDoc at the SPCL Group at ETH Zurich, working on reconfigurable hardware for
High-Performance Computing. I received my Ph.D. from the University of Pisa.
My principal research interests are related to High-Performance Computing (HPC) with a particular focus on Dataflow Accelerators (e.g., FPGAs, ML Accelerators) for HPC, Parallel Programming, and Sustainability. In approaching these topics my main objective is to provide the application programmer with high-level abstractions and tools to develop complex parallel software with reduced time-to-market. If you want to know more about my resarch please check my publications page.
If you want to get in touch, feel free to drop me an email.
Dataflow devices represent an avenue towards saving the control and data movement overhead of Load-Store Architectures. Various dataflow accelerators have been proposed, but how to efficiently schedule applications on such devices remains an open problem. The programmer can explicitly implement both temporal and spatial parallelism, and pipelining across multiple processing elements can be crucial to take advantage of the fast on-chip interconnect, enabling the concurrent execution of different program components. This paper introduces canonical task graphs, a model that enables streaming scheduling of task graphs over dataflow architectures. We show how a task graph can be statically analyzed to understand its steady-state behavior, and we use this information to partition it into temporally multiplexed components of spatially executed tasks. Results on synthetic and realistic workloads show how streaming scheduling can increase speedup and device utilization over a traditional scheduling approach.
@inproceedings{streaming_scheduling,author={De Matteis, Tiziano and Gianinazzi, Lukas and de Fine Licht, Johannes and Hoefler, Torsten},title={{Streaming Task Graph Scheduling for Dataflow Architectures}},year={2023},month=jun,pages={225–237},numpages={13},booktitle={Proceedings of the 32th International Symposium on High-Performance Parallel and Distributed Computing (HPDC'23)},location={Orlando, FL, USA},publisher={ACM},isbn={9798400701559},note={},doi={10.1145/3588195.3592999},url={https://dl.acm.org/doi/10.1145/3588195.3592999},}
SIGMETRICS23
Noise in the Clouds: Influence of Network Performance Variability on Application Scalability
Daniele
De Sensi, Tiziano
De Matteis, Konstantin
Taranov, and
3 more authors
Proc. ACM Meas. Anal. Comput. Syst., New York, NY, USA, Dec 2022
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
@article{noise_cloud,author={De Sensi, Daniele and De Matteis, Tiziano and Taranov, Konstantin and Di Girolamo, Salvatore and Rahn, Tobias and Hoefler, Torsten},title={{Noise in the Clouds: Influence of Network Performance Variability on Application Scalability}},journal={Proc. ACM Meas. Anal. Comput. Syst.},year={2022},month=dec,volume={6},number={3},location={New York, NY, USA},publisher={Association for Computing Machinery},}
SC20
fBLAS: streaming linear algebra on FPGA
Tiziano
De Matteis, Johannes
Fine Licht, and Torsten
Hoefler
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, Georgia, Dec 2020
Spatial computing architectures pose an attractive alternative to mitigate control and data movement overheads typical of load-store architectures. In practice, these devices are rarely considered in the HPC community due to the steep learning curve, low productivity, and the lack of available libraries for fundamental operations. High-level synthesis (HLS) tools are facilitating hardware programming, but optimizing for these architectures requires factoring in new transformations and resources/performance trade-offs. We present fBLAS, an open-source HLS implementation of BLAS for FPGAs, that enables reusability, portability and easy integration with existing software and hardware codes. fBLAS’ implementation allows scaling hardware modules to exploit on-chip resources, and module interfaces are designed to natively support streaming on-chip communications, allowing them to be composed to reduce offchip communication. With fBLAS, we set a precedent for FPGA library design, and contribute to the toolbox of customizable hardware components necessary for HPC codes to start productively targeting reconfigurable platforms.
@inproceedings{fblas,author={De Matteis, Tiziano and de Fine Licht, Johannes and Hoefler, Torsten},title={fBLAS: streaming linear algebra on FPGA},year={2020},isbn={9781728199986},publisher={IEEE Press},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},articleno={59},numpages={13},keywords={hardware library, high level synthesis, spatial architectures},location={Atlanta, Georgia},series={SC '20},}
SC19
Streaming message interface: high-performance distributed memory programming on reconfigurable hardware
Tiziano
De Matteis, Johannes
Fine Licht, Jakub
Beránek, and
1 more author
In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Denver, Colorado, Dec 2019
Distributed memory programming is the established paradigm used in high-performance computing (HPC) systems, requiring explicit communication between nodes and devices. When FPGAs are deployed in distributed settings, communication is typically handled either by going through the host machine, sacrificing performance, or by streaming across fixed device-to-device connections, sacrificing flexibility. We present Streaming Message Interface (SMI), a communication model and API that unifies explicit message passing with a hardware-oriented programming model, facilitating minimal-overhead, flexible, and productive inter-FPGA communication. Instead of bulk transmission, messages are streamed across the network during computation, allowing communication to be seamlessly integrated into pipelined designs. We present a high-level synthesis implementation of SMI targeting a dedicated FPGA interconnect, exposing runtime-configurable routing with support for arbitrary network topologies, and implement a set of distributed memory benchmarks. Using SMI, programmers can implement distributed, scalable HPC programs on reconfigurable hardware, without deviating from best practices for hardware design.
@inproceedings{smi,author={De Matteis, Tiziano and de Fine Licht, Johannes and Ber\'{a}nek, Jakub and Hoefler, Torsten},title={Streaming message interface: high-performance distributed memory programming on reconfigurable hardware},year={2019},isbn={9781450362290},publisher={Association for Computing Machinery},address={New York, NY, USA},url={https://doi.org/10.1145/3295500.3356201},doi={10.1145/3295500.3356201},booktitle={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},articleno={82},numpages={33},keywords={reconfigurable computing, high-level synthesis tools, distributed memory programming},location={Denver, Colorado},series={SC '19},}
PPoPP16
Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing
Tiziano
De Matteis and Gabriele
Mencagli
In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Barcelona, Spain, Dec 2016
This paper addresses the problem of designing control strategies for elastic stream processing applications. Elasticity allows applications to rapidly change their configuration (e.g. the number of used resources) on-the-fly, in response to fluctuations of their workload. In this work we face this problem by adopting the Model Predictive Control technique, a control-theoretic method aimed at finding the optimal application configuration along a limited prediction horizon by solving an online optimization problem. Our control strategies are designed to address latency constraints, by using Queueing Theory models, and energy consumption by changing the number of used cores and the CPU frequency through the Dynamic Voltage and Frequency Scaling (DVFS) function of modern multi-core CPUs. The proactive capabilities, in addition to the latency- and energy-awareness, represent the novel features of our approach. Experiments performed using a high-frequency trading application show the effectiveness compared with state-of-the-art techniques.
@inproceedings{ppopp2016,author={De Matteis, Tiziano and Mencagli, Gabriele},title={Keep Calm and React with Foresight: Strategies for Low-Latency and Energy-Efficient Elastic Data Stream Processing},booktitle={Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)},year={2016},pages={13:1--13:12},articleno={13},doi={10.1145/2851141.2851148},isbn={978-1-4503-4092-2},location={Barcelona, Spain},numpages={12},url={http://doi.acm.org/10.1145/2851141.2851148},}