Reza Jahadi

Software Engineer and AI researcher

About Me

Hello! I'm Reza, an AI engineer and embedded systems developer with a deep passion for building efficient machine learning systems from the ground up. My work bridges the gap between high-level model optimization, such as pruning, quantization, and transformer compression and low-level system design on real-world hardware.

I thrive at the intersection of software and hardware, where performance, scalability, and resource constraints drive innovation. Whether optimizing deep learning models for deployment or developing firmware and drivers on embedded platforms, I bring a systems-level mindset and hands-on experience that spans the entire ML stack.

Selected Projects

Parallel matrix-vector multiplication on GPU

Performing matrix-vector multiplication using row-wise decomposition on GPU and measuring the speedup.

CUDA C Bash Scripting

Iris dataset analysis using SVM and Neural Networds

The primary focus of this project is on data visualization, training Support Vector Machine (SVM) models, and building a neural network for classification tasks.

Python Machine Learning Neural Network

CNN acceleration on edge devices

Deploying a CNN model on Kria KV260 edge device and measuring the inference performance.

Python CNN Bash Linux

Distributed CNN Inference on MNIST

Implemented MPI-based CNN inference with pipelining on CPU clusters for speedup.

CNN C Distributed Systems

End-to-End AlexNet Training and Quantization on CIFAR-10

Complete pipeline for training, quantizing, and testing an AlexNet model on CIFAR-10 dataset using PyTorch and Vitis AI.

Python PyTorch CNN Quantization

High-Precision Arithmetic Engine in C++

Designed and developed a C++ engine for performing arithmetic operations on extremely large integers (up to 200 digits), bypassing the limitations of built-in data types.

C++ Arithmetic Algorithms

Face Detection & Verification App

A desktop application that detects and verifies passenger identities using face recognition and passport data, matching live webcam images with stored passport photos.

Python Deep Learning Computer Vision SQLite

View More Projects on GitHub

Publications

Fused Tensor Core: A Hardware–Software Co-Design for Efficient Execution of Attentions on GPUs

2025 IEEE Embedded Systems Letters

Authors: Reza Jahadi, Ehsan Atoofian

Developed a GPU hardware–software co-design to accelerate attention layers by reducing memory footprint and offloading non-MMA operations to tensor cores. Achieved 13.4% performance improvement and 18.3% energy-delay reduction over software-only optimizations.

LLM GPU optimization Computer Architecture

Paper

Low-Power Register File for Tensor Cores

2024 IEEE 15th International Green and Sustainable Computing Conference (IGSC)

Authors: Reza Jahadi, Ehsan Atoofian

We propose a value-aware register file design for Tensor Cores that leverages the bit-level sparsity in CNNs to reduce leakage power. By introducing LPS, LPS+, and PLPS+ SRAM cells, we achieve up to 77.3% power savings with negligible accuracy loss.

CNN Optimization GPU Tensor Cores Computer Architecture

Paper

PCTC: Hardware and Software Co-design for Pruned Capsule Networks

Euro-Par 2024: Parallel Processing

Authors: Mohammad Hafezan, Reza Jahadi & Ehsan Atoofian

We present PCTC, a co-designed hardware-software approach that enables efficient execution of Capsule Networks on NVIDIA Tensor Cores. By rearchitecting matrix-vector operations and introducing structured pruning tailored to capsule layers, PCTC achieves up to 31% energy savings.

Neural Networks Tensor Core Optimization GPU

Paper

Hobbies

Movies

I'm a big fan of movies, especially thriller and mystery genres. I enjoy stories that keep me guessing, with clever plot twists, psychological depth, and suspenseful narratives.

Travel

I love traveling and exploring new places. Being in nature whether it's hiking through forests, visiting lakes and waterfalls, or simply enjoying a scenic view that gives me a deep sense of peace and energy. Sometimes you just need to get away from all the noise and chill out in nature to clear your head.

Contact

rz.jahadihoseini@gmail.com

GitHub

Google Scholar

Stack Overflow

This project presents a detailed machine learning analysis of the classic Iris dataset. The study involves t-SNE-based data visualization, training Support Vector Machine (SVM) models with different configurations, and implementing a neural network for multi-class classification.

Key Components:
• t-SNE Visualization: 2D projection of high-dimensional data to identify clustering patterns among flower species.
• Support Vector Machines: Both linear and polynomial SVMs are trained and compared. Best results achieved using non-linear kernel.
• Neural Network: Two hidden layers with 60 neurons each, ReLU activation, and Softmax output; trained over 30 epochs.

Experimental Setup:
• Data split: 70% training / 30% test
• Libraries: Scikit-learn, Matplotlib, Seaborn, Keras
• Optimization strategies include tuning the regularization parameter C and loss functions

As illustrated, we can clearly observe distinct clusters. In the visualization, data points with similar characteristics naturally form clusters. Iris-versicolor and Iris-virginica appear closely grouped, indicating high similarity between them, while Iris-setosa is distinctly separated from the other two. This reflects differences in their feature distributions. The t-SNE algorithm effectively projects high-dimensional data into a 2D space, preserving the intrinsic clustering structure of the original dataset.

Performance:

Linear SVM Accuracy: 96%
Non-linear SVM Accuracy: 95.81%
Neural Network Accuracy: 97.77%
ROC AUC Score: 1.00 for all models, indicating perfect class separability

Conclusion:
The project shows how hyperparameter tuning and model architecture choices can impact classification accuracy. With proper setup, simple models can achieve near-perfect performance on well-structured datasets like Iris.

View on GitHub

This project implements CNN inference on the MNIST dataset using C and MPI (Message Passing Interface) for distributed execution across multiple CPUs.

We designed and compared two parallel inference approaches:

Data Parallelism (mnist_data.c): Each MPI process runs the full CNN model and processes a different portion of the input data independently. This method is efficient and scales well with minimal communication overhead.
Pipeline Parallelism (mnist.c): CNN layers are distributed across MPI processes, forming a computation pipeline. Each processor handles one or more layers and passes intermediate results to the next, reducing idle time and overlapping computation stages.

While the data parallel method tends to offer better performance due to lower inter-process communication, the pipelined approach demonstrates a novel execution pattern that can effectively utilize CPUs in sequence. Despite the communication overhead, pipeline parallelism achieves comparable performance under specific configurations.

This technique is particularly valuable because it:

Maximizes processor utilization through overlapped execution
Introduces pipelined execution to reduce layer-wise bottlenecks
Adapts to both model and data parallelism, depending on resource availability

The project highlights how distributed parallelism strategies can accelerate deep learning inference on CPU clusters—even for relatively small models like those trained on MNIST.

Result:
Although the pipeline method experiences some latency due to inter-process communication, our evaluations show that it performs closely to the data parallel method in practice.

Following figure shows the execution time (vertical axis) vs the number of nodes (horizontal axis) for the two methods. Also, you can see the speedup for the pipeline method compared to the data parallel method.

View on GitHub View Slides