Hey there! I’m Daniyal. If you’re here, you might have used my code
and it may have broke. Feel free to email me, DM me on Twitter, or reach
out on LinkedIn.
If you have any questions about my open-source projects or want to
pick my brain, feel free to reach out via email, LinkedIn, or
Twitter.
I work at Samsung Semiconductor as a Senior Engineer on AI/ML
compilers, building the torch.compile + vLLM inference engine for custom
accelerators. I’m also pursuing my M.S. in CS at Georgia Tech. If you
work on torch.dynamo/inductor/torch-mlir also please reach out.
Sparse Steps to
Reasoning (SSR): Pending ICML 2026
ICML 2026 submission; proposed SSR, a step-wise RL framework that
eliminates cross-step gradient interference in multi-step reasoning via
mostly-disjoint parameter allocation, achieving +6.0pp on AIME &
GPQA Diamond over the best GSPO baseline.
Built on a custom fork of NVIDIA Megatron-LM to scale distributed RL
training.
Developed superweight discovery, a pre-RL analysis pipeline that
identifies and freezes the language-critical parameters most responsible
for preserving core linguistic behavior, enabling aggressive
reasoning-focused updates without destabilizing the model.
Showed that disjoint parameter allocation is the key mechanism
behind SSR’s gains: allocating separate update regions to different
reasoning steps improves multi-step RL beyond step-level credit
assignment alone.
Built KronosOpt, a custom optimizer that sequentially claims
non-overlapping top-k parameter subsets per reasoning step, giving each
step unambiguous credit signals with a step-dependent budget schedule to
prevent late-step gradient starvation.
Designed a KV-cache compression system for Vision-Language Models
applying FFT, DCT, and Haar Wavelet transforms to key/value caches with
multi-bin quantization, reducing memory footprint by up to 60% with
minimal accuracy degradation.
Implemented modality-aware compression with separate text and vision
configurations, three-tier buffering (exact sink, exact recent,
spectral-compressed history), and query-aware coefficient selection via
energy x query-similarity ranking.
Evaluated across 30+ datasets from the MLBench suite on LLaVA and
Yi-VL models; benchmarked 70+ configuration variants across top-K,
prompt budget, and compression mode axes.
RocketKV
- KV Cache Compression for vLLM: December 2025 - Present
Optimizing RocketKV as part of my master’s thesis — a training-free
KV cache compression method for long-context LLM inference, published at
ICML 2025.
The original paper demonstrated up to 3.7x end-to-end speedup and
32.6% peak memory reduction on A100, but the reference implementation
was not practical for production use.
Adapted the two-stage compression pipeline (hybrid sparse attention
with dimension reduction + chunk-max scoring, followed by top-k exact
attention) into a vLLM plugin, achieving better decode performance than
baseline vLLM.
Wrote fused CUDA kernels using CUTLASS CuTe for Tensor Core MMA
operations, online softmax, and paged KV extraction with vectorized
128-bit loads.
The plugin registers as a vLLM platform plugin and transparently
swaps in the custom Starship attention backend — no model changes
required.
Evaluated on DeepSeek-R1-Distill models across GSM-8K, Math500,
AIME25, and LongBench with negligible accuracy loss at 512 token
budgets.
CARForge
- CAR-T Cell Design with Evo2: September 2025 - Present
Building a deep learning pipeline for designing optimized Chimeric
Antigen Receptor (CAR) T-cell constructs using the 7B Evo2 DNA
foundation model.
Fine-tuning Evo2 with LoRA (8.4M trainable params / 6.5B total),
structure cross-attention via OpenFold 3, and antigen conditioning
across 4,190+ target antigens from SAbDab.
Generates domain-annotated CAR sequences (scFv, VH/VL, CDRs, hinge,
transmembrane, costimulatory, CD3ζ) with continuous outcome conditioning
(expression, cytotoxicity, persistence).
Trained on 86K+ examples from Yoshida et al., Bits-to-Binders,
CARMSeD, and synthetic augmentation (codon variants, orthologs, negative
examples).
Achieved 76-77% nucleotide prediction accuracy with ~2.6-2.8
perplexity after 3 epochs of training.
LongMemory
- Memory Vector Injection for Frozen LLMs
Designed a memory vector injection system for frozen LLMs;
compresses sentences into 8 orthogonal latent vectors via contrastive
attention pooling, then injects them as cross-attention at multiple
decoder layers.
Achieved 97% token accuracy on retrieval tasks.
Tinkerbell - Fine-Tuning
Framework
Built a fine-tuning framework supporting LoRA-based PEFT, custom
loss functions (DPO, contrastive), and multi-threaded concurrent adapter
training across HuggingFace, Megatron-LM, and vLLM backends.
GPU
programming and Architecture Lecture: July 2024
Education
Georgia Institute of Technology Master of Science in Computer Science
August 2025 - Present (Remote)
Research affiliate at GT SAIL Lab, focusing on
reinforcement learning and deep learning
Relevant Coursework: Reinforcement Learning, Deep Learning
Georgia Institute of Technology Bachelor of Science in Computer Science
August 2020 - December 2023
Concentrations: Intelligence/AI and Systems and Architecture
High Honors
Relevant Coursework: Operating Systems, Artificial Intelligence,
Advanced Algorithms and Data Structures, Robotics and Perception,
Computer Architecture, Circuit Design Lab
Work History
Samsung Semiconductor Senior Engineer, AI/ML Software Compiler December 2024 - Present — Bay Area, CA
Built the torch.compile + vLLM inference engine from scratch as a
custom extension, enabling distributed model inference across custom
accelerators.
Implemented paged attention and chunked prefill within the vLLM
extension, improving decode throughput and enabling overlap of prefill
and decode phases for reduced time-to-first-token.
Engineered expert parallelism for MoE routing, distributing expert
computation across devices to scale inference for mixture-of-experts
models.
Manhattan Associates Software Engineer January 2024 - December 2024 — Atlanta, GA
Developed Java/Spring Boot microservices for transportation
logistics; improved resource allocation by 12%.
NCR Corporation Software Engineering Intern May 2022 - August 2022 — Atlanta, GA
Built a real-time MQTT monitoring tool with React + Redux and SQL
backend; reduced debugging time by 40%.
Arm
Scalable Matrix Extension for Triton-shared: March 2024 - July 2024
Implemented an optimization pass in the Triton compiler to utilize
Arm’s Scalable Matrix Extension (SME) and Scalable Vector Extension
version 2 (SVE2) instruction sets, enabling more efficient matrix
multiplication on Arm CPUs
Added support for bfloat16 and float16 data types in addition to
float32, taking advantage of Arm’s SVE-BF16 and FP16 instructions when
the target hardware supports them
Worked on integrating the SME optimization pass into Triton’s
compilation pipeline, collaborating with the Triton compiler team via
GitHub pull requests
Debugged issues with lowering the SME optimized IR to LLVM IR and
machine code, gaining expertise in MLIR, LLVM and the Arm SME/SVE
instruction set architectures
Leveraged MLIR’s transform dialect to apply patterns like tiling,
vectorization, and lowering of vector operations to Arm SME
instructions
Worked on emulating the SME optimization pass using Arm’s
instruction emulators prior to having access to real Arm hardware with
SME support
import guidance# set the default language model used to execute guidance programsguidance.llm = guidance.llms.TWGUI("http://127.0.0.1:5000")# define a guidance program that adapts a proverbprogram = guidance("""Tweak this proverb to apply to model instructions instead.{{proverb}}- {{book}}{{chapter}}:{{verse}}UPDATEDWhere there is no guidance{{gen 'rewrite' stop="\\n-"}}- GPT {{#select 'chapter'}}9{{or}}10{{or}}11{{/select}}:{{gen 'verse'}}""")# execute the program on a specific proverbexecuted_program = program( proverb="Where there is no guidance, a people falls,\nbut in an abundance of counselors there is safety.", book="Proverbs", chapter=11, verse=14)
AutoGPT-Alpaca-Trader June
2023
Innovative Plugin Development: Spearheaded the
design and implementation of a cutting-edge AutoGPT plugin, seamlessly
integrating the GPT-4 powered AutoGPT application with Alpaca Trading
API to augment algorithmic trading strategies with advanced AI
capabilities.
API Integration and Security: Expertly
established secure and efficient connections to Alpaca’s Trading API,
enabling robust trade execution, account management, and real-time data
retrieval functionalities, while ensuring data integrity and compliance
with industry best practices.
Enhanced Trade Management: Developed a
comprehensive suite of tools for the automated placement, modification,
and cancellation of diverse stock and ETF orders, including market,
limit, and stop orders, resulting in a streamlined trading experience
and improved operational efficiency.
Account and Portfolio Management: Implemented
advanced features for real-time monitoring and management of user
account details, portfolio positions, and transaction history,
delivering a holistic view of financial assets and enhancing user
decision-making.
Market Data and Risk Management: Provided
traders with access to vital real-time and historical market data,
including stock quotes and bar data, as well as corporate action
insights, complemented by a robust paper trading environment for
strategy testing and risk mitigation.
AutoGPT Messages: May 2023
Developed the AutoGPT plugin for iMessages, enabling seamless
integration with AI-powered messaging across multiple platforms,
ensuring user data privacy and security.
Implemented a Python server backend, allowing the plugin to operate
universally while maintaining a dedicated Mac server for core
functionalities.
Streamlined the installation process with cross-platform support,
providing detailed instructions for Linux, Mac, Windows, and WSL
environments.
Enhanced user experience by integrating with the iMessage API and
providing options for public accessibility using tools like tunnelto and
ngrok.
Designed a user-friendly interface with real-time notifications,
customizable settings, and integration capabilities with other
communication tools for comprehensive messaging solutions.
Developed the Auto-GPT-Text-Gen-Plugin to enable users to fully
customize prompts for integration with locally installed large language
models (LLMs), facilitating a shift away from dependency on GPT-4 and
GPT 3.5.
Implemented a robust connection to Text Generation WebUI, serving as
an API gateway for various models, which streamlines the process of
managing complex configurations and environment settings.
Provided comprehensive documentation and a step-by-step installation
guide, ensuring users can effortlessly download, configure, and utilize
the plugin with their specific text generation setup.
Integrated flexibility for model selection and the ability to tweak
generation parameters such as top_p, top_k, and repetition_penalty
through environmental variables, enhancing user control over text
generation outcomes.
Encapsulated API interactions and prompt management within the
TextGenPluginController class, laying the groundwork for potential
future expansions to support multiple APIs, thereby ensuring long-term
maintainability and scalability of the plugin.
Developed a Flask-based API to interact with iMessage, enabling
users to send and retrieve messages as well as fetch recent contacts,
enhancing communication automation.
Implemented secure access to the API by creating a custom decorator
function that validates API keys, ensuring secure and authenticated
interactions.
Orchestrated background data synchronization using threading,
allowing for real-time updates of messages while maintaining a
responsive API service.
Integrated iMessage reader and AppleScript for seamless message
sending and retrieval, showcasing strong cross-technology integration
skills.
Designed a user-friendly setup process, including environment
variable configuration and easy-to-follow instructions, improving the
accessibility of the API for end users.
BuzzOS is an Operating System built for the Intel/AMD x86_64
architecture using assembly and Rust. The operating system includes a
Graphical User Interface (GUI) and is designed to provide a complete
user experience.
The operating system includes user space and a mechanism for
user-level processes to perform system calls to the kernel. This allows
users to run applications and perform various tasks on the system.
BuzzOS also includes drivers for various hardware components,
including the keyboard, mouse, timer, disk, and Intel PIC 8259. These
drivers enable a robust input experience and ensure that the operating
system can communicate effectively with various hardware components.
In addition to the core operating system functionality, BuzzOS also
includes a fully functional desktop interface with games and system
apps. This interface provides users with a familiar and intuitive
environment for interacting with the operating system.
Overall, BuzzOS is an impressive project that demonstrates the power
and flexibility of modern operating systems. By leveraging assembly and
Rust, the project was able to create a complete operating system with a
GUI and a range of drivers and applications. This is a significant
achievement and represents a valuable contribution to the field of
operating systems. Github
Page
Path-finding Robot: October
2022
Developed proficiency in Robotics and Computer Vision through
implementing the Rapidly-exploring Random Tree (RRT) algorithm,
enhancing path planning efficiency in autonomous robotic
navigation.
Leveraged Computer Vision techniques to enable real-time object
detection and environment mapping, optimizing robot’s perception and
decision-making capabilities.
Designed and executed algorithms for image processing and feature
extraction, significantly improving the accuracy of object recognition
in varied lighting and environmental conditions.
Employed state-of-the-art machine learning models for image
captioning, translating visual data into descriptive language, and
enhancing human-robot interaction.
Demonstrated strong problem-solving skills in Robotics by handling
exceptions such as VectorTimeoutException, ensuring seamless operation
and reliability of robotic systems.
The COVID Vaccine Tracker is a tool for predicting the progress of
COVID-19 vaccinations across US states. It uses data from vaccine
databases and factors in state population to estimate when each state
will reach an 80% vaccination rate. The project was created in March of
2021 but could potentially be modified for use with the Delta variant of
COVID-19.
The model used in the project is based on a logarithmic curve. It
provided fairly accurate predictions until the 50% vaccination mark but
did not accurately predict the curve going logarithmic at that point.
Despite this limitation, the tool still provides valuable insights into
the progress of vaccinations across different US states.
Create-Cpp-App is a Command Line Interface (CLI) tool that provides
an npm-like experience for building C++ applications. The tool is
designed to streamline the process of building C++ apps by automating
many of the repetitive and time-consuming tasks that developers
typically face.
The tool is built to be intuitive and user-friendly, and it generates
makefiles and automatically updates CMake files for a fast and efficient
development experience. This allows developers to focus on writing code
instead of worrying about the build process.
Create-Cpp-App also includes a range of built-in testing, address
sanitization, benchmarking, and other tools for building
production-ready C++ applications. These tools are designed to help
developers ensure that their code is of high quality and
performance.
Overall, Create-Cpp-App is an innovative tool that helps simplify the
process of building C++ applications. By providing an npm-like
experience, the tool makes it easy for developers to get started with
building C++ apps and reduces the time and effort required to build
high-quality, production-ready applications.
Clean Up Crew is a web application that serves as a platform for
connecting small communities with local businesses. The application was
built using Next.js, MongoDB, AWS S3, Google Maps API, and ReactJS.
The platform allows users to create and interact with posts in a
given area. Users can post about community events, local businesses, and
other topics related to their community. The application includes a
sorting algorithm based on various factors such as location, user
interaction, and other metrics to ensure that the most relevant content
is displayed to users.
The project was developed by a team of programmers who participated
in a programming competition. Over a period of 36 hours, the team worked
on developing the application and implementing its various features.
After the competition, the team was awarded 13th place out of 191 teams,
which is a testament to their hard work and the effectiveness of the
application they developed.
Overall, this project represents a valuable contribution to small
communities looking to improve their localities and small businesses
seeking new opportunities. The platform provides a means for these
groups to connect and collaborate, and the sorting algorithm ensures
that the most relevant content is displayed to users. By utilizing
modern web technologies and APIs, the platform is able to provide a
seamless and user-friendly experience for its users.
Self-Driving-Car: January
2021
The Self-Driving Car project is a machine learning project that aims
to simulate the behavior of a self-driving car using a Convolutional
Neural Network (CNN) and computer vision techniques. The project
involves constructing a virtual environment where a car can be driven
autonomously using machine learning algorithms.
The CNN is used to determine the speed and angle of rotation of the
simulated vehicle based on data obtained from a virtual camera. The
camera captures images of the environment and feeds them into the CNN,
which processes the data and outputs a prediction for the vehicle’s next
move. The CNN is trained using a dataset of labeled images and their
corresponding speed and steering angles.
To implement the CNN, the project utilizes a number of machine
learning libraries, including Tensorflow, Keras, and NumPy. These
libraries provide a range of tools for developing, training, and testing
machine learning models, as well as tools for processing and analyzing
large datasets.
The project also includes a testing environment where the performance
of the self-driving car can be evaluated. This environment allows the
user to adjust parameters such as the speed and complexity of the
environment, and to observe how the car responds to different
scenarios.
Overall, the Self-Driving Car project represents an exciting
application of machine learning and computer vision techniques to the
field of autonomous vehicles. By simulating the behavior of a
self-driving car in a virtual environment, the project provides a safe
and scalable platform for testing and developing new algorithms and
techniques for autonomous driving.
The Amazon Shopping Clone is a web application built using the MERN
stack (MongoDB, Express, React, and Node.js) and Stripe API. It mimics
the design and user interface of the Amazon.com website, allowing users
to browse and purchase products in a familiar environment.
One of the key features of the application is its login system, which
allows users to create accounts and securely store their personal and
payment information. This information is stored using MongoDB, a NoSQL
database that provides a flexible and scalable data storage
solution.
In addition to the login system, the application also utilizes the
Stripe API to handle transactions in a secure and scalable manner.
Stripe is a popular payment processing platform that provides a wide
range of features for online businesses, including secure payment
processing, subscription management, and fraud detection.
To ensure a smooth and intuitive user experience, the application
implements a design language that closely mimics that of the Amazon.com
website. This includes a consistent color scheme, typography, and
layout, as well as familiar user interface elements such as navigation
menus, search bars, and product listings.
Overall, the Amazon Shopping Clone provides a robust and scalable
platform for online shopping that combines the familiarity and
convenience of Amazon.com with the security and scalability of modern
web technologies. Github
Page
You can access the live demo of the FakeBlock
Shopping project here