Daniyal Khan - Portfolio

Last Updated: April 2026

A bit about me

Hey there! I’m Daniyal. If you’re here, you might have used my code and it may have broke. Feel free to email me, DM me on Twitter, or reach out on LinkedIn.

If you have any questions about my open-source projects or want to pick my brain, feel free to reach out via email, LinkedIn, or Twitter.

I work at Samsung Semiconductor as a Senior Engineer on AI/ML compilers, building the torch.compile + vLLM inference engine for custom accelerators. I’m also pursuing my M.S. in CS at Georgia Tech. If you work on torch.dynamo/inductor/torch-mlir also please reach out.

Sparse Steps to Reasoning (SSR): Pending ICML 2026

ICML 2026 submission; proposed SSR, a step-wise RL framework that eliminates cross-step gradient interference in multi-step reasoning via mostly-disjoint parameter allocation, achieving +6.0pp on AIME & GPQA Diamond over the best GSPO baseline.
Built on a custom fork of NVIDIA Megatron-LM to scale distributed RL training.
Developed superweight discovery, a pre-RL analysis pipeline that identifies and freezes the language-critical parameters most responsible for preserving core linguistic behavior, enabling aggressive reasoning-focused updates without destabilizing the model.
Showed that disjoint parameter allocation is the key mechanism behind SSR’s gains: allocating separate update regions to different reasoning steps improves multi-step RL beyond step-level credit assignment alone.
Built KronosOpt, a custom optimizer that sequentially claims non-overlapping top-k parameter subsets per reasoning step, giving each step unambiguous credit signals with a step-dependent budget schedule to prevent late-step gradient starvation.

Paper

RocketKV - KV Cache Compression for vLLM: December 2025 - Present

Optimizing RocketKV as part of my master’s thesis — a training-free KV cache compression method for long-context LLM inference, published at ICML 2025.
The original paper demonstrated up to 3.7x end-to-end speedup and 32.6% peak memory reduction on A100, but the reference implementation was not practical for production use.
Adapted the two-stage compression pipeline (hybrid sparse attention with dimension reduction + chunk-max scoring, followed by top-k exact attention) into a vLLM plugin, achieving better decode performance than baseline vLLM.
Wrote fused CUDA kernels using CUTLASS CuTe for Tensor Core MMA operations, online softmax, and paged KV extraction with vectorized 128-bit loads.
The plugin registers as a vLLM platform plugin and transparently swaps in the custom Starship attention backend — no model changes required.
Evaluated on DeepSeek-R1-Distill models across GSM-8K, Math500, AIME25, and LongBench with negligible accuracy loss at 512 token budgets.

Paper (arXiv) | OpenReview (ICML 2025)

CARForge - CAR-T Cell Design with Evo2: September 2025 - Present

Building a deep learning pipeline for designing optimized Chimeric Antigen Receptor (CAR) T-cell constructs using the 7B Evo2 DNA foundation model.
Fine-tuning Evo2 with LoRA (8.4M trainable params / 6.5B total), structure cross-attention via OpenFold 3, and antigen conditioning across 4,190+ target antigens from SAbDab.
Generates domain-annotated CAR sequences (scFv, VH/VL, CDRs, hinge, transmembrane, costimulatory, CD3ζ) with continuous outcome conditioning (expression, cytotoxicity, persistence).
Trained on 86K+ examples from Yoshida et al., Bits-to-Binders, CARMSeD, and synthetic augmentation (codon variants, orthologs, negative examples).
Achieved 76-77% nucleotide prediction accuracy with ~2.6-2.8 perplexity after 3 epochs of training.

LongMemory - Memory Vector Injection for Frozen LLMs

Designed a memory vector injection system for frozen LLMs; compresses sentences into 8 orthogonal latent vectors via contrastive attention pooling, then injects them as cross-attention at multiple decoder layers.
Achieved 97% token accuracy on retrieval tasks.

Tinkerbell - Fine-Tuning Framework

Built a fine-tuning framework supporting LoRA-based PEFT, custom loss functions (DPO, contrastive), and multi-threaded concurrent adapter training across HuggingFace, Megatron-LM, and vLLM backends.

GitHub

RFT Chain of Thought Part 2 (May 2025):

Link

RFT Chain of Thought Part 1:

LLM Chain of Thought Finetuning: December 2024-

Developed an LLM fine-tuning framework utilizing chain-of-thought fine-tuning with individual link scoring.
Integrated per-link rewards and synthetically generated rubric grading to enhance reasoning capabilities.
Developed a custom TRL (Transformers Reinforcement Learning) Trainer to implement this fine-tuning approach.

More code/paper should be updated this March

Blog Post

Triton viz tool demo

Vortex proposal: July 2024

Proposed changes to Vortex GPU and Software Stack:

Link

GPU programming and Architecture Lecture: July 2024

Education

Georgia Institute of Technology
Master of Science in Computer Science
August 2025 - Present (Remote)

Research affiliate at GT SAIL Lab, focusing on reinforcement learning and deep learning
Relevant Coursework: Reinforcement Learning, Deep Learning

Georgia Institute of Technology
Bachelor of Science in Computer Science
August 2020 - December 2023

Concentrations: Intelligence/AI and Systems and Architecture
High Honors
Relevant Coursework: Operating Systems, Artificial Intelligence, Advanced Algorithms and Data Structures, Robotics and Perception, Computer Architecture, Circuit Design Lab

Work History

Samsung Semiconductor
Senior Engineer, AI/ML Software Compiler
December 2024 - Present — Bay Area, CA

Built the torch.compile + vLLM inference engine from scratch as a custom extension, enabling distributed model inference across custom accelerators.
Implemented paged attention and chunked prefill within the vLLM extension, improving decode throughput and enabling overlap of prefill and decode phases for reduced time-to-first-token.
Engineered expert parallelism for MoE routing, distributing expert computation across devices to scale inference for mixture-of-experts models.

Manhattan Associates
Software Engineer
January 2024 - December 2024 — Atlanta, GA

Developed Java/Spring Boot microservices for transportation logistics; improved resource allocation by 12%.

NCR Corporation
Software Engineering Intern
May 2022 - August 2022 — Atlanta, GA

Built a real-time MQTT monitoring tool with React + Redux and SQL backend; reduced debugging time by 40%.

Arm Scalable Matrix Extension for Triton-shared: March 2024 - July 2024

Implemented an optimization pass in the Triton compiler to utilize Arm’s Scalable Matrix Extension (SME) and Scalable Vector Extension version 2 (SVE2) instruction sets, enabling more efficient matrix multiplication on Arm CPUs
Added support for bfloat16 and float16 data types in addition to float32, taking advantage of Arm’s SVE-BF16 and FP16 instructions when the target hardware supports them
Worked on integrating the SME optimization pass into Triton’s compilation pipeline, collaborating with the Triton compiler team via GitHub pull requests
Debugged issues with lowering the SME optimized IR to LLVM IR and machine code, gaining expertise in MLIR, LLVM and the Arm SME/SVE instruction set architectures
Leveraged MLIR’s transform dialect to apply patterns like tiling, vectorization, and lowering of vector operations to Arm SME instructions
Worked on emulating the SME optimization pass using Arm’s instruction emulators prior to having access to real Arm hardware with SME support

Link

Guidance API: June 2023

Developed the Guidance API, integrating advanced language model capabilities for enhanced text generation and processing.
Enabled efficient network calls to Guidance, harnessing the power of cutting-edge language models for users.
Introduced a comprehensive output structure, supporting multiple generations, selections, conditionals, and tool use.
Optimized system performance with smart seed-based generation caching, ensuring efficient token storage.
Laid groundwork for future compatibility with role-based chat models, expanding the API’s versatility.
Enhanced control over modern language models, offering a superior alternative to traditional prompting and chaining.
Utilized intuitive syntax based on Handlebars templating, ensuring a user-friendly experience.
Enabled real-time interactions with Playground-like streaming in Jupyter/VSCode Notebooks.
Seamlessly integrated with Hugging Face models, introducing features like guidance acceleration, token healing, and regex pattern guides.
Emphasized model performance and precision, ensuring high-quality outputs and adherence to desired formats.

Link

2nd Link

import guidance

# set the default language model used to execute guidance programs
guidance.llm = guidance.llms.TWGUI("http://127.0.0.1:5000")

# define a guidance program that adapts a proverb
program = guidance("""Tweak this proverb to apply to model instructions instead.

{{proverb}}
- {{book}} {{chapter}}:{{verse}}

UPDATED
Where there is no guidance{{gen 'rewrite' stop="\\n-"}}
- GPT {{#select 'chapter'}}9{{or}}10{{or}}11{{/select}}:{{gen 'verse'}}""")

# execute the program on a specific proverb
executed_program = program(
    proverb="Where there is no guidance, a people falls,\nbut in an abundance of counselors there is safety.",
    book="Proverbs",
    chapter=11,
    verse=14
)

AutoGPT-Alpaca-Trader June 2023

Innovative Plugin Development: Spearheaded the design and implementation of a cutting-edge AutoGPT plugin, seamlessly integrating the GPT-4 powered AutoGPT application with Alpaca Trading API to augment algorithmic trading strategies with advanced AI capabilities.
API Integration and Security: Expertly established secure and efficient connections to Alpaca’s Trading API, enabling robust trade execution, account management, and real-time data retrieval functionalities, while ensuring data integrity and compliance with industry best practices.
Enhanced Trade Management: Developed a comprehensive suite of tools for the automated placement, modification, and cancellation of diverse stock and ETF orders, including market, limit, and stop orders, resulting in a streamlined trading experience and improved operational efficiency.
Account and Portfolio Management: Implemented advanced features for real-time monitoring and management of user account details, portfolio positions, and transaction history, delivering a holistic view of financial assets and enhancing user decision-making.
Market Data and Risk Management: Provided traders with access to vital real-time and historical market data, including stock quotes and bar data, as well as corporate action insights, complemented by a robust paper trading environment for strategy testing and risk mitigation.

AutoGPT Messages: May 2023

Developed the AutoGPT plugin for iMessages, enabling seamless integration with AI-powered messaging across multiple platforms, ensuring user data privacy and security.
Implemented a Python server backend, allowing the plugin to operate universally while maintaining a dedicated Mac server for core functionalities.
Streamlined the installation process with cross-platform support, providing detailed instructions for Linux, Mac, Windows, and WSL environments.
Enhanced user experience by integrating with the iMessage API and providing options for public accessibility using tools like tunnelto and ngrok.
Designed a user-friendly interface with real-time notifications, customizable settings, and integration capabilities with other communication tools for comprehensive messaging solutions.

Github Page

AutoGPT Local Infrence Server: May 2023

Developed the Auto-GPT-Text-Gen-Plugin to enable users to fully customize prompts for integration with locally installed large language models (LLMs), facilitating a shift away from dependency on GPT-4 and GPT 3.5.
Implemented a robust connection to Text Generation WebUI, serving as an API gateway for various models, which streamlines the process of managing complex configurations and environment settings.
Provided comprehensive documentation and a step-by-step installation guide, ensuring users can effortlessly download, configure, and utilize the plugin with their specific text generation setup.
Integrated flexibility for model selection and the ability to tweak generation parameters such as top_p, top_k, and repetition_penalty through environmental variables, enhancing user control over text generation outcomes.
Encapsulated API interactions and prompt management within the TextGenPluginController class, laying the groundwork for potential future expansions to support multiple APIs, thereby ensuring long-term maintainability and scalability of the plugin.

Github Page

iMessages API: May 2023

Developed a Flask-based API to interact with iMessage, enabling users to send and retrieve messages as well as fetch recent contacts, enhancing communication automation.
Implemented secure access to the API by creating a custom decorator function that validates API keys, ensuring secure and authenticated interactions.
Orchestrated background data synchronization using threading, allowing for real-time updates of messages while maintaining a responsive API service.
Integrated iMessage reader and AppleScript for seamless message sending and retrieval, showcasing strong cross-technology integration skills.
Designed a user-friendly setup process, including environment variable configuration and easy-to-follow instructions, improving the accessibility of the API for end users.

Github Page

BuzzOS: January 2023

BuzzOS is an Operating System built for the Intel/AMD x86_64 architecture using assembly and Rust. The operating system includes a Graphical User Interface (GUI) and is designed to provide a complete user experience.

The operating system includes user space and a mechanism for user-level processes to perform system calls to the kernel. This allows users to run applications and perform various tasks on the system.

BuzzOS also includes drivers for various hardware components, including the keyboard, mouse, timer, disk, and Intel PIC 8259. These drivers enable a robust input experience and ensure that the operating system can communicate effectively with various hardware components.

In addition to the core operating system functionality, BuzzOS also includes a fully functional desktop interface with games and system apps. This interface provides users with a familiar and intuitive environment for interacting with the operating system.

Overall, BuzzOS is an impressive project that demonstrates the power and flexibility of modern operating systems. By leveraging assembly and Rust, the project was able to create a complete operating system with a GUI and a range of drivers and applications. This is a significant achievement and represents a valuable contribution to the field of operating systems. Github Page

Path-finding Robot: October 2022

Developed proficiency in Robotics and Computer Vision through implementing the Rapidly-exploring Random Tree (RRT) algorithm, enhancing path planning efficiency in autonomous robotic navigation.
Leveraged Computer Vision techniques to enable real-time object detection and environment mapping, optimizing robot’s perception and decision-making capabilities.
Designed and executed algorithms for image processing and feature extraction, significantly improving the accuracy of object recognition in varied lighting and environmental conditions.
Employed state-of-the-art machine learning models for image captioning, translating visual data into descriptive language, and enhancing human-robot interaction.
Demonstrated strong problem-solving skills in Robotics by handling exceptions such as VectorTimeoutException, ensuring seamless operation and reliability of robotic systems.

Github Page

Flutter Tower Defense Game: April 2022

Designed and developed a tower defense game using the Flutter framework.

Implemented game mechanics including tower placement, enemy spawning, and pathfinding using the Dart programming language.
Utilized Flutter’s built-in animation framework to create smooth and visually appealing animations for tower attacks and enemy movements.
Integrated Google Firebase for user authentication and cloud storage to save game progress and scores.
Takes advantage of Flutter’s cross-platform nature, allowing it to run on iOS, Android, Mac, Windows, Linux, and Web.
Collaborated with a team of developers and designers to ensure timely delivery and a high-quality end product.

Github Page

You can play the game here.

COVID Vaccine Tracker: February 2021

The COVID Vaccine Tracker is a tool for predicting the progress of COVID-19 vaccinations across US states. It uses data from vaccine databases and factors in state population to estimate when each state will reach an 80% vaccination rate. The project was created in March of 2021 but could potentially be modified for use with the Delta variant of COVID-19.

The model used in the project is based on a logarithmic curve. It provided fairly accurate predictions until the 50% vaccination mark but did not accurately predict the curve going logarithmic at that point. Despite this limitation, the tool still provides valuable insights into the progress of vaccinations across different US states.

Github Page

https://github.com/danikhan632/tower_defense_game

Create C++ App: November 2022

Create-Cpp-App is a Command Line Interface (CLI) tool that provides an npm-like experience for building C++ applications. The tool is designed to streamline the process of building C++ apps by automating many of the repetitive and time-consuming tasks that developers typically face.

The tool is built to be intuitive and user-friendly, and it generates makefiles and automatically updates CMake files for a fast and efficient development experience. This allows developers to focus on writing code instead of worrying about the build process.

Create-Cpp-App also includes a range of built-in testing, address sanitization, benchmarking, and other tools for building production-ready C++ applications. These tools are designed to help developers ensure that their code is of high quality and performance.

Overall, Create-Cpp-App is an innovative tool that helps simplify the process of building C++ applications. By providing an npm-like experience, the tool makes it easy for developers to get started with building C++ apps and reduces the time and effort required to build high-quality, production-ready applications.

Github Page

Clean Up Crew: October 2022

Clean Up Crew is a web application that serves as a platform for connecting small communities with local businesses. The application was built using Next.js, MongoDB, AWS S3, Google Maps API, and ReactJS.

The platform allows users to create and interact with posts in a given area. Users can post about community events, local businesses, and other topics related to their community. The application includes a sorting algorithm based on various factors such as location, user interaction, and other metrics to ensure that the most relevant content is displayed to users.

The project was developed by a team of programmers who participated in a programming competition. Over a period of 36 hours, the team worked on developing the application and implementing its various features. After the competition, the team was awarded 13th place out of 191 teams, which is a testament to their hard work and the effectiveness of the application they developed.

Overall, this project represents a valuable contribution to small communities looking to improve their localities and small businesses seeking new opportunities. The platform provides a means for these groups to connect and collaborate, and the sorting algorithm ensures that the most relevant content is displayed to users. By utilizing modern web technologies and APIs, the platform is able to provide a seamless and user-friendly experience for its users.

Self-Driving-Car: January 2021

The Self-Driving Car project is a machine learning project that aims to simulate the behavior of a self-driving car using a Convolutional Neural Network (CNN) and computer vision techniques. The project involves constructing a virtual environment where a car can be driven autonomously using machine learning algorithms.

The CNN is used to determine the speed and angle of rotation of the simulated vehicle based on data obtained from a virtual camera. The camera captures images of the environment and feeds them into the CNN, which processes the data and outputs a prediction for the vehicle’s next move. The CNN is trained using a dataset of labeled images and their corresponding speed and steering angles.

To implement the CNN, the project utilizes a number of machine learning libraries, including Tensorflow, Keras, and NumPy. These libraries provide a range of tools for developing, training, and testing machine learning models, as well as tools for processing and analyzing large datasets.

The project also includes a testing environment where the performance of the self-driving car can be evaluated. This environment allows the user to adjust parameters such as the speed and complexity of the environment, and to observe how the car responds to different scenarios.

Overall, the Self-Driving Car project represents an exciting application of machine learning and computer vision techniques to the field of autonomous vehicles. By simulating the behavior of a self-driving car in a virtual environment, the project provides a safe and scalable platform for testing and developing new algorithms and techniques for autonomous driving.

Github Page

Amazon-Shopping Clone: December 2020

The Amazon Shopping Clone is a web application built using the MERN stack (MongoDB, Express, React, and Node.js) and Stripe API. It mimics the design and user interface of the Amazon.com website, allowing users to browse and purchase products in a familiar environment.

One of the key features of the application is its login system, which allows users to create accounts and securely store their personal and payment information. This information is stored using MongoDB, a NoSQL database that provides a flexible and scalable data storage solution.

In addition to the login system, the application also utilizes the Stripe API to handle transactions in a secure and scalable manner. Stripe is a popular payment processing platform that provides a wide range of features for online businesses, including secure payment processing, subscription management, and fraud detection.

To ensure a smooth and intuitive user experience, the application implements a design language that closely mimics that of the Amazon.com website. This includes a consistent color scheme, typography, and layout, as well as familiar user interface elements such as navigation menus, search bars, and product listings.

Overall, the Amazon Shopping Clone provides a robust and scalable platform for online shopping that combines the familiarity and convenience of Amazon.com with the security and scalability of modern web technologies. Github Page

You can access the live demo of the FakeBlock Shopping project here

Daniyal Khan - Portfolio

Last Updated: April 2026

A bit about me

Links

Sparse Steps to Reasoning (SSR): Pending ICML 2026

RocketKV - KV Cache Compression for vLLM: December 2025 - Present

CARForge - CAR-T Cell Design with Evo2: September 2025 - Present

LongMemory - Memory Vector Injection for Frozen LLMs

Tinkerbell - Fine-Tuning Framework

RFT Chain of Thought Part 2 (May 2025):

RFT Chain of Thought Part 1:

LLM Chain of Thought Finetuning: December 2024-

Triton viz tool demo

Vortex proposal: July 2024

GPU programming and Architecture Lecture: July 2024

Education

Work History

Arm Scalable Matrix Extension for Triton-shared: March 2024 - July 2024

Guidance API: June 2023

AutoGPT-Alpaca-Trader June 2023

AutoGPT Messages: May 2023

AutoGPT Local Infrence Server: May 2023

iMessages API: May 2023

BuzzOS: January 2023

Path-finding Robot: October 2022

Flutter Tower Defense Game: April 2022

COVID Vaccine Tracker: February 2021

Create C++ App: November 2022

Clean Up Crew: October 2022

Self-Driving-Car: January 2021

Amazon-Shopping Clone: December 2020