Jiahao Yu's Page

I am a last-year computer science Ph.D. candidate at Northwestern University, working with Prof. Xinyu. My research interests lie in Large Language Models and cybersecurity. I hold B.S. degree from Shanghai Jiao Tong University (2021).

I will be joining the Department of Computer Engineering at New York University Abu Dhabi (NYUAD) as a Tenure-Track Assistant Professor (TTAP). I am actively looking for self-motivated Ph.D. students and postdocs to work with me — if you are passionate about Large Language Models and cybersecurity, feel free to reach out!

If you have any research issue, feel free to contact me! Enjoy research and life :)

news

Jun 4, 2026	I will be joining the Department of Computer Engineering at New York University Abu Dhabi (NYUAD) as a Tenure-Track Assistant Professor (TTAP). I am actively looking for self-motivated Ph.D. students and postdocs — feel free to reach out!
Oct 6, 2025	Our work PATCHAGENT was accepted as CSAW 2025 Finalist. We will be presenting the work in New York City!
Oct 5, 2025	The official SWEBench-Verified and SWEBench-Lite open-weight leaderboard is updated. Our EntroPO are 1st on SWEBench-Lite and 5th on SWEBench-Verified (only suppressed by models 10x larger than ours).
Oct 5, 2025	Our work GPO: Learning from Critical Steps to Improve LLM Reasoning was covered by MIT Technology Review China .
Mar 18, 2025	Our work Soft-Label Integration for Robust Toxicity Classification was covered by MIT Technology Review China .

selected publications

arXiv

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

1st on SWEBench-Lite(Open-weight)
5th on SWEBench-Verified(Open-weight)

Jiahao Yu*, Zelei Cheng*, Xian Wu, and 1 more author

arXiv preprint arXiv:2509.12434 2026

PDF
TIFS

PROMPTFUZZ: Harnessing Fuzzing Techniques for Robust Testing of Prompt Injection in LLMs

Jiahao Yu*, Yangguang Shao*, Hanwen Miao, and 1 more author

IEEE Transactions on Information Forensics and Security 2026
NIPS

GPO: Learning from Critical Steps to Improve LLM Reasoning

Featured in MIT Technology Review China

Jiahao Yu*, Zelei Cheng, Xian Wu, and 1 more author

In 2025

PDF
NIPS

BlockScan: Detecting Anomalies in Blockchain Transactions

Jiahao Yu*, Xian Wu*, Hao Liu, and 2 more authors

In 2025

PDF
USENIX

Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs’ Ethical Boundaries

Long Talk

Jiahao Yu*, Haozheng Luo*, Jerry Yao-Chieh, and 3 more authors

In Proceedings of the 2025 USENIX Security 2025

PDF
USENIX

PATCHAGENT: A Practical Program Repair Agent Mimicking Human Expertise

Long Talk
Patched over 10 real-world bugs
CSAW 2025 Finalist

Zheng Yu, Ziyi Guo, Yuhang Wu, and 5 more authors

In Proceedings of the 2025 USENIX Security 2025
ICML

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Zihao Wang, Yibo Jiang, Jiahao Yu, and 1 more author

In Proceedings of the 42nd International Conference on Machine Learning 2025
USENIX

LLM-Fuzzer: Scaling Assessment of Large Language Model Jailbreaks

Jiahao Yu, Xingwei Lin, Zheng Yu, and 1 more author

In Proceedings of the 2024 USENIX Security 2024
NIPS

Soft-Label Integration for Robust Toxicity Classification

Featured in MIT Technology Review China

Zelei Cheng, Xian Wu, Jiahao Yu, and 3 more authors

In Proceedings of the 38th Conference on Neural Information Processing Systems 2024

PDF
ICML

RICE: Breaking Through the Training Bottlenecks of Reinforcement Learning with Explanation

Spotlight Top-3.5%

Zelei Cheng, Xian Wu, Jiahao Yu, and 3 more authors

In Proceedings of the 41st International Conference on Machine Learning 2024
ICLR@SET-LLM

Assessing Prompt Injection Risks in 200+ Custom GPTs

Featured in WIRED

Jiahao Yu, Yuhang Wu, Dong Shu, and 3 more authors

In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models 2024

PDF
arXiv

GPTFuzzer: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

Geekcon 2023 Annual Themed Debate Breakthrough Awards

Jiahao Yu, Xingwei Lin, Zheng Yu, and 1 more author

In 2023

PDF

NIPS

StateMask: Explaining Deep Reinforcement Learning through State Mask

Zelei Cheng*, Xian Wu*, Jiahao Yu*, and 3 more authors

In Proceedings of the 37th Conference on Neural Information Processing Systems 2023

Bib PDF

@inproceedings{statemask,
  title = {StateMask: Explaining Deep Reinforcement Learning through State Mask},
  author = {Cheng*, Zelei and Wu*, Xian and Yu*, Jiahao and Sun, Wenhai and Guo, Wenbo and Xing, Xinyu},
  booktitle = {Proceedings of the 37th Conference on Neural Information Processing Systems},
  year = {2023},
}

USENIX

AIRS Explanation for Deep Reinforcement Learning based Security Applications

Jiahao Yu, Wenbo Guo, Qi Qin, and 3 more authors

In Proceedings of the 2023 USENIX Security 2022

Abs PDF

Recently, we have witnessed the success of deep reinforcement learning (DRL) in many security applications, ranging from malware mutation to selfish blockchain mining. Like all other machine learning methods, the lack of explainability has been limiting its broad adoption as users have difficulty establishing trust in DRL models’ decisions. Over the past years, different methods have been proposed to explain DRL models but unfortunately, they are often not suitable for security applications, in which explanation fidelity, efficiency, and the capability of model debugging are largely lacking. In this work, we propose AIRS, a general framework to explain deep reinforcement learning-based security applications. Unlike previous works that pinpoint important features to the agent’s current action, our explanation is at the step level. It models the relationship between the final reward and the key steps that a DRL agent takes, and thus outputs the steps that are most critical towards the final reward the agent has gathered. Using four representative security-critical applications, we evaluate AIRS from the perspectives of explainability, fidelity, stability, and efficiency. We show that AIRS could outperform alternative explainable DRL methods. We also showcase AIRS’s utility, demonstrating that our explanation could facilitate the DRL model’s failure offset, help users establish trust in a model decision, and even assist the identification of inappropriate reward designs.