Hi! I'm Caden. I'm currently thinking about interfaces for interpretability and (occasionally) chasing some research threads at a nonprofit AI lab.
Email: kh4dien [at] gmail [dot] com
Find me on Github
| Twitter
Current
Work
-
Steering Fine Tuning with Targeted Concept Ablation
Helena Casademunt*, Caden Juang*, Samuel Marks, Senthooran Rajamanoharan, Neel Nanda
ICLR SLLM, ICLR BuildingTrust
paper | code
-
Sparsifying a Model's Computations: Preliminary Findings
Jacob Drori, Jannik Brinkmann, Logan Riggs, Caden Juang
LessWrong
post
-
Robustly identifying concepts introduced during chat fine-tuning using crosscoders
Julian Minder*, Clement Dumas*, Caden Juang, Bilal Chughtai, Neel Nanda
Arxiv
paper | code
-
Automatically Interpreting Millions of Features in Large Language Models
Gonçalo Paulo, Alex Mallen, Caden Juang, Nora Belrose
ICML 2025
paper | code
-
Open Source Automated Interpretability for Sparse Autoencoder Features
Caden Juang*, Gonçalo Paulo*, Jacob Drori, Nora Belrose
Eleuther
post | code | demo
-
NNsight and NDIF: Democratizing Access to Foundation Model Internals
Jaden Fiotto-Kaufman*, Alexander R Loftus*, Eric Todd, Jannik Brinkmann, Caden Juang, Koyena Pal, Can Rager, Aaron Mueller, Samuel Marks, Arnab Sen Sharma, Francesca Lucchetti, Michael Ripa, Adam Belfki, Nikhil Prakash, Sumeet Multani, Carla Brodley, Arjun Guha, Jonathan Bell, Byron Wallace, David Bau
ICLR 2025
paper | code
-
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer*, Caden Juang*, Severin Field*
Arxiv
paper
Misc
Last updated: 6/2025 |
source