I'm a research scientist at the UK AI Security Institute (AISI), working on white box control, sandbagging, low-incrimination control, training-based mitigations, and model organisms.
Previously: Working on lie-detector probes and black box monitors, and training sandbagging model organisms in order to stress-test them.
Before this I was interning at the Center for Human-Compatible Artificial Intelligence under Erik Jenner. We were developing mechanistic anomaly detection techniques to automatically flag jailbreaks and backdoors at runtime, by detecting unusual patterns of activations. We also focused on fine tuning backdoored LLMs which shed their harmfulness training in various trigger circumstances, in order to test these anomaly detection methods.
See my post on graphical tensor notation for interpretability. I also attended MATS 5.0 under Lee Sharkey and Dan Braun (see our paper: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning), attended the Machine Learning for Alignment Bootcamp in Berkeley in 2022, did a machine learning/ neuroscience internship in 2020/2021, and also wrote a post exploring the potential counterfactual impact of AI safety work.
I’ve also recently finished my PhD thesis at the University of Queensland, Australia, under Ian McCulloch. I’ve been working on new "tensor network" algorithms, which can be used to simulate entangled quantum materials, quantum computers, or to perform machine learning. I've also proposed a new definition of wavefunction branches using quantum circuit complexity.
My website: https://sites.google.com/view/jordantensor/
Contact me: jordantensor [at] gmail [dot] com Also see my CV, LinkedIn, or Twitter.
Comments