About
Roughly speaking, at the moment I am mainly interested in working on the following 3 areas. If you have a solid take on why you disagree with my views here, please do point this out to me! As the saying goes, ‘An error does not become a mistake until you refuse to correct it.’
Applying systems thinking and safety engineering to AI
Other safety-critical industries like aviation, nuclear, submarine or finance have much more mature approaches to risk management. Moreover, they often think of risk at the ‘system level’, rather than at the level of individual components – studying how failures emerge due to complex interactions between components, all of which might be working fine on their own. This sort of thinking is largely absent in AI, where people are only now beginning to work out standards and best practices for risk management.
At SaferAI, we are currently trying to ‘push the needle’ on threat modelling for AI-driven Loss of Control scenarios, in particular trying to gain insights that are actionable and based on more than just ‘best guesses’. This is extremely hard due to a vast amount uncertainty (including Vingean uncertainty). But we think that we don’t have to reinvent the wheel and start from scratch – we hope that by considering AI as an engineering system with a control/feedback cycle and by borrowing concepts from other sectors, we can make useful marginal contribution whose value proposition is distinct from existing work on LoC (see e.g., Anthropic’s Sabotage Risk Report, METR’s misalignment audit or DeepMind’s safety case).
At this stage, I don’t have much more to say, as it is a very fresh research direction and there are tons of open questions in my head. I hope to be able to share something more concrete by the end of summer 2026. We have taken a first stab at this work in a paper called ‘Exploring Systems-Thinking Approaches to Loss of Control Risk’ that was accepted to TAIGR ICML and will be uploaded to arXiv soon.
I am interested in talking to people who have experience with safety engineering in other sectors and who are keen to apply their knowledge to AI. If that’s you – please get in touch!
Keywords: risk modelling, risk management, safety engineering, systems thinking, complex systems, emergent failures, STPA
Foundational research into AI Safety
Alignment research has been largely abandoned in favour of other areas that are implicitly an admission that we are incapable of solving the alignment problem. Things like AI Control or evaluations are all useful, but at the end of the day they are not solving THE problem. To be clear, I’m not criticising these fields – I do not claim to offer anything better – but I’m just making an observation. It seems that we currently do not have a theoretically-grounded plan for tackling this problem, apart from alternatives like ‘ship product and learn’ or ‘just stop and never build the thing’.
For this reason, I still think that it is not unreasonable for a significant chunk of AI Safety people to work on ambitious – even moonshot – approaches to alignment, or at least using ambitious techniques to help us understand AI systems, steer their behaviour and build relevant safeguards. I do concede that in most cases this probably doesn’t make sense, because AI Safety as a whole is understaffed, so even the ‘obviously useful and important’ areas need to be scaled up before we start allocating serious amounts of talent to the more niche bets. You probably also need to have a rather specific mindset and be comfortable with the possibility that your work will not be useful in the short term (or perhaps never and you really should have picked something else, as all the critics had been telling you).
Still, I think that we should be open to efforts that have a chance of succeeding beyond very short timelines, especially if they can be expected to have an outsized impact. I find the concept of ‘broad timelines’ by Toby Ord a very good mental lever and a good representation of my own views on this topic (you can skip to the ‘Implications’ section if you’re short on time - pun intended). Arguments I’ve heard in favour of committing to short timelines exclusively:
- Interventions that work on short timelines will also be useful in the long run. I think this depends on the particular agenda we’re looking at. For example, I think red-teaming, data filtering or unlearning are likely to remain valuable for a long time, because they will continue to minimise x-risk due to misuse (and probably also due to autonomous rogue AI, to some extent). When it comes to things like scalable oversight or AI Control, the consensus seems to be that these methods will not scale and will eventually become moot (see for example here and here. To be clear, I am not saying we should stop pursuing them - it’s better to be in 2032 with a mature state of AI Control than not, and perhaps we will learn some useful things along the way too. However, I would like to point out that these alignment-relevant agendas will quite plausibly not scale to superintelligence and, as such, the problem of alignment remains unsolved.
- Yudkowski and MIRI tried for years and concluded that alignment is a ridiculously hard problem – not necessarily unsolvable indefinitely, but almost certainly unsolvable by the time we need to have it solved. I concede that some very smart people worked at MIRI and I respect their views (and of other convinced doomers). Nonetheless, I would also argue that we – humanity as a whole – haven’t really tried to get this problem sorted out. The number of people globally working on the hard problem of alignment pales in comparison with the number of people that developed things like Quantum Field Theory or the nuclear bomb. I do not want to put too much weight on a small group of people concluding that we should all abandon basic research on AI Alignment and switch to governance.
For these reasons, I still feel a large affinity towards foundational, theoretical research. I think there’s a lot of value in such research, most of which cannot be predicted in advance, just like Einstein didn’t realise General Relativity would be used to calibrate the timing of GPS satellites, and the founders of CERN didn’t think their child would lead to the invention of touchscreens. Of course, I am both biased and nerd-sniped due to my background, but even after correcting for this, I think the argument is valid enough that I do not feel ashamed of putting it out there in public.
I am enthusiastic about trying to apply frameworks from other fields to AI, the classic example being Timaeus using Singular Learning Theory to understand training dynamics, or Simplex using computational mechanics and Bayesian updating to understand internal transformer representations. And I am pleased to see programs like the Iliad and AFFINE being created.
Is any of this going to help AI Safety? No idea, but it would be silly not to try. If you think that this is a waste of time, because the talent involved in such research could be redirected to work on other things – things you deeply believe are more useful and tractable – I think that this is a bit of a non sequitur. A lot of people outside of AI Safety will not be interested in working on `typical’ AI Safety topics, simply because they do not buy into the idea, have stable jobs elsewhere, etc. But if you can formulate an interesting problem for them, and abstract it away from the AI Safety framing, they likely will want to work on it. In such cases, you aren’t redirecting talent away from the areas you are most keen on – you are attracting talent that counterfactually wouldn’t be working on anything relevant to AI safety.
AI Verification
Trust, or more precisely ‘belief under uncertainty’, is the ultimate boss of humanity. Soldiers can be made to obey orders, even the controversial or pointless ones, because they believe they will be punished if they do not do so. Their superiors relaying such mindless orders do so because they believe they will be punished by their high-rank commanders. And these commanders are kept in check by each other, believing that their peers, but also existing norms and institutions, will hold them accountable. In theory, however, if they all agreed to cooperate and defect, they could get away with it, because all these rules, laws and norms are in practice upheld by them. It’s a collective action problem based on trust issues. This implicit social contract sometimes does break down and results in failing states where nothing can be enforced.
What is my point? My point is that I see this as analogous to the problem of upholding international treaties. Multilateral alliances or weapons conventions are not upheld because they have been enshrined in law - there is no single global institution that can reliably enforce punishments for violations, especially on the most powerful actors. They are upheld by a mutual trust in all parties that they will respect this deal. This is of course a simplified picture and there are other factors at play, e.g. a party might defect from an agreement if the expected pain of punishments is below the benefits of defection.
In any case, increasing trust and observability into other parties’ actions seems to be a fundamental building block of a verifiable and enforceable treaty. The same holds for an international agreement on AI (of whatever nature, e.g. a moratorium, ‘memorandum of understanding’, a designation of ‘red lines’). Currently, there is no political will for such an agreement, but should it arise in the future, it would be a shame if it fell apart simply because interested parties cannot verify mutual compliance.
For this reason, I think that the agenda of AI Verification is one of the most important ones for AI Safety, especially for people with short timelines. It is also one where ‘generalists’ can still have an outsized marginal impact - so little progress has been done that you can make important contributions after a short period of upskilling. There are probably <30 people working on this globally, so the field is critically understaffed.
I am mainly interested in work on the conceptual and software side, as I do not have any hardware experience (but would be keen to obtain it though). Things like cryptography, confidential computing, zero-knowledge proofs, how to design setups that do not require mutually-trusted hardware, as well as the human and political aspects that need to be sorted out in order for verification to work.
If you’d like to read more about this, any of these three reports is a great starting point:
- Verifying International Agreements on AI: Six Layers of Verification for Rules on Large-Scale AI Development and Deployment
- Verification for International AI Governance
- Mechanisms to Verify International Agreements About AI Development
And if you’re interested in contributing, do feel free to get in touch - I can connect you with people who are leading this research agenda.
Other interests
Other topics I enjoy reading about, but will probably not work on in the foreseeable future: mechanistic interpretability, Redwood-style AI control, AI welfare, animal welfare, jailbreaks, unlearning, cybersecurity.