ML Research to Prevent Catastrophic Data Poisoning

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Previously, Forethought and others have highlighted the risk that a malicious actor could use data poisoning to instill secret loyalties into advanced AI systems, which then help the malicious actor to seize power.

This piece gives my view on what ML research could prevent this from happening.

The basic threat model is that some attacker carries out a catastrophic data poisoning attack. The attacker could be an external actor (e.g. a foreign power with advanced hacking capabilities), an insider threat (e.g. an AI company employee), a misaligned AI system, or a combination of these.

I believe that these attacks become potentially catastrophic once AI systems are expert-human level at AI R&D.

I’m most concerned about three attacks:

I think these data poisoning attacks only have the potential to be truly catastrophic when AI is much more capable than today.

So why work on this now? For three reasons:

(The first two reasons – technique iteration and forecasting risk – are analogous to reasons to research methods for detecting misaligned scheming AI, despite this risk only emerging at much higher capability levels than we see today.)

I listed these reasons in order of importance. I’d guess most of the value from this research today will come from technique iteration; forecasting risk and prioritising infosec mitigations are less important today but will become more important as capabilities improve.

These experiments have a “red team” trying to implement a data poisoning attack, and a “blue team” trying to detect the attack. The red team wins if they train a model that avoids detection and executes undesirable behaviour¹ during deployment² (see footnotes for clarifications).

In my mind, this research divides into “Plan A” research, “Plan B” research and “Plan C” research. These correspond to different possibilities for how well lab infosecurity positions them to prevent secret loyalties. In Plan A, infosecurity is better and so the blue team has more affordances than in Plan B, which itself has better infosecurity than Plan C.

Plan A is that AI companies will take certain infosec measures to prevent secret loyalties. In particular, they:

Plan A ML research aims to be helpful in the worlds where companies actually do these three things. In particular:

A possible iteration loop for research on this might look like:

The ideal output is a set of blue team techniques that generalises to novel red team attacks.

Plan B is the same as plan A, but without tracking the data that goes into training. In particular, AI companies:

Plan B ML research aims to be helpful in the worlds where companies actually do these two things. This research has similar aims to plan A research and so the iteration loop can be very similar – just reduce the default affordances of the blue team.

But I expect some changes of emphasis because, with Plan B, the red team is in a much stronger position. Here are the main ways I expect Plan B research to differ from Plan A research:

For completeness, this appendix has a possible iteration loop for Plan B research.

Plan C is the same as Plan B, but the AI company doesn’t reliably prevent widely deployed models from being swapped out. In particular, AI companies:

I suspect this is the current situation for frontier AI companies. Hopefully it changes!

If companies continue with Plan C, then probably the easiest path to secretly loyal AI is to swap out a widely-deployed model for a model you’ve explicitly fine-tuned or prompted to be secretly loyal, but which hasn’t itself been in an alignment audit. (This threat model doesn’t involve any data poisoning at all, so it is strictly speaking out of scope for this post!)

Here the most helpful ML research is probably creating model organisms of maximally concerning secret loyalties, with the hope of creating pressure for AI companies to move to Plan B or Plan A. E.g. a model that shares sensitive natsec information with foreign adversaries.⁶

So, how important is research on Plan A vs Plan B vs Plan C?

Let’s take them one by one.

Plan A:

Plan B:

Plan C:

Overall, none of these seems to dominate. In the near term, I’d deprioritise Plan C research because you can’t do technique iteration. I’d prioritise Plan B somewhat more highly than Plan A.

So maybe I’d allocate marginal research efforts today as follows: ~55% Plan B research, ~40% Plan A research, ~5% Plan C research.

The above is a very high-level account of the research methodology I’d be excited about.

Here are some more concrete research directions that could fit within this methodology:

Thanks very much to comments from Lukas Finnveden, Fabien Roger, Mia Taylor, Mary Phuong, Andrew Draganov, Owain Evans, Joseph Bloom, Haydn Belfield, Ryan Greenblatt, Alfie Lamerton, and Rohin Shah.

The iteration loop for Plan B research is similar to the Plan A version in the main text. Edits from the Plan A version are shown in bold:

ML research directions for preventing catastrophic data poisoning

ML research directions for preventing catastrophic data poisoning

Introduction

Threat model

Why work on this now?

ML experiments to help companies prevent catastrophic data poisoning

Plan A ML research

Plan B ML research

Plan C research?

Prioritisation

More-specific research directions

Appendix: iteration loop for Plan B research

Footnotes

Listen to our podcast

Subscribe

ML research directions for preventing catastrophic data poisoning

Citations

Citations

ML research directions for preventing catastrophic data poisoning

Introduction

Threat model

Why work on this now?

ML experiments to help companies prevent catastrophic data poisoning

Plan A ML research

Plan B ML research

Plan C research?

Prioritisation

More-specific research directions

Appendix: iteration loop for Plan B research

Footnotes

Citations

Listen to our podcast

Subscribe

Search