link : link
visualizing weights : https://distill.pub/2020/circuits/visualizing-weights/
Notes
- Two type of safety failures
- know safety problems
- unknown safety problems
- In extreme, a model might make a ‘treacherous turn’ — it may use its understanding of the training setup to deliberately behave well only during training, then pursue different goals once it knows it’s outside the training distribution
- Neural network parameters can be seen as the assembly instructions of a complex computer program.
Research Desiderata
- Map neural network parameters to human understandable algorithms.
- Prefer rigorous understanding of a narrow aspect of a model to less understanding of entire models.
- Methods should be able to discover unknown and unanticipated algorithms.
- Methods should be possible to apply to standard, widely used neural networks.
Questions
- What controls whether a language model generate true or false statements ?
- To what extend does the model represent social interaction or mental state ?
- Is the model deliberately deceiving users ? And what does it mean to model ?
- What is going on in meta-learning ? Particularly (mesa-optimization)
- How do large language models store factual knowledge ?
Aspirational Goal
Fully Understand a Neural Network
- One has a theory of what every neuron does, and can provide a “proof by induction”.
- One has a theory that can explain every parameter in the model.
- One can reproduce the network with handwritten weights, without consulting the original, simply by understanding the theory of how it works.
Research Direction
1. Discovering Features and Circuits
In the Circuits approach, neural networks are composed of features and circuits. Because features and circuits are often universal, recurring across many models, every circuit and feature we find gives us a foothold into understanding future models, making it easier and faster to study them. Few example:
2. Scaling Circuits to Larger Models
Weakness of the circuits approach is that, by focusing on small scale structure, it may not be able to scale to understanding large scale models. One promising direction is to find additional structure. Two such types of structure
- motifs (recurring patterns in circuits, such as equivariance)
- structural phenomena (large scale patterns in how neural networks are organized, such as branch specialization)
3. Resolving Polysemanticity
Some neurons respond to multiple unrelated features. One theory is that polysemanticity occurs become models would ideally represent more features than they have neurons, and exploit the fact that high dimensional spaces can have many “almost orthogonal” directions to store these features in “superposition” across neurons. Polysemanticity seems related to questions about disentangling representations.