Surfacing Control: Targeted Interventions in DiT Models

Authors: Daniel Hearn, Creative Computing Institute, UAL (d.hearn@arts.ac.uk), Mick Grierson, Creative Computing Institute, UAL (m.grierson@arts.ac.uk)

Abstract: While state-of-the-art generative models like Diffusion Transformers (DiT) are capable of producing high-fidelity media, their black-box nature reduces creative expression to high-level control, limiting human agency. Extending prior work on network bending, we demonstrate that fundamental operations, such as scaling and noise injection in selected transformer block layers, can achieve expressive and fine-grained control of the generative process. Our approach is validated across multiple domains, including image and audio generation, showing that these interventions provide a generalizable method to enhance human agency. By enabling direct intervention, our work positions these models as instruments for artistic expression, enabling novel aesthetic directions not possible through prompt-only generation.

Official Code

The official code for the experiments can be found in the GitHub repository.

Parameter Settings for Generated Images

Combined Scaling Grids

k_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

v_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

q_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

ff_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

all_hooks_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameters: all hooks, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

Hook Comparison

hook_comparison.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter scale: 1.0, Noise scale: 1.5 (except first image: base), Hooks: base, k, v, q, ff, all (left to right)

Block Comparison Rows

k_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scale: 2.0, Noise scale: 0.0

v_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scale: 2.0, Noise scale: 0.0

q_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scale: 2.0, Noise scale: 0.0

ff_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scale: 2.0, Noise scale: 0.0

Audio Examples

Below are audio grids for each parameter sweep (by_param) and all-hooks combos, applied at block 8 in the transformer model. All audio files are generated with the same prompt "Techno beat 120bpm", a guidance scale of 2.0, and 8 generation steps.

Parameter Sweeps (by_param)

self_attn

	noise 0.0	noise 0.5	noise 1.0	noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_attn

	noise 0.0	noise 0.5	noise 1.0	noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_q

	noise 0.0	noise 0.5	noise 1.0	noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_kv

	noise 0.0	noise 0.5	noise 1.0	noise 1.5
param 0.5
param 1.0
param 2.0
param 4.0

ff

	noise 0.0	noise 0.5	noise 1.0	noise 1.5
param 0.5
param 1.0
param 2.0
param 3.0

Block Sweeps (by_layer)

The following audio examples demonstrate the effect of applying hooks to different transformer blocks. All examples use fixed parameter scale: 2.0, noise scale: 0.5.