Surfacing Control: Targeted Interventions in DiT Models for Creative Expression in Multiple Domains

This page contains supplementary audio and images.

Authors: Daniel Hearn, Creative Computing Institute, UAL (d.hearn@arts.ac.uk), Mick Grierson, Creative Computing Institute, UAL (m.grierson@arts.ac.uk)

Abstract: While state-of-the-art generative models like Diffusion Transformers (DiT) are capable of producing high-fidelity media, their black-box nature reduces creative expression to high-level control, limiting human agency. Extending prior work on network bending, we demonstrate that fundamental operations, such as scaling and noise injection in selected transformer block layers, can achieve expressive and fine-grained control of the generative process. Our approach is validated across multiple domains, including image and audio generation, showing that these interventions provide a generalizable method to enhance human agency. By enabling direct intervention, our work positions these models as instruments for artistic expression, enabling novel aesthetic directions not possible through prompt-only generation.

Official Code

The official code for the experiments can be found in the GitHub repository.

Parameter Settings for Generated Images

Combined Scaling Grids

k_hook_combined_scaling_grid
k_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
v_hook_combined_scaling_grid
v_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
q_hook_combined_scaling_grid
q_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
ff_hook_combined_scaling_grid
ff_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
all_hooks_combined_scaling_grid
all_hooks_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameters: all hooks, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]

Hook Comparison

hook_comparison
hook_comparison.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter scale: 1.0, Noise scale: 1.5 (except first image: base), Hooks: base, k, v, q, ff, all (left to right)

Block Comparison Rows

k_hook_block_comparison_row
k_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scale: 2.0, Noise scale: 0.0
v_hook_block_comparison_row
v_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scale: 2.0, Noise scale: 0.0
q_hook_block_comparison_row
q_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scale: 2.0, Noise scale: 0.0
ff_hook_block_comparison_row
ff_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scale: 2.0, Noise scale: 0.0

Audio Examples

Below are audio grids for each parameter sweep (by_param) and all-hooks combos, applied at block 8 in the transformer model. All audio files are generated with the same prompt "Techno beat 120bpm", a guidance scale of 2.0, and 8 generation steps.

Parameter Sweeps (by_param)

self_attn

noise 0.0noise 0.5noise 1.0noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_attn

noise 0.0noise 0.5noise 1.0noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_q

noise 0.0noise 0.5noise 1.0noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0

cross_kv

noise 0.0noise 0.5noise 1.0noise 1.5
param 0.5
param 1.0
param 2.0
param 4.0

ff

noise 0.0noise 0.5noise 1.0noise 1.5
param 0.5
param 1.0
param 2.0
param 3.0

Block Sweeps (by_layer)

The following audio examples demonstrate the effect of applying hooks to different transformer blocks. All examples use fixed parameter scale: 2.0, noise scale: 0.5.

self_attn (by layer)

Block 0Block 1Block 2Block 3
Block 4Block 5Block 6Block 7
Block 8Block 9Block 10Block 11
Block 12Block 13Block 14Block 15

cross_attn (by layer)

Block 0Block 1Block 2Block 3
Block 4Block 5Block 6Block 7
Block 8Block 9Block 10Block 11
Block 12Block 13Block 14Block 15

cross_q (by layer)

Block 0Block 1Block 2Block 3
Block 4Block 5Block 6Block 7
Block 8Block 9Block 10Block 11
Block 12Block 13Block 14Block 15

cross_kv (by layer)

Block 0Block 1Block 2Block 3
Block 4Block 5Block 6Block 7
Block 8Block 9Block 10Block 11
Block 12Block 13Block 14Block 15

ff (by layer)

Block 0Block 1Block 2Block 3
Block 4Block 5Block 6Block 7
Block 8Block 9Block 10Block 11
Block 12Block 13Block 14Block 15

All-Hooks Combo

param \ noise0.01.0
1.0
2.0