Surfacing Control: Targeted Interventions in DiT Models for Creative Expression in Multiple Domains
This page contains supplementary audio and images.
Authors: Daniel Hearn, Creative Computing Institute, UAL (d.hearn@arts.ac.uk), Mick Grierson, Creative Computing Institute, UAL (m.grierson@arts.ac.uk)
Abstract: While state-of-the-art generative models like Diffusion Transformers (DiT) are capable of producing high-fidelity media, their black-box nature reduces creative expression to high-level control, limiting human agency. Extending prior work on network bending, we demonstrate that fundamental operations, such as scaling and noise injection in selected transformer block layers, can achieve expressive and fine-grained control of the generative process. Our approach is validated across multiple domains, including image and audio generation, showing that these interventions provide a generalizable method to enhance human agency. By enabling direct intervention, our work positions these models as instruments for artistic expression, enabling novel aesthetic directions not possible through prompt-only generation.
Official Code
The official code for the experiments can be found in the GitHub repository.
Parameter Settings for Generated Images
Combined Scaling Grids
k_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
v_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
q_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
ff_hook_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
all_hooks_combined_scaling_grid.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameters: all hooks, Parameter scales (rows): [1.0, 0.5, 1.5, 2.5], Noise scales (columns): [0.0, 0.5, 1.0, 1.5]
Hook Comparison
hook_comparison.png
Target block: 12, Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter scale: 1.0, Noise scale: 1.5 (except first image: base), Hooks: base, k, v, q, ff, all (left to right)
Block Comparison Rows
k_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: k, Parameter scale: 2.0, Noise scale: 0.0
v_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: v, Parameter scale: 2.0, Noise scale: 0.0
q_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: q, Parameter scale: 2.0, Noise scale: 0.0
ff_hook_block_comparison_row.png
Target blocks: 0-23 (left to right), Prompt: "A portrait photograph of a person smiling", Guidance scale: 4.0, Steps: 20, Parameter: ff, Parameter scale: 2.0, Noise scale: 0.0
Audio Examples
Below are audio grids for each parameter sweep (by_param) and all-hooks combos, applied at block 8 in the transformer model. All audio files are generated with the same prompt "Techno beat 120bpm", a guidance scale of 2.0, and 8 generation steps.
Parameter Sweeps (by_param)
self_attn
noise 0.0
noise 0.5
noise 1.0
noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0
cross_attn
noise 0.0
noise 0.5
noise 1.0
noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0
cross_q
noise 0.0
noise 0.5
noise 1.0
noise 1.5
param 0.5
param 1.0
param 2.0
param 10.0
cross_kv
noise 0.0
noise 0.5
noise 1.0
noise 1.5
param 0.5
param 1.0
param 2.0
param 4.0
ff
noise 0.0
noise 0.5
noise 1.0
noise 1.5
param 0.5
param 1.0
param 2.0
param 3.0
Block Sweeps (by_layer)
The following audio examples demonstrate the effect of applying hooks to different transformer blocks. All examples use fixed parameter scale: 2.0, noise scale: 0.5.