Neural Net: Number of Attention Heads

What attention heads are and how they affect model performance.

Published May 8, 2024 ET

You'll often hear, "X model has Y attention heads."

Attention heads allow the model to focus on different parts of the input sequence simultaneously. Each head processes the data differently, providing diverse perspectives that are then combined to produce the final output.

More attention heads: Can improve the model's ability to capture various aspects of the input data, leading to better performance on tasks requiring nuanced understanding. Also increases computational load.

Fewer attention heads: Might reduce the model's ability to discern different features and relationships within the data, potentially lowering performance on complex tasks, but reduces computational requirements.

Source: ChatGPT 5/27/24