Kernel Deformed Exponential Families for Sparse Continuous Attention

Alexander Moreno, Supriya Nagesh, Zhenke Wu, Walter Dempsey, James Rehg (2022+). Submitted


Attention mechanisms take an expectation of a data representation with respect to probability weights. Recently, (Martins et al. 2020, 2021) proposed continuous attention mechanisms, focusing on unimodal attention densities from the exponential and deformed exponential families: the latter has sparse support. (Farinhas 2021) extended this to to multimodality via Gaussian mixture attention densities. In this paper, we extend this to two general flexible classes: kernel exponential families (Canu 2006) and our new sparse counterpart kernel deformed exponential families. Theoretically, we show new existence results for both kernel exponential and deformed exponential families, and that the deformed case has similar approximation capabilities to kernel exponential families. Lacking closed form expressions for the context vector, we use numerical integration: we show exponential convergence for both kernel exponential families and a smooth approximation to kernel deformed exponential families. Experiments show that kernel deformed exponential families can attend to multiple compact regions of the data domain.


kernel methods, attention mechanism, theory, exponential families, deformed exponential families