2025-09-11
Lipschitzness of the Softmax Function Definition
Let e i ∈ R d e_i \in \mathbb{R}^d e i ∈ R d be the i i i -th standard basis vector.
Let Δ d \Delta_d Δ d be the d d d -dimensional probability simplex, i.e.,
Δ d = { s ∈ R d | ∑ i = 1 d s i = 1 , and s i ≥ 0 for all i } . \Delta_{d} = \left\{ s \in\mathbb{R}^d \middle| \sum_{i=1}^d s_{i} = 1,\: \text{and } s_{i}\ge 0 \text{ for all } i \right\}. Δ d = { s ∈ R d i = 1 ∑ d s i = 1 , and s i ≥ 0 for all i } .
Define the softmax with inverse temperature λ > 0 \lambda>0 λ > 0 , σ λ : R d → Δ d \sigma_{\lambda}:\mathbb{R}^d\to\Delta_d σ λ : R d → Δ d , by
σ λ ( x ) = 1 ∑ i = 1 d exp ( λ x i ) [ exp ( λ x 1 ) … exp ( λ x d ) ] . \sigma_{\lambda}(\mathbf x)
=
\frac{1}{\sum\limits_{i=1}^d\exp(\lambda x_i)}
\begin{bmatrix}
\exp(\lambda x_1) \\
\dots \\
\exp(\lambda x_d)
\end{bmatrix}. σ λ ( x ) = i = 1 ∑ d exp ( λ x i ) 1 exp ( λ x 1 ) … exp ( λ x d ) .
Proposition
For any 1 ≤ p , q ≤ ∞ 1\le p,q\le\infty 1 ≤ p , q ≤ ∞ and σ λ \sigma_{\lambda} σ λ , the following holds:
∥ σ λ ( x ) − σ λ ( y ) ∥ p ≤ L p , q ∥ x − y ∥ q , \|\sigma_{\lambda}(x)-\sigma_{\lambda}(y)\|_{p} \le L_{p,q}\|x-y\|_{q}, ∥ σ λ ( x ) − σ λ ( y ) ∥ p ≤ L p , q ∥ x − y ∥ q ,
where
L p , q = λ 2 − 1 + 1 p − 1 q . L_{p,q} = \lambda 2^{-1 + \frac{1}{p} - \frac{1}{q}}. L p , q = λ 2 − 1 + p 1 − q 1 .
Proof
Let s = σ λ ( z ) s=\sigma_{\lambda}(z) s = σ λ ( z ) . Then the Jacobian matrix can be written as
J ( z ) = ∇ σ λ ( z ) = λ ( d i a g ( s ) − s s T ) . J(z) = \nabla \sigma_{\lambda}(z) = \lambda(\mathrm{diag}(s)- ss^T). J ( z ) = ∇ σ λ ( z ) = λ ( diag ( s ) − s s T ) .
By the mean value inequality,
∥ σ λ ( x ) − σ λ ( y ) ∥ p ≤ ( sup z sup ∥ u ∥ q = 1 ∥ J ( z ) u ∥ p ) ∥ x − y ∥ q . \|\sigma_{\lambda}(x)-\sigma_{\lambda}(y)\|_{p} \le \left( \sup_{z} \sup_{\|u\|_{q}= 1} \|J(z) u\|_{p} \right) \|x-y\|_{q}. ∥ σ λ ( x ) − σ λ ( y ) ∥ p ≤ ( z sup ∥ u ∥ q = 1 sup ∥ J ( z ) u ∥ p ) ∥ x − y ∥ q .
Hence it suffices to bound sup z ∥ J ( z ) ∥ q → p \sup_{z}\|J(z)\|_{q\to p} sup z ∥ J ( z ) ∥ q → p .
First, for the standard basis { e i } i = 1 d \{e_i\}_{i=1}^d { e i } i = 1 d , the identity
d i a g ( s ) − s s T = ∑ i < j s i s j ( e i − e j ) ( e i − e j ) T \mathrm{diag}(s)-ss^T = \sum_{i<j}s_{i}s_{j}(e_{i}-e_{j})(e_{i}-e_{j})^T diag ( s ) − s s T = ∑ i < j s i s j ( e i − e j ) ( e i − e j ) T
holds, so for any u ∈ R d u\in\mathbb{R}^d u ∈ R d ,
J ( z ) u = λ ∑ i < j s i s j ( u i − u j ) ( e i − e j ) . J(z)u = \lambda \sum_{i<j} s_{i}s_{j}(u_{i}-u_{j})(e_{i}-e_{j}). J ( z ) u = λ i < j ∑ s i s j ( u i − u j ) ( e i − e j ) .
By the triangle inequality and ∥ e i − e j ∥ p = 2 1 / p \|e_i-e_j\|_{p}=2^{1/p} ∥ e i − e j ∥ p = 2 1/ p (for i ≠ j i\neq j i = j ),
∥ J ( z ) u ∥ p ≤ λ ∑ i < j s i s j ∣ u i − u j ∣ ∥ e i − e j ∥ p = λ 2 1 / p ∑ i < j s i s j ∣ u i − u j ∣ \begin{align*}
\|J(z)u\|_{p} &\le \lambda \sum_{i<j} s_{i}s_{j} |u_{i}-u_{j}| \|e_{i}-e_{j}\|_{p} \\
&= \lambda 2^{1/p} \sum_{i<j} s_{i}s_{j} |u_{i}-u_{j}|
\end{align*} ∥ J ( z ) u ∥ p ≤ λ i < j ∑ s i s j ∣ u i − u j ∣∥ e i − e j ∥ p = λ 2 1/ p i < j ∑ s i s j ∣ u i − u j ∣
Here s = σ λ ( z ) s=\sigma_\lambda(z) s = σ λ ( z ) ranges over the interior of the simplex. Since the function
s ↦ ∑ i < j s i s j ∣ u i − u j ∣ s\mapsto\sum_{i<j}s_is_j|u_i-u_j| s ↦ ∑ i < j s i s j ∣ u i − u j ∣
is continuous, its supremum over the interior coincides with its supremum over the closed simplex Δ d \Delta_d Δ d , which includes the boundary.
Therefore,
sup z sup ∥ u ∥ q = 1 ∥ J ( z ) u ∥ p = sup ∥ u ∥ q = 1 sup z ∥ J ( z ) u ∥ p = λ 2 1 / p sup ∥ u ∥ q = 1 sup s ∈ Δ d ∑ i < j s i s j ∣ u i − u j ∣ . \begin{align*}
\sup_{z} \sup_{\|u\|_{q}= 1} \|J(z) u\|_{p}
&= \sup_{\|u\|_{q}=1} \sup_{z} \|J(z)u\|_{p} \\
&= \lambda 2^{1/p} \sup_{\|u\|_{q}=1} \sup_{s \in\Delta_{d}} \sum_{i<j} s_{i}s_{j}|u_{i}-u_{j}|.
\end{align*} z sup ∥ u ∥ q = 1 sup ∥ J ( z ) u ∥ p = ∥ u ∥ q = 1 sup z sup ∥ J ( z ) u ∥ p = λ 2 1/ p ∥ u ∥ q = 1 sup s ∈ Δ d sup i < j ∑ s i s j ∣ u i − u j ∣.
Let i m i n = arg min i u i i_{min}=\argmin_{i}u_{i} i min = arg min i u i and i m a x = arg max i u i i_{max}=\argmax_{i}u_{i} i ma x = arg max i u i . The right-hand side is maximized when s s s concentrates on i m i n i_{min} i min and i m a x i_{max} i ma x , hence
sup s ∈ Δ d ∑ i < j s i s j ∣ u i − u j ∣ = ∣ u i m a x − u i m i n ∣ max S ∈ [ 0 , 1 ] S ( 1 − S ) = ∣ u i m a x − u i m i n ∣ / 4. \sup_{s \in \Delta_{d}} \sum_{i<j} s_{i}s_{j}|u_{i}-u_{j}| = |u_{i_{max}}-u_{i_{min}} | \max_{S \in[0,1]} S(1-S) = |u_{i_{max}}-u_{i_{min}} | /4. s ∈ Δ d sup i < j ∑ s i s j ∣ u i − u j ∣ = ∣ u i ma x − u i min ∣ S ∈ [ 0 , 1 ] max S ( 1 − S ) = ∣ u i ma x − u i min ∣/4.
The equality condition is s i m i n = s i m a x = 1 2 s_{i_{min}}=s_{i_{max}}=\frac{1}{2} s i min = s i ma x = 2 1 .
Next, since for any real numbers a , b a,b a , b we have ∣ a − b ∣ q ≤ 2 q − 1 ( ∣ a ∣ q + ∣ b ∣ q ) |a-b|^q \le 2^{q-1} (|a|^q + |b|^q) ∣ a − b ∣ q ≤ 2 q − 1 ( ∣ a ∣ q + ∣ b ∣ q ) , applying this to a = u i m a x , b = u j m i n a=u_{i_{max}},b=u_{j_{min}} a = u i ma x , b = u j min yields:
∣ u i m a x − u i m i n ∣ ≤ 2 1 − 1 / q ( ∣ u i m a x ∣ q + ∣ u i m i n ∣ q ) 1 / q ≤ 2 1 − 1 / q . |u_{i_{max}}-u_{i_{min}}| \le 2^{1-1/q} (|u_{i_{max}}|^q + |u_{i_{min}}|^q)^{1/q} \le 2^{1-1/q}. ∣ u i ma x − u i min ∣ ≤ 2 1 − 1/ q ( ∣ u i ma x ∣ q + ∣ u i min ∣ q ) 1/ q ≤ 2 1 − 1/ q .
The equality condition is u i m a x = 2 − 1 / q , u i m i n = − 2 − 1 / q u_{i_{max}}=2^{-1/q}, u_{i_{min}}=-2^{-1/q} u i ma x = 2 − 1/ q , u i min = − 2 − 1/ q .
Putting everything together, we obtain
sup z sup ∥ u ∥ q = 1 ∥ J ( z ) u ∥ p ≤ λ 2 1 / p ⋅ 1 4 ⋅ 2 1 − 1 / q = λ 2 − 1 + 1 / p − 1 / q = : L p , q . \sup_{z} \sup_{\|u\|_{q}= 1} \|J(z) u\|_{p} \le \lambda 2^{1/p} \cdot \frac{1}{4} \cdot 2^{1-1/q} = \lambda 2^{-1+1/p-1/q} =: L_{p,q}. z sup ∥ u ∥ q = 1 sup ∥ J ( z ) u ∥ p ≤ λ 2 1/ p ⋅ 4 1 ⋅ 2 1 − 1/ q = λ 2 − 1 + 1/ p − 1/ q =: L p , q .
□ \square □
Examples
For ( 2 , 2 ) (2,2) ( 2 , 2 ) , L 2 , 2 = λ / 2 L_{2,2} = \lambda/2 L 2 , 2 = λ /2
For ( 1 , 1 ) (1,1) ( 1 , 1 ) , L 1 , 1 = λ / 2 L_{1,1} = \lambda/2 L 1 , 1 = λ /2
For ( 1 , ∞ ) (1,\infty) ( 1 , ∞ ) , L 1 , ∞ = λ L_{1,\infty} = \lambda L 1 , ∞ = λ
References
softmax 関数のリプシッツ連続性