Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Leyang Hu^*
Brown University Matteo Gamba^*
KTH Randall Balestriero
Brown University

^*Indicates Equal Contribution

Illustration of Curvature Tuning (CT) on classification (top) and regression (bottom) tasks. CT steers a pretrained model by replacing ReLUs with a β-parameterized activation function and tuning β from 1 to 0, effectively modulating the model’s decision boundary curvature.

* The pretrained model for classification is a 3-layer MLP with hidden width 20 trained for 2000 steps; for regression, it is a 9-layer MLP with hidden width 64 trained for 20000 steps.

Why CT?

Modulating the nonlinearities of a model's activation functions can control the curvature of the decision boundaries, while changing the model's weights alone cannot, making CT a complementary method to current finetuning approaches like LoRA. A toy example of a binary classification task demonstrating the advantage of modulating activation functions is shown below:

* 2-layer MLP with hidden width 7. (a) Baseline trained for 4000 steps, then fine-tuned for another 4000 steps using (b) LoRA (r = 1, α = 1) and (c) Trainable CT.

Methods such as LoRA implicitly tune the slopes and breakpoints of the piecewise affine decision boundary, whereas CT changes the boundary's geometry.

Implementation of CT

We begin by presenting the core activation that gives CT its expressive power—referred to as CT Unit (CTU):

\[ \varphi_{\beta,c}(\mathbf{x}) = c \cdot \sigma\left(\frac{\beta \mathbf{x}}{1 - \beta}\right) \cdot \mathbf{x} + (1 - c) \cdot \ln\left[1 + \exp\left(\frac{\mathbf{x}}{1 - \beta}\right)\right] \cdot (1 - \beta) \]

* In practice, for numerical stability we use $\eta = \frac{\beta}{1 - \beta + \varepsilon}$ and $\gamma = \frac{1}{1 - \beta + \varepsilon}$, where $\varepsilon = 10^{-6}$ allows the method to remain well-defined at $\beta = 1$.

where $\beta \in \left[0, 1\right]$ modulates the curvature, $c \in \left[0, 1\right]$ is the mixing coefficient, and $\sigma(\cdot)$ denotes the sigmoid function. This is essentially a convex combination of reparameterized SiLU and SoftPlus:

\[ \text{SiLU}(\mathbf{x}) = \sigma(\eta \mathbf{x}) \cdot \mathbf{x},\quad \eta = \frac{\beta}{1 - \beta};\qquad \text{SoftPlus}(\mathbf{x}) = \frac{1}{\gamma} \cdot \ln\left[1 + \exp\left(\gamma \mathbf{x}\right)\right],\quad \gamma = \frac{1}{1 - \beta} \]

These two activations are chosen because, based on the connection between deep networks and max-affine spline operators, each independently smooths the mapping of a ReLU-based network—transforming it from piecewise affine to fully nonlinear (details in the paper).

However, each activation alone shifts the unit’s output mean—negatively for SiLU and positively for SoftPlus. When propagated through deep networks, this can alter decision boundaries or regression outputs, requiring retraining to correct. By combining them, we cancel out these shifts while preserving curvature control, as shown below:

* The combined version sets $c = 0.5$.

In practice, we provide two implementations of CT differing in how CTU is applied:

CT (for model steering): Replaces all ReLUs in the network with CTUs using a fixed $ c = 0.5 $ and a shared $ \beta \in [0, 1] $. This version is highly parameter-efficient—introducing only a single hyperparameter—and does not require backpropagation, making it suitable as a lightweight steering method.
Trainable CT (for model finetuning): Also replaces all ReLUs with CTUs, but assigns each output neuron its own trainable pair $ (\beta, c) $, optimized via backpropagation. While it introduces additional parameters, the increase is modest compared to methods like LoRA and yields improved performance.

Results

CT improves generalization

Mean accuracy (%) over three runs of ImageNet-pretrained ResNet-18/50/152 when transferred to 12 downstream datasets.

* The second row under each method indicates the number of trainable parameters (excluding the linear classifier).

Dataset	ResNet-18				ResNet-50				ResNet-152
Dataset	Frozen (0)	CT (1)	LoRA (35923)	Train CT (3968)	Frozen (0)	CT (1)	LoRA (79443)	Train CT (45440)	Frozen (0)	CT (1)	LoRA (243283)	Train CT (143744)
Arabic Characters	81.91	87.65	93.37	93.76	80.65	83.66	94.21	95.67	79.86	79.21	95.96	96.47
Arabic Digits	97.93	98.77	99.08	99.03	98.33	98.37	99.08	99.16	98.07	98.15	99.15	99.10
Beans	87.76	90.36	93.23	94.01	89.58	91.93	94.79	95.57	87.50	87.50	93.75	96.35
CUB-200	62.84	63.18	54.83	64.30	65.23	64.62	66.17	71.03	67.68	68.15	70.59	73.04
DTD	62.80	62.66	54.36	63.62	67.34	66.91	64.70	65.07	66.97	66.99	66.63	63.39
FashionMNIST	88.63	88.70	91.65	91.07	90.05	90.34	92.19	92.78	90.44	90.51	92.77	93.39
FGVC-Aircraft	36.80	38.68	29.19	46.44	38.03	41.16	41.99	55.70	38.74	38.51	48.84	58.16
Flowers102	80.86	81.97	67.53	86.55	84.00	83.84	82.58	87.62	82.98	83.28	84.40	83.43
Food101	61.41	62.27	64.40	66.04	68.06	68.02	71.42	73.60	71.11	71.13	74.66	76.08
DermaMNIST	74.83	75.05	74.21	77.66	75.94	75.89	75.73	78.02	75.68	76.23	76.91	77.94
OCTMNIST	65.03	67.27	74.27	69.53	67.53	68.00	75.90	74.13	69.27	69.10	76.43	75.17
PathMNIST	86.77	87.51	87.62	87.17	90.08	90.26	85.43	87.33	89.91	89.82	84.94	83.60
Average	73.96	75.34	73.64	78.26	76.24	76.92	78.68	81.31	76.52	76.55	80.42	81.34

CT outperforms linear probing on the frozen backbone by an average of 1.97%, 1.16%, and 0.02% for ResNet-18/50/152 respectively, and Trainable CT surpasses LoRA (rank 1) by 10.20%, 4.64% and 1.70%.

CT improves robustness

Mean robust accuracy (%) over three runs of ImageNet-pretrained ResNet-18/50/152 under $\ell_2$/$\ell_\infty$ attacks and common corruptions on CIFAR-10/100 and ImageNet.

Model	Dataset	$\ell_2$			$\ell_\infty$			Corruption
Model	Dataset	Base	CT	$\beta$	Base	CT	$\beta$	Base	CT	$\beta$
ResNet18	CIFAR10	53.67	53.67	1.00	11.17	14.93	0.90	77.73	77.73	1.00
	CIFAR100	24.30	25.50	0.92	4.47	6.90	0.92	51.81	51.95	0.94
	ImageNet	23.37	23.37	1.00	0.00	7.00	0.89	33.11	33.32	0.92
	Average	33.78	34.18	0.97	5.21	9.61	0.90	54.22	54.33	0.95
ResNet50	CIFAR10	55.10	56.53	0.97	10.10	12.08	0.90	77.26	77.26	1.00
	CIFAR100	23.83	25.80	0.96	4.43	7.90	0.93	53.91	53.93	0.98
	ImageNet	31.90	31.90	1.00	0.30	9.30	0.93	39.64	39.64	1.00
	Average	36.94	38.08	0.98	4.94	10.68	0.94	56.94	56.94	0.99
ResNet152	CIFAR10	56.27	56.27	1.00	11.47	15.00	0.99	78.82	78.83	0.99
	CIFAR100	27.90	28.23	0.98	5.40	7.70	0.99	56.12	56.12	1.00
	ImageNet	42.50	42.50	1.00	0.30	13.53	0.97	45.47	45.47	0.99
	Average	42.22	42.33	0.99	5.72	12.08	0.98	60.14	60.14	0.99

CT yields substantial improvements under $\ell_\infty$ attacks (average gains of 44.01%/1032.64%/1494.46% for ResNet-18/50/152 respectively), moderate gains under $\ell_2$ attacks (1.65%/3.62%/0.39%), and marginal improvements (0.30%/0.01%/0.00%) under corruptions, with the selected $\beta$ values generally close to 1.

CT also works on Transformers

Mean accuracy (%) over three runs of Imagenette-pretrained Swin-T/S when transferred to 12 downstream datasets.

* The second row under each method indicates the number of trainable parameters (excluding the linear classifier).

Dataset	Swin-T				Swin-S
Dataset	Frozen (0)	CT (1)	LoRA (74832)	Train CT (532)	Frozen (0)	CT (1)	LoRA (148560)	Train CT (868)
Arabic Characters	30.67	31.08	56.32	41.95	31.81	31.16	62.16	40.88
Arabic Digits	83.71	85.24	97.54	90.82	80.74	81.11	97.91	91.44
Beans	60.68	61.46	75.52	68.49	55.99	54.43	73.96	67.71
CUB-200	4.82	4.87	7.42	6.09	4.46	4.02	9.19	6.71
DTD	15.92	15.90	16.99	17.04	16.03	15.78	18.67	17.66
FashionMNIST	73.81	74.01	83.90	77.07	73.28	73.29	86.15	75.76
FGVC-Aircraft	4.57	4.47	5.59	6.14	4.61	4.74	6.55	6.16
Flowers102	14.09	14.01	16.66	16.53	12.93	13.12	17.28	17.99
Food101	14.85	14.79	18.17	15.20	14.22	14.28	19.41	14.50
DermaMNIST	70.24	70.99	74.08	71.37	69.23	70.34	73.93	70.59
OCTMNIST	49.60	51.37	63.53	53.23	48.07	47.93	63.90	51.23
PathMNIST	76.73	77.78	81.31	77.35	74.82	76.54	76.62	78.59
Average	41.64	42.16	49.75	45.11	40.52	40.56	50.48	44.94

CT outperforms linear probing on Swin-T by an average of 0.68% but underperforms it on Swin-S by 0.58%. Meanwhile, Trainable CT significantly improves over linear probing on both models—by 13.32% and 17.93%, respectively—though it underperforms LoRA by 8.29% and 11.89%.

What CT is doing behind the scenes

Theoretically, CT projects a ReLU-based model to a space of smooth functions.

Theorem:

For a ReLU network $ f: \mathbb{R}^d \to \mathbb{R} $ with parameter $ \mathbf{W} $ (collecting all weights and biases), for fixed $ c \in [0, 1] $ and $ \beta \in [0,1) $, replacing every instance of ReLU with a CTU with hyperparameters $ \beta, c $ is equivalent to projecting $ f $ to a smooth function $ f_{\beta,c} $ with bounded gradients and curvature, while keeping $ \mathbf{W} $ fixed. Importantly, for $ 0 < \beta < 1 $, $ f_{\beta,c} $ enjoys higher local expressivity than $ f $ for the same parameter $ \mathbf{W} $, due to non-vanishing local curvature.