Curvature Tuning: Provable Training-free Model Steering From a Single Parameter

Leyang Hu*
Brown University
Matteo Gamba*
KTH
Randall Balestriero
Brown University

*Indicates Equal Contribution
Teaser Image

Illustration of Curvature Tuning (CT) on classification (top) and regression (bottom) tasks. CT steers a pretrained model by replacing ReLUs with a β-parameterized activation function and tuning β from 1 to 0, effectively modulating the model’s decision boundary curvature.

* The pretrained model for classification is a 3-layer MLP with hidden width 20 trained for 2000 steps; for regression, it is a 9-layer MLP with hidden width 64 trained for 20000 steps.

Abstract

The scaling of model and data sizes has reshaped the AI landscape, establishing finetuning pretrained models as the standard paradigm for solving downstream tasks. However, dominant finetuning methods typically rely on weight adaptation, often lack interpretability, and depend on heuristically chosen hyperparameters. In this paper, we take a different perspective and shift the focus from weights to activation functions, viewing them through the lens of spline operators. We propose Curvature Tuning (CT), an interpretable and principled steering method that modulates a model's decision boundary by injecting a single hyperparameter into its activation functions. We show that CT provably adjusts model decision boundary curvature and, more fundamentally, projects a model onto a space of smooth functions—thereby complementing current finetuning methods, whose effect lies primarily in feature adaptation. Making this hyperparameter trainable gives rise to a novel and highly parameter-efficient finetuning method. Empirically, CT improves both generalization and robustness. For example, it boosts downstream accuracy of ResNet-50/152 by 7.14%/8.46% over linear probing and 4.64%/1.70% over LoRA across 12 datasets, and improves robust accuracy on the $\ell_{\infty}$ benchmark from RobustBench by 1032.64%/1494.46%. Our code is available here

Why CT?

Modulating the nonlinearities of a model's activation functions can control the curvature of the decision boundaries, while changing the model's weights alone cannot, making CT a complementary method to current finetuning approaches like LoRA. A toy example of a binary classification task demonstrating the advantage of modulating activation functions is shown below:

* 2-layer MLP with hidden width 7. (a) Baseline trained for 4000 steps, then fine-tuned for another 4000 steps using (b) LoRA (r = 1, α = 1) and (c) Trainable CT.

Methods such as LoRA implicitly tune the slopes and breakpoints of the piecewise affine decision boundary, whereas CT changes the boundary's geometry.

Implementation of CT

We begin by presenting the core activation that gives CT its expressive power—referred to as CT Unit (CTU):

\[ \varphi_{\beta,c}(\mathbf{x}) = c \cdot \sigma\left(\frac{\beta \mathbf{x}}{1 - \beta}\right) \cdot \mathbf{x} + (1 - c) \cdot \ln\left[1 + \exp\left(\frac{\mathbf{x}}{1 - \beta}\right)\right] \cdot (1 - \beta) \]

* In practice, for numerical stability we use \(\eta = \frac{\beta}{1 - \beta + \varepsilon}\) and \(\gamma = \frac{1}{1 - \beta + \varepsilon}\), where \(\varepsilon = 10^{-6}\) allows the method to remain well-defined at \(\beta = 1\).

where $\beta \in \left[0, 1\right]$ modulates the curvature, $c \in \left[0, 1\right]$ is the mixing coefficient, and $\sigma(\cdot)$ denotes the sigmoid function. This is essentially a convex combination of reparameterized SiLU and SoftPlus:

\[ \text{SiLU}(\mathbf{x}) = \sigma(\eta \mathbf{x}) \cdot \mathbf{x},\quad \eta = \frac{\beta}{1 - \beta};\qquad \text{SoftPlus}(\mathbf{x}) = \frac{1}{\gamma} \cdot \ln\left[1 + \exp\left(\gamma \mathbf{x}\right)\right],\quad \gamma = \frac{1}{1 - \beta} \]

These two activations are chosen because, based on the connection between deep networks and max-affine spline operators, each independently smooths the mapping of a ReLU-based network—transforming it from piecewise affine to fully nonlinear (details in the paper).

However, each activation alone shifts the unit’s output mean—negatively for SiLU and positively for SoftPlus. When propagated through deep networks, this can alter decision boundaries or regression outputs, requiring retraining to correct. By combining them, we cancel out these shifts while preserving curvature control, as shown below:

Activation Function Biases

* The combined version sets $c = 0.5$.

In practice, we provide two implementations of CT differing in how CTU is applied:

  • CT (for model steering): Replaces all ReLUs in the network with CTUs using a fixed \( c = 0.5 \) and a shared \( \beta \in [0, 1] \). This version is highly parameter-efficient—introducing only a single hyperparameter—and does not require backpropagation, making it suitable as a lightweight steering method.
  • Trainable CT (for model finetuning): Also replaces all ReLUs with CTUs, but assigns each output neuron its own trainable pair \( (\beta, c) \), optimized via backpropagation. While it introduces additional parameters, the increase is modest compared to methods like LoRA and yields improved performance.

Results

CT improves generalization

Mean accuracy (%) over three runs of ImageNet-pretrained ResNet-18/50/152 when transferred to 12 downstream datasets.

* The second row under each method indicates the number of trainable parameters (excluding the linear classifier).

Dataset ResNet-18 ResNet-50 ResNet-152
Frozen
(0)
CT
(1)
LoRA
(35923)
Train CT
(3968)
Frozen
(0)
CT
(1)
LoRA
(79443)
Train CT
(45440)
Frozen
(0)
CT
(1)
LoRA
(243283)
Train CT
(143744)
Arabic Characters81.9187.6593.3793.7680.6583.6694.2195.6779.8679.2195.9696.47
Arabic Digits97.9398.7799.0899.0398.3398.3799.0899.1698.0798.1599.1599.10
Beans87.7690.3693.2394.0189.5891.9394.7995.5787.5087.5093.7596.35
CUB-20062.8463.1854.8364.3065.2364.6266.1771.0367.6868.1570.5973.04
DTD62.8062.6654.3663.6267.3466.9164.7065.0766.9766.9966.6363.39
FashionMNIST88.6388.7091.6591.0790.0590.3492.1992.7890.4490.5192.7793.39
FGVC-Aircraft36.8038.6829.1946.4438.0341.1641.9955.7038.7438.5148.8458.16
Flowers10280.8681.9767.5386.5584.0083.8482.5887.6282.9883.2884.4083.43
Food10161.4162.2764.4066.0468.0668.0271.4273.6071.1171.1374.6676.08
DermaMNIST74.8375.0574.2177.6675.9475.8975.7378.0275.6876.2376.9177.94
OCTMNIST65.0367.2774.2769.5367.5368.0075.9074.1369.2769.1076.4375.17
PathMNIST86.7787.5187.6287.1790.0890.2685.4387.3389.9189.8284.9483.60
Average73.9675.3473.6478.2676.2476.9278.6881.3176.5276.5580.4281.34

CT outperforms linear probing on the frozen backbone by an average of 1.97%, 1.16%, and 0.02% for ResNet-18/50/152 respectively, and Trainable CT surpasses LoRA (rank 1) by 10.20%, 4.64% and 1.70%.

CT improves robustness

Mean robust accuracy (%) over three runs of ImageNet-pretrained ResNet-18/50/152 under $\ell_2$/$\ell_\infty$ attacks and common corruptions on CIFAR-10/100 and ImageNet.

Model Dataset $\ell_2$ $\ell_\infty$ Corruption
BaseCT$\beta$ BaseCT$\beta$ BaseCT$\beta$
ResNet18CIFAR1053.6753.671.0011.1714.930.9077.7377.731.00
CIFAR10024.3025.500.924.476.900.9251.8151.950.94
ImageNet23.3723.371.000.007.000.8933.1133.320.92
Average33.7834.180.975.219.610.9054.2254.330.95
ResNet50CIFAR1055.1056.530.9710.1012.080.9077.2677.261.00
CIFAR10023.8325.800.964.437.900.9353.9153.930.98
ImageNet31.9031.901.000.309.300.9339.6439.641.00
Average36.9438.080.984.9410.680.9456.9456.940.99
ResNet152CIFAR1056.2756.271.0011.4715.000.9978.8278.830.99
CIFAR10027.9028.230.985.407.700.9956.1256.121.00
ImageNet42.5042.501.000.3013.530.9745.4745.470.99
Average42.2242.330.995.7212.080.9860.1460.140.99

CT yields substantial improvements under $\ell_\infty$ attacks (average gains of 44.01%/1032.64%/1494.46% for ResNet-18/50/152 respectively), moderate gains under $\ell_2$ attacks (1.65%/3.62%/0.39%), and marginal improvements (0.30%/0.01%/0.00%) under corruptions, with the selected $\beta$ values generally close to 1.

CT also works on Transformers

Mean accuracy (%) over three runs of Imagenette-pretrained Swin-T/S when transferred to 12 downstream datasets.

* The second row under each method indicates the number of trainable parameters (excluding the linear classifier).

Dataset Swin-T Swin-S
Frozen
(0)
CT
(1)
LoRA
(74832)
Train CT
(532)
Frozen
(0)
CT
(1)
LoRA
(148560)
Train CT
(868)
Arabic Characters30.6731.0856.3241.9531.8131.1662.1640.88
Arabic Digits83.7185.2497.5490.8280.7481.1197.9191.44
Beans60.6861.4675.5268.4955.9954.4373.9667.71
CUB-2004.824.877.426.094.464.029.196.71
DTD15.9215.9016.9917.0416.0315.7818.6717.66
FashionMNIST73.8174.0183.9077.0773.2873.2986.1575.76
FGVC-Aircraft4.574.475.596.144.614.746.556.16
Flowers10214.0914.0116.6616.5312.9313.1217.2817.99
Food10114.8514.7918.1715.2014.2214.2819.4114.50
DermaMNIST70.2470.9974.0871.3769.2370.3473.9370.59
OCTMNIST49.6051.3763.5353.2348.0747.9363.9051.23
PathMNIST76.7377.7881.3177.3574.8276.5476.6278.59
Average41.6442.1649.7545.1140.5240.5650.4844.94

CT outperforms linear probing on Swin-T by an average of 0.68% but underperforms it on Swin-S by 0.58%. Meanwhile, Trainable CT significantly improves over linear probing on both models—by 13.32% and 17.93%, respectively—though it underperforms LoRA by 8.29% and 11.89%.

What CT is doing behind the scenes

Theoretically, CT projects a ReLU-based model to a space of smooth functions.

Theorem:

For a ReLU network \( f: \mathbb{R}^d \to \mathbb{R} \) with parameter \( \mathbf{W} \) (collecting all weights and biases), for fixed \( c \in [0, 1] \) and \( \beta \in [0,1) \), replacing every instance of ReLU with a CTU with hyperparameters \( \beta, c \) is equivalent to projecting \( f \) to a smooth function \( f_{\beta,c} \) with bounded gradients and curvature, while keeping \( \mathbf{W} \) fixed. Importantly, for \( 0 < \beta < 1 \), \( f_{\beta,c} \) enjoys higher local expressivity than \( f \) for the same parameter \( \mathbf{W} \), due to non-vanishing local curvature.

BibTeX

@misc{hu2025curvaturetuningprovabletrainingfree,
      title={Curvature Tuning: Provable Training-free Model Steering From a Single Parameter}, 
      author={Leyang Hu and Matteo Gamba and Randall Balestriero},
      year={2025},
      eprint={2502.07783},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.07783}, 
}