| dc.contributor.advisor |
von Luxburg, Ulrike (Prof. Dr.) |
|
| dc.contributor.author |
Haas, Moritz |
|
| dc.date.accessioned |
2025-12-01T11:45:14Z |
|
| dc.date.available |
2025-12-01T11:45:14Z |
|
| dc.date.issued |
2025-12-01 |
|
| dc.identifier.uri |
http://hdl.handle.net/10900/172733 |
|
| dc.identifier.uri |
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1727339 |
de_DE |
| dc.identifier.uri |
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1727339 |
de_DE |
| dc.identifier.uri |
http://dx.doi.org/10.15496/publikation-114058 |
|
| dc.description.abstract |
In recent years, deep artificial neural networks have shown increasingly impressive learning capabilities.With growing computation resources, the size of these networks keeps growing and their generalization performance continues to improve. In this thesis, we aim to bridge the gap between theory and practice of deep learning by developing theoretical understanding of large-scale artificial neural network training that is empirically predictive at moderate scale, proves fundamental limitations or results in practical benefits. In particular, we contribute to the understanding of width-scaling properties of neural networks through a combination of infinite-width theory and empirical evaluations of theoretical predictions.
First, we study when and how overfitting of wide neural networks in the neural tangent parameterization (NTP) can generalize well. For this purpose, we significantly generalize previous inconsistency results for kernel regression in fixed input dimension
to overfitting with many common neural estimators beyond the minimum norm interpolant. But we also show that suitable spiky-smooth estimators can overfit benignly with minimax-optimal convergence rates in arbitrary covariate dimension. For wide neural networks in NTP, such a spiky-smooth inductive bias can be induced by adding a single shifted high-frequency low-amplitude sin-curve to the activation function. Thereby we demonstrate that overfitting is neither intrinsically beneficial nor harmful for generalization with the right choice of estimator, irrespective of the input dimension.
Second, we study width-dependent parameterizations for Sharpness Aware Minimization (SAM). While for stochastic gradient descent and Adam the Maximal Update Parameterization (μP) has been shown to induce hyperparameter transfer and improved generalization at large width, we prove that training with SAM in μP with a global perturbation radius only effectively perturbs the last layer. This observation motivates characterizing all possible width-dependent choices of layerwise initialization variances, learning rates and perturbation radii into unstable, vanishing, non-trivial and effective perturbation parameterizations. We find that there exists a unique stable parameterization, we call Maximal Update and Perturbation Parameterization (μP^2), that achieves width-independent feature learning and width-independent perturbations of the trainable weights in all layers. In experiments training multilayer perceptrons and ResNets on CIFAR-10 as well as Vision Transformers on ImageNet-1K with SAM, we observe that μP^2 improves generalization and jointly transfers both the optimal learning rate and perturbation radius from small to large width, as opposed μP with global perturbation scaling. This confirms that width-independent training dynamics induce empirically favorable and predictable scaling properties, but also shows that non-standard optimization algorithms can require scaling considerations that go beyond μP.
Third, we study the dominant width scaling practice (standard parameterization, SP): Networks are initialized with He initialization and trained using a single learning rate for all trainable parameters, tuned at each model scale. Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. We identify the cross-entropy loss as the key component that enables stable training under large learning rates allowing stable hidden-layer feature learning despite logit divergence, even in the infinite-width limit.We empirically validate that infinite-width predictions hold at moderate width, and therefore provide the first infinite-width proxy for SP that remains predictive of practical neural networks in this controlled divergence regime.
Overall, our results suggest that the width dependence in practical neural network training is surprisingly predictable with Tensor Program-based analyses, even over the course of long training. This enables understanding and correcting real-world neural network scaling. Our findings reinforce that corrected width scaling has significant downstream impact on hyperparameter transfer, feature learning and consequently generalization and predictability at large model scale. However, all relevant architecture and training components such as the activation function, the loss function and intermediate perturbation steps need to be taken into account to arrive at the correct qualitative conclusions and to find the correct width-scaling rules that achieve width-independent training dynamics |
en |
| dc.language.iso |
en |
de_DE |
| dc.publisher |
Universität Tübingen |
de_DE |
| dc.rights |
cc_by |
de_DE |
| dc.rights |
ubt-podok |
de_DE |
| dc.rights.uri |
https://creativecommons.org/licenses/by/4.0/legalcode.de |
de_DE |
| dc.rights.uri |
https://creativecommons.org/licenses/by/4.0/legalcode.en |
en |
| dc.rights.uri |
http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=de |
de_DE |
| dc.rights.uri |
http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=en |
en |
| dc.subject.classification |
Deep Learning , Lerntheorie , Statistik |
de_DE |
| dc.subject.ddc |
004 |
de_DE |
| dc.subject.other |
Deep Learning Theory |
en |
| dc.subject.other |
Infinite-Width Theory |
en |
| dc.subject.other |
Benign Overfitting |
en |
| dc.subject.other |
Hyperparameter Transfer |
en |
| dc.subject.other |
Tensor Programs |
en |
| dc.title |
How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond |
en |
| dc.type |
PhDThesis |
de_DE |
| dcterms.dateAccepted |
2025-07-28 |
|
| utue.publikation.fachbereich |
Informatik |
de_DE |
| utue.publikation.fakultaet |
7 Mathematisch-Naturwissenschaftliche Fakultät |
de_DE |
| utue.publikation.noppn |
yes |
de_DE |