How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond

DSpace Repositorium (Manakin basiert)

Zur Kurzanzeige

dc.contributor.advisor von Luxburg, Ulrike (Prof. Dr.)
dc.contributor.author Haas, Moritz
dc.date.accessioned 2025-12-01T11:45:14Z
dc.date.available 2025-12-01T11:45:14Z
dc.date.issued 2025-12-01
dc.identifier.uri http://hdl.handle.net/10900/172733
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1727339 de_DE
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1727339 de_DE
dc.identifier.uri http://dx.doi.org/10.15496/publikation-114058
dc.description.abstract In recent years, deep artificial neural networks have shown increasingly impressive learning capabilities.With growing computation resources, the size of these networks keeps growing and their generalization performance continues to improve. In this thesis, we aim to bridge the gap between theory and practice of deep learning by developing theoretical understanding of large-scale artificial neural network training that is empirically predictive at moderate scale, proves fundamental limitations or results in practical benefits. In particular, we contribute to the understanding of width-scaling properties of neural networks through a combination of infinite-width theory and empirical evaluations of theoretical predictions. First, we study when and how overfitting of wide neural networks in the neural tangent parameterization (NTP) can generalize well. For this purpose, we significantly generalize previous inconsistency results for kernel regression in fixed input dimension to overfitting with many common neural estimators beyond the minimum norm interpolant. But we also show that suitable spiky-smooth estimators can overfit benignly with minimax-optimal convergence rates in arbitrary covariate dimension. For wide neural networks in NTP, such a spiky-smooth inductive bias can be induced by adding a single shifted high-frequency low-amplitude sin-curve to the activation function. Thereby we demonstrate that overfitting is neither intrinsically beneficial nor harmful for generalization with the right choice of estimator, irrespective of the input dimension. Second, we study width-dependent parameterizations for Sharpness Aware Minimization (SAM). While for stochastic gradient descent and Adam the Maximal Update Parameterization (μP) has been shown to induce hyperparameter transfer and improved generalization at large width, we prove that training with SAM in μP with a global perturbation radius only effectively perturbs the last layer. This observation motivates characterizing all possible width-dependent choices of layerwise initialization variances, learning rates and perturbation radii into unstable, vanishing, non-trivial and effective perturbation parameterizations. We find that there exists a unique stable parameterization, we call Maximal Update and Perturbation Parameterization (μP^2), that achieves width-independent feature learning and width-independent perturbations of the trainable weights in all layers. In experiments training multilayer perceptrons and ResNets on CIFAR-10 as well as Vision Transformers on ImageNet-1K with SAM, we observe that μP^2 improves generalization and jointly transfers both the optimal learning rate and perturbation radius from small to large width, as opposed μP with global perturbation scaling. This confirms that width-independent training dynamics induce empirically favorable and predictable scaling properties, but also shows that non-standard optimization algorithms can require scaling considerations that go beyond μP. Third, we study the dominant width scaling practice (standard parameterization, SP): Networks are initialized with He initialization and trained using a single learning rate for all trainable parameters, tuned at each model scale. Existing infinite-width theory would predict instability under large learning rates and vanishing feature learning under stable learning rates. However, empirically optimal learning rates consistently decay much slower than theoretically predicted. We identify the cross-entropy loss as the key component that enables stable training under large learning rates allowing stable hidden-layer feature learning despite logit divergence, even in the infinite-width limit.We empirically validate that infinite-width predictions hold at moderate width, and therefore provide the first infinite-width proxy for SP that remains predictive of practical neural networks in this controlled divergence regime. Overall, our results suggest that the width dependence in practical neural network training is surprisingly predictable with Tensor Program-based analyses, even over the course of long training. This enables understanding and correcting real-world neural network scaling. Our findings reinforce that corrected width scaling has significant downstream impact on hyperparameter transfer, feature learning and consequently generalization and predictability at large model scale. However, all relevant architecture and training components such as the activation function, the loss function and intermediate perturbation steps need to be taken into account to arrive at the correct qualitative conclusions and to find the correct width-scaling rules that achieve width-independent training dynamics en
dc.language.iso en de_DE
dc.publisher Universität Tübingen de_DE
dc.rights cc_by de_DE
dc.rights ubt-podok de_DE
dc.rights.uri https://creativecommons.org/licenses/by/4.0/legalcode.de de_DE
dc.rights.uri https://creativecommons.org/licenses/by/4.0/legalcode.en en
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=de de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_mit_pod.php?la=en en
dc.subject.classification Deep Learning , Lerntheorie , Statistik de_DE
dc.subject.ddc 004 de_DE
dc.subject.other Deep Learning Theory en
dc.subject.other Infinite-Width Theory en
dc.subject.other Benign Overfitting en
dc.subject.other Hyperparameter Transfer en
dc.subject.other Tensor Programs en
dc.title How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond en
dc.type PhDThesis de_DE
dcterms.dateAccepted 2025-07-28
utue.publikation.fachbereich Informatik de_DE
utue.publikation.fakultaet 7 Mathematisch-Naturwissenschaftliche Fakultät de_DE
utue.publikation.noppn yes de_DE

Dateien:

Das Dokument erscheint in:

Zur Kurzanzeige

cc_by Solange nicht anders angezeigt, wird die Lizenz wie folgt beschrieben: cc_by