Mechanistic Design and Scaling of Hybrid Architectures
arxiv(2024)
摘要
The development of deep learning architectures is a resource-demanding
process, due to a vast design space, long prototyping times, and high compute
costs associated with at-scale model training and evaluation. We set out to
simplify this process by grounding it in an end-to-end mechanistic architecture
design (MAD) pipeline, encompassing small-scale capability unit tests
predictive of scaling laws. Through a suite of synthetic token manipulation
tasks such as compression and recall, designed to probe capabilities, we
identify and test new hybrid architectures constructed from a variety of
computational primitives. We experimentally validate the resulting
architectures via an extensive compute-optimal and a new state-optimal scaling
law analysis, training over 500 language models between 70M to 7B parameters.
Surprisingly, we find MAD synthetics to correlate with compute-optimal
perplexity, enabling accurate evaluation of new architectures via isolated
proxy tasks. The new architectures found via MAD, based on simple ideas such as
hybridization and sparsity, outperform state-of-the-art Transformer,
convolutional, and recurrent architectures (Transformer++, Hyena, Mamba) in
scaling, both at compute-optimal budgets and in overtrained regimes. Overall,
these results provide evidence that performance on curated synthetic tasks can
be predictive of scaling laws, and that an optimal architecture should leverage
specialized layers via a hybrid topology.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要