Data Pruning: The Unexpected Trick to Boosting AI Accuracy

Scaling up deep learning models through more data, larger models, and increased computing has yielded impressive gains in recent years. However, the improvements we've seen in accuracy come at an immense cost — requiring massively larger datasets, models with billions of parameters, and weeks of training on specialized hardware. 

But what if throwing more data and computing at AI models is a more complex way forward? In a new paper titled "Beyond neural scaling laws: beating power law scaling via data pruning," researchers demonstrate a technique called data pruning that achieves better accuracy with substantially less data.

The Core Idea Behind Data Pruning

The core idea behind data pruning is simple: not all training examples provide equal value for learning. Many standards may need to be revised or more formal. Data pruning seeks to identify and remove these redundant examples from the training set, allowing models to focus their capacity on only the most valuable data points. 

The paper shows theoretically and empirically that carefully pruning away large portions of training data can maintain accuracy and substantially improve it. This challenges the standard practice of collecting ever-larger datasets in deep learning without considering if all examples are equally helpful.

Intuitively, data pruning works because neural networks exhibit a power law relationship between accuracy and the amount of training data. Doubling your dataset improves accuracy, but only slightly. For example, the authors show that in language modeling, a 10x increase in training data improves performance by only 0.6 nats on a test set. This means each additional example provides diminishing returns. Data pruning counteracts this by removing redundant examples that offer little new information.

The key to making data pruning work is having an excellent metric to identify easy, redundant examples to remove versus complex, informative examples to keep. The authors benchmark several metrics on ImageNet and find that most proposed metrics don't effectively identify helpful examples. However, a metric measuring how much networks "memorize" each example works quite well, allowing pruning away 30% of ImageNet images with no loss in accuracy.

Remarkably, the authors show data pruning can improve accuracy exponentially with dataset size, instead of the power law relationship without pruning. This surprising result means carefully selected small datasets can outperform massive randomly collected datasets — a promising finding for reducing the costs of training powerful AI models.

Beating Power Laws with Data Pruning: Key Facts

Here are some of the critical facts demonstrating how data pruning can beat power law scaling:

  • Data pruning boosted CIFAR-10 test accuracy from 92% to 94% after removing 50% of training data. Surprisingly, carefully chosen data subsets can outperform the entire dataset.
  • On ImageNet, pruning the "hardest," 30% of examples matched accuracy compared to no pruning. This shows large portions of ImageNet are redundant for current models.
  • With data pruning, test error on CIFAR-10 decayed exponentially with dataset size instead of a power law. Clever data selection is more a matter of careful sampling than unthinkingly collecting more data.
  • Data pruning reduced the computational cost of training by 59% with no loss in accuracy on CIFAR-10. So, data pruning can cut the energy consumption of training.
  • A simple self-supervised data pruning metric matched the performance of the best-supervised metrics on ImageNet. This could enable the pruning of massive unlabeled datasets.
  • These results demonstrate data pruning is a promising technique to improve the accuracy and efficiency of deep learning. While simple data pruning strategies were effective, developing improved pruning metrics is an exciting direction for future work.

Turn Your AI Ideas into Apps with CPROMPT

The data pruning technique discussed in this paper has the potential to make deep learning more accessible by reducing data and computing requirements. At CPROMPT, we aim to make AI more accessible by allowing anyone to turn text prompts into web apps within minutes.  With CPROMPT, you don't need any coding or technical expertise. Our no-code platform lets you generate a customized web app powered by state-of-the-art AI simply by describing what you want in plain English. CPROMPT makes it easy to turn your AI ideas into real applications to share or sell, whether you're a researcher, student, artist, or entrepreneur.

CPROMPT also has many capabilities that could be useful for experimenting with data pruning techniques like those discussed in this paper. You can connect and prune datasets, train AI models, and deploy pruned models into apps accessible through a simple web interface.

To learn more about how CPROMPT can help you create AI-powered apps and share your ideas with the world, visit our website at https://cprompt.ai. With innovative techniques like data pruning and no-code tools like CPROMPT, the future of AI looks more accessible and sample than ever.

Glossary

Fine-tuning: The process of training a pre-trained machine learning model further on a downstream task by adjusting the model parameters to specialize it to the new task.

Foundation model: A model trained on an extensive and general dataset that can then be adapted or fine-tuned to many downstream tasks. Foundation models like GPT-3 have enabled new AI applications.

Out-of-distribution (OOD): Describes test examples from a different data distribution than the examples the model was trained on. Assessing performance on OOD data is essential for evaluating model robustness.

Overfitting: When a machine learning model performs worse on new test data than on the training data it was fit to. Overly complex models can overfit by memorizing the peculiarities of the training set.

Power law: A relationship where one quantity varies as a power of another. Many metrics in machine learning scale according to a power law. 

Pretraining: Initial training phase where a model is trained on a massive dataset before fine-tuning on a downstream task. Pretraining can enable knowledge transfer and improve sample efficiency.

Pruning: Removing parts of a machine learning model or training dataset according to some criterion to increase sample efficiency. The paper discusses data pruning specifically.