The types of data typically used for training language models may be used up in the near future according to a paper by researchers from Epoch. As researchers build more powerful models, they have to find more texts to train them on. Teven Le Scao, a researcher at Hugging Face who was not involved in Epoch's work, said that large language model researchers are worried that they are going to run out of data.
The issue stems from the fact that the data used to train models is either high quality or low quality. The text from the former is seen as better-written and is often produced by professional writers according to the lead author of the paper.
Texts from social media posts or comments on websites like 4chan make up the majority of data from low-quality categories. The types of language they want the models to reproduce are the ones that fall into the high-quality category. Some impressive results have been achieved by this approach.
Swabha Swayamdipta, a University of Southern California machine learning professor who specializes in dataset quality, said that one way to overcome the data constraints would be to reexamine the quality of the data. It would be a net positive for language models if more diverse datasets were used in the training process.
Data used for training language models can be extended. Performance and cost constraints make it difficult to train large language models on the same data. It could be possible to train a model several times.
Some researchers believe big may not equal better when it comes to language models anyway. Percy Liang, a computer science professor at Stanford University, says there’s evidence that making models more efficient may improve their ability, rather than just increase their size.
“We've seen how smaller models that are trained on higher-quality data can outperform larger models trained on lower-quality data,” he explains.