Researchers from top universities are warning about the issue of AI model collapse, which occurs when generative AI models are trained on data created by other AI models. The process leads to performance degradation, increased errors, and repetitive responses. Training models on machine-generated data rather than human-created data causes them to "forget," which could pose a problem as LLMs contribute more and more to the language found online. Model collapse arises from the accumulation of errors during training, distorting the model's understanding of reality. - Recursive training exacerbates this problem, causing models to drift further from the original data distribution.
- In their study, the researchers simulated the effects of training generative models on their own data and observed complete changes in data distribution within 50 generations.
- To maintain the quality of future generative models, the researchers emphasize the importance of training with human-generated content and ensuring fair representation of minority groups in the datasets.
- They suggest preserving the original human-produced dataset, periodically incorporating it into model training, and introducing new, clean, human-generated datasets.
|