Test suites are not reflection complete! https://sdrinf.com/reflection-completen...

sebzim4500 · on March 29, 2023

>Test suites are not reflection complete!

This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).

I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.

For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.

astrange · on March 29, 2023

You can try holding a tournament with it vs other models if you can think of a game for them to play.

meghan_rain · on March 29, 2023

Idea: Create a set of tests, that AI experts vet, but are kept secret. New models are run against them and only the scores are published.