Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Test suites are not reflection complete! https://sdrinf.com/reflection-completeness -essentially, the moment a set of testing data gets significant traction, it becomes a target to optimize for.

Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.



>Test suites are not reflection complete!

This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).

I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.

For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.


You can try holding a tournament with it vs other models if you can think of a game for them to play.


Idea: Create a set of tests, that AI experts vet, but are kept secret. New models are run against them and only the scores are published.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: