Test suites are not reflection complete! https://sdrinf.com/reflection-completeness -essentially, the moment a set of testing data gets significant traction, it becomes a target to optimize for.
Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.
This is true of all the existing NLP benchmarks but I don't see why it should be true in general. In machine vision, for example, benchmarks like ImageNet were still useful even when people were trying to optimize directly for them. (ImageNet shows its age now but that's because it's too easy).
I hope we can come up with something similarly robust for language. It can't just be a list of 1000 questions, otherwise it will end up in the training data and everyone will overfit to it.
For example, would it be possible to generate billions of trivia questions from WikiData? Good luck overfitting on that.
Instead, I strongly recommend to put together a list of "control questions" of your own, that covers the general, and specific use cases you're interested in. Specifically, I'd recommend adding questions on topics you have high degree of expertise on; and topics where you can figure out what "expert" answer actually looks like; then run it against the available models by yourself.