Based on the statistical ensemble of observed uses, which is about as well as can be done (given adequate representational power, which transformer architectures have had since GPT-3) for intersubjective matters like language.
Y'want to come up with a set of test questions and take some LLMs for a spin to assess the typical quality? This isn't unknowable. We can get a (statistical) answer with a bit of effort.