My first instinct was creativity. I had models generate poems, short stories, metaphors, the kind of rich, open-ended output that feels like it should reveal deep differences in cognitive ability. I used an LLM-as-judge to score the outputs, but the results were pretty bad. I managed to fix LLM-as-Judge with some engineering, and the scoring system turned out to be useful later for other things, so here it is:
США отказались вводить новые санкции против российской нефти14:34,这一点在传奇私服官网中也有详细论述
brew install rcli,详情可参考手游
�@���̃��[���ł́A�y�Ȃ��֘A�R���e���c�̐����ɂ����āu�d�v�ȕ����v��AI���g�p���ꂽ�ꍇ�A�w���̃^�O���t�^���邱�Ƃ����߂������B���̓I�ɂ͈ȉ���4�̃J�e�S���[���ݒ肳���Ă����B,这一点在超级工厂中也有详细论述
Duplicate my colleague’s change with jj duplicate and "check it out" with jj edit