How to test rationality? The case of argument evaluation

Critical thinking is often lamented today as a great missing skill. Popular discourse can sometimes feel ‘irrational’, in the sense that people do not seem to really engage with the other side’s arguments, especially for political topics, because they are biased by their own views.

I stumbled upon the CART (Comprehensive Assessment for Rational Thinking), also known as the “rationality quotient test”. I found it to be a great endeavor (much better than whatever IQ is) but it is still a prototype. And there doesn’t seem to be any official way to take the test.

There is a vibe-coded website allowing you to take a test inspired by the CART. There are flaws, but if you want to get the gist of the CART in a quick and interactive way, it is not bad. But it’s definitely NOT the CART itself.

But this is all just context; I am here interested in the “argument evaluation subset” test, which examines how well you can assess the quality of arguments without being biased by your own beliefs through 23 questions. In each question, there is a belief held by someone named Dale and a quick justification for why Dale believes it. Then we have a critic’s counter-argument, and we have to evaluate Dale’s response to the critic.

The correct way to answer is based on the median answers of 8 “experts”. And this got me to ask: is that sound? Is it necessary that such a test should be based on experts’ ratings? Can’t we have objectively correct answers to such a test and an objectively correct way to find the answers?

You can take a look at the test and experts’ ratings yourself here[1]. Do you agree with the ratings? Do you think some of the ratings are ‘wrong’? What do you think about the test or methodology overall?

Tell me if I should just copy everything in the thread because you have to download stuff to see the questions and ratings. I figured I could use pastebin so the questions and ratings are here.


  1. Download the first file, unzip it, and it’s the file named ‘Argument_Evaluation_Subset.doc’ ↩︎

2 Likes

Jesus, Dale’s response to the critic with regards to capital punishment for murderers. They get what they deserve. I’d rather not assume that to be factually correct, but I guess they had to scrape the bottom of the barrel at some point. Not too important to the topic at hand, though.

I think that if we were dealing with more fleshed-out scenarios and arguments in which we could apply decision theory to yielding some specific end other than just the general strength of a given argument (rebuttal), then yes, I think we could potentially come to an objectively correct answer. That being said, rating arguments in such a nebulous way on such a test is probably necessary because of the way humans think, and I don’t think many test-takers would be inclined to start drawing decision trees or something even if they knew what one was and it could apply to evaluating Dale’s rebuttals.

That is, whatever means the experts would have to take to reach an answer to the strength of Dale’s rebuttals would need to be reasonably expected of the test-takers. So, I think it would be best to keep it within range of what can be reasonably expected of the most rational “experts” absent any specific, arbitrary knowledge like game-theory or something.

I’m not sure. It seems like it might be worthwhile to develop.

Can’t we have a middle ground between the current nebulous approach and full-blown decision theory?
For example, maybe instead of just asking a nebulous “argument quality”, we can divide it into more precise parts. One could be something like “goalpost moving”. I feel like it’s already less subjective if we try to evaluate whether Dale is defending a different statement than the original belief.
Also something like having a (good) description of the difference between “Very weak” and “weak”. Do you think those kind of changes would not improve the matter enough and we’ll still need experts’ ratings if we don’t go to something like decision theory?

Take the last question for example:

1. Dale's belief: The government's lean burn project, which is developing a fuel efficient jet engine, should be continued.

* Dale’s premise or justification for belief: The lean burn project should be continued because it could eventually result in much lower consumption of jet fuel, and could save $1.5 billion in a ten year period.

* Critic’s counter-argument: While it is true that the lean burn project could save $1.5 billion in a ten year period, the cost of completing the project is going to exceed $2 billion (assume statement factually correct).

* Dale’s rebuttal to Critic’s counter-argument: Because $4 billion has already been invested in the project (assume statement factually correct), it would be foolish to halt the project and waste the money already invested.

* Indicate the strength of Dale’s rebuttal to the Critic’s counter-argument:

* 23) A=Very Weak B=Weak C=Strong D=Very Strong

2. Experts' rating: 1.5

This is a classic case of the sunk cost fallacy. Dale says it would be foolish to waste the money, but the money is already gone. Worse, it doesn’t even address the critic’s damning point: if the project continues, they’ll end up losing more money overall. Yet this got a 1.5 and not a 1. To me, some of the experts were clearly wrong and could be convinced, upon further reflection, that 1 is the answer. I don’t see how this response is anything but irrelevant and fallacious, and I don’t think I need decision theory.

Another example is cases 2 and 11. To me, those are practically the same. Dale’s response is saying, “They’ll disobey the law anyway,” yet one got a 1 rating and the other a 2 rating. There is some principle I can’t quite articulate that tells me those two ought to have the same rating for it to be consistent.

I’m not sure. It seems like it might be worthwhile to develop.

If you mean “develop the test into something better”, I find it crazy I haven’t found anything more recent on this kind of argument evaluation (I haven’t looked very long tho), even though this is from 1997. I don’t know if the field (psychology) just isn’t interested in that, or if this test is seen as already very good.

The question of the quality of enlightenment an individual or group carry in their reasoning structures is answered to the greater extent by their engagement in any interaction inviting discourse.

I would encourage anyone not overly familiar with the motions of “Eristic” and “Dialectic”, should look at a recent presentation published on YouTube channel “Philosophy Coded”, in which the narrator marks out Schopenhauer’s thought on whether another’s form of expression can sustain inquiry and reflective argument.

Those notions definitely seem relevant to the topic here.

I’ll watch the video, but it seems tangential and also not necessarily trustworthy.

1 Like

This (RQ) is an impressive development on the ancient idea of IQ (Binet/Goddard/others). I skimmed through the website and the general thrust is comprehensiveness, to expand intelligence assessment to cover skills/abilities not included in standard IQ tests. This is progress and might, inter alia, explain anomalies such as failed “geniuses” and obscenely successful “morons” (absit iniuria).

Yes, it is sound to use experts to evaluate an argument. I find the use of “objective” here a wee of an elephant in the room because both Dale and the experts are trying to be as objective as they could be, and notice that the arguments require some already established facts to support each side. Gun control, as Dale has argued, works because other countries with strict regulations about guns have fewer fatalities in gun related incidents. That argument can only work with established records.

I am not sure I’ll go as far but yes I also find the idea of a rationality quotient something interesting. What do you think about the argument evaluation test? Is it a useful subtest? Is it done well?

Well the facts are generally assumed to be true, so we aren’t really doing any empirical work here or evaluating any empirical claim.

I find the use of “objective” here a wee of an elephant in the room because both Dale and the experts are trying to be as objective as they could be

The experts could sincerely try to be objective and correct but this does not mean they successfully can do so. What do you think of the case (23) I wrote earlier for example? The current rating of 2 means at the very least some experts put a rating of 2 and it’s possible some even put a rating of 3 (‘Strong’) for what’s basically a sunk cost fallacy. Those ratings to me simply degrade the final rating, so I am appealing to a higher ‘principle’ than the board of experts to judge those experts’ ratings.

Do you think I am right in this particular case? Even if I am right, overall it’s still okay to rely on experts but maybe a better way could be found or it’s necessary to rely on experts?

Another case is that for some questions, the rating is 2.5 this guarantees some experts rated 2 or below (‘Weak’) and some rated 3 or above (‘High’). It seems in this case, one of the group of experts are simply wrong and it would be weird to include the wrong answers to dilute the right answer.

The rating uses beta weights to ensure that outliers do not figure at all in the final score. So, rest assured that there was not a score of 3 in it. Besides, Dale got a score of 1.5, not 2.
And what higher principle could we turn to? The examples are like micro-theses which could be peer-reviewed.
The sunk cost fallacy could only be used in certain cases; certainly not in building a railway, for example. One could very well argue to continue building it what with the amount of money already put into it. With its completion, it would be a fully functional railway. It went over budget twice, but now it is a complete infrastructure.

Oh yeah, research and development is a classic example of an endeavor abundantly showered with sunk cost fallacy attacks.

The rating uses beta weights to ensure that outliers do not figure at all in the final score. So, rest assured that there was not a score of 3 in it.

What do you mean by that? I don’t really know what they do so an explanation would be useful. But how come you have a rating of 2.5, for example, if it doesn’t mean that some gave 2 or less and some gave 3 or more?

Besides, Dale got a score of 1.5, not 2.

Oh yeah, I forgot and mixed up, yeah it’s 1.5 instead of 2 and what I wanted to say is some put 2 when others put 1 and the right answer seems to just be 1.

And what higher principle could we turn to? The examples are like micro-theses which could be peer-reviewed.

Well the principle that says that the sunk cost fallacy is a fallacy for example independently of what experts might think. I guess an “irrelevance” principle.

The sunk cost fallacy could only be used in certain cases; certainly not in building a railway, for example. One could very well argue to continue building it what with the amount of money already put into it. With its completion, it would be a fully functional railway. It went over budget twice, but now it is a complete infrastructure.

That’s not a sunk cost fallacy. If you say “sure we used money, maybe we went over budget but once it’s done, it’s gonna be nice”, that’s a reasonable argument. The sunk cost fallacy is the opposite in a way, someone points out that finishing the project will be bad and you respond that you must finish it nonetheless because you already invested effort, time or money into it. But it clearly doesn’t matter you already invested stuff in the project, those won’t be lost because you give up the project, they are already lost. If finishing will be negative then the best thing to do is to give up. The sunk cost is this idea that it will be a waste to give up even if finishing will be bad. You can look at the case, it really isn’t a good argument. At least, I hope it’s clear otherwise I lose my point that we can ‘feel’ some higher principle.

CART’s argument evaluation uses a technique different to standard tests for judging an argument’s worth. In philosophical logic we check for validity/soundness if the argument is deductive and strength/cogency if the argument is inductive. CART uses a panel of experts and arguments seem to be graded along a spectrum. I understand that this could make logicians and some folks using the old system uncomfortable and the million dollar question is, “what were CART’s developers thinking?” I guess I should dial down my optimism on CART’s abilities, not much by way of data is available for a proper analysis.

Because calculating scores using beta weights does not use raw data divided by another raw data. I don’t know how to calculate using beta weights. All I know is, they use fractions to assign the scores you see in Dale’s argument.

It doesn’t matter. Do you disagree that by the looks of it, there are probably questions where some experts said 2 or less and some experts said 3 or more?

If you agree but don’t think it’s an issue. Imagine someone making a math test where all the questions have numerical answers (a number). To evaluate, they ask 10 mathematicians what they think the answer is. In some questions, the answers of the mathematicians are not the same, so the person takes the mean/median. At the end, you get a score depending on how close you were to the mathematicians’ mean/median answer. Would you say this is a good test?

It does appear to be an example of the sunk cost fallacy. The (slight) defense I can see for the argument (audience dependent) is that the personal consequences of walking away from that sunk cost would be greater than finishing a project that doesn’t pencil out.

I think that would be reading too much into the argument.

Maybe, but it’s not much of a leap. Announcing the cancellation of a project that is 67% complete after having sunk $4 Billion into it would be inherently traumatic.

The loss of $4 billion is traumatic not the cancellation of the project, continuing the project would be even more traumatic as one would lose $4.5 billion instead.

In a perfectly rational world it the loss of $4.5 billion would be more traumatic than the loss of $4 billion. But we don’t live in a perfectly rational world. In a lot of companies Dale could “fall” for the sunk cost fallacy, lose his employers an extra $500 million and keep his job, or make the “smart” choice and lose his job.

In other words, in an irrational world a rational actor will sometimes need to act irrationally in pursuit of a rational end.