Importance Of Essay Type Test Items Analysis

Assessment: Why Item Analyses Are So Important

by Grant Wiggins, Authentic Education

As I have often written, the Common Core Standards are just common sense – but that the devil is in the details of implementation. And in light of the unfortunate excessive secrecy surrounding the test items and their months-later analysis, educators are in the unfortunate and absurd position of having to guess what the opaque results mean for instruction. It might be amusing if there weren’t personal high stakes of teacher accountability attached to the results.

So, using the sample of released items in the NY tests, I spent some time this weekend looking over the 8th grade math results and items to see what was to be learned – and I came away appalled at what I found.

Readers will recall that the whole point of the Standards is that they be embedded in complex problems that require both content and practice standards. But what were the hardest questions on the 8th grade test? Picayune, isolated, and needlessly complex calculations of numbers using scientific notation. And in one case, an item is patently invalid in its convoluted use of the English language to set up the prompt, as we shall see.

As I have long written, there is a sorry record in mass testing of sacrificing validity for reliability. This test seems like a prime example. Score what is easy to score, regardless of the intent of the Standards. There are 28 8th grade math standards. Why do such arguably less important standards have at least 5 items related to them? (Who decided which standards were most important? Who decided to test the standards in complete isolation from one another simply because that is psychometrically cleaner?)

Here are the released items related to scientific notation:

It is this last item that put me over the edge.

The item analysis. Here are the results from the BOCES report to one school on the item analysis for questions related to scientific notation. The first number, cast as a decimal, reflects the % of correct answers statewide in NY. So, for the first item, question #8, only 26% of students in NY got this one right. The following decimals reflect regional and local percentages for a specific district. Thus, in this district 37% got the right answer, and in this school, 36% got it right. The two remaining numbers thus reflect the difference between the state score for the district and school (.11 and .10, respectively).

Notice that, on average, only 36% of New York State 8th graders got these 5 questions right, pulling down their overall scores considerably.

Now ask yourself: given the poor results on all 5 questions – questions that involve isolated and annoying computations, hardly central to the import of the Standards – would you be willing to consider this as a valid measure of the Content and Process Standards in action? And would you be happy if your accountability scores went down as a teacher of 8th grade math, based on these results? Neither would I.

There are 28 Standards in 8th grade math. Scientific Notation consists of 4 of the Standards. Surely from an intellectual point of view the many standards on linear relationships and the Pythagorean theorem are of greater importance than scientific notation. But the released items and the math suggest each standard was assessed 3-4 times in isolation prior to the few constructed response items. Why 5 items for this Standard?

It gets worse. In the introduction to the released tests, the following reassuring comments are made about how items will be analyzed and discussed:

Fair enough: you cannot read the student’s mind. At least you DO promise me helpful commentary on each item. But note the third sentence: The rationales describe why the wrong answer choices are plausible but incorrect and are based on common errors in computation. (Why only computation? Is this an editorial oversight?) Let’s look at an example for arguably the least valid questions of the five:

Oh. It is a valid test of understanding because you say it is valid. Your proof of validity comes from simply reciting the standard and saying this item assesses that.

Wait, it gets even worse. Here is the “rationale” for the scoring, with commentary:

Note the difference in the rationales provided for wrong answers B and C: “may have limited understanding” vs. “may have some understanding… but may have made an error when obtaining the final result.”

This raises a key question unanswered in the item analysis and in the test specs. Does computational error = lack of understanding? Should Answers B and C be scored equal? (I think not, given the intent of the Standards). The student “may have some understanding” of the Standard or may not. Were Answers B and C treated equally? We do not know; we can’t know given the test security.

So, all you are really saying is: wrong answer.

Answers A, B, C are plausible but incorrect. They represent common student errors made when subtracting numbers expressed in scientific notation. Huh? Are we measuring subtraction here or understanding of scientific notation? (Look back at the Standard.)

Not once does the report suggest an equally plausible analysis: students were unable to figure out what this question was asking!!! The English is so convoluted, it took me a few minutes to check and double-check whether I parsed the language properly:

Plausible but incorrect… The wrong answers are “plausible but incorrect.” Hey, wait a minute: that language sounds familiar. That’s what it says under every other item! For example:

All they are doing is copying and pasting the same sentence, item after item, and then substituting in the standard being assessed!!  Aren’t you then merely saying: we like all our distractors equally because they are all “plausible” but wrong?

Understanding vs. computation. Let’s look more closely at another set of rationales for a similar problem, to see if we see the same jumbling together of conceptual misunderstanding and minor computational error. Indeed, we do:

Look at the rationale for B, the correct answer: it makes no sense. Yes, the answer is 4 squared which is an equivalent expression to the prompt. But then they say: “The student may have correctly added the exponents.” That very insecure conclusion is then followed, inexplicably, by great confidence: “A student who selects this response “understands the properties of integer exponents…” – which is of course, just the Standard, re-stated. Was this blind recall of a rule or is it evidence of real understanding? We’ll never know from this item and this analysis.

In other words, all the rationales are doing, really, is claiming that the item design is valid – without evidence. We are in fact learning nothing about student understanding, the focus of the Standard.

Hardly the item analysis trumpeted at the outset.

Not what we were promised. More fundamentally, these are not the kinds of questions the Common Core promised us. Merely making the computations trickier is cheap psychometrics, not an insight into student understanding. They are testing what is easy to test, not necessarily what is most important.

By contrast, here is an item from the test that assesses for genuine understanding:

This is a challenging item – perfectly suited to the Standard and the spirit of the Standards. It requires understanding the hallmarks of linear and nonlinear relations and doing the needed calculations based on that understanding to determine the answer. But this is a rare question on the test.

Why should the point value of this question be the same as the scientific notation ones?

In sum: questionable. This patchwork of released items, bogus “analysis” and copy and paste “commentary” give us little insight into the key questions: where are my kids in terms of the Standards? What must we do to improve performance against these Standards?

My weekend analysis, albeit informal, gives me little faith in the operational understanding of the Standards in this design, without further data on how item validity was established, whether any attempt was made to carefully distinguish computational from conceptual errors in the design and scoring- and whether the tentmakers even understand the difference between computation and understanding.

It is thus inexcusable for such tests to remain secure, with item analysis and released items dribbled out at the whim of the DOE and the vendor. We need a robust discussion as to whether this kind of test measures what the Standards call for, a discussion that can only occur if the first few years of testing lead to a release of the whole test after it is taken.

New York State teachers deserve better.

This article first appeared on Grant’s personal blog; Grant can be found on twitter here; image attribution flickr user anthonypbruce; Assessment: Why Item Analyses Are So Important

Constructing tests

Designing tests is an important part of assessing students understanding of course content and their level of competency in applying what they are learning.  Whether you use low-stakes and frequent evaluations–quizzes–or high-stakes and infrequent evaluations–midterm and final–careful design will help provide  more calibrated results.

Here are a few general guidelines to help you get started:

  • Consider your reasons for testing.
    • Will this quiz monitor the students’ progress so that you can adjust the pace of the course?
    • Will ongoing quizzes serve to motivate students?
    • Will this final provide data for a grade at the end of the quarter?
    • Will this mid-term challenge students to apply concepts learned so far?

The reason(s) for giving a test will help you determine features such as length, format, level of detail required in answers, and the time frame for returning results to the students.

  • Maintain consistency between goals for the course, methods of teaching, and the tests used to measure achievement of goals. If, for example, class time emphasizes review and recall of information, then so can the test; if class time emphasizes analysis and synthesis, then the test can also be designed to demonstrate how well students have learned these things.
  • Use testing methods that are appropriate to learning goals. For example, a multiple choice test might be useful for demonstrating memory and recall, for example, but it may require an essay or open-ended problem-solving for students to demonstrate more independent analysis or synthesis.
  • Help Students prepare. Most students will assume that the test is designed to measure what is most important for them to learn in the course. You can help students prepare for the test by clarifying course goals as well as reviewing material. This will allow the test to reinforce what you most want students to learn and retain.
  • Use consistent language (in stating goals, in talking in class, and in writing test questions) to describe expected outcomes. If you want to use words like explain or discuss, be sure that you use them consistently and that students know what you mean when you use them.
  • Design test items that allow students to show a range of learning. That is, students who have not fully mastered everything in the course should still be able to demonstrate how much they have learned.

Multiple choice exams

Multiple choice questions can be difficult to write, especially if you want students to go beyond recall of information, but the exams are easier to grade than essay or short-answer exams. On the other hand, multiple choice exams provide less opportunity than essay or short-answer exams for you to determine how well the students can think about the course content or use the language of the discipline in responding to questions.

If you decide you want to test mostly recall of information or facts and you need to do so in the most efficient way, then you should consider using multiple choice tests.

The following ideas may be helpful as you begin to plan for a multiple choice exam:

  • Since questions can result in misleading wording and misinterpretation, try to have a colleague answer your test questions before the students do.
  • Be sure that the question is clear within the stem so that students do not have to read the various options to know what the question is asking.
  • Avoid writing items that lead students to choose the right answer for the wrong reasons. For instance, avoid making the correct alternative the longest or most qualified one, or the only one that is grammatically appropriate to the stem.
  • Try to design items that tap students’ overall understanding of the subject. Although you may want to include some items that only require recognition, avoid the temptation to write items that are difficult because they are taken from obscure passages (footnotes, for instance).
  • Consider a formal assessment of your multiple-choice questions with what is known as an “item analysis” of the test.
    For example:
    • Which questions proved to be the most difficult?
    • Were there questions which most of the students with high grades missed?

This information can help you identify areas in which students need further work, and can also help you assess the test itself: Were the questions worded clearly? Was the level of difficulty appropriate? If scores are uniformly high, for example, you may be doing everything right, or have an unusually good class. On the other hand, your test may not have measured what you intended it to.

Essay questions


“Essay tests let students display their overall understanding of a topic and demonstrate their ability to think critically, organize their thoughts, and be creative and original. While essay and short-answer questions are easier to design than multiple-choice tests, they are more difficult and time-consuming to score. Moreover, essay tests can suffer from unreliable grading; that is, grades on the same response may vary from reader to reader or from time to time by the same reader. For this reason, some faculty prefer short-answer items to essay tests. On the other hand, essay tests are the best measure of students’ skills in higher-order thinking and written expression.”
(Barbara Gross Davis, Tools for Teaching, 1993, 272)

When are essay exams appropriate?

  • When you are measuring students’ ability to analyze, synthesize, or evaluate
  • When you have been teaching at these levels (i.e. writing intensive courses, upper-division undergraduate seminars, graduate courses) or the content lends it self to more critical analysis as opposed to recalling information

How do you design essay exams?

  • Be specific
  • Use words and phrases that alert students to the kind of thinking you expect; for example, identify, compare, or critique
  • Indicate with points (or time limits) the approximate amount of time students should spend on each question and the level of detail expected in their responses
  • Be aware of time; practice taking the exam yourself or ask a colleague to look at the questions

How do you grade essay exams?

  • Develop criteria for appropriate responses to each essay question
  • Develop a scoring guide that tell what you are looking for in each response and how much credit you intend to give for each part of the response
  • Read all of the responses to question 1, then all of the responses to question 2, and on through the exam. This will provide a more holistic view of how the class answered the individual questions

How do you help students succeed on essay exams?

  • Use study questions that ask for the same kind of thinking you expect on exams
  • During lecture or discussion emphasize examples of thinking that would be appropriate on essay exams
  • Provide practice exams or sample test questions
  • Show examples of successful exam answers

Assessing your test

Regardless of the kind of exams you use, you can assess their effectiveness by asking yourself some basic questions:

  • Did I test for what I thought I was testing for?
    If you wanted to know whether students could apply a concept to a new situation, but mostly asked questions determining whether they could label parts or define terms, then you tested for recall rather than application.
  • Did I test what I taught?
    For example, your questions may have tested the students’ understanding of surface features or procedures, while you had been lecturing on causation or relation–not so much what the names of the bones of the foot are, but how they work together when we walk.
  • Did I test for what I emphasized in class?
    Make sure that you have asked most of the questions about the material you feel is the most important, especially if you have emphasized it in class. Avoid questions on obscure material that are weighted the same as questions on crucial material.
  • Is the material I tested for really what I wanted students to learn?
    For example, if you wanted students to use analytical skills such as the ability to recognize patterns or draw inferences, but only used true-false questions requiring non-inferential recall, you might try writing more complex true-false or multiple-choice questions.


Leave a Reply

Your email address will not be published. Required fields are marked *