Intro to the Exam & Test Development Process

May 13, 2024

Home » Intro to the Exam & Test Development Process

Table of Contents

The exam and test development cycle the process of building a high-quality assessment in accordance to international standards like NCCA or AERA/APA/NCME, so that it has validity. If there are stakes associated with the test, it is critical that the standards are met and the test is built well. Developing and maintaining a good test is not easy, and requires sophisticated work such as standard setting, job analysis, and item response theory. It is often a cycle rather than a single process, because organizations are continually developing the next form(s) of their test.

Major Phases in the Test Development Cycle

So what are the steps of the test development cycle? Well, there is a well-defined and recognized process for, though it is rarely the exact same for every organization. For example, the step of designing and blueprinting the test is quite different for certification vs. university admissions vs. K-12 benchmark, but you do need that step.

Even within the realm of certification, there is variation. Accreditation guidelines say you need to complete these steps, but leave the specific approach up to you. For example, you are required to do a cutscore study, but you are allowed to choose Bookmark vs Angoff vs other method.

This article will provide a brief intro to major steps of the test development cycle. If you prefer a video walkthrough, here’s one of our YouTube videos:

Define the Construct and Target Market

The first step is to define what exactly the test is for, and how you want the scores to be interpreted. This is because validity, the most important concept in assessment, is the amount of evidence we have supporting the intended interpretations of test scores. So, obviously, we need to start off by saying what that is.

It’s also important to define the target market. It’s not just “mastery of fractions in math” but “mastery of fractions in the 4th grade math curriculum for Minnesota.” In certification, it’s not just knowledge of ophthalmology; it is the level of knowledge needed to be minimally competent as an ophthalmic assistant.

Test Design: Job Analysis or Curriculum

A job analysis study provides the vehicle for defining the important job knowledge, skills, and abilities (KSA) that will later be translated into content on a certification or pre-employment exam. During a job analysis, important job KSAs are obtained by directly analyzing job performance of highly competent job incumbents or surveying subject-matter experts regarding important aspects of successful job performance. The job analysis generally serves as a fundamental source of evidence supporting the validity of scores for certification exams.

If the assessment is in K-12 education, there is typically a curriculum that the test is tied to, and therefore the design should be aligned. Admissions and placement testing require a plan based on the goal. Are you trying to predict success in technical undergraduate programs? If so, what are the abilities that maximize this?

Test Specifications and Blueprints

The results of the job analysis study are quantitatively converted into a blueprint for the certification test. Basically, it comes down to this: if the experts say that a certain topic or skill is done quite often or is very critical, then it deserves more weight on the exam, right? There are different ways to do this. My favorite article on the topic is Raymond & Neustel, 2006. Here’s a free tool to help. The image below shows an example of how to take the job task analysis survey ratings and turn them into blueprints.

For educational assessment, the process is not as straightforward, but still necessary. You might have a curriculum that clearly outlines the topics for an 8th grade math test. But how many items should come from each topic, and of what item type, and at what level of Bloom’s taxonomy? You’ll likely need a focus group of experts.

Item Development

After important job KSAs and learning objectives are established, subject-matter experts write test items to assess them. The end result is the development of an item bank from which exam forms can be constructed. The quality of the item bank also supports test validity. A key operational step is the development of an Item Writing Guide and holding an item writing workshop for the SMEs.

Pilot Testing

There should be evidence that each item in the bank actually measures the content that it is supposed to measure. In order to assess this, data must be gathered from samples of test-takers. After items are written, they are generally pilot tested by administering them to a sample of examinees in a low-stakes context. After pilot test data is obtained, a psychometric analysis of the test and test items can be performed. This analysis will yield statistics that indicate the degree to which the items measure the intended test content. Items that appear to be weak indicators of the test content generally are removed from the item bank or flagged for item review so they can be reviewed by subject matter experts for correctness and clarity.

Note that this is not always possible, and is one of the ways that different organizations diverge in how they approach exam development.

Standard Setting

Standard setting also is a critical source of evidence supporting the validity of pass/fail decisions made based on test scores. Standard setting is a process by which a passing score (or cutscore) is established; this is the point on the score scale that differentiates between examinees that are and are not deemed competent to perform the job.

In order to be valid, the cutscore cannot be arbitrarily defined. Two examples of arbitrary methods are the quota (setting the cut score to produce a certain percentage of passing scores) and the flat cutscore (such as 70% on all tests). Both of these approaches ignore the content and difficulty of the test. Avoid these!

Instead, the cutscore must be based on one of several well-researched criterion-referenced methods from the psychometric literature. There are two types of criterion-referenced standard-setting procedures (Cizek, 2006): examinee-centered and test-centered.

The most frequently used test-centered method is the Modified Angoff Method (Angoff, 1971) which requires a committee of subject matter experts (SMEs). Another commonly used approach is the Bookmark Method.

Equating

If the test has more than one form – which is required by NCCA Standards and other guidelines – they must be statistically equated. If you use classical test theory, there are methods like Tucker or Levine. If you use item response theory, you can either bake the equating into the item calibration process with software like Xcalibre, or use conversion methods like Stocking & Lord.

What does this process do? Well, if this year’s certification exam had an average of 3 points higher than last years, how do you know if this year’s version was 3 points easier, or this year’s cohort was 3 points smarter, or a mixture of both? Equating determines this. Learn more here.

Psychometric Analysis & Reporting

This part is an absolutely critical step in the test development cycle. You need to statistically analyze the results to flag any items that are not performing well, so you can replace or modify them. This looks at statistics like item p-value (difficulty), item point biserial (discrimination), option/distractor analysis, and differential item functioning. You should also look at overall test reliability/precision and other psychometric indices. If you are accredited, you need to perform year-end reports and submit them to the governing body. Learn more about item and test analysis.

Exam & Test Development: It’s a Vicious Cycle

Now, consider the big picture: in most cases, an exam is not a one-and-done thing. It is re-used, perhaps continually. Often there are new versions released, perhaps based on updated blueprints or simply to swap out questions so that they don’t get overexposed. That’s why this is better conceptualized as a test development cycle, like the circle shown above. Often, major steps like Job Analysis are only done once every 5 years. Other steps like the rotation of item development, piloting, equating, and psychometric reporting might happen with each exam window (perhaps you do exams in December and May each year, or even far more frequently).

Getting Started in Test Development

ASC has extensive expertise in managing this cycle for professional credentialing exams, as well as many other types of assessments. Get in touch with us to talk to one of our psychometricians.

Want to delve deeper yourself? I recommend the aptly named Handbook of Test Development.

Nathan Thompson

Nathan Thompson earned his PhD in Psychometrics from the University of Minnesota, with a focus on computerized adaptive testing. His undergraduate degree was from Luther College with a triple major of Mathematics, Psychology, and Latin. He is primarily interested in the use of AI and software automation to augment and replace the work done by psychometricians, which has provided extensive experience in software design and programming. Dr. Thompson has published over 100 journal articles and conference presentations, but his favorite remains https://scholarworks.umass.edu/pare/vol16/iss1/1/ .

Ready to talk to an assessment expert?

Get in touch, and we'll meet to discuss how we can improve your exam development, delivery, and psychometrics!

Request a Consultation