Skip to main content

Nathan Thompson

Followers

3

Following

11

Co-author

1

Public Views

CEO of assess.com

less

University of Zagreb

Uppsala University

University of East London

Armando Marques-Guedes

UNL - New University of Lisbon

University of Leicester

Gwen Robbins Schug

University of North Carolina at Greensboro

Gabriel Gutierrez-Alonso

University of Salamanca

Macquarie University

Universidade Federal do Rio Grande do Sul

Swansea University

Uploads

Papers by Nathan Thompson

Development and Validation of a Comprehensive Genomics Knowledge Scale

Public Health Genomics, 2021

Background: Genomic testing is increasingly employed in clinical, research, educational, and comm... more Background: Genomic testing is increasingly employed in clinical, research, educational, and commercial contexts. Genomic literacy is a prerequisite for the effective application of genomic testing, creating a corresponding need for validated tools to assess genomics knowledge. We sought to develop a reliable measure of genomics knowledge that incorporates modern genomic technologies and is informative for individuals with diverse backgrounds, including those with clinical/life sciences training. Methods: We developed the GKnowM Genomics Knowledge Scale to assess the knowledge needed to make an informed decision for genomic testing, appropriately apply genomic technologies and participate in civic decision-making. We administered the 30-item draft measure to a calibration cohort (n = 1,234) and subsequent participants to create a combined valida-tion cohort (n = 2,405). We performed a multistage psychometric calibration and validation using classical test theory and item response theory (IRT) and conducted a post-hoc simulation study to evaluate the suitability of a computerized adaptive testing (CAT) implementation. Results: Based on exploratory factor analysis, we removed 4 of the 30 draft items. The resulting 26-item GKnowM measure has a single dominant factor. The scale internal consistency is α = 0.85, and the IRT 3-PL model demonstrated good overall and item fit. Validity is demonstrated with significant correlation (r = 0.61) with an existing genomics knowledge measure and significantly higher scores for individuals with adequate health literacy and healthcare providers (HCPs), including HCPs who work with genomic testing. The item bank is well suited to CAT, achieving high accuracy (r = 0.97 with the full measure) while administering a mean of 13.5 items. Conclusion: GKnowM is an updated, broadly relevant, rigorously validated 26-item measure for assessing genomics knowledge that we anticipate will be useful for assessing population genomic literacy and evaluating the effectiveness of genomics educational interventions.

The MEDPRO Project: An SBIR Project for a Comprehensive IRT and CAT Software System — CAT Software

Development of computerized adaptive tests (CAT) requires a number of appropriate software tools.... more Development of computerized adaptive tests (CAT) requires a number of appropriate software tools. This paper describes the development of two new CAT software programs. CATSIM has been designed specifically to conduct several different kinds of simulation studies, which are necessary for planning purposes as well as properly designing live CATs. FastCAT is a software system for banking items and publishing CAT tests as standalone files, to be administered anywhere. Both are available for public use.

Computerized Classification Testing with Composite Hypotheses

proposed a widely used method of applying the sequential probability ratio test (SPRT; to compute... more proposed a widely used method of applying the sequential probability ratio test (SPRT; to computerized classification testing with item response theory. This method formulates the classification problem as a point hypothesis that an examinee's ability, θ, is equal to a point, θ 1, below the cutscore or a point, θ 2, above the cutscore. The current paper argues that the actual goal of classification testing is a composite hypothesis that an examinee's ability θ is in the region of θ either above or below the cutscore, rather than equal to an arbitrarily defined point. A formulation of the SPRT to reflect this testing paradigm is proposed.

A Framework for the Development of Computerized Adaptive Tests

Practical Assessment, Research and Evaluation, Jan 10, 2011

A substantial amount of research has been conducted over the past 40 years on technical aspects o... more A substantial amount of research has been conducted over the past 40 years on technical aspects of computerized adaptive testing (CAT), such as item selection algorithms, item exposure controls, and termination criteria. However, there is little literature providing practical guidance on the development of a CAT. This paper seeks to collate some of the available research methodologies into a general framework for the development of any CAT assessment. Publish live CAT Publishing and distribution; software development This paper proceeds to discuss some of the issues relevant to each stage. This discussion, however, is by no means comprehensive. To the extent that each assessment program's situation is different and unique, it raises its own issues. Moreover, extensive attention has been given to individual aspects in other sources, such as the technical discussion of item exposure in . Therefore, an assessment program should utilize this framework as simply that, rather than as a comprehensive recipe, to identify issues relevant to the situation at hand and the type of research, business, or psychometric work necessary to present guidance for each decision. Thompson & Weiss, CAT Framework This is important not only from a practical viewpoint, but because this is the foundation for validity. A CAT developed without adequate research and documentation in each of these stages runs the danger of being inefficient at the least and legally indefensible at the worst. For example, arbitrarily setting specifications for a live CAT (termination criterion, maximum items, etc.) without empirical evidence for the choices could result in examinee scores that are simply not as accurate as claimed, providing some subtraction from the validity of their interpretations.

Advanced Methods of Designing Tests for Pass/Fail Decisions

Most examinations in the realm of professional regulatory testing are designed with the purpose o... more Most examinations in the realm of professional regulatory testing are designed with the purpose of pass/fail decisions for each examinee. Historically, these decisions have been made by administering a fixed number of items to each examinee, and assigning a "pass" decision if the examinee observed score is equal to or above a cutscore. However, this method is inefficient, in that it requires a large number of items to be administered before the decision is made. It is for this reason that computerized adaptive testing (CAT) methods were developed. Yet even CAT methods are not optimal for regulatory testing, as they are designed to obtain precise scores, and precise scores are not neededonly a pass/fail decision. A related methodology, computerized classification testing (CCT), specifically designs tests to provide this decision with as few items as possible, but retaining the decision accuracy of both fixed-form and CAT methods. This paper will explain differences between these approaches and provide a comparison of them to demonstrate the efficiency of the CCT approach for pass/fail testing. Simulations will be conducted for tests under each approach, and results compared in terms of decision accuracy (percentage correctly classified) and efficiency (average test length). The applicability of each method in the licensure/certification context will be discussed. Participants will be able to recognize the advantages and disadvantages to help determine if CCT or CAT methods might be useful for their testing program.

Utilizing the Generalized Likelihood Ratio as a Termination Criterion

Computer-based testing can be used to classify examinees into mutually exclusive groups. Currentl... more Computer-based testing can be used to classify examinees into mutually exclusive groups. Currently, the predominant psychometric algorithm for designing computerized classification tests (CCTs) is the sequential probability ratio test (SPRT; based on item response theory (IRT). The SPRT has been shown to be more efficient than confidence intervals around θ estimates as a method for CCT delivery . More recently, it was demonstrated that the SPRT, which only uses fixed values, is less efficient than a generalized form which tests whether a given examinee's θ is below θ 1 or above θ 2 . This formulation allows the indifference region to vary based on observed data. Moreover, this composite hypothesis formulation better represents the conceptual purpose of the test, which is to test whether θ is above or below the cutscore. The purpose of this study was to explore the specifications of the new generalized likelihood ratio (GLR; . As with the SPRT, the efficiency of the procedure depends on the nominal error rates and the distance between θ 1 and θ 2 . This study utilized a monte-carlo approach, with 10,000 examinees simulated under each condition, to evaluate differences in efficiency and accuracy due to hypothesis structure, nominal error rate, and indifference region size. The GLR was always at least as efficient as the fixed-point SPRT while maintaining equivalent levels of accuracy.

RUNNING HEAD: Classification Accuracy in CCT Cutscore Location and Classification Accuracy in Computerized Classification Testing

A common dependent variable in studies of computerized classification testing (CCT; Parshall, Spr... more A common dependent variable in studies of computerized classification testing (CCT; Parshall, Spray, Kalhon, & Davey, 2006) is the amount of classification error or accuracy. This is controlled in part by nominal values that are chosen as parameters for the termination criterion of the CCT. Past research, however, has focused on a comparison of observed classification error to assess the efficiency of various CCT algorithms, and has not evaluated the relationship between observed and nominal error rates. This study explored classification accuracy with CCT as a function of cutscore location with relation to the examinee distribution, as a cutscore near the center will lead to much more error than a very high or low cutscore. In addition to classification accuracy, the average number of items required to make a classification decision is also considered. A cutscore that is very high or low will fail or pass, respectively, most examinees with very few items. The purpose of this monte ...

Theory of Correspondences

Much attention has been given to professional certification examinations from a psychometric or p... more Much attention has been given to professional certification examinations from a psychometric or professional industry point of view. This paper argues that the most basic reason for the existence of professional certifications is economic, and that the credential is at its very core a product that is being sold by an organization. Within this paper a model has been developed that considers all of the critical stakeholders of the credentialing process, outlines the benefits of each of the stakeholders, and identifies the necessary contributions that each makes to the value of the credential, and finally, how that value might be increased or decreased. This model has important implications for credentialing organizations seeking to increase the volume or value of the certifications they offer.

A Framework for the Development of Computerized Adaptive Tests

Practical Assessment, Research and Evaluation, 2011

A substantial amount of research has been conducted over the past 40 years on technical aspects o... more A substantial amount of research has been conducted over the past 40 years on technical aspects of computerized adaptive testing (CAT), such as item selection algorithms, item exposure controls, and termination criteria. However, there is little literature providing practical guidance on the development of a CAT. This paper seeks to collate some of the available research methodologies into a general framework for the development of any CAT assessment.

Digital Module 19: Foundations of IRT Estimation

Educational Measurement: Issues and Practice, 2020

In this digital ITEMS module, Dr. Zhuoran Wang and Dr. Nathan Thompson introduce the basic item r... more In this digital ITEMS module, Dr. Zhuoran Wang and Dr. Nathan Thompson introduce the basic item response theory (IRT) item calibration and examinee scoring procedures as well as strategies to improve estimation accuracy. They begin the module with a conceptual review of IRT that includes core advantages of the IRT framework, commonly used IRT models, and essential components, such as information and likelihood functions. In the second part of the module, they illustrate the structure and inner workings of calibration and scoring algorithms, such as the MMLE/EM algorithm for item parameter calibration and the MLE, EAP, and MAP algorithms for examinee scoring. In part three, they demonstrate the influence of multiple factors on estimation accuracy and provide strategies for maximizing accuracy. In addition to audio-narrated slides, the digital module contains sample R code, quiz questions with diagnostic feedback, curated resources, and a glossary.

Stochastic Curtailment in Adaptive Mastery Testing

Applied Psychological Measurement, 2014

A well-known stopping rule in adaptive mastery testing is to terminate the assessment once the ex... more A well-known stopping rule in adaptive mastery testing is to terminate the assessment once the examinee’s ability confidence interval lies entirely above or below the cut-off score. This article proposes new procedures that seek to improve such a variable-length stopping rule by coupling it with curtailment and stochastic curtailment. Under the new procedures, test termination can occur earlier if the probability is high enough that the current classification decision remains the same should the test continue. Computation of this probability utilizes normality of an asymptotically equivalent version of the maximum likelihood ability estimate. In two simulation sets, the new procedures showed a substantial reduction in average test length while maintaining similar classification accuracy to the original method.

Computerized Classification Testing with Composite Hypotheses

Reckase (1983) proposed a widely used method,ofapplying the sequential probability ratio test (SP... more Reckase (1983) proposed a widely used method,ofapplying the sequential probability ratio test (SPRT; Wald, 1947) to computerized classification testing withitem response theory. This method formulates the classification problem as a point hypothesis that an examinee’s ability, θ, is equal to a point, θ1,below the cutscore or a point, θ2,above the cutscore. The current paper argues that the actual goal

Computerized Adaptive Testing: A Primer on Benefits, Design, and Implementation

PsycEXTRA Dataset

IRTEQ: Software for Linking and Equating with Item Response Theory

International Journal of Testing, 2010

An essential requirement for testing programs where every examinee does not see the same set of i... more An essential requirement for testing programs where every examinee does not see the same set of items is the capability to link different sets of items together. Such a situation can occur when multiple fixed forms are used, either at the same time or across years, but it is also necessary for bank-based testing methods such as computerized adaptive testing (CAT, Wainer, 2000) and linear-on-the-fly testing (Folk & Smith, 2002). To ensure the comparability of item parameters and ability (θ ) estimates, a linking and equating analysis must be performed. IRTEQ is a state-of-the-art software program that performs much of this task for testing programs developed with item response theory (IRT: Embretson & Reise, 2000). IRTEQ, developed by Kyung T. Han at the University of MassachusettsAmherst, is a Windows R © application that provides graphical output in addition to the standard text output. These two characteristics are a notable advantage over its predecessor, EQUATE (Baker, 1993). The Windows interface greatly increases the user-friendliness of the program, while the graphical output facilitates interpretation of the results. Moreover, EQUATE is limited to characteristic curve methods. Equating refers to the process of setting scores from two different scales onto the same scale. Linking refers to the same process, but with item parameters. In IRT, the same scale is used for both the scores and the item parameters, and therefore the process is equivalent. There are several linking/equating designs available to researchers and practitioners; IRTEQ is designed for the most common, the non-equivalent groups anchor test (NEAT) design.

A Comparative Study of the Impact of Certified and Noncertified Ophthalmic Medical Personnel on Practice Quality and Productivity

Eye & Contact Lens: Science & Clinical Practice, 2008

To compare ophthalmic practice productivity and performance attributes, as rated by employing oph... more To compare ophthalmic practice productivity and performance attributes, as rated by employing ophthalmologists, of noncertified and three levels of certified ophthalmic medical personnel. Methods. Three hundred eighty-five American and Canadian ophthalmologists in a clinic-based, stratified, random sample were surveyed regarding productivity performance and attributes of the ophthalmic medical personnel they employ. Instrument scales assessed 14 desirable professional attributes and 10 practice productivity measures. The attributes were credibility, reliability, competence, quality assurance, quality of patient care, knowledge base to make adjustments, increased skills (expertise), ability to work independently, broader knowledge base, ability to detect errors, ability to be trained to perform multiple roles in the practice, professional image, good judgment, and initiative and drive. The productivity measures were patient satisfaction, doctor productivity, trouble-shooting rapport, triage screening, effective patient flow, reduced patient complaints, increased referrals, number of patients per hour, revenue per patient, and patient follow-up. Participants indicated whether certified personnel more often showed these attributes and contributed to practice productivity measures as compared to noncertified personnel or whether there was no difference. Results were analyzed with a chi-square goodness-of-fit test. Survey reliability and validity were evaluated. Results. Significantly more ophthalmologists responded that the three levels of certified personnel contributed more to 5 of the 10 practice productivity measures (i.e., doctor productivity, troubleshooting rapport, triage screening, effective patient flow, and number of patients per hour). A statistically significant number of ophthalmologists also believed that certified personnel showed more of all 14 of the personal attributes considered desirable compared to noncertified ophthalmic medical personnel. Conclusions. Compared to noncertified personnel, the employment of certified ophthalmic personnel enhances the quality and productivity of an ophthalmic practice. Overall practice productivity is increased with certified ophthalmic medical personnel.

Item Selection in Computerized Classification Testing

Educational and Psychological Measurement, 2008

Several alternatives for item selection algorithms based on item response theory in computerized ... more Several alternatives for item selection algorithms based on item response theory in computerized classification testing (CCT) have been suggested, with no conclusive evidence on the substantial superiority of a single method. It is argued that the lack of sizable effect is because some of the methods actually assess items very similarly through different calculations and will usually select the same item. Consideration of methods that assess information across a wider range is often unnecessary under realistic conditions, although it might be advantageous to utilize them only early in a test. In addition, the efficiency of item selection approaches depend on the termination criteria that are used, which is demonstrated through didactic example and Monte Carlo simulation. Item selection at the cut score, which seems conceptually appropriate for CCT, is not always the most efficient option. A broad framework for item selection in CCT is presented that incorporates these points.

Advanced Methods of Designing Tests for Pass/Fail Decisions

Most examinations in the realm of professional regulatory testing are designed with the purpose o... more Most examinations in the realm of professional regulatory testing are designed with the purpose of pass/fail decisions for each examinee. Historically, these decisions have been made by administering a fixed number of items to each examinee, and assigning a "pass" decision if the examinee observed score is equal to or above a cutscore. However, this method is inefficient, in that it requires a large number of items to be administered before the decision is made. It is for this reason that computerized adaptive testing (CAT) methods were developed. Yet even CAT methods are not optimal for regulatory testing, as they are designed to obtain precise scores, and precise scores are not neededonly a pass/fail decision. A related methodology, computerized classification testing (CCT), specifically designs tests to provide this decision with as few items as possible, but retaining the decision accuracy of both fixed-form and CAT methods. This paper will explain differences between these approaches and provide a comparison of them to demonstrate the efficiency of the CCT approach for pass/fail testing. Simulations will be conducted for tests under each approach, and results compared in terms of decision accuracy (percentage correctly classified) and efficiency (average test length). The applicability of each method in the licensure/certification context will be discussed. Participants will be able to recognize the advantages and disadvantages to help determine if CCT or CAT methods might be useful for their testing program.

A Proposed Framework of Test Administration Methods

Journal of Applied Testing Technology, 2008

Utilizing the Generalized Likelihood Ratio As a Termination Criterion

Proceedings of the 2009 GMAC Conference on …, 2009

Opportunities and Challenges of AI in Educational Assessment

by Alper Sahin and Nathan Thompson

Journal of Measurement and Evaluation in Education and Psychology, 2024

In the past few years, as artificial intelligence (AI) and large language models (LLM) have rapid... more In the past few years, as artificial intelligence (AI) and large language models (LLM) have rapidly entered our
lives, we have witnessed groundbreaking innovations across numerous fields. The rapid pace of these changes has
been met with excitement by some and apprehension by others. However, we all agree that they have made
tremendous contributions so far and their contributions in the future will reshape our existence. The field of
educational assessment is no exception. With this in mind, we issued a call for a special issue themed
“Opportunities and Challenges of AI in Educational Assessment.” which finally included seven distinguished
articles on subthemes of fair and responsible use of AI in educational assessment, learning analytics, automated
scoring, and real-life examples of AI and LLM.

Development and Validation of a Comprehensive Genomics Knowledge Scale

Public Health Genomics, 2021

Background: Genomic testing is increasingly employed in clinical, research, educational, and comm... more Background: Genomic testing is increasingly employed in clinical, research, educational, and commercial contexts. Genomic literacy is a prerequisite for the effective application of genomic testing, creating a corresponding need for validated tools to assess genomics knowledge. We sought to develop a reliable measure of genomics knowledge that incorporates modern genomic technologies and is informative for individuals with diverse backgrounds, including those with clinical/life sciences training. Methods: We developed the GKnowM Genomics Knowledge Scale to assess the knowledge needed to make an informed decision for genomic testing, appropriately apply genomic technologies and participate in civic decision-making. We administered the 30-item draft measure to a calibration cohort (n = 1,234) and subsequent participants to create a combined valida-tion cohort (n = 2,405). We performed a multistage psychometric calibration and validation using classical test theory and item response theory (IRT) and conducted a post-hoc simulation study to evaluate the suitability of a computerized adaptive testing (CAT) implementation. Results: Based on exploratory factor analysis, we removed 4 of the 30 draft items. The resulting 26-item GKnowM measure has a single dominant factor. The scale internal consistency is α = 0.85, and the IRT 3-PL model demonstrated good overall and item fit. Validity is demonstrated with significant correlation (r = 0.61) with an existing genomics knowledge measure and significantly higher scores for individuals with adequate health literacy and healthcare providers (HCPs), including HCPs who work with genomic testing. The item bank is well suited to CAT, achieving high accuracy (r = 0.97 with the full measure) while administering a mean of 13.5 items. Conclusion: GKnowM is an updated, broadly relevant, rigorously validated 26-item measure for assessing genomics knowledge that we anticipate will be useful for assessing population genomic literacy and evaluating the effectiveness of genomics educational interventions.

The MEDPRO Project: An SBIR Project for a Comprehensive IRT and CAT Software System — CAT Software

Development of computerized adaptive tests (CAT) requires a number of appropriate software tools.... more Development of computerized adaptive tests (CAT) requires a number of appropriate software tools. This paper describes the development of two new CAT software programs. CATSIM has been designed specifically to conduct several different kinds of simulation studies, which are necessary for planning purposes as well as properly designing live CATs. FastCAT is a software system for banking items and publishing CAT tests as standalone files, to be administered anywhere. Both are available for public use.

Computerized Classification Testing with Composite Hypotheses

proposed a widely used method of applying the sequential probability ratio test (SPRT; to compute... more proposed a widely used method of applying the sequential probability ratio test (SPRT; to computerized classification testing with item response theory. This method formulates the classification problem as a point hypothesis that an examinee's ability, θ, is equal to a point, θ 1, below the cutscore or a point, θ 2, above the cutscore. The current paper argues that the actual goal of classification testing is a composite hypothesis that an examinee's ability θ is in the region of θ either above or below the cutscore, rather than equal to an arbitrarily defined point. A formulation of the SPRT to reflect this testing paradigm is proposed.

A Framework for the Development of Computerized Adaptive Tests

Practical Assessment, Research and Evaluation, Jan 10, 2011

A substantial amount of research has been conducted over the past 40 years on technical aspects o... more A substantial amount of research has been conducted over the past 40 years on technical aspects of computerized adaptive testing (CAT), such as item selection algorithms, item exposure controls, and termination criteria. However, there is little literature providing practical guidance on the development of a CAT. This paper seeks to collate some of the available research methodologies into a general framework for the development of any CAT assessment. Publish live CAT Publishing and distribution; software development This paper proceeds to discuss some of the issues relevant to each stage. This discussion, however, is by no means comprehensive. To the extent that each assessment program's situation is different and unique, it raises its own issues. Moreover, extensive attention has been given to individual aspects in other sources, such as the technical discussion of item exposure in . Therefore, an assessment program should utilize this framework as simply that, rather than as a comprehensive recipe, to identify issues relevant to the situation at hand and the type of research, business, or psychometric work necessary to present guidance for each decision. Thompson & Weiss, CAT Framework This is important not only from a practical viewpoint, but because this is the foundation for validity. A CAT developed without adequate research and documentation in each of these stages runs the danger of being inefficient at the least and legally indefensible at the worst. For example, arbitrarily setting specifications for a live CAT (termination criterion, maximum items, etc.) without empirical evidence for the choices could result in examinee scores that are simply not as accurate as claimed, providing some subtraction from the validity of their interpretations.

Advanced Methods of Designing Tests for Pass/Fail Decisions

Most examinations in the realm of professional regulatory testing are designed with the purpose o... more Most examinations in the realm of professional regulatory testing are designed with the purpose of pass/fail decisions for each examinee. Historically, these decisions have been made by administering a fixed number of items to each examinee, and assigning a "pass" decision if the examinee observed score is equal to or above a cutscore. However, this method is inefficient, in that it requires a large number of items to be administered before the decision is made. It is for this reason that computerized adaptive testing (CAT) methods were developed. Yet even CAT methods are not optimal for regulatory testing, as they are designed to obtain precise scores, and precise scores are not neededonly a pass/fail decision. A related methodology, computerized classification testing (CCT), specifically designs tests to provide this decision with as few items as possible, but retaining the decision accuracy of both fixed-form and CAT methods. This paper will explain differences between these approaches and provide a comparison of them to demonstrate the efficiency of the CCT approach for pass/fail testing. Simulations will be conducted for tests under each approach, and results compared in terms of decision accuracy (percentage correctly classified) and efficiency (average test length). The applicability of each method in the licensure/certification context will be discussed. Participants will be able to recognize the advantages and disadvantages to help determine if CCT or CAT methods might be useful for their testing program.

Utilizing the Generalized Likelihood Ratio as a Termination Criterion

Computer-based testing can be used to classify examinees into mutually exclusive groups. Currentl... more Computer-based testing can be used to classify examinees into mutually exclusive groups. Currently, the predominant psychometric algorithm for designing computerized classification tests (CCTs) is the sequential probability ratio test (SPRT; based on item response theory (IRT). The SPRT has been shown to be more efficient than confidence intervals around θ estimates as a method for CCT delivery . More recently, it was demonstrated that the SPRT, which only uses fixed values, is less efficient than a generalized form which tests whether a given examinee's θ is below θ 1 or above θ 2 . This formulation allows the indifference region to vary based on observed data. Moreover, this composite hypothesis formulation better represents the conceptual purpose of the test, which is to test whether θ is above or below the cutscore. The purpose of this study was to explore the specifications of the new generalized likelihood ratio (GLR; . As with the SPRT, the efficiency of the procedure depends on the nominal error rates and the distance between θ 1 and θ 2 . This study utilized a monte-carlo approach, with 10,000 examinees simulated under each condition, to evaluate differences in efficiency and accuracy due to hypothesis structure, nominal error rate, and indifference region size. The GLR was always at least as efficient as the fixed-point SPRT while maintaining equivalent levels of accuracy.

RUNNING HEAD: Classification Accuracy in CCT Cutscore Location and Classification Accuracy in Computerized Classification Testing

A common dependent variable in studies of computerized classification testing (CCT; Parshall, Spr... more A common dependent variable in studies of computerized classification testing (CCT; Parshall, Spray, Kalhon, & Davey, 2006) is the amount of classification error or accuracy. This is controlled in part by nominal values that are chosen as parameters for the termination criterion of the CCT. Past research, however, has focused on a comparison of observed classification error to assess the efficiency of various CCT algorithms, and has not evaluated the relationship between observed and nominal error rates. This study explored classification accuracy with CCT as a function of cutscore location with relation to the examinee distribution, as a cutscore near the center will lead to much more error than a very high or low cutscore. In addition to classification accuracy, the average number of items required to make a classification decision is also considered. A cutscore that is very high or low will fail or pass, respectively, most examinees with very few items. The purpose of this monte ...

Theory of Correspondences

Much attention has been given to professional certification examinations from a psychometric or p... more Much attention has been given to professional certification examinations from a psychometric or professional industry point of view. This paper argues that the most basic reason for the existence of professional certifications is economic, and that the credential is at its very core a product that is being sold by an organization. Within this paper a model has been developed that considers all of the critical stakeholders of the credentialing process, outlines the benefits of each of the stakeholders, and identifies the necessary contributions that each makes to the value of the credential, and finally, how that value might be increased or decreased. This model has important implications for credentialing organizations seeking to increase the volume or value of the certifications they offer.

A Framework for the Development of Computerized Adaptive Tests

Practical Assessment, Research and Evaluation, 2011

A substantial amount of research has been conducted over the past 40 years on technical aspects o... more A substantial amount of research has been conducted over the past 40 years on technical aspects of computerized adaptive testing (CAT), such as item selection algorithms, item exposure controls, and termination criteria. However, there is little literature providing practical guidance on the development of a CAT. This paper seeks to collate some of the available research methodologies into a general framework for the development of any CAT assessment.

Digital Module 19: Foundations of IRT Estimation

Educational Measurement: Issues and Practice, 2020

In this digital ITEMS module, Dr. Zhuoran Wang and Dr. Nathan Thompson introduce the basic item r... more In this digital ITEMS module, Dr. Zhuoran Wang and Dr. Nathan Thompson introduce the basic item response theory (IRT) item calibration and examinee scoring procedures as well as strategies to improve estimation accuracy. They begin the module with a conceptual review of IRT that includes core advantages of the IRT framework, commonly used IRT models, and essential components, such as information and likelihood functions. In the second part of the module, they illustrate the structure and inner workings of calibration and scoring algorithms, such as the MMLE/EM algorithm for item parameter calibration and the MLE, EAP, and MAP algorithms for examinee scoring. In part three, they demonstrate the influence of multiple factors on estimation accuracy and provide strategies for maximizing accuracy. In addition to audio-narrated slides, the digital module contains sample R code, quiz questions with diagnostic feedback, curated resources, and a glossary.

Stochastic Curtailment in Adaptive Mastery Testing

Applied Psychological Measurement, 2014

A well-known stopping rule in adaptive mastery testing is to terminate the assessment once the ex... more A well-known stopping rule in adaptive mastery testing is to terminate the assessment once the examinee’s ability confidence interval lies entirely above or below the cut-off score. This article proposes new procedures that seek to improve such a variable-length stopping rule by coupling it with curtailment and stochastic curtailment. Under the new procedures, test termination can occur earlier if the probability is high enough that the current classification decision remains the same should the test continue. Computation of this probability utilizes normality of an asymptotically equivalent version of the maximum likelihood ability estimate. In two simulation sets, the new procedures showed a substantial reduction in average test length while maintaining similar classification accuracy to the original method.

Computerized Classification Testing with Composite Hypotheses

Reckase (1983) proposed a widely used method,ofapplying the sequential probability ratio test (SP... more Reckase (1983) proposed a widely used method,ofapplying the sequential probability ratio test (SPRT; Wald, 1947) to computerized classification testing withitem response theory. This method formulates the classification problem as a point hypothesis that an examinee’s ability, θ, is equal to a point, θ1,below the cutscore or a point, θ2,above the cutscore. The current paper argues that the actual goal

Computerized Adaptive Testing: A Primer on Benefits, Design, and Implementation

PsycEXTRA Dataset

IRTEQ: Software for Linking and Equating with Item Response Theory

International Journal of Testing, 2010

An essential requirement for testing programs where every examinee does not see the same set of i... more An essential requirement for testing programs where every examinee does not see the same set of items is the capability to link different sets of items together. Such a situation can occur when multiple fixed forms are used, either at the same time or across years, but it is also necessary for bank-based testing methods such as computerized adaptive testing (CAT, Wainer, 2000) and linear-on-the-fly testing (Folk & Smith, 2002). To ensure the comparability of item parameters and ability (θ ) estimates, a linking and equating analysis must be performed. IRTEQ is a state-of-the-art software program that performs much of this task for testing programs developed with item response theory (IRT: Embretson & Reise, 2000). IRTEQ, developed by Kyung T. Han at the University of MassachusettsAmherst, is a Windows R © application that provides graphical output in addition to the standard text output. These two characteristics are a notable advantage over its predecessor, EQUATE (Baker, 1993). The Windows interface greatly increases the user-friendliness of the program, while the graphical output facilitates interpretation of the results. Moreover, EQUATE is limited to characteristic curve methods. Equating refers to the process of setting scores from two different scales onto the same scale. Linking refers to the same process, but with item parameters. In IRT, the same scale is used for both the scores and the item parameters, and therefore the process is equivalent. There are several linking/equating designs available to researchers and practitioners; IRTEQ is designed for the most common, the non-equivalent groups anchor test (NEAT) design.

A Comparative Study of the Impact of Certified and Noncertified Ophthalmic Medical Personnel on Practice Quality and Productivity

Eye & Contact Lens: Science & Clinical Practice, 2008

To compare ophthalmic practice productivity and performance attributes, as rated by employing oph... more To compare ophthalmic practice productivity and performance attributes, as rated by employing ophthalmologists, of noncertified and three levels of certified ophthalmic medical personnel. Methods. Three hundred eighty-five American and Canadian ophthalmologists in a clinic-based, stratified, random sample were surveyed regarding productivity performance and attributes of the ophthalmic medical personnel they employ. Instrument scales assessed 14 desirable professional attributes and 10 practice productivity measures. The attributes were credibility, reliability, competence, quality assurance, quality of patient care, knowledge base to make adjustments, increased skills (expertise), ability to work independently, broader knowledge base, ability to detect errors, ability to be trained to perform multiple roles in the practice, professional image, good judgment, and initiative and drive. The productivity measures were patient satisfaction, doctor productivity, trouble-shooting rapport, triage screening, effective patient flow, reduced patient complaints, increased referrals, number of patients per hour, revenue per patient, and patient follow-up. Participants indicated whether certified personnel more often showed these attributes and contributed to practice productivity measures as compared to noncertified personnel or whether there was no difference. Results were analyzed with a chi-square goodness-of-fit test. Survey reliability and validity were evaluated. Results. Significantly more ophthalmologists responded that the three levels of certified personnel contributed more to 5 of the 10 practice productivity measures (i.e., doctor productivity, troubleshooting rapport, triage screening, effective patient flow, and number of patients per hour). A statistically significant number of ophthalmologists also believed that certified personnel showed more of all 14 of the personal attributes considered desirable compared to noncertified ophthalmic medical personnel. Conclusions. Compared to noncertified personnel, the employment of certified ophthalmic personnel enhances the quality and productivity of an ophthalmic practice. Overall practice productivity is increased with certified ophthalmic medical personnel.

Item Selection in Computerized Classification Testing

Educational and Psychological Measurement, 2008

Several alternatives for item selection algorithms based on item response theory in computerized ... more Several alternatives for item selection algorithms based on item response theory in computerized classification testing (CCT) have been suggested, with no conclusive evidence on the substantial superiority of a single method. It is argued that the lack of sizable effect is because some of the methods actually assess items very similarly through different calculations and will usually select the same item. Consideration of methods that assess information across a wider range is often unnecessary under realistic conditions, although it might be advantageous to utilize them only early in a test. In addition, the efficiency of item selection approaches depend on the termination criteria that are used, which is demonstrated through didactic example and Monte Carlo simulation. Item selection at the cut score, which seems conceptually appropriate for CCT, is not always the most efficient option. A broad framework for item selection in CCT is presented that incorporates these points.

Advanced Methods of Designing Tests for Pass/Fail Decisions

Most examinations in the realm of professional regulatory testing are designed with the purpose o... more Most examinations in the realm of professional regulatory testing are designed with the purpose of pass/fail decisions for each examinee. Historically, these decisions have been made by administering a fixed number of items to each examinee, and assigning a "pass" decision if the examinee observed score is equal to or above a cutscore. However, this method is inefficient, in that it requires a large number of items to be administered before the decision is made. It is for this reason that computerized adaptive testing (CAT) methods were developed. Yet even CAT methods are not optimal for regulatory testing, as they are designed to obtain precise scores, and precise scores are not neededonly a pass/fail decision. A related methodology, computerized classification testing (CCT), specifically designs tests to provide this decision with as few items as possible, but retaining the decision accuracy of both fixed-form and CAT methods. This paper will explain differences between these approaches and provide a comparison of them to demonstrate the efficiency of the CCT approach for pass/fail testing. Simulations will be conducted for tests under each approach, and results compared in terms of decision accuracy (percentage correctly classified) and efficiency (average test length). The applicability of each method in the licensure/certification context will be discussed. Participants will be able to recognize the advantages and disadvantages to help determine if CCT or CAT methods might be useful for their testing program.

A Proposed Framework of Test Administration Methods

Journal of Applied Testing Technology, 2008

Utilizing the Generalized Likelihood Ratio As a Termination Criterion

Proceedings of the 2009 GMAC Conference on …, 2009

Opportunities and Challenges of AI in Educational Assessment

by Alper Sahin and Nathan Thompson

Journal of Measurement and Evaluation in Education and Psychology, 2024

In the past few years, as artificial intelligence (AI) and large language models (LLM) have rapid... more In the past few years, as artificial intelligence (AI) and large language models (LLM) have rapidly entered our
lives, we have witnessed groundbreaking innovations across numerous fields. The rapid pace of these changes has
been met with excitement by some and apprehension by others. However, we all agree that they have made
tremendous contributions so far and their contributions in the future will reshape our existence. The field of
educational assessment is no exception. With this in mind, we issued a call for a special issue themed
“Opportunities and Challenges of AI in Educational Assessment.” which finally included seven distinguished
articles on subthemes of fair and responsible use of AI in educational assessment, learning analytics, automated
scoring, and real-life examples of AI and LLM.