Ucat 2023 Technical Report
Ucat 2023 Technical Report
Technical Report
Testing Interval: 10 July 2023 to 28 September 2023
Prepared by:
Pearson VUE
19 March 2024
Non-disclosure and Confidentiality Notice
Copyright © 2024 NCS Pearson, Inc. All rights reserved. The PEARSON logo is a
trademark in the U.S. and/or other countries.
i|P a g e
Table of Contents
1. Executive Summary ........................................................................................... 1
2. Introduction......................................................................................................... 3
3. Exam Design 2023 .............................................................................................. 4
4. Examination Results .......................................................................................... 6
4.1 Overall Exam Results ....................................................................................... 6
4.2 Special Educational Needs ............................................................................... 9
4.3 Medicine and Dentistry ................................................................................... 12
4.4 Mode of Delivery ............................................................................................. 13
4.5 Examination Results by Demographic Variables ............................................ 13
4.5.1 Variation by Demographic Group ................................................................ 13
4.5.2 Gender ........................................................................................................ 14
4.5.3 Ethnicity ....................................................................................................... 16
4.5.4 Socio-Economic Classification (SEC) .......................................................... 19
4.5.5 Age .............................................................................................................. 21
4.5.6 Education .................................................................................................... 24
4.5.7 Country of Residence .................................................................................. 25
4.5.8 First Language ............................................................................................ 26
4.5.9 Demographic Interactions and SEN ............................................................ 28
5. Exam Timing Analysis ..................................................................................... 30
6. Test Form Analysis .......................................................................................... 37
7. Item Analysis .................................................................................................... 41
7.1 Cognitive Item Analysis ................................................................................... 41
7.1.1 Item Analysis for SEN.................................................................................. 47
7.1.2 Comparison of UCAT Item Bank Statistics with UCAT ANZ ........................ 48
7.2 SJT Item Analysis ........................................................................................... 49
7.3 Differential Item Functioning (DIF) .................................................................. 53
7.3.1 Introduction .................................................................................................. 53
7.3.2 Method of DIF Detection.............................................................................. 54
7.3.3 Sample Size Requirements ......................................................................... 54
7.3.4 DIF Results .................................................................................................. 55
8. Summary ........................................................................................................... 61
8.1 Recommendations .......................................................................................... 61
ii | P a g e
Table of Tables
Table 1. UCAT Exam Design .......................................................................................... 4
Table 2. SJT Band Scaled Score Range and Description ............................................... 5
Table 3. Cognitive Subtest and Total Scaled Score Summary Statistics......................... 6
Table 4. Historic Cognitive Subtests Mean Scaled Scores (2017–2023) ........................ 7
Table 5. The Scaled Score Zero-Order Correlation of the Subtests ................................ 8
Table 6. SJT Band Distribution in 2023 ........................................................................... 8
Table 7. Exam Version Time Allowed.............................................................................. 9
Table 8. Exam Version Candidate Volumes .................................................................. 10
Table 9. SEN and Non-SEN Cognitive Subtests ........................................................... 10
Table 10. SJT Band by Exam Version........................................................................... 11
Table 11. Stratified Sample of 2020 UCAT ................................................................... 11
Table 12. Medicine/Dentistry Candidates: Cognitive and Total Scaled Scores ............. 12
Table 13. Medicine/Dentistry Candidates: SJT Bands .................................................. 12
Table 14. Gender Counts .............................................................................................. 14
Table 15. Gender Scaled Scores .................................................................................. 14
Table 16. Gender t-Test ................................................................................................ 15
Table 17. Ethnic Group Counts ..................................................................................... 16
Table 18. Ethnic Group Mean Scaled Score ................................................................. 17
Table 19. Ethnic Group F-Test ...................................................................................... 18
Table 20. SEC Counts................................................................................................... 20
Table 21. SEC Scaled Scores ....................................................................................... 21
Table 22. SEC F-Test.................................................................................................... 21
Table 23. Age Counts.................................................................................................... 22
Table 24. Age F-Test..................................................................................................... 23
Table 25. Correlation of Scaled Score with Age (ungrouped) ....................................... 23
Table 26. Education Scaled Scores .............................................................................. 24
Table 27. Education t-Test ............................................................................................ 25
Table 28. Candidate Count by Residence ..................................................................... 25
Table 29. Candidate Scaled Scores by Residence ....................................................... 26
Table 30. Residence F-Test .......................................................................................... 26
Table 31. Scaled Scores by Language and Country of Residence ............................... 27
Table 32. Language t-Test ............................................................................................ 28
Table 33. Subtest Performance Differences: UCAT and UCATSEN (controlling for
demographic variables) ...................................................................................... 29
Table 34. Mean Subtest Section Timing: Non-SEN and SEN ....................................... 30
iii | P a g e
Table 35. Subtest Section Timing: Non-SEN and SEN UCAT Incomplete Tests .......... 31
Table 36. Proportion of Test Reached After Guessing Responses Excluded ............... 36
Table 37. Candidates by Form ...................................................................................... 37
Table 38. Cognitive Raw Score Test Statistics .............................................................. 37
Table 39. SJT Raw Score Test Statistics (252 score points)......................................... 38
Table 40. Cognitive Scaled Score Test Statistics .......................................................... 40
Table 41. Cognitive Items Passing the Quality Criteria ................................................. 42
Table 42. Discrimination Summary Statistics ................................................................ 43
Table 43. p Value Summary Statistics........................................................................... 44
Table 44. VR Type Point biserial and p Value ............................................................... 45
Table 45. DM Response Type Point biserial and p Value ............................................. 45
Table 46. DM Response and Item Type Point biserial and p Value .............................. 46
Table 47. QR Type Point biserial and p Value .............................................................. 46
Table 48. AR Type Point biserial and p Value ............................................................... 47
Table 49. Item Analysis of UCAT and UCATSEN ......................................................... 47
Table 50. Comparison of Operational Item Statistics: UCAT & UCAT ANZ 2023 ......... 48
Table 51. Number of Operational Items Showing Drift in UCAT vs UCAT ANZ ............ 49
Table 52. Candidate Removal Summary for SJT Item Analysis .................................... 49
Table 53. SJT Item Quality Criteria ............................................................................... 51
Table 54. Operational SJT Item Analysis Summary ...................................................... 52
Table 55. SJT Pretest Item Summary Statistics ............................................................ 53
Table 56. Gender DIF.................................................................................................... 55
Table 57. Age DIF ......................................................................................................... 55
Table 58. Ethnicity DIF .................................................................................................. 57
Table 59. SEC DIF ........................................................................................................ 58
Table 60. Honours Degree DIF ..................................................................................... 59
Table 61. English as First Language DIF ...................................................................... 59
Table 62. Residency DIF ............................................................................................... 59
iv | P a g e
Table of Figures
Figure 1. Candidate Volumes since 2017........................................................................ 6
Figure 2. Scaled Scores by Year since 2017 .................................................................. 7
Figure 3. SJT Band Proportions 2017–2023 ................................................................... 9
Figure 4. Distribution of Candidates by Gender 2017–2023.......................................... 14
Figure 5. Scaled Score Distribution of Candidates by Gender 2017–2023 ................... 16
Figure 6. Distribution of Candidates by Ethnic Group 2017–2023................................. 17
Figure 7. Ethnic Group Mean Scaled Score for Total Scaled Score 2017–2023 ........... 18
Figure 8. Ethnic Group Mean Scaled Score for SJT 2017–2023................................... 19
Figure 9. Candidates by SEC 2017–2023 ..................................................................... 20
Figure 10. Mean Scaled Scores by Age ........................................................................ 22
Figure 11. Mean Total Scaled Scores of Cognitive Subtests by Age ............................ 23
Figure 12. Country of Residence 2017–2023 ................................................................ 25
Figure 13. Count of Language 2017–2023 .................................................................... 27
Figure 14. Mean and Maximum Time for UCAT and UCATSEN ................................... 31
Figure 15. Candidates Reaching All Items 2017–2023 ................................................. 32
Figure 16. VR Item Time Distribution ............................................................................ 33
Figure 17. DM Item Time Distribution ............................................................................ 34
Figure 18. QR Item Time Distribution ............................................................................ 34
Figure 19. AR Item Time Distribution ............................................................................ 35
Figure 20. SJT Item Time Distribution ........................................................................... 35
Figure 21. Raw Score Reliability 2017–2023 ................................................................ 38
Figure 22. Proportion of Operational Items Failing Analysis 2017–2023 ....................... 42
Figure 23. Proportion of Pretest Items Failing Analysis 2017–2023 .............................. 43
Figure 24. Point biserial 2017–2023 .............................................................................. 44
Figure 25. p Value 2017–2023 ...................................................................................... 45
Figure 26. Proportion of SJT Items Failing Analysis 2017–2023 ................................... 51
Figure 27. Average Item Facility of Operational SJT Items 2017–2023 ........................ 52
Figure 28. Average Item Partial Correlation of Operational SJT Items 2017–2023 ....... 53
v|P a g e
1. Executive Summary
The University Clinical Aptitude Test (UCAT) was administered in 2023 from 10 July 2023
to 28 September 2023. This report covers the 35,625 exams that were delivered during
that period, which is a small decrease (2%) from 2022. The exam was delivered in two
modes: online and test centre. Online test delivery accounted for only 0.1% of candidates,
so it is not possible to reliably compare results between these two groups.
This report covers four of the five versions of the UCAT made available for candidates
with special educational needs (SEN). One version, taken during the contingency period,
is not included in this report. Six percent of candidates who took the UCAT opted for a
SEN version, and, similarly to previous years, candidates who took SEN versions of the
exam outperformed those who took the non-SEN version.
Each exam consists of five subtests. The scaling of the subtests in 2023 was adjusted to
even out the distribution of scaled scores among these subtests. This adjustment resulted
in a higher mean scaled score for Verbal Reasoning (VR) and lower scores for
Quantitative Reasoning (QR) and Abstract Reasoning (AR), bringing the averages of all
subtests closer together. After accounting for the rescaling effort, the mean scaled scores
for VR, QR, Decision Making (DM), and AR remained stable and were comparable to the
mean scaled scores in 2022. The Situational Judgement Test (SJT) bands showed a
deviation within 4% of the target proportions. Notably, the percentage of candidates in the
lowest SJT band fell from 14% in 2022 to 9% in 2023, aligning more closely with the target
of 10%.
The 2023 UCAT consisted of five test forms. Reliabilities for the forms were good across
the board and corresponding standard errors of measurement (SEMs) were satisfactorily
low and consistent with previous years.
The cognitive subtests were speeded to a certain extent. Most candidates used all the
available time and the average time used was very close to the available time. In 2022,
changes were made to lessen the time pressure in these subtests, and these changes
continued into 2023. As a result, the level of time pressure in 2023 was similar to 2022
but lower than in previous years. Analysis excluding guesses suggests that candidates
were generally able to attempt most questions in each subtest. Speededness was lower
in the SEN exams, where candidates have more time available. The SJT remains the
least speeded subtest.
In 2023, demographic trends largely mirrored those of past years, with the notable
exceptions of a continuous decrease in UK - White candidates and a rise in UK - Asian
and non-UK candidates. Candidates with a higher socio-economic classification (SEC),
those of white ethnicity, English as a first language speakers, and UK residents were
associated with higher scores. In the cognitive subtests, male candidates generally
outperformed female candidates, while in the SJT, female candidates performed better
than their male counterparts.
Pearson VUE Confidential P a g e |1
Individual item analysis showed satisfactory quality for the majority of operational items.
Pretesting is intended to identify poor-quality items before they enter the operational
scored test, and therefore the pretest items ranged more broadly in quality and on the
whole performed less well. Four operational items and 17 pretest items from the cognitive
subtests did not meet quality standards and were removed from the item bank. In the SJT
subtest, 35 operational items and 195 pretest items were found to have failed to meet all
of the relevant criteria. Additionally, 8 operational items and 16 pretest items were
removed due to potentially exhibiting bias.
This report covers the 2023 UCAT that was delivered from 10 July 2023 to 28 September
2023. As outlined in Section 3, the exam consisted of five subtests ranging from 29 to 69
items each. The design of the exam remained the same as in the previous year, with a
small change to the scaling of three of the subtests. The VR subtest was scaled up by 20
scaled score points while the QR subtest and AR subtest were scaled down by 10 scaled
score points each.
Section 4 describes the exam results in terms of candidate volumes, scaled scores, and
SJT bands. It also reports exam results in reference to candidates who qualified for a
SEN version of the exam, whether candidates applied for medicine or dentistry, the mode
of delivery, and candidate demographic characteristics.
Candidates were given 120 minutes to answer a total of 228 items from the five subtests.
There were five groups of candidates who took a SEN version of the exam, and thus had
extra time allowances in 2023. The timing and scoring of the SEN exams are explored in
detail in Section 4.2.
There have been changes to the scaling of the subtests in 2023. For the past 5 years, the
mean scaled scores for QR and AR were comparatively higher than the other subtests,
while for VR, the mean scaled score was relatively lower. Therefore, UCAT decided to
scale down both QR and AR by 10 points and scale up VR by 20 points to narrow the gap
between the cognitive subtests while maintaining similar total cognitive subtest scores.
The raw scores in each cognitive subtest were transformed to a scaled score ranging
from 300 to 900. SJT scaled scores ranged from 300 to 790. Universities received the
cognitive subtest scaled scores plus a total score: a simple sum of the four cognitive
subtest scores ranging from 1,200 to 3,600. SJT scaled scores are further categorised
into four bands. The bands are determined by scaled score ranges as defined in Table 2.
The 2023 UCAT was delivered in two modes: the OnVUE mode, where a candidate can
take the test remotely with an online proctor, or the test centre mode, where candidates
take the test in a specially designed test centre. Only 31 candidates took the online
version of the test (see Section 4.4).
Table 3 presents summary statistics for each of the cognitive subtests plus the total scaled
score for the cognitive subtests. VR scores were lowest with a mean score of 591, and
the highest average score was achieved on AR with a mean of 652.
Figure 2 shows the change in scaled scores since 2017. The year 2017 was chosen as
a starting point for comparison because prior to 2017 there was no operational DM
section.
Considering the rescaling efforts detailed earlier, the average performance across
subtests have been stable since 2018, with the exception of a small gradual increase in
performance in AR. Cohort-to-cohort deviations remain within a few scaled score points
after accounting for the rescaling and timing adjustment implemented. These deviations
are significantly below one SEM for these subtests, as detailed in Section 6. Statistically,
these minor deviations are not substantial enough to raise concerns. This stability
indicates consistent performance across different cohorts, aligning with expectations
given the absence of major test alterations and a stable candidate composition.
For the SJT, the number and percentage of candidates in each band for the 35,625
candidates who took the 2023 UCAT are shown in Table 6 below. Candidates are
awarded a band for the SJT exam based on their underlying scaled score.
The proportions of candidates in the four different bands deviated from the target.
Specifically, the percentages for Bands 1 and 2 exceed the target by three and one
percentage points, respectively, while those for Bands 3 and 4 fall short by four and one
percentage points, respectively. This shows that the candidates in this cohort performed
slightly better than we have anticipated, resulted in a higher proportion of Band 1-2
candidates and a lower proportion of Band 3-4 candidates.
Figure 3 illustrates the distribution of candidates across SJT bands since 2017. From
2018, target proportions for each SJT band were introduced. Although these targets vary
annually, they typically fluctuate within a 1% to 2% range. The 2023 target proportions
are represented by dotted lines in Figure 3. Generally, the actual proportions align closely
with the targets, albeit with minor deviations. This year, the largest significant deviation is
4%, which is consistent with the range of deviation observed in previous years. The
equating method undertaken when constructing test forms ensures that the difficulty of
the test forms is controlled year-on-year, meaning test construction is not the source of
the shifts in performance we see in Figure 3.
The distribution of scores is important because the band boundaries (defined in Table 2)
are set each year by candidate performance in the prior year. Candidate performance in
2020 was relatively high, with an increase in candidates being categorised as Band 1.
This increase resulted in the boundary for Band 1 being higher in 2021 than in 2020;
therefore, when candidate performance returned to normal, correspondingly fewer
candidates were categorised as Band 1. However, the 2022 band thresholds were based
on the 2021 population, and therefore the band distributions are much closer to the target.
Thresholds for the current year, established on the basis of the 2022 candidate cohort,
have yielded a slightly higher proportion of candidates in Bands 1 and 2 than the intended
target proportion, suggesting a small increase in the performance of the 2023 cohort
relative to that of 2022. These findings will guide the calibration of thresholds for the
subsequent year.
Historically, candidates who take a SEN version of the exam usually outperform
candidates who take the non-SEN version. Table 9 summarises the scaled score
statistics by exam version. SEN candidates outperformed non-SEN candidates in all four
subtests. The sample size of UCATSEN50, UCATSA, and UCATSENSA are small and
results for those versions should be treated with caution.
The pattern of SEN candidates being stronger than non-SEN candidates is repeated for
the SJT results, where the UCAT version of the exam has the lowest proportion of
candidates in Band 1 and the highest in Band 4. The breakdown of SJT band proportions
by exam version is presented in Table 10 below. The version of the exam on which
candidates performed the best is the UCATSA, where 83% of candidates are categorised
as either Band 1 or Band 2, but note the prior warning that few candidates sat that version
of the exam, meaning comparison may not be reliable.
One potential reason for SEN candidates outperforming non-SEN candidates is the extra
time they receive. After the 2020 exam, Pearson VUE undertook analysis to understand
whether some of this difference may also be due to demographic differences between the
SEN and non-SEN candidate groups. We matched 100 stratified samples of UCATSEN
candidates to the demographic makeup of the UCAT candidates according to first
language, gender, residency, age group, education level and SEC. The comparison of
average scaled scores of the stratified sample of UCATSEN candidates to the UCAT
candidates is shown in Table 11 below. We anticipated that when the samples were
matched demographically, the UCATSEN scores would come closer to the UCAT results,
and that is the case for the VR and DM subtests, as well as the total score. However, for
QR, the average score did not change and for AR, it increased.
The majority of candidates applied for medicine, accounting for 59% of candidates, a
reduction from 63% in 2022 and 69% in 2021. In contrast, 13% of candidates applied for
dentistry, an increase from 11% in 2022 and 9% in 2021. The remaining 29% applied for
neither or could not be matched with UCAS data.
Candidates who applied for medicine as a first choice outperformed those who applied
for dentistry, as illustrated in Table 12. The highest mean scaled score was achieved on
AR and the lowest on VR for both candidate groups. Candidates who did not apply for
medicine or dentistry or were not matched by UCAS performed less well than both other
groups.
Better performance by medicine candidates is also reflected in the SJT banding. As Table
13 shows, more medicine than dentistry candidates appeared in Band 1, and fewer
medicine than dentistry candidates appeared in Band 4.
In summary, UCAT candidates who applied for medicine performed better across all
subtests than those who applied for dentistry and both of these groups performed better
than those who applied to neither. This is consistent with test performance in previous
years.
Given the large difference in volumes between the two modes and the low number of
candidates who took the test in the online mode in 2023, it is not possible to draw reliable
inferences on differences in performance for the 2023 cohort of candidates.
For the purpose of the demographic analysis, the SJT scaled score summary statistics
are included in the relevant tables to illustrate trends. These scores are not issued to
candidates and are not directly comparable to the scaled scores of the cognitive subtests.
The distribution of candidates by gender has remained stable since 2017, with a slight
increase in female candidates from 2017 to 2019 (Figure 4).
Males outperformed females on all subtests except the SJT, where females performed
better than males. The difference between male and female average scores is shown in
Table 15, ranging from 10 scaled score points on VR to 33 scaled score points on QR.
However, note that these differences are less than the SEM on the subtest and therefore
may not be significant. Further analysis can be found below.
A statistical test was used to examine whether the differences between the two groups
observed in Table 15 were statistically significant. Table 16 shows the t-statistic, degrees
of freedom and p value for each subtest and the total cognitive scores. The df column
represents the combined sample sizes of both groups minus two, reflecting independent
data points for comparison. A non-zero t-statistic indicates there is a difference in the
mean scaled score between two group samples. However, the difference may or may not
be statistically significant. That is, the difference may or may not be sufficient evidence of
a true difference in the entire population (e.g., between all eligible males and all eligible
females). The p value shows the probability due to chance of observing a particular t-
statistic (or something more extreme). Lower p values (e.g., less than 0.01) indicate that
we would be unlikely to see such a difference in our sample if there were
no true difference in the population.
Therefore, Table 16 shows us that there are differences between male and female
performance on each subtest and on the total cognitive scores, and that these differences
are likely not to be the result of random chance.
4.5.3 Ethnicity
UCAT candidates who reside in the UK are requested to answer a question relating to
their ethnicity. The ethnic categories in the questionnaire were simplified in 2023 by
reducing the number of options. These options align closely with the groups used in
previous reports except for UK-Chinese, which is no longer a separate category. The
categories used are:
• White
• Mixed or multiple ethnic groups
• Asian or Asian British
• Black, African, Caribbean or Black British
• Other ethnic group
• I prefer not to say
Table 17 shows the breakdown of candidates by ethnicity in the 2023 exam. The biggest
candidate group was UK - Asian. Twenty-one percent of candidates were not categorised
due to being non-UK candidates.
UK - White candidates performed better on average on all subtests than other groups.
Table 18 shows the average scores in each subtest for each ethnic group. Performance
was the lowest for UK - Black candidates on average on all subtests except the SJT,
where non-UK candidates received the lowest average scaled scores.
Mean total cognitive scaled scores fell for all ethnic groups between 2017 and 2018
reflecting the rescaling that took place (Figure 7). After 2018, scores have remained fairly
stable for most of the ethnic groups, with small increases for Non-UK candidates. The
UK - Chinese ethnic category was removed from the survey since 2022.
Figure 7. Ethnic Group Mean Scaled Score for Total Scaled Score 2017–2023
In the SJT, there was a fairly large increase in scores for all ethnic groups between 2019
and 2020 and a slightly larger fall for all groups between 2020 and 2022, with a small
increase observed in 2023. The most notable thing about ethnic group trends for the SJT
is the margin by which non-UK candidates underperformed relative to the other groups,
as can be observed in Figure 8.
This issue is illustrated in Table 20, which shows that 23% of all candidates reside in the
UK but cannot be categorised into an SEC. The candidates who can be categorised fall
predominantly into SEC 1, representing Managerial and Professional Occupations.
Prior to 2021, SEC was calculated for up to two parents or carers, then candidates were
categorised as the highest of the two SECs. However, in 2021, the SEC questions
changed to ask candidates to enter responses for only the highest earning parent or carer.
The result is that proportionally more candidates appear in the NA category from 2021
than in previous years, as illustrated in Figure 9. It suggests that there are fewer
candidates in SEC 1 since 2021 than in previous years; however, since this fall
corresponds to a similar rise in SEC NA, it is likely that the new way of measuring SEC is
influencing this measure. The trend in 2023 is similar to that observed in 2022.
As with the other demographic categories, hypothesis testing was used to examine
whether the scores are likely to be true reflections of the candidate population. Table 22
shows that the score differences observed in each subtest are likely to be due to true
differences.
4.5.5 Age
The majority of UCAT candidates are aged 16–19 years old. A small minority of
candidates are 35 or older and an even smaller proportion are under 16 (Table 23). A
steady proportional increase in candidates aged 16–19 taking the test can be observed;
76% of the testing population was aged 16–19 in 2020, 78% in 2021, 81% in 2022 and
82% in 2023.
Candidates who were aged 16–19 tended to perform better in all cognitive subtests, as
illustrated in Figure 10 below. In the SJT, candidates who were 20–24 tended to perform
the best. Candidates who were under 16 and over 34 typically had the lowest performance
on the exam; however, the small group sizes for those categories means it is difficult to
draw meaningful conclusions from that information. Overall, candidates who were aged
16–19 performed better than other candidates when evaluated by their total cognitive
scaled scores, followed by the candidates who were aged 20–24, as illustrated in Figure
11.
Hypothesis testing demonstrated that the differences observed among the groups is
unlikely to have occurred due to chance, as shown in Table 32.
To understand how age relates to subtest performance, Table 25 shows the correlation
between candidate age and their performance on each subtest. As the significance
column shows, all the subtests had statistically significant correlations except for the SJT.
For the cognitive subtests with significant correlations, age is slightly negatively correlated
with performance, meaning as candidates get older, they tend to perform less well. The
strongest negative correlation is for QR. No significant correlation between age and SJT
subtest was observed for the year 2023.
4.5.6 Education
Candidates are requested to state their highest academic qualification, and these are
then grouped into the following categories:
The majority of candidates in 2023 had a school leaver qualification (84%), 15% had a
degree or above (down from 16% in 2022), and a small minority had no formal
qualifications.
Candidates with a degree or above performed better on average on the SJT. For the
cognitive subtests and the total cognitive score, below-honours degree candidates
performed better on average, as shown in Table 26.
Table 27 shows that the differences observed in Table 26 are statistically significant.
As in past technical reporting, EU and Rest of World are combined into one category
called Non-UK. Since 2017, the proportion of candidates who reside in the UK has been
relatively stable, as shown in Figure 12 below.
Across all subtests, candidates who stated that English was their first language
outperformed those who stated that English was not their first language regardless of
their country of residence, as shown in Table 31 below.
In line with the other demographic categories, a test was carried out to understand
whether the differences observed in Table 31 can be considered true reflections of the
differences between the two groups. Table 32 shows that that such differences are
unlikely to have occurred by chance.
The results of these analyses tend to support the statistical testing of each demographic
characteristic; that is, testing that the differences we observe between demographics are
true reflections of the differing abilities of the demographic groups. They also tend to show
that SEN status does interact with certain demographic characteristics to have a
combined influence on scores, although this is only apparent on QR for qualification, SEC
and gender; and VR for qualification.
A shortened version of that analysis was also conducted this year to continue monitoring
the differences in the performance between UCAT candidates and UCATSEN
candidates, as presented in Table 33. After controlling for the effect of the demographic
variables (see the note in Table 33), the difference in exam version still explains a
significant amount of variance in the candidates’ performance, as candidates who took
the UCATSEN performed better than those who took the UCAT. The largest difference
Pearson VUE Confidential P a g e | 28
was observed in the AR subtest, and the smallest difference was observed in the QR
subtest. In 2022, the largest difference observed was in QR and the smallest was in SJT,
which correspond to the most speeded and least speeded subtests of the exam
respectively. The pattern in 2022 had led to the hypothesis that the SEN exam advantage
is positively associated with the speededness of the exam. This year results are
contradicting to this hypothesis, as both QR and AR are relatively speeded subtests. The
performance differences between UCAT and UCATSEN will be continuously monitored
in future years to ensure test fairness to all candidates.
Table 33. Subtest Performance Differences: UCAT and UCATSEN (controlling for
demographic variables)
Subtest F p η2
VR 99.43 <.0001 0.0026
DM 109.43 <.0001 0.0029
QR 75.79 <.0001 0.0020
AR 128.47 <.0001 0.0035
SJT 98.25 <.0001 0.0026
Note. The comparison was only made between UCAT and UCATSEN exam codes, which accounted for 99% of the candidates. The
rest of the accommodated exam codes were not included because of the small number of candidates. The demographic variables
that were controlled included gender, SEC, age group, highest academic qualification, country of residence and first language.
Candidates’ ethnicity was not included in the analysis as more than 20% of candidates did not provide this information.
Despite the consistent differences observed in the SEN exam across the years, the
effect size, eta-squared η2, of these differences across all subtests is less than 0.005
after controlling for the effect of the demographic variables, indicating the effect sizes of
the differences are very small. The small effect size suggests that the performance gap
is not worryingly large considering the normal variation in participants’ performance after
accounting for the differences in candidates’ demographic composition.
Test timing can be examined in more detail in Table 35. It shows that the most speeded
non-SEN subtests are VR and QR, where 87% and 87% of candidates respectively
reached all the items and between 6% to 7% of candidates did not reach five or more
items. The SJT is the least speeded in all exam versions.
Table 35. Subtest Section Timing: Non-SEN and SEN UCAT Incomplete Tests
Mean Number
Five or Five or
Reached Reached of Unreached
More Items More Items
Exam Subtest All Items All Items Items for
Unreached Unreached
N % Incomplete
N %
Tests Only
VR 29,148 87% 2,283 7% 6.78 (4483)
DM 31,245 93% 719 2% 3.57 (2386)
UCAT QR 29,183 87% 2,101 6% 6 (4448)
AR 30,144 90% 1,638 5% 6.61 (3487)
SJT 32,973 98% 118 0% 3.66 (658)
VR 1,207 93% 39 3% 5.36 (94)
DM 1,254 96% 7 1% 2.53 (47)
UCATSEN QR 1,207 93% 40 3% 5.32 (94)
AR 1,242 95% 17 1% 4.25 (59)
SJT 1,288 99% 1 0% 2.23 (13)
VR 388 92% 16 4% 6.53 (34)
DM 402 95% 4 1% 3.35 (20)
UCATSENSA QR 386 91% 19 5% 6.14 (36)
AR 384 91% 19 5% 7.05 (38)
SJT 416 99% 1 0% 4 (6)
VR 86 92% 3 3% 4.29 (7)
UCATSEN50 DM 91 98% 0 0% 2.5 (2)
QR 89 96% 1 1% 2.75 (4)
Pearson VUE Confidential P a g e | 31
Mean Number
Five or Five or
Reached Reached of Unreached
More Items More Items
Exam Subtest All Items All Items Items for
Unreached Unreached
N % Incomplete
N %
Tests Only
AR 90 97% 1 1% 4.33 (3)
SJT 93 100% 0 0% N/A
VR 159 89% 8 4% 5 (19)
DM 169 95% 2 1% 2.67 (9)
UCATSA QR 157 88% 11 6% 6.1 (21)
AR 166 93% 6 3% 7.17 (12)
SJT 177 99% 0 0% 1 (1)
Over time, VR, QR and AR have tended to become less speeded, when speededness is
defined as the proportion of candidates who reach all the items. Figure 15 shows that
although there is a lot of fluctuation year on year, the SJT and DM have fluctuated within
a fairly narrow band, whereas the proportion of candidates seeing all the items in the
other subtests has gently increased from 2017 to 2021.
In 2022, a change was made to the timing of the AR and QR subtests with the aim of
reducing the speededness of QR. One minute was taken from the AR subtest (with the
removal of 5 pretest items) and this was added to the QR subtest (where no additional
items were included). The item time has been considered in the form build for QR and AR
for a number of years, but this was also extended to VR and DM in 2022. A notable
increase in the percentage of candidates reaching all items has been observed since
2022. There are no major changes regarding test speededness in 2023 and the
percentages of candidates reaching all items are similar to those in 2022.
The further examination of speededness for the VR, DM, and QR subtests involved
excluding responses based on various guessing thresholds. The threshold for exclusion
is a relatively subjective decision that would yield different results. A 1-second threshold,
used in previous years, predominantly excluded only the most hasty responses; a 5-
second threshold effectively removed the peak and those below the peak of the guessing
distribution, eliminating most guessed responses and a minor portion of overlapping non-
guessed responses; a 10-second threshold, surpassing the valley for both VR and QR
and approximating that of DM, likely filtered out nearly all guessed responses but also
removed a significant number of non-guessed responses.
Although a similar analysis was conducted for AR and the SJT, it serves primarily for
comparative purposes only. Due to the overlapping distributions of guessed and non-
guessed responses in these subtests, as previously discussed, applying a fixed threshold
is less effective and could inadvertently exclude a substantial number of non-guessed
Pearson VUE Confidential P a g e | 35
responses. Consequently, the results for AR and the SJT, detailed in Table 36, should be
interpreted with caution.
Form Candidates
Form 1 46
Form 2 9,415
Form 3 9,404
Form 4 8,428
Form 5 8,332
Table 38 shows the raw score summary for each subtest on each form. It also includes
the reliability statistic, Cronbach’s alpha. Alpha is based on the intercorrelations or internal
consistency among the items, and it reflects the reproducibility of the test results. High
reliability is desirable because it indicates that a test is consistent in measuring the desired
construct. All subtests have satisfactorily high reliabilities. Notably, QR emerged as the
subtest with the highest reliability, a distinction previously held by AR for several years.
The SJT is analysed in a similar way to the cognitive sections above; however, because
the maximum raw score available on the SJT can change year on year, an additional
column called mean percent raw score is added (Table 39). Similar to the cognitive
results, the reliability is adequately high and the SEM adequately low for the SJT.
Table 39. SJT Raw Score Test Statistics (252 score points)
Mean Percent Raw
Form Mean SD Min Max Alpha SEM
Score
Form 1 199.07 21.39 123 240 78.99% 0.86 8.00
Form 2 197.65 23.53 42 240 78.43% 0.88 8.15
Form 3 197.10 21.19 70 242 78.21% 0.85 8.21
Form 4 197.86 22.14 40 242 78.52% 0.87 7.98
Form 5 196.49 23.34 56 240 77.97% 0.87 8.42
Subtest reliability has been consistent since 2017. Figure 21 shows the mean Cronbach’s
alpha for each subtest in each form since 2017. Note that prior to 2019, it is the mean of
three forms, whereas since 2019, it is the mean of five forms. DM has become more
reliable since its launch in 2017, and the reliability of VR has slightly dropped but remained
consistent since 2020, with a small improvement in 2023. The reliability of both QR and
the SJT has continued to improve this year.
Raw scores are scaled and reported as scaled scores. The summary statistics for scaled
scores on each form are presented below in Table 40. Instead of alpha, the scaled score
reliability is the conditional reliability at each scaled score point. Similar to the results for
Pearson VUE Confidential P a g e | 38
raw scores, the scaled score reliability is adequately high for each subtest and each form.
Table 40 also includes the results for the SJT.
The cognitive items are analysed using item response theory, whereas the SJT items are
analysed using classical test theory, so they are dealt with separately here.
• Point biserial: the degree to which a test item discriminated between strong and
weak candidates. For operational items, it must be greater than 0.1 for the item to
remain in the bank. For pretest items, it must be greater than 0.05.
• p Value: the proportion of candidates who answered the item correctly—the item
difficulty. This must be between 0.1 and 0.95 for the item to remain in the bank.
• IRT b: the difficulty parameter from the item response theory (IRT) analysis of the
items. It must be between -3 and 3 for the item to remain active.
Items that do not meet the statistical criteria laid out above are retired from the bank. It
may be possible for them to be revised and reused under a different item ID, but typically
they are used for training purposes to show item writers what type of item does not work
well.
Table 41 below summarises the number of items that passed the quality criteria by
subtest, and by whether they were operational or pretest items. More pretest items tend
to fail at this stage since they are new unscored items being tested for the first time. The
scored items by contrast have all been previously tested.
Consistent with previous years, only four operational items failed the analysis. Those
items did not discriminate highly enough. For the pretest items, few failed in the VR, DM
and QR subtests. The pretest failures were due to low discrimination in addition to items
being too easy or difficult. There were no pretest items for AR this year. Figure 22 and
Figure 23 show that the pretest pass rate has been consistent, with excellent pass rates
for VR, DM and QR.
Table 42 shows a summary of the point biserial values. The maximum point biserial is 1,
and higher values are better because they indicate that an item can discriminate well
between strong and weak candidates. Given that the unscored items have not been
tested before, it is expected that those items, on average, will discriminate less well than
the scored items, and that is the case across all the cognitive subtests.
Historically, the point biserial values for scored items have been high and stable, whereas
the values for unscored items have been lower and less consistent, as illustrated in Figure
24. Despite a small drop in QR and AR in 2023, the operational items appear to have
become slightly more discriminating over time for all subtests. This is an indication that
the quality of the subtests has improved over time.
Table 43 shows the summary of p values for the cognitive subtests. p values reflect the
proportion of candidates who answered an item correctly, so higher values indicate easier
items, and lower values harder items. Of the operational items, DM items appear to have
been the most difficult on average for 2023 candidates and AR items were the easiest on
average. The pretest pools appear to have been somewhat more difficult overall than the
operational test items for all subtests.
Since 2017, pretesting has been successful in identifying items that are too difficult and
too easy. Figure 25 shows that the items in the pretest pools are usually more difficult
than the operational items on average. Note that the subtests are equated year-on-year,
meaning changes in difficulty of individual items does not have an impact on the ability
required for candidates to achieve a given scaled score.
Table 44 shows that the four-option multiple-choice items are better at discriminating
between stronger and weaker candidates than the three-option items. The lower point
biserials in the pretest pool shows that pretesting is successfully removing items that do
not discriminate effectively. The operational items are also rather easier on average than
the pretest pool items.
The DM subtest contains multiple-choice items, scored out of one, and drag-and-drop
items, which are scored out of two. The drag-and-drop items are more difficult than the
multiple-choice items and they discriminate better, as shown in Table 45.
In addition to different response types, the DM subtest also contains different item types.
Among the drag-and-drop items, interpreting information items are more difficult than
syllogism items but the latter discriminate slightly better than the former, as presented in
Table 46. For the multiple-choice items, the items on statistical reasoning and Venn
diagrams are the most discriminating. Logical Puzzles were found to be the most difficult
item type in DM, while Syllogisms were found to be the easiest.
Table 46. DM Response and Item Type Point biserial and p Value
The QR subtest has item sets and standalone items. Each item set contains four items.
As with the pretest pool as a whole, the pretest items discriminate less well on average
than the ones that have already been pretested prior to appearing in the 2023 exam, as
shown in Table 47.
The AR subtest consists of four different types. Table 48 below shows that the
discrimination of all four item types is similarly strong across the operational items.
Pearson VUE Confidential P a g e | 46
Table 48. AR Type Point biserial and p Value
Table 50 compares the summary statistics for the operational item analysis for the UCAT
2023 to the UCAT ANZ 2023 values. Across all the subtests, the point biserial summary
statistics were similar, with the results from the ANZ population showing slightly higher
values, indicating that all operational items discriminated as strongly as expected for the
UCAT ANZ population. In terms of the p value, which is sample-dependant, the UCAT
ANZ population had higher (i.e. easier) average values across subtests. The IRT difficulty,
on the other hand, is on a common scale. Table 50 shows that for all subtests, the 2023
UCAT and UCAT ANZ had very similar mean IRT difficulty values, indicating a
comparable level of difficulty for both populations.
Table 50. Comparison of Operational Item Statistics: UCAT & UCAT ANZ 2023
Item N UCAT 2023 UCAT ANZ 2023
Subtest
Statistics Items Mean SD Mean SD
p Value 160 0.57 0.13 0.60 0.13
VR Point biserial 160 0.29 0.06 0.30 0.06
IRT Difficulty 160 -0.21 0.61 -0.19 0.61
Facility 104 0.55 0.15 0.58 0.15
DM Point biserial 104 0.37 0.08 0.39 0.09
IRT Difficulty 104 0.23 0.72 0.20 0.72
p Value 128 0.62 0.14 0.65 0.13
QR Point biserial 128 0.38 0.07 0.41 0.07
IRT Difficulty 128 -0.26 0.71 -0.26 0.72
p Value 200 0.66 0.13 0.67 0.13
AR Point biserial 200 0.32 0.09 0.35 0.10
IRT Difficulty 200 0.15 0.73 0.17 0.73
In addition, during the standard UCAT and UCAT ANZ item analysis, any item that shows
an item drift more extreme than +/-0.5 is removed from the anchor and re-calibrated as
Pearson VUE Confidential P a g e | 48
the item difficulty is considered to have changed significantly. This can give an indication
of whether the relative difficulty of the items for the UCAT ANZ population is comparable
to that for the UCAT population.
Table 51 summarises the number of items showing drift in the UCAT since 2017
compared to the UCAT ANZ since 2019. Compared to the UCAT 2023, the number of
drift items are slightly higher for the UCAT ANZ 2023. These items were reviewed by the
Content Team and there was no clear explanation for the differences in terms of the
cultural sensitivity of the items.
Table 51. Number of Operational Items Showing Drift in UCAT vs UCAT ANZ
UCAT UCAT ANZ
Subtest
2017 2018 2019 2020 2021 2022 2023 2019 2020 2021 2022 2023
2 3 6 4 4 5 6 12 13 13 8 9
VR
(2%) (3%) (3%) (2%) (2%) (3%) (4%) (10%) (6%) (6%) (4%) (6%)
11 6 17 37 12 7 3 7 47 11 9 8
DM
(14%) (8%) (13%) (28%) (9%) (5%) (3%) (9%) (36%) (8%) (7%) (8%)
2 0 1 0 2 6 2 3 2 4 5 4
QR
(2%) (0%) (1%) (0%) (1%) (4%) (2%) (3%) (1%) (2%) (3%) (3%)
7 5 21 25 40 19 5 22 24 37 13 7
AR
(5%) (3%) (8%) (10%) (16%) (8%) (3%) (15%) (10%) (15%) (5%) (4%)
At present, it is recommended that the degree of drift is monitored in 2024. We would not
recommend taking any action to create a separate item bank for the UCAT ANZ at this
time.
Prior to calculating the item statistics, outlier candidates are removed from the sample
according to the criteria outlined in Table 52. The candidates that are removed are judged
as not interacting with the test as expected and are therefore not representative of the
UCAT population.
The following item statistics are calculated for the SJT items:
• Item facility: the mean score on the items as a percentage of the maximum score
available. It represents the difficulty of the item.
• Item SD: the SD of the scores on the items. It gives an indication of how well the
item is differentiating among candidates.
• Item partial correlation: the correlation of the item score with the total score for the
operational items and the scaled score for the pretest items. It compares how
individuals perform on a given item with how they perform on the test overall and
is a measure of discrimination. Item correlations can be interpreted in the following
way:
o Below 0.13 – poor correlation with the test overall and items within this band
are unlikely to be used in an operational test.
o 0.13 to 0.17 – acceptable correlations. Items within this band will only be
included if other items within the scenario have higher item partials.
o 0.17 to 0.25 – reasonable item performance.
o Above 0.25 – good item performance.
In 2023, there are discussions in adjusting the SJT item quality criteria to align them with
the criteria used for the cognitive items. The changed criteria are expected to be slightly
more lenient than the current criteria which will result in more operational and pretest
items being deemed successful and support with the continued development and
improvements to the item bank. As these changes are still under discussion and the
following results are based on the existing quality criteria.
Table 53 shows the number of items that met and did not meet the quality criteria. The
most/least item type was more successful than the standard items, with all operational
items and 63% of the pretest items meeting the criteria.
The proportion of items meeting the quality criteria is fairly consistent with previous years.
Figure 26 shows that the proportion of operational standard items meeting the criteria is
consistent between 2022 and 2023. The number of pretest most/least items not meeting
the criteria increased slightly from 25% in 2022 to 37% in 2023. In 2023, the percentage
of standard rating items that met the criteria was 43% and it is the same as the percentage
of the pretest dichotomous items meeting the criteria in 2022. However, it should be noted
that a proportion of the dichotomous items pretested in 2022 were adapted from already
successful items in the item bank and therefore a higher pass rate is expected. A similar
passing rate in newly developed items in 2023 showed an improvement in pretest item
quality.
The summary of all operational SJT items is shown below in Table 54.
Since 2017, the item mean score and facility has tended to increase, as illustrated in
Figure 27, indicating that items are becoming somewhat easier. Figure 28 shows an
increase in item partial correlation, which indicates that despite the test being relatively
easy, it has progressively improved in consistently measuring the same ability, and the
items are getting better overall at discriminating among strong and weak candidates. A
better discrimination between candidates implies that the test results could be considered
as being more reliable in distinguishing stronger and weaker candidates. In other words,
improvement is seen in the item quality. However, a relatively high facility could imply that
the test might be too easy to distinguish between strong and very strong candidates. Even
though high facility items provide face validity to the SJT, in future test development,
harder items will also be developed to minimise the upward trend of item facility.
Table 55 shows the summary statistics for the SJT pretest items. While the Most/Least
items showed a slightly higher discriminating ability than the standard rating items, the
average item total facility is relatively high.
Since the SJT makes extensive use of polytomous scoring, the DIF analysis was
performed with a hierarchical regression approach using the equated scaled score.
In both approaches, items were classified into one of three categories: A, B or C. Category
A contains items with negligible DIF, Category B contains items with slight to moderate
DIF and Category C contains items with moderate to large DIF. For the cognitive subtests,
these categories are derived from the DIF classification categories developed by
Educational Testing Service (ETS) and are defined below:
A: DIF is not significantly different from zero or has an absolute value < 1.0
B: DIF is significantly different from zero and has an absolute value >= 1.0 and < 1.5
C: DIF is significantly larger than 1.0 and has an absolute value >= 1.5
Items flagged in Category C are removed from the item bank on the basis that they may
contain bias. Items flagged in Categories A and B are not removed because of the small
effect or lack of statistical significance.
For the SJT, effects that explain less than 1% of score variance (R-squared change
< 0.01) are considered negligible for flagging purposes and items that do not reach
significance or explain less than this proportion of variance are labelled ‘A’, meaning that
they can be considered free of DIF. Larger effects, where the group variable has a
significant beta coefficient, are labelled ‘B’ or ‘C’. Changes of 0.01 or above are
considered slight to moderate and labelled ‘B’, unless all of the change is explained by
the interaction term, in which case they are labelled ‘A’. Changes above 0.05 (5% of the
variance in responses) are considered moderate to large and are labelled ‘C’, where there
is a significant main effect of the group difference variable.
A 160 100% 101 97% 125 98% 199 100% 193 98%
Operational B 0 0% 3 3% 3 2% 0 0% 3 2%
C 0 0% 0 0% 0 0% 1 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 240 100% 239 100% 217 100% N/A N/A 336 95%
B 0 0% 1 0% 1 0% N/A N/A 15 4%
Pretest
C 0 0% 0 0% 0 0% N/A N/A 0 0%
NA 0 0% 0 0% 0 0% N/A N/A 0 0%
In contrast to previous years, the age comparison has been changed to increase the
number of items where a comparison can be made. Since 2022, the comparison is
between less than 20 and greater than 25 in contrast to less than 20 and greater than 35
in previous years (Table 57). One operational VR item was found to exhibit Category C
DIF favouring younger candidates. Three Category C items were identified in DM, with
one item favouring older candidates and two favouring younger candidates. Two AR items
showed Category C DIF; both favoured younger candidates over older candidates. These
items will be reviewed by the Content Team and removed from the bank.
Table 58 shows there were six instances of Category C DIF identified in the ethnicity
comparisons. Of these, one operational AR item favoured White candidates over Black
candidates; one pretest QR item favoured White candidates over both Asian and non-
White candidates; and one SJT pretest item showed a reverse pattern of favouring Asian
and non-White candidates over White candidates.
Asian C 0 0% 0 0% 0 0% 0 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 159 99% 103 99% 128 100% 200 100% 196 100%
White/ B 1 1% 1 1% 0 0% 0 0% 0 0%
Mixed C 0 0% 0 0% 0 0% 0 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 160 100% 102 98% 128 100% 200 100% 187 95%
White/
B 0 0% 2 2% 0 0% 0 0% 9 5%
Non-
C 0 0% 0 0% 0 0% 0 0% 0 0%
White
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 42 18% 7 3% 140 64% N/A N/A 79 23%
White/ B 0 0% 1 0% 0 0% N/A N/A 9 3%
Black C 0 0% 0 0% 0 0% N/A N/A 0 0%
NA 198 82% 232 97% 78 36% N/A N/A 0 0%
A 240 100% 215 90% 217 100% N/A N/A 322 92%
White/ B 0 0% 0 0% 0 0% N/A N/A 15 4%
Asian C 0 0% 0 0% 1 0% N/A N/A 1 0%
Pretest
Since 2022, a comparison between SEC1 and non-SEC-1 has also been included to allow
more comparisons to be made. One pretest QR item was categorised as DIF category C,
favouring SEC-1 candidates over non-SEC-1 candidates. Four category C DIF items were
identified for SJT pretest items, all favouring SEC-1 candidates over non-SEC-1, SEC-2,
SEC-3, and SEC-4 respectively.
A 160 100% 104 100% 128 100% 200 100% 196 100%
B 0 0% 0 0% 0 0% 0 0% 0 0%
SEC1/4
C 0 0% 0 0% 0 0% 0 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 160 100% 104 100% 127 99% 200 100% 196 100%
B 0 0% 0 0% 1 1% 0 0% 0 0%
SEC 1/5
C 0 0% 0 0% 0 0% 0 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 160 100% 104 100% 128 100% 200 100% 196 100%
SEC B 0 0% 0 0% 0 0% 0 0% 0 0%
1/(2-5) C 0 0% 0 0% 0 0% 0 0% 0 0%
NA 0 0% 0 0% 0 0% 0 0% 0 0%
A 0 0% 0 0% 0 0% N/A N/A 216 62%
B 0 0% 0 0% 0 0% N/A N/A 5 1%
SEC 1/2
C 0 0% 0 0% 0 0% N/A N/A 1 0%
NA 240 100% 240 100% 218 100% N/A N/A 0 0%
A 138 57% 40 17% 167 77% N/A N/A 265 75%
B 0 0% 0 0% 0 0% N/A N/A 6 2%
SEC 1/3
C 0 0% 0 0% 0 0% N/A N/A 1 0%
NA 102 42% 200 83% 51 23% N/A N/A 0 0%
A 0 0% 0 0% 0 0% N/A N/A 237 68%
Pretest
B 0 0% 0 0% 0 0% N/A N/A 1 0%
SEC 1/4
C 0 0% 0 0% 0 0% N/A N/A 1 0%
NA 240 100% 240 100% 218 100% N/A N/A 0 0%
A 1 0% 0 0% 8 4% N/A N/A 246 70%
SEC 1/5 B 0 0% 0 0% 0 0% N/A N/A 3 1%
C 0 0% 0 0% 0 0% N/A N/A 0 0%
NA 239 100% 240 100% 210 96% N/A N/A 0 0%
A 240 100% 232 97% 217 100% N/A N/A 339 97%
SEC B 0 0% 0 0% 0 0% N/A N/A 6 2%
1/(2-5) C 0 0% 0 0% 1 0% N/A N/A 1 0%
NA 0 0% 8 3% 0 0% N/A N/A 0 0%
As Table 60 illustrates, there was one Category C DIF item detected in the comparison
between candidates who had an honours degree or above and those who did not. This
item was a QR pretest item and favoured candidates with a degree education over those
without. There were high candidate volumes across the board, meaning comparisons
could be made for all subtests.
Comparison was also possible for the most part across all subtests for candidates who
reported English as being their first or primary language and those who reported that it
was not. As Table 61 shows, two pretest items were flagged as having Category C DIF.
The pretest DM item was found to favour native English speakers over non-native English
speakers while the pretest SJT item was found to favour non-native English speakers
over native English speakers.
Four Category C DIF items were identified in the comparison of candidates who reported
the UK as their residence with those who reported the UK as not being their residence
(as presented in Table 62). A pretest VR item and a pretest SJT item were found to favour
UK residents over non-UK residents; and a pretest VR item and a pretest QR item were
found to show the reverse pattern.
Very few candidates took the online version of the UCAT (31 candidates; see Section
4.4), so comparison was not possible.
In conclusion, 24 Category C DIF items were identified, with 8 operational items and 16
pretest items. This is a higher number compared to 2022, where only 10 Category C DIF
items were identified. This increase may be partially attributed to a slight increase in
number of candidates from ethnic minority groups, particularly UK - Asian and non-UK
candidates. The greater diversity among candidates might have contributed to more
varied responses to the items, aiding in the detection of item bias. These items have been
removed from the item bank to ensure they are not used in future tests, and additional
efforts will be made to review these items to reduce potential bias in future item
development.
The scores in the 2023 administration of the UCAT were broadly in line with scores in
previous years. The proportion of candidates taking the SEN version remained
unchanged, and the demographic composition of the test-takers stayed mostly the same,
except for a continued decline in UK - White candidates and an increase in UK - Asian
and non-UK candidates.
Candidates taking a SEN version of the exam continue to score better than candidates
taking the non-SEN version, and demographic trends in scores and candidate volumes
were consistent with previous years’ administrations of the exam. Higher scores continue
to be associated with candidates who are resident in the UK, have White ethnicity, are in
SEC 1, and speak English as a first language. Certain scoring patterns by demographic
also persist in the 2023 version of the exam. Male candidates outperformed female
candidates on the cognitive sections and vice versa on the SJT.
In terms of test quality, the test forms were reliable, with appropriately low measurement
error, and individual items performed well, with very few operational cognitive items
needing to be retired. More SJT items did not meet the required criteria than the
operational items which is consistent with performance in previous years. However, SJT
criteria is currently under review so that it is aligned with that of the cognitive tests which
will result in more items meeting the criteria. There are a relatively higher number of
Category C DIF items identified this year.
8.1 Recommendations
The outcome of the UCAT 2023 analysis identifies certain small operational changes that
have improved the ongoing performance of the test, as well as several areas that might
provide fertile ground for further research.
As it stands, certain subtests have a greater impact on the total cognitive score that
candidates receive than others. AR and QR, as the highest scoring subtests, have a
greater influence on the total score than VR, which is the lowest scoring subtest. In 2023,
the subtests were slightly rescaled to bring them closer to an average score of 600, which
was found to be effective. Pearson VUE recommend continuing the rescaling year-on-
year until a more balanced scaled scores distribution is achieved.
In 2022, adjustments were implemented to reduce the speededness of the subtests and
these modifications were continued in 2023, resulting in a level of speededness