Module 1 - 4 ASSESSMENT IN LEARNING 1
Module 1 - 4 ASSESSMENT IN LEARNING 1
Module 1 consists of three (3) lessons attainable for coverage of Prelim term as
follows:
Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.
Assessment in Learning
The word Assessment is rooted in the Latin Word assidere, which means “to sit
beside another.” Assessment is generally defined as the process of gathering
quantitative and/or qualitative data for the purpose of making decisions. Assessment in
learning is vital to the educational process similar to curriculum and instruction. Schools
and teachers will not be able to determine the impact of curriculum and instruction on
students or learners without assessing learning.
The objective format provides for a more bias-free scoring as the test items have exact
correct answers while the subjective format allows for a less objective means of scoring
especially if no rubric is used.
table of Specifications (TOS) is used that maps outs the essential aspects of a test
(e.g., test objectives, contents, topics covered by the test, item distribution).
Descriptive statistics are used to describe and interpret the results of tests. A test is
said to be good and effective if it is valid, reliable, has acceptable level of difficulty and
can discriminate between learners with higher and lower ability.
(2)
The two most common psychometric theories that serve as frameworks for
assessment and measurement are the Classical Test Theory (CTT) and the Item
Response Theory (IRT).
Classical Test Theory (CTT) is also known as the true score theory. It explains
that variations in the performance of examinees on a given measure is due to variations
in their abilities. The CTT assumes that an examinee’s observed score in a given
measure is the sum of the examinee’s true score and some degree of error in the
measurement caused by some internal and external conditions. The CTT also assumes
that all measures are imperfect, and the scores obtained from a measure could differ
from the true score (i.e., true ability) of an examinee.
The CTT provides an estimation of the item difficulty based on the frequency or
number of examinees who correctly answer a particular item; items with fewer number
of examinees with correct answers are considered more difficult. The CTT also provides
an estimation of item discrimination based on the number of examinees with higher or
lower ability to answer a particular item. Test reliability can also be estimated using
approaches from CTT (e.g., Kuder-Richardson 20, Cronbach’s Alpha). Item analysis
based on CTT has been the dominant approach because of the simplicity of calculating
the statistics (e.g., item difficulty index, item discrimination index, item-total correlation).
The Item Response Theory (IRT) analyses test items by estimating the probability
that an examinee answers an item correctly or incorrectly. One of the central differences
of IRT from CTT is that in IRT, it is assumed that the characteristic of an item can be
estimated independently of the characteristic or ability of the examinee and vice-versa.
Assessment in learning could be of different types. The most common types are
formative, summative, diagnostic, and placement. Other experts would describe the
types of assessment as traditional and authentic.
Formative Assessment. Refers to assessment activities that provide information to
both teachers and learners on how they can improve the teaching-learning process. It is
formative because it is used at the beginning and during instruction for teachers to
assess learner’s understanding. The information collected on student learning allows
teachers to make adjustments to their instructional process and strategies to facilitate
learning. It also inform learners about their strengths and weaknesses to enable them to
take steps to learn better and improve their performance as the class progresses.
(3)
or learning problems in the course of teaching. It can also be done at the beginning of
the school year for spirally-designed curriculum so that corrective actions are applied if
pre-requisite knowledge and skills for the targets of instruction have not been mastered
yet.
There are many principles in the assessment in learning. The following are
considered as core principles.
LESSON 2 :
ASSESSMENT PURPOSES, LEARNING
TARGETS, AND APPROPRIATE METHODS
THINK ABOUT THESE EXPECTATIONS:
Assessment works best when its purpose is clear. Without a clear purpose, it is
difficult to design or plan assessment effectively and efficiently. In classrooms, teachers
are expected to know the instructional goals and learning outcomes, which will inform
how they will design and implement their assessment. In general, the purpose of
classroom assessment may be classified in terms of the following:
1. Assessment of Learning. This refers to the use of assessment to determine
learner’s acquired knowledge and skills from instruction and whether they were
able to achieve the curriculum outcomes. It is generally summative in nature.
2. Assessment for Learning. This refers to the use of assessment to identify the
needs of learners in order to modify instruction or learning activities in the
classroom. It is formative in nature and it is meant to identify gaps in the learning
experiences of learners so that they can be assisted in achieving the curriculum
outcomes.
3. Assessment as Learning. This refers to the use of assessment to help learners
become self-regulated. It is formative in nature and meant to use assessment
tasks, results, and feedback to help learners practice self-regulation and make
adjustments to achieve the curriculum outcomes.
It is very important that assessment is aligned with instruction and the identified
learning outcomes for learners. Knowing what will be taught (curriculum content,
competency, and performance standards) and how it will be taught (instruction) are as
important as knowing what we want from the very start (curriculum outcome) in
determining the specific purpose and strategy for assessment. The alignment is easier if
teachers have clear purpose on why they are performing the assessment. Typically,
teachers use classroom assessment for assessment of learning more than assessment
for learning and assessment as learning.
(2)
Motivational. Classroom assessment can serve as a mechanism for teachers to
be motivated and
engaged in learning and achievement in the classroom. Grades, for instance, can
motivate and
demotivate learners.
Learning Targets
Before discussing what learning targets are, it is important to first define educational
goals, standards, and objectives.
Goals. Goals are general statements about desired learner outcomes in a given
year or during the
duration of a program (e.g., senior high school).
Standards. Standards are specific statements about what learners should know
and are capable of
doing at a particular grade level, subject, or course. McMillan (2014) described four
different types of
educational standards:
1. Content - desired outcomes in a content area.
2. Performance - what students do to demonstrate competence.
3. Developmental - sequence of growth and change over time).
4. Grade-level - outcomes for a specific grade.
Educational Objective. Educational objectives are specific statements of learner
performance at the end of an instructional unit. These are sometimes referred to as
behavioural objectives and are typically stated with the use of verbs. The most
popular taxonomy of educational objectives is blooms Taxonomy of Educational
Objectives.
(3)
In this example, differentiate is the verb that represents the type of cognitive
process (in this case,
analyse), while qualitative research and quantitative research is the noun phrase
that represents the
type of knowledge (in this case, conceptual). See Table 2.2 and Table 2.3 below.
Table 2.2.
Cognitive Process Dimensions in the Revised Bloom’s Taxonomy
of Educational Objectives
Table 2.3
Knowledge Dimensions in the Revised bloom’s taxonomy of
Educational Process
(4)
Other experts consider a fifth type of learning target – affect, which refers to
affective characteristics that students can develop and demonstrate because of
instruction. This includes attitudes, beliefs, interests, and values. Some experts use
disposition as an alternative term for affect. The example is shown below.
I can appreciate the importance of addressing potential ethical issues in the conduct
of thesis research.
Once the learning targets are identified, appropriate assessment methods can be
selected to measure student learning. The match between a learning target and the
assessment method used to measure if students have met the target is very critical.
Matrix of the different types of learning targets and sample assessment methods are
shown in Table 2.5.1 and Table 2.5.2 below.
Table 2.5.1
Matching Learning Targets with Paper-and-Pencil Types of Assessment
(5)
Table 2.5.2
Matching Learning Targets with Other Types of Assessment
There are other types of assessment, and it is up to the teachers to select the
method of assessment and design appropriate assessment tasks and activities to
measure the identified learning targets.
LESSON 3 :
DIFFERENT CLASSIFICATIONS OF
ASSESSMENT
THINK ABOUT THESE EXPECTATIONS:
Classification Type
Educational
Purpose Psychological
Paper-and-Pencil
Form Performance-Based
Teacher-made
Function Standardized
Achievement
Kind of Learning Aptitude
Speed
Ability Power
Norm-Referenced
Interpretation of Learning Criterion-Referenced
Educational assessments are used in the school setting for the purpose of tracking
the growth of learners and grading their performance. This assessment in the
educational setting comes in the form of formative and summative assessment.
The purpose of formative assessment is to track and monitor student learning and
their progress toward the learning target. Formative assessment can be any form of
assessment (paper-and-pencil or performance-based) that is conducted before, during,
and after instruction. Before instruction begins, formative assessment serves as a
diagnostic tool to determine whether learners already know about the learning target.
More specifically, formative assessment given at the start of the lesson determine the
following:
1. What learners know and do not know so that instruction can supplement what
learners do not know.
2. Misconceptions of learners so that they can be corrected.
3. Confusion of learners so that they can be clarified.
4. What learners can and cannot do so that enough practice can be given to
perform the task.
The information from educational assessment at the beginning of the lesson is used
by the teacher to prepare relevant instruction for learners. During instruction,
educational assessment is done where the teacher stops at certain parts of the teaching
episodes to ask learners questions, assign exercises, short essays, board work, and
other tasks. If the majority of the learners are still unable to accomplish the task, then
the teacher realizes that further instruction is needed by learners.
When the teacher observes that majority or all of the learners are able to
demonstrate the learning target, then the teacher can now conduct the summative
assessment. The purpose of summative
(2)
assessment is to determine and record what the learners have learned. It is best to
have a summative assessment for each learning target so that there is an evidence that
learning has taken place.
Psychological assessments, such as tests and scales, are measures that determine
the learner’s cognitive and non -cognitive characteristics. Examples of cognitive tests
are those that measure ability, aptitude, intelligence, and critical thinking. Affective
measures are for personality, motivation, attitude, interest, and disposition. The results
of these assessments are used by the school’s guidance counsellor to perform
interventions on the learners’ academic, career, and social and emotional development.
Standardized tests have fixed directions for administering and scoring. They can be
purchased with test manuals, booklets, and answer sheets. When these tests were
developed, the items were sampled on a large number of target group called the norm.
The norm group’s performance is used to compare the results of those who took the
test.
Non-standardized or teacher-made tests are usually intended for classroom
assessment. They are used for classroom purposes, such as determining whether
learners have reached the learning target. These intend to measure behaviour (such as
learning) in line with the objectives of the course. Examples are quizzes, long tests, and
exams. Formative and summative assessments are usually teacher-made tests.
Achievement tests measure what learners have learned after instruction or after
going through a specific curricular program. Achievement tests provide information on
what learners can do and have acquired after training and instruction. Achievement is a
measure of what a person has learned within or up to a given time
(3)
Aptitudes are the characteristics that influence a person’s behavior that aid goal
attainment in a particular situation (Lohgman 2005). Specifically, aptitude refers to the
degree of readiness to learn and perform well in a particular situation or domain (Corno
et al. 2002). Examples include the ability to comprehend instructions, manage one’s
time, use previously acquired knowledge appropriately, make good inferences and
generalizations, and manage one’s emotions.
Speed tests consist of easy items that need to be completed within a time limit.
Power tests consist of items with increasing level of difficulty, but time is sufficient to
complete the whole test.
Example of Power Test:
The one developed by the National Council of Teachers of Mathematics that
determines the ability of
the examinees to utilize data to reason and become creative, formulate, solve, and
reflect critically on
the problems provided.
Example of Speed Test:
A typing test in which examinees are required to correctly type as many words as
possible given a limited
amount of time.
There are two types of test based on how the scores are interpreted: Norm-
Referenced and Criterion-Referenced Tests. Criterion-Referenced Test has a given set
of standards, and the scores are compared to the given criterion.
For Example:
In a 50-item test: 40 – 50 is very high, 30 – 39 is high, 20 – 29 is average, and 10 –
19 is low, and
0 – 9 is very low.
One approach in criterion-Referenced interpretation is that the score is compared to
a specific cut-off.
Having an established norm for a test means obtaining the normal or average
performance in the distribution of scores. A normal distribution is obtained by increasing
the sample size. A norm is a standard and is based on a very large group of samples.
Norms are reported in the manual of standardized tests.
What is the use of a norm? (1) ANorm is the basis of interpreting a test score. (2) A
Norm can be used to interpret a particular score.
MODULE 2:
DEVELOPMENT AND ADMINISTRATION OF TESTS
Module 2 consists of two (2) lessons attainable for coverage of Midterm as follows:
Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.
Opening
In designing a well planned written test, first and foremost you should be able to
identify the intended learning outcomes in a course, where a written test is an
appropriate method to use. These learning outcomes are knowledge, skills, attitudes,
and values that every student should develop throughout the course. Clear articulation
of learning outcomes is a primary consideration in lesson planning because it serves as
the basis for evaluating the effectiveness of the teaching and learning process
determined through testing or assessment. Learning objectives or outcomes are
measurable statements that articulate, at the beginning of the course, what students
should know and be able to do or value as a result of taking the course.
Evaluation Create
Synthesis Evaluate
Analysis Analyze
Application Apply
Comprehension Comprehension
Knowledge Remember
(2)
Step 4. Determine the number of items for the whole test. To determine the
number of items to be
included in the test, the amount of time needed to answer the items are
considered. As a general
rule, students are given 30 – 60 seconds for each item in test formats with
choices. For a one-
hour class, this means that the test should not exceed 60 items or maybe just
50 items.
Step 5. Determine the number of items per topic. To determine the number of
items to be
included in the test, the weights per topic are considered as shown below.
Topic % of Time No. of Items
(Weight)
Theories and Concepts 10.0 5
Psychoanalytic Theories 30.0 15
Trait Theories 20.0 10
Humanistic Theories 10.0 5
Cognitive Theories 10.0 5
Behavioral Theories 10.0 5
Social Learning Theories 10.0 5
TOTAL 100 50 Items
(3)
Different Formats of a Test table of Specifications
There are three (3) types of TOS:
1. One-Way TOS. A one-way TOS maps out the content or topic, test objectives,
number of hours spent, and format, number, and placement of items. This type
of TOS is easy to develop and use because it just works around the objectives
without considering the different levels of cognitive behaviors. However, a one-
way TOS cannot ensure that all levels of cognitive behaviors that should have
been developed are covered in the test.
2. Two-Way TOS. A two-way TOS reflects not only the content, time spent, and
number of items but also the levels of cognitive behaviour targeted per test
content bsed on the theory behind cognitive testing.
For Example.
One advantage of this format is that it allows one to see the levels of
cognitive skills and dimensions of knowledge that are emphasized by the test. It
also shows the framework of assessment used in the development of the test.
However, this format is more complex than the one-way format.
Content Time No. & KD* Level of Cognitive Behavior, Item Format, No.
Spent Percent and
of Items Placement of Items
K C AP AN SY E
F 1,3
Theories and 0.5 5 #1 - 3
Concepts hours (10.0%) C 1,2
#4 - 5
F 1,2
#6 - 7
Psychoanalytic 1.5 15 C 1,2 1,2
Theories hours (30.0%) #8 - 9 #10 -
11
P 1,2 1,2
#12 - #14 -
13 15
M 1,3 11,1 11,1
#16 - #41 #42
18
etc.
Scoring 1 point per item 2 points per 3 points per
item item
OVERALL 50
TOTAL (100.0%) 20 20 10
3. Three-Way TOS. This type of TOS reflects the features of one-way and two-way
TOS. One advantage of this format is that it challenges the test writer to classify
objectives based on the theory behind the assessment. It also shows the
variability of thinking skills targeted by the test. However, it takes a much longer
to develop this type of TOS.
LESSON 5 :
CONSTRUCTION OF WRITTEN TESTS
THINK ABOUT THESE EXPECTATIONS:
Opening
Classroom assessment are an integral part learners’ learning. They do more than
measure learning. They also inform the learner’s what needs to be learned and to what
extent and how to learn them. They also provide the parents some feedback about their
child’s achievement of the desired learning outcomes. The schools also get to benefit
from classroom assessments because learners’ test results can provide them evidence-
based data that are useful for instructional planning and decision making. It is important
that assessment tasks or tests are meaningful and further promote deep learning, as
well as fulfil the criteria and principles of test construction.
3. Is the test matched or aligned with the course’s DLOs and the course contents
learning activities?
The assessment tasks should be aligned with the instructional activities and the
DLOs.
For Example:
If you want learners to articulate and justify their stand on ethical decision-
making and social practices in business (i.e., DLO), then an essay test and
class debate are appropriate measures and task for this learning outcome.
A multiple-choice test may be used but only if you intend learners’ ability to
recognize what is ethical versus unethical decision-making practice.
Matching-type items may be appropriate if you want to know whether your
students can differentiate and match the different approaches or terms to
their definitions.
(2)
cussed in the class or that they have never encountered, read, or heard about
should be minimized or avoided.
For the purposes of classroom assessment, traditional tests fall into two general
categories:
1. Selected-Response Type – in which learners select the correct response from
one given options.
2. Constructed-Response Type – in which the learners are asked to formulate
their own answers.
Selected-Response Tests. The learners are required to choose the correct
answer or best
alternative from several choices. They are limited when assessing learning
outcomes that involve
more complex and higher level thinking skills. Selected-Response tests include:
1. Multiple Choice Test. It is the most commonly used format in formal testing
and typically consists of a stem (problem), one correct or best alternative
(correct answer), and three or more incorrect or inferior alternatives(distractors).
2. True-False or Alternative Response Test. It generally consists of a statement
and deciding if the statement is true (accurate/correct) or false (inaccurate,
incorrect).
3. Matching-Type Test. It consists of two sets of items to be matched with each
other based on a specified attribute.
Content:
1. Write items that reflect only one specific content and cognitive processing skills.
2. Do not lift and use statements from the textbook or other learning materials as
test questions.
3. Keep the vocabulary simple and understandable based on the level of learners /
examinees.
4. Edit and proof read the items for grammatical and spelling before administering
them to the learners.
Stem:
1. Write the directions in the stem in a clear and understandable manner.
Faulty: Read each question and indicate your answer by shading the circle
corresponding to your
answer.
Good: The test consist of two parts. Part A is a reading comprehension test,
and Part B is a
grammar / language test. Each question is multiple-choice test with five
(5) options. You
are to answer each question but will not be penalized for a wrong
answer or for guessing.
You can go back and review your answers during the time allotted.
2. Write stems that are consistent in form and structure, that is, present all items in
question form or in descriptive or declarative form.
Faulty: 1) Who was the Philippine President during the Martial Law?
2) The first president of the Commonwealth of the Philippines was
______.
Good: 1) Who was the Philippine president during Martial Law?
2) Who was the first president of the Commonwealth of the Philippines?
3. Word the stem positively and avoid double negatives, such as NOT and
EXCEPT in a stem. If a negative word is necessary, underline or capitalize the
words for emphasis.
Faulty: Which of the following is not a measure of variability?
Good: Which of the following is NOT a measure of variability?
4. Refrain from making the stem too wordy or containing too much information
unless the problem/question requires the facts presented to solve the problem.
Faulty: What does DNA stand for, and what is the organic chemical of complex
molecular
structure found in all cells and viruses and codes genetic information for
the transmission
of inherited traits?
Options:
1. Provide three (3) to five (5) options per item, with only one being the correct or
best answer/alternative.
2. Write options that are parallel or similar in form and length to avoid giving clues
about the correct answer.
Faulty: What is an ecosystem?
a) It is a community of living organisms in conjunction with the non-living
components of their environment that interact as a system.
b) It is a place on Earth’s surface where life dwells.
c) It is an area that one or more individual organisms defend against
competition from other organisms.
d) It is the biotic and abiotic surroundings of an organism of population.
Good: Which experimental gas law describes how the pressure of gas tends to
increase as the volume of the container decreases? (i.e., “The absolute pressure
exerted by a given mass of an ideal gas as inversely proportional to the volume
it occupies.”)
a) Avogadro’s Law c) Charles Law
b) Boyle’s Law d) Faraday’s Law
4. Place correct response randomly to avoid a discernable pattern of correct
answer.
5. Use None of the above carefully and only when there is one absolutely correct
answer, such as in spelling or math items.
Faulty: Which of the following is a nonparametric statistic?
a) ANCOVA b) ANOVA c) Correlation d) None of the
Above
Good: Which of the following is a nonparametric statistic?
a) ANCOVA b) ANOVA c) Correlation d) t-test
6. Avoid All of the Above option, especially if it is intended to be the correct answer.
Faulty: Who among the following has become President of Philippine Senate?
a) Ferdinand Marcos c) Quintin Paredes
b) Manuel Quezon d) All of the Above
Good: Who was the first ever President of the Philippine Senate?
a) Ferdinand Marcos c) Manuel Quezon
b) Quintin Paredes d) Manuel Roxas
7. Make all options realistic and reasonable.
2. Ensure that the stimuli are longer and the responses are shorter.
Faulty: Match the description of the flag to its country.
A B
_____ Bangladesh a) Green background with red circle in the center.
_____ Indonesia b) One red strip on top and white strip at the botton.
_____ Japan c) Red background with white five-petal flower in the
center.
_____ Singapore d) Red background with large yellow in the center.
_____ Thailand e) red background with large yellow pointed star in
the center.
f) White background with large red circle in the
center.
(5)
Item #1 is considered an unacceptable item because its response options are not
parallel and include different kinds of information that can provide clues to the
correct/wrong answers. On the other hand, Item#2 details the basis for matching
and the response options only include related concepts.
True or false items are used to measure learner’s ability to identify whether a
statement or proposition is correct/true or incorrect/false. They are best used when
learners’ ability to judge or evaluate is one of the desired learning outcomes of the
course.
There are different variations of the true or false items. These include the following:
1. T – F Correction or Modified True-or-False Question. In this format, the
statement is presented with a key word or phrase that is underlined, and the
learner has to supply the correct word or phrase.
e.g., Multiple –Choice is authentic.
2. Yes – No Variation. In this format, the learner has to choose yes or no, rather
than true or false.e.g., The following are kinds of test. Circle Yes if it is an
authentic test and No if not.
Multiple Choice test Yes No
Debates Yes No
End-of-the-Term Project Yes No
True or False Test Yes No
3. A – B Variation. In this format, the learner has to choose A or B, rather than
true or false. e.g., Indicate which of the following are traditional or authentic
tests by Circling A if it is traditional test and B if it is authentic.
Traditional Authentic
Multiple Choice test A B
(6)
Debates A B
End-of-the-Term Project A B
True or False Test A B
(7)
They are the preferred form of assessment when teachers want to measure learners’
higher-order thinking skills, particularly their ability to reason, analyse, synthesize, and
evaluate.
Problem-solving test items are used to measure learners’ ability to solve problems
that require quantitative knowledge and competencies and/or critical thinking skills.
These items present problem situation or task that will require learners to demonstrate
work procedures or come up with a correct solution.
There are different variations of the quantitative problem-solving items. These
include the following:
1. One Answer Choice. This type of question contains four or five options, and
students are required to choose the best answer.
Example: What is the mean of the following score distribution: 32, 44, 56, 69,
75, 77, 95, 96?
a) 68 b) 69 c) 72 d) 74
The correct answer is a) 68.
2. All Possible Answer Choices. This type of question has four or five options,
and students are required to choose all of the options that are correct.
Example: Consider the following score distribution: 12, 14, 14, 14, 17, 24,
27, 28, 30. Which
of the following is/are the correct measure(s) of central tendency?
Indicate all possible
answers.
a) Mean = 20 c) Median = 17
b) Mean = 22 d) Mode = 14
Correct answers are options a, c, and d.
3. Type-In Answer. This type of question does not provide options to choose
from. Instead, the learners are asked to supply the correct answer. The teacher
should inform the learners at the start how their answers will be rated. The
teacher may require just the correct answer or may require learners to present
the step-by-step procedures in coming up their answers. On the other hand, for
non-mathematical problem solving, such as a case study, the teacher may
present a rubric how their answers will be rated.
Example: Compute the mean of the following score distribution: 32, 44, 56,
69, 75, 77, 95, 96.
Indicate your answer in the blank provided.
In this case, the learners will only need to give the correct answer without having
to show the procedures for computation.
In Item#1, This question asks “about how many” and does not indicate whether
learners need to give the exact weight or whether they need to round off their
answer and to what extent.
2. Be specific and clear of the type of response required from the students.
Faulty: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity juice in
Philippines, aside
from their Singapore market. The sales for the juice in the Singapore
market were S 5 million
more than those of their Philippine market in 2016, S 3 million more than
in 2017, and S 4.5
million in 2018. If the sales in Philippine market in 2018 was P 35 million,
what were the sales
in Singapore market during that year?
This is a faulty question because it does not specify in what currency should the
answer be presented.
Good: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity juice in
Philippines, aside
from their Singapore market. The sales for the juice in the Singapore
market were S 5 million
more than those of their Philippine market in 2016, S 3 million more than
in 2017, and S 4.5
million in 2018. If the sales in Mexican market in 2018 was P 35 million,
what were the sales
in U.S market during that year? Provide answer in Singapore dollar (S 1 =
P 36.50).
This is a better item because it specifies in what currency should the answer be
presented, and the exchange rate was given.
MODULE 3:
ADMINISTRATION OF TESTS AND
ORGANIZATION OF TEST RESULTS
Module 3 consists of two (2) lessons attainable for coverage of Midterm as follows:
Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.
LESSON 6 :
4. Use Procedures and Statistical Analysis to Establish Test Validity and Reliability.
5. Decide whether a Test is Valid or Reliable.
6. Decide which Test Items are Easy and Difficult.
Opening
In order to establish the validity and reliability of an assessment tool, you need to
know the different ways of establishing test validity and reliability.
Test Reliability
In the first condition, consistent response is expected when the test is given to the
same participants. In the second condition, reliability is attained if the responses to the
same test is consistent with the same test or its equivalent . In the third condition, there
is reliability when the person responded in the same way or consistently across items
that measure the same characteristic.
1. Test-Retest Method. You have a test, and you need to administer it at one time
to a group of examinees. Administer it again at another time to the “same group”
of examinees. There is a time interval of not more than 6 months between the
first and second administration of tests that measure stable characteristics, such
as standardized aptitude tests. The post test can be given with a minimum time
interval of 30 minutes.
Test-retest is applicable for tests that measure stable variables, such as aptitude
and psychomotor measure (e.g., typing test, task in physical education).
Correlate the test scores from the first and second administration. Significant and
positive correlation indicates that the test has temporal stability. Correlation is
refer to a statistical procedure where linear relationship is expected for two
variables. You may use Pearson Product Moment Correlation or Pearson r
because test data are usually in an interval scale.
2. Parallel Forms Method. There are two versions of a test. The items need to
exactly measure the same skill. Each test version is called a “form.” Administer
one form at one time and the other form to another time to the “same” group of
participants. The responses on the two forms should be more or less the same.
Parallel Forms are applicable if there are two versions of the test. This is usually
done when the test is repeatedly used for different groups, such as entrance
examinations and licensure examinations. Different versions of the test are given
to a different group of examinees.
Correlate the test results for the first form and the second form. Significant and
positive correlation coefficient are expected. The significant and positive
correlation indicates in the two forms are the same or consistent. Pearson r
usually used for this analysis.
Correlate the two sets of scores using Pearson r. After the correlation, use
another formula called Spearman Brown Coefficient. The correlation coefficient
obtained using Pearson r and Spearman Brown should be significant and
positive to mean that the test has internal consistency reliability.
This technique will; work well when the assessment tool has a large number of
items. It is also applicable for scales and inventories (e.g., Likert Scale from
“strongly agree” to “strongly disagree”).
You will notice that statistical analysis is required to determine the reliability of a
measure. The very basis of statistical analysis to determine reliability is the use of linear
regression.
1. Linear Regression.
Linear regression is demonstrated when you have two variables that are
measured, such as two sets of scores in a test taken at two different times by
the same participants. When the two scores are plotted in a graph (with x – axis
and y – axis), they tend to form a straight line . The straight line formed for the
two sets of scores can produce a linear regression. When a straight line is
formed, you can say that there is a correlation between the two sets of scores.
This can be seen in the graph shown. The graph is called a scatterplot. Each
point in the scatterplot is a respondent with two scores (one for each test).
Given point: P (2, 2), M (4, 6), and Q (10, 8).
(3)
y - axis
• Q (10, 8)
• MN (4,6)
•
P (2, 2)
x - axis
Example. Suppose that a teacher gave the spelling of two-syllable words with
20 items for Monday
and Tuesday. The teacher wanted to determine the reliability of two
set of scores by
computing for the Pearson r.
The value of a correlation coefficient does not exceed 1.00 or – 1.00. A value of
1.00 and – 1.00 indicates perfect correlation.
3. Difference Between a Positive and a Negative Correlation
When the value of the correlation is positive, it means that the higher the scores
in x, the higher the scores in y. This is called a positive correlation.
When the value of the correlation is negative, it means that the higher the
scores in x, the lower the scores in y. This is called a negative correlation.
4. Determining the Strength of a Correlation
The strength of the correlation also indicates the strength of the reliability of
the test. This is indicated by the value of the correlation coefficient. The closer
the value to 1.00 or - 1.00, the stronger is the correlation. Below is the guide:
± 1.00 - Perfect (±) correlation
±0.91 - ± 0.99 - Very strong relationship
±0.71 - ± 0.90 - Strong relationship
±0.41 - ± 0.70 - Moderate strong relationship
±0.21 - ± 0.40 - Low relationship
±0.01 - ± 0.20 - Negligible relationship
5. Determining the Significance of the Correlation
The correlation obtained between two variables could be due to chance. In
order to determine if the correlation is free of certain errors, it is tested for
significance. When a correlation is significant, it means that the probability of the
two variables being related is free of certain errors.
In order to determine if a correlation coefficient value is significant, it is
compared with an expected probability of correlation coefficient values called a
critical value. When the value computed is greater than the critical value, it
means that the information obtained has more than 95% chance of being
correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency
of test is the Cronbach’s alpha. Follow the procedure to determine the internal
consistency.
Illustration:
The checklist has five items. The teacher wanted to determine if the
items have internal consistency.
12 ΣD 2
Formula: Kendall’s ω =
m2(N )(N 2−1)
12(33.2) 398.4
Substitute: Kendall’s ω = ; Kendall’s ω =
32 (5)(52−1) 1080
398.4
Kendall’s ω = ; Kendall’s ω =
9(5)(24)
0.3688888 or 0.37
Validity
A measure is valid when it measures what it is supposed to measure. If a quarterly
examination is valid, then the contents should directly measure the objectives of the
curriculum. If a scale that measures personality is composed of five factors, then the
scores on the five factors should have items that are highly correlated. If an entrance
examination is valid, it should predict students’ grades after the first semester.
A coordinator in science is checking the science test paper for grade 4. She
asked the grade 4 science teacher to submit the table of specifications containing
the objectives of the lesson and the corresponding items. The coordinator
checked whether each item is aligned with the objectives.
2. Face Validity. When the test is presented well, free of errors and administered
well. The test items and layout are reviewed and tried out on a small group of
respondents. A manual for administration can be made as a guide for the test
administrator.
The assistant principal browsed the test paper made by the math teacher. She
checked if the contents of the items are about mathematics. She examined if
instructions are clear. She browsed through the items if the grammar is correct
and if the vocabulary is within the students’ level of understanding.
4. Construct Validity. The components or factors of the test should contain items
that are strongly correlated. The pearson r can be used to correlate the items
for each factor. However, there is a technique called factor analysis to determine
which items are highly correlated to form a factor.
5. Concurrent Validity. When two or more measures are present for each
examinee that measure the same characteristic. The scores on the measures
should be correlated.
(7)
An item is difficult if majority of students are unable to provide the correct answer.
The item is easy if majority of the students are able to answer correctly. An item can
discriminate if the examinees who score high in the test can answer more the items
correctly than examinees who got low scores.
Step 2. Select 27% of the papers from the lower group and 27% from the upper group.
For smaller classes such as a group of only 20 students, you may just divide
it in half with 10 test papers (students) belonging to the lower group and 10
test papers (students) belonging in the upper group.
In the example (40 high school students 0, 27% would be 10.8 or 11. You
are going to get the bottom 11 test papers (lower group) and upper 11 test
papers (upper group) and set aside the middle 18 test papers.
Step 3. Tabulate the number of students in both the upper and lower groups who
selected each alternative.
Example: A tabulation of the number of students who selected each alternative for the
first five terms of the
given test is shown in Table 6.1.
Table 6.1. Sample Tabulation of Students’ Responses
Groups (upper Alternatives No. of Students
and lower 27%) who got the
Item No. a b c d Total
item right
Upper 0 0 1 10 10 11
1 Lower 1 0 1 9 9 11
Upper 8 1 1 1 8 11
2 Lower 4 2 2 3 4 11
Upper 8 1 2 0 8 11
3 Lower 5 2 3 1 5 11
Upper 1 0 0 10 10 11
4 Lower 0 1 0 10 10 11
Upper 3 2 1 5 5 11
5 Lower 5 4 2 0 0 11
In computing for the difficulty index of each item using the formula below:
R
Item Difficulty =
T
(8)
Where:
R = Number of students who got the item right from both groups.
T = Total number of students from both group.
Example: Compute for the difficulty index of the first five test items given
earlier.
R
Formula: Item Difficulty =
T
Solutions:
Example: Compute for the discrimination index of the first five test items
given earlier.
(9)
Verbal
Item Upper Lower Difficult Discrimination Interpretation
No. Group Group y Index Index (Discriminating Decision
Index)
1 10 9 0.86 0.09 Poor Reject/Revise
2 8 4 0.55 0.36 Good Retain
3 8 5 0.59 0.27 Moderate Retain
4 10 10 0.91 0 Poor Reject/Revise
5 5 0 0.23 0.45 High Retain
Formula:
RU −RL
Discrimination Index =
1 /2 T
Solutions:
In extreme cases, a negative value for the discriminating index might occur. This
would mean that there are more students in the lower group who got the item correctly
compared to the upper group. This could mean that the item is questionable and there
might be high degree of ambiguity in the test item. Remember however, that these are
assumptions or guesses as to the reasons why it occurred. The data from item analysis
tell us only what specific items are poorly functioning, and it does not tell us the reasons
or causes of its poor.
LESSON 7 :
ORGANIZATION OF TEST DATA USING TABLES AND
GRAPHS
THINK ABOUT THESE EXPECTATIONS:
Opening
The appropriate statistical tools and procedures to apply for the results of testing are
as follows:
For Traditional assessment, the common statistical tools to assess the scores are
measures of central tendency, point measures, and measures of variability.
For Authentic assessment, particularly on performance test, the common statistical
tools to assess the scores are measures of central tendency, point measures, and
measures of variability are still applied.
For Rubric assessment, weighted arithmetic mean is used.
For Investigatory projects, usually mean, t-Test (bivariate experimental design), z-
Test (bivariate descriptive design), F-test or ANOVA (analysis of variance), and many
others are employed.
The scores collected from assessments are arranged in a methodical order by
grouping them in classes in a form of frequency distribution. This Lesson 7 of Module 3
presents the frequency distributions, tallying the scores, and graphical representation
like bar graph, line graph, pictograph, and circle graph.
Frequency Distribution
Frequency distribution is applicable when the number of cases (N) is 30 or more.
Table 2.1 scores are results of 50 teacher education students in a 110-item test in
Assessment of Learning 2 in a certain State University in Metro Manila.
50 97 96 95 48 55 58 59 51 53
85 80 83 77 70 60 62 63 64 65
(2)
90 91 92 93 90 83 82 66 67 68
98 70 71 72 73 74 75 76 77 69
98 71 72 73 75 78 79 84 86 87
In arranging the scores in a form of frequency distribution, the steps are as follows:
Step 1. Find the absolute range. The range is obtained by subtracting the highest score
(HS) and lowest
score (LS).
R = HS - LS; R = 98 - 48; R = 50
Step 2. Find the class interval. In finding the class interval, divide the range by 10 and
by 20 such that the
class limits are not less than 10 and not more than 20, provided that the class
covers the total
number of scores.
Step 2 can be modified in finding the class interval by the use of Sturge’s
formula to obtain a
common result as follows:
Formula of k:
k = 1 + 3.32 log N (2.2)
Where: c = class interval
R = Range
k = definite divisor
Computation:
a) Solve for k:
b) Solve for c:
R
c =
k
50
c = ; c = 7.53 or c = 8 (rounded off)
6.64
Step 3. Set up the classes. Look for a multiple of c whose product is less than or
equal to the lowest
score.
8 x 6 = 48
Step 4. Choose the starting lower class limit. The product is the lower limit to the upper
limit whose value
of c is decrease by 1 added to lower limit to serve as upper limit..
Step 5. List down the class limits or class interval and tally the score for each class
interval. The
procedure is starting from the lower class limit in a vertical column going
upward.
96 - 103 IIII 4
88 - 95 IIII – I 6
80 - 87 IIII - III 8
72 - 79 IIII –IIII - II 12
64 - 71 IIII - IIII 10
56 - 63 IIII 5
48 - 55 IIII 5
Total N = 50
The tally must be carefully checked if the sum of each class is equal and also to the
number of cases. If unequal tally occurs, tallying must be repeated and rechecked to
arrive at an exact tally and frequency. At the bottom of column 3, symbol N or Σf is
written which means number of cases (N) or ‘sum of” frequency (Σf) equals to 50.
There are many types of graphs, but the most common methods of graphing a
frequency distribution are the following:
25 •
20 •
15 •
10 •
5 •
0 • • • • •
20.00 40.00 60.00 80.00
2. Frequency Polygon. This is also used for quantitative data, and it is one of the
most commonly used methods in presenting test scores. It is the line graph of a
frequency polygon. It is very similar to a histogram, but instead of bars, it uses
lines to compare sets of test data in the same axes. Figure 7.2 illustrates a
frequency polygon.
14
12
10
8
0 90 95 100 105 110 115 120 125 130 135 140 145 150
(4)
You can construct a frequency polygon manually using the histogram in Figure 7.1
by following these simple steps.
Step 1. Locate the midpoint on the top of each bar. Bear in mind that the height of
each bar represents the frequency in each class interval, and the width of the
bar is the class interval. As such, that point in the middle of each bar is
actually the midpoint of that class interval. In the histogram on Figure 7.1,
there are two spaces without bars. In such a case, the midpoint falls on the
line.
Step 2. Draw a line to connect all the midpoints in consecutive order.
Step 3. The line graph is an estimate of the frequency polygon of the test scores.
Following the above steps, you can draw a frequency polygon using the histogram
presented earlier in Figure 7.1.
25 •
20 •
15 •
10 •
5 •
0 • • • • •
20.00 40.00 60.00 80.00
25 •
20 •
15 •
10 •
5 •
0 • • • • •
20.00 40.00 60.00 80.00
4. Pie Graph. One commonly used method to represent categorical data is the use
of a circle graph. You have learned in basic mathematics that there are 360° in a
full circle. As such the categories can be represented by the slices of the circle
that appear like a pie, thus, the name pie graph. The size of the pie is
determined by the percentage of students who belong in each category.
Example. In a class of 100 students, results were categorized according to
different levels which
shown below.
No. of Percent
Group Students Percentage Equivalent in
(%) the Circle
Above Average 10 10 % 0.10 x 360 =
36°
Average 40 40 % 0.40 x 360 =
144°
Below Average 30 30 % 0.30 x 360 =
108°
Poor 20 20 % 0.20 x 360 =
72°
Total10 100 100 % 360°
Graph:
Above
Poor Average
72% 10%
Average
Below 40%
Average
30%
Skewness
Y Y
x x
Figure 7.6 is labeled as normal distribution. Note that half the area of the curve is
a mirror reflection of the other half. It I symmetrical distribution, which is also referred to
as bell-shaped distribution. The higher frequencies are concentrated in the middle of the
distribution. A number of experiments have shown that IQ scores, height, and weight of
human beings follow a normal distribution.
The graphs in Figure 7.7 and Figure 7.8 are symmetrical in shape. The degree of
asymmetry of graph is its skewness. Basic principle of a coordinate system tells you
that, as you move toward the right of the x-axis, the numerical value increases.
Likewise, s you move up the y-axis, the scale value becomes higher. Thus, in a
negatively-skewed distribution, there are more who get higher scores and the tail,
indicting lower frequencies of distribution points to the left or to the lower scores. In
positive-skewed distribution, lower scores are clustered on the left side. This means that
there are more who get lower scores and the tail indicates the lower frequencies are on
the right or to the higher scores.
Kurtosis
f z
Test Scores
What difference can you observe among the three distribution of test scores?
It is the flatness of the distribution, which is also the consequence of how high or
peaked the distribution is. This property is referred to as kurtosis.
What curve has more extreme scores than the normal distribution?
What curve has more scores that are far from the central value (or average) than
does the normal distribution?
For the meantime, the characteristics are simply described visually. The next lesson
will connect these visual characteristics to important statistical measures.
MODULE 4:
UTILIZATION AND COMMUNICATION
OF TEST RESULTS
Module 4 consists of two (2) lessons attainable for coverage of Final Term as
follows:
Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.
LESSON 8 :
ANALYSIS, INTERPRETATION, AND THE USE OF TEST DATA
Opening
The discussion in this lesson will build upon the concepts and examples presented
in Lesson 7, which focused on the tabular and graphical presentation and interpretation
of test results. In this lesson, other ways of summarizing test data using descriptive
statistics, which provides a more precise means of describing a set of scores, will be
discussed. The word “measures” is commonly associated with numerical and
quantitative data.
The word “measures of central tendency” means the central location or point of
convergence of a set of values. Test scores have a tendency to converge a central
value. This value is the average of the set of scores. In other words, a measure of
central tendency gives a single value that represents a given set of scores. Three
commonly used measures of central tendency or central location are the mean, median,
and the mode.
Mean. This is the most preferred measure of central tendency for use with test
scores, also referred to as the “arithmetic mean”. The computation is very simple.
Σx
That is, x =
N
Where: x = the mean
Σx = the sum of all the scores
N = the number of scores in the set
Consider the test scores of 15 students given Table 8.1.
50 97 96 95 48 55 58 59
85 80 83 77 70 51 53
The given data is ungroup data, use the formula in finding for mean.
Σx 1057
x = = = 70.466667 or 70.47
N 15
You have many ways in computing the mean. The traditional long computation
techniques have outlived their relevance due to advancement of technology and the
emergence of statistical software. Using your scientific calculator, you will see the
symbols x, Σx. Just follow the steps indicated in the guide. There are also simple steps
in Excel.
Total N = 50 Σmf =
3751
In the traditional way, it cannot be argued that you can see at a glance how the
scores are distributed among the range of values in a condensed manner. You can
even estimate the average of the scores by looking at the frequency in each class
interval. In the absence of statistical program, the mean can be computed with the
following formula:
Σmf
x =
N
Where:
x = the mean
m = midpoint of the class interval
f = frequency of each class interval
N = total frequency
Thus, the mean of the test scores in Table 8.2 is calculated as follows:
Σmf 3751
x = = = 75.02
N 50
The easiest way is to use SPSS (Social Statistical Software for Social Sciences) by
simply following these steps:
1. Open the Data Editor window. It is understood you have prepared the data set
earlier.
2. On the menu bar click Analyze, then Descriptive Statistic, the Frequencies. This
opens the Frequencies dialog box.
3. Press Continue on the Descriptive Option box, then press OK on the left
Descriptive Box, and you will finally see the following image.
Median
Median is the value that divides the ranked score into halves, or the middle value of
the ranked scores. If the number of scores is odd, then there is only one middle value
that gives the median. However, if the number of scores in the set is even, then there
are two middle values. In this case, the median is the average of these two middle
values.
If there are more than 30 scores, arranging the scores and finding the middle value
will take time. The scientific calculator will not give you the median. Again, statistical
software can do this for you with simple steps similar to finding the mean.
1. On the menu bar click on Analyze, then descriptive Statistic, the frequencies.
This opens the frequencies dialog box.
2. Click on the desired variable name in the left box. In the data set, let us consider
the test scores also in the Table. Move your cursor to Statistics and the
frequency box will pop out. Click Median.
3. You will also see that you can use the same process in finding the mean. Earlier,
you opted to use Descriptive instead of the frequencies. The click Continue. Then
press OK.
Again, how do you work it out the conventional way? Either, you rank the 50 scores,
which takes time, or you arrange the scores in the frequency distribution as shown here:
Less than
Class Interval Midpoin Frequency mf Cumulative
t (f) Frequency
(m) (<Cf)
N
−¿ Cf
Mdn = L + c [ 2 ]
f1
N
−¿ Cf 25−20
Solution: Mdn = L + c [ 2 ] = 71.5 + 8 [ ]
12
f1
5
Mdn = 71.5 + 8 [ ] = 71.5 + 8 [0.41666667)
12
Mdn = 71.5 + 3.3333333336 = 74.83333336 or 74.83
Mode
Mode is the easiest measure of central tendency to obtain. It is the score or value
with the highest frequency in the set of scores. If the scores are arranged in a frequency
distribution, the mode is estimated as the midpoint of the class interval which has the
highest frequency. The class interval with the highest frequency is also called the modal
class. In a graphical representation of the frequency distribution, the mode is the value
in the horizontal axis at which the curve is at its highest point. If there are two highest
points, then, there are two modes. When all the scores in a group have the same
frequency, the group of scores has no mode.
Considering the test data in Table 8.2, it can be seen that highest frequency of 12
occurred in the class interval 72 - 79. The rough estimate of the mode is 75.5, which
is the midpoint of the class interval. Using statistical software and following the steps in
finding the mean and the median, the image will appear.
Measures of Dispersion
You can see that different distributions may be asymmetrical, may have the same
average value (mean, median, mode), but how the scores in each distribution are
spread out around these measures are different.
x
10 20 30 40 50 60 70 80 90
Figure 8.1. Measures of Variability of Sets of Test Scores
There are several indices of variability, and the most commonly used in the area of
assessment are the following:
Range. It is the difference between the highest score and the lowest score in a
distribution. It is the simplest measure of variability but also considered as the least
accurate measure of dispersion because its value is determined by just two scores in
the group. It does not take into consideration the spread of all scores; its value simply
depends on the highest and lowest scores. Its value could be drastically changed by a
single value. Consider the following examples:
Now, replace a high score in one of the scores, say, the last score and make it 40.
The range becomes:
Range = HS - LS
Range = 40 - 9
Range = 31
You will see that with just a single score, the range increased so high, which can be
interpreted as large dispersion of test scores; however, when you look at the individual
scores, it is not.
Variance and Standard Deviation. Standard Deviation is the most widely used
measure of variability and is considered as the most accurate to represent the
deviations of individual scores from the mean values in the distribution.
120 120
Solve for mean: xA = xB = xC =
10 10
120
10
xA = 12 xB = 12 xC = 12
You will note that while the distributions contain different scores, they have the
same mean. If you ask how each mean represents the score in their respective
distribution, there will be no doubt with the mean of distribution C because each score in
the distribution is 12. How about in distributions A and B? For these two distributions,
the mean of 12 is better estimate of the scores in distribution B than in distribution A.
You can see that no score in B is more than 4 points away from the mean of 12.
However, in distribution A, half of the 12 scores is 4 points or more away from the
mean. You can see that there is less variability of scores in B than A.
Recall that Σ(x - x ) is the sum of the deviation scores from the mean, which is
equal to zero. As such, you square each deviation score, then sum up all the squared
deviation scores, and divide it by the number of cases. This yields the variance. Getting
its square root is the standard deviation.
Σ ( x−μ ) 2
σ2 =
N
Where:
σ2 = population variance
μ = population mean
x = score in the distribution
N = number of scores in the distribution
Finding the square give us this formula for the standard deviation. That is,
Σ ( x−μ ) 2
σ = √
N
Where:
σ = population standard deviation; x = score in the distribution
μ = population mean; N = number of scores in the
distribution
If you are dealing with the sample data and wish to calculate an estimate of σ, the
following formula is used for such statistic:
Σ ( x−x ) 2
s = √
N−1
Where:
s = standard deviation; x = raw score in the
distribution
x = sample mean; N = number of scores in the
distribution
Σ ( x−x ) 2
s2 =
N−1
Where:
s2 = variance; x = raw score in the distribution
x = sample mean; N = number of scores in the
distribution
Using the scores in Class A and Class B in the above data set, you can apply the
formula:
Class A Class B
x (x - x) (x x (x - x) (x
- x )2 - x )2
22 22 - 12 = 10 16 16 - 12 = 4
100 16
18 18 - 12 = 6 15 15 - 12 = 3
36 9
16 16 - 12 = 4 15 15 - 12 = 3
16 9
14 14 - 12 = 2 14 14 - 12 = 2
4 4
12 12 - 12 = 0 12 12 - 12 = 0
0 0
11 11 - 12 = -1 11 11 - 12 = -1
1 1
9 9 - 12 = -3 11 11 - 12 = -1
9 1
7 7 - 12 = -5 9 9 - 12 = -3
25 9
6 6 - 12 = -6 9 9 - 12 = -3
36 9
5 5 - 12 = -7 8 8 - 12 = -4
49 16
Σx = 120 Σ( x - x Σx = 120 Σ( x -
)2 = 276 x )2 = 74
Σx 120
a) Solve for Mean: x = = = 12
N 10
276 74
s2 = s2 =
10−1 10−1
276 74
s2 = s2 =
9 9
s2 = 30.67 s 2 = 8.22
s = √ 30.67 s = √ 8.22
s = 5.538 s = 2.8674
You may be thinking that the process will be difficult if you are dealing with many
scores in a distribution. This is not really a problem if you have a scientific calculator.
Secure scientific calculator for easy and accurate solutions.
Measures of Position
While measures of central tendency and measures of dispersion are used often in
assessment, there are other methods of describing data distributions such as using
measures of position or location. What are these measures?
Quartile. In our discussion about the measure of central tendency, you learned that
median of a distribution divides the data into two equal groups. In a similar way, the
quartiles are the three values that divide a set of scores into four equal parts, with one-
fourth of the data values in each part. This means about 25% of the data falls at or
below the first quartile (Q1); 505 of the data falls at or below the 2 nd quartile (Q2); and
75% falls or below the 3rd quartile (Q3).
Notice that Q2 is also the median. You can also say that Q1 is the median of the first
half of the values, and Q3 the median of the 2nd half of the values.
Example: Given the following scores, find the first quartile, 3 rd quartile, and quartile
deviation.
90 85 85 86 100 105 109 110 88 105 100 112
Q3−Q 1
Q =
2
Solutions:
85 85 86 88 90 100 100 105 105 109 110 112
86+88 105+109
Q1 = Q3 =
2 2
174 214
Q1 = Q3 =
2 2
Q 1 = 87 Q 3 = 107
Note that in the above example, the left and right 50% contains even center values,
so the median in each half is the average of the two center values.
Q3−Q 1
Consequently, applying the formula: Q = gives the quartile deviation.
2
That is,
107−87 20
Q = = = 10
2 2
Decile. It divides the distribution into ten equal parts. There are nine (9) deciles
such that 10% of the distribution are equals or less than decile 1, (D 1), 20% of the
scores are equal or less than decile 2, (D 2), and so on. A student whose mark is below
the first decile is said to belong in decile 1. A student whose mark is between the first
and second deciles is in decile 2. A student whose mark is above the ninth decile
belongs to decile 10.
Percentile. It divides the distribution into 100 equal parts. In the same manner, for
percentiles, there are 99 percentiles such that 1% of the scores are less than the first
percentile, 2% of the scores are less than the second percentile, and so on.
Example: If you scored 95 in a 100-item test, and your percentile rank is 99 th,
this means that 99% of those who took the test performed lower than you. This is
also means that you belong to the top 1% of those who took the test.
(8)
It is also called Gaussian Distribution, named after Carl Friedrich Gauss. This
distribution has been used as a standard reference for many statistical decisions in the
field of research and evaluation.
In assessment, the area in the curve refers to the number of scores that fall within a
specific standard deviation from the mean score. In other words, each portion under the
curve contains a fixed percentage of cases as follows:
68% of the scores fall between one standard deviations below and above the
mean.
95% of the scores fall between two standard deviations below and above the
mean.
99.77% of the scores fall between three standard deviations below and
above the mean.
68%
95%
99.77%
-3 -2 -1 0 +1 +2 +3
From the above figures, you can state the properties of the normal curve:
1. The mean, median, and mode are all equal.
2. The curve is symmetrical. As such, the value in a specific areas on the left is
equal to the value of its corresponding area on the right.
3. The curve changes form concave to convex and approaches the x-axis, but the
tails do not touch the horizontal axis.
4. The total are on the curve is equal to one (1).
(9)
Standard Scores
In the preceding topic, you discussed raw scores, which are the original scores
collection from an actual testing situation. However, there are situations where
computing measures from raw scores may not be enough.
Consider a situation where you, as a student, want to know in what subjects you
performed best and poorest to determine where you need to exert more effort. In cases
like these, you cannot find the answer by merely relying on a single score. More
concretely, if you get a score of 86 in Science and 90 in English, you cannot conclude
that you perform better in English, simply because 90 is higher than 86. Say, you later
learned that the mean score of the class in Science was 80, and in English, the mean
score was 95. This situation indicates that a single score like 86 or 90 is not meaningful
unless it is compared with other test scores.
In particular, a score can be interpreted more meaningfully if you know the mean
and variability of the other scores where the single score belongs. Knowing this, a raw
score can be converted into Standard Scores.
x−x
z =
s
The standard deviation helps you locate the relative position of the score in
a distribution. The equation gives you the z-score, which can indicate the
number of standard deviations the score is above or below the mean. A z-score
is called a standard score, simply because it is a deviation score expressed in
standard deviation units.
If raw scores are expressed as z-scores, you can see their relative position
in their respective distribution. If the raw scores are already converted into
standard scores, you can now compare the two scores even when these scores
come from different distributions or when scores are measuring two different
things, like knowledge in English or Science. The following figure illustrates this
point.
70 80 86 90 70 90 95 100
x x x x
Science English
Figure 8.4. A comparison of Score Distributions with Different
Means and
Standard Deviation
In the above figure, a score of 86 in Science indicates better performance
than a score of 90
in English. Let us suppose that standard deviations in Science and English are
3 and 2, respect –
tively. You can express these raw scores as z-scores.
( 10 )
Science English
x−x x−x
z = z =
s s
86−80 90−95
z = z =
3 2
6 −5
z = = 2 z = = - 2.5
3 2
From the above, if 86 and 90 are your scores in the two subjects, you can
confidently say that, compared with the rest of your class, you performed better
in Science than in English.
2. T – score. As you can see in the computation of the z-score, it can give you a
negative number, which simply means the score is below the mean. However,
communicating negative z-score as below the mean may not be understandable
to others. You will not even say to students that they got a negative z-score. A z-
score may also be a repeating or non-repeating decimal, which may not be
comfortable to others. One option is to convert a z-score into a T-score, which is
a transformed standard score. To do this, there is scaling in which a mean of 0 in
a z-score is transformed into a mean of 50, and the standard deviation in z-score
is multiplied by 10. The corresponding equation is:
T-score = 50 + 10z
T-score = 50 + 10(- 2)
T-score = 50 - 20
T-score = 30
T-score = 50 + 10( 2)
T-score = 50 + 20
T-score = 70
Stanine = 2z + 5
Stanine = 2(2) + 5 = 4 + 5 = 9.
Example:
Scores in stanine scale have some limitations. Since they are in a 9-point
scale and
expressed as a whole number, they are not precise. Different Z-scores or T-
scores may have the
same stanine score equivalent.
With the above percentage distribution of scores in each stanine, you can
directly convert a set of raw scores into stanine scores. Simply arrange the raw scores
from lowest to highest, and with the percentage of scores in each stanine, you can
directly assign the appropriate stanine score in each raw score.
LESSON 9` :
ANALYSIS, INTERPRETATION, AND THE USE OF TEST DATA
It is important that you review your prior knowledge and experiences, as well as the
standards or policies used by your institution in grading and reporting learners’
performance in the test and the course as a whole. You may also need to read books
and other references on the topics to validate you’re a priori knowledge and to enhance
further your knowledge and skills.
Grades do not exist in a vacuum but are part of the instructional process and serve
as feedback loop between the teacher and learners. They give feedback on what
specific topic/s learners have mastered and what they need to focus more when they
review for summative assessment or final exams. Grades serve as a motivator for
learners to study and do better in the next tests to maintain or improve their final grade.
Grades also give the parents information about their children’s achievements. They
provide teachers some bases for improving their teaching and learning practices and for
identifying learners who need further educational intervention. They are also useful to
school administrators who want to evaluate the effectiveness of the instructional
programs in deve3loping the needed skills and competencies of the learners.
There are various ways to score and grade results in multiple-choice tests.
Traditionally, the two most commonly-used scoring methods are number right scoring
(NR) and negative marking (NM).
Number Right Scoring (NR). It entails assigning positive values only to correct
answers while giving a score of zero to incorrect answers. The test score is the sum of
the scores for correct responses. One major concern with this scoring method is that
learners may get the correct answer by guessing; thus, affecting the test reliability and
validity.
Example: Solve for 3(x + 8) - (x - 2) = - 28.
a) x = 32 b) x = 8 c) x = - 8 d) x = - 32
For the above item, the correct answer is d) x = - 32 and this will a score.
Responses other than d will be given zero (0) point.
There are four types of rating scales for the assessment of writing, which can also
be applied to other authentic or performance-type assessment. These four types of
scoring are (1) Holistic, (2) Analytic, (3) Primary Trait, and (4) Multiple Trait Scoring.
Rating/Grade Characteristics
A Is very organized. Has a clear opening statement that catches audience’s interest.
(Exemplary) Content of report is comprehensive and demonstrates substance and depth. Delivery
is very clear and understandable. Uses slides/multimedia equipment effortlessly to
enhance presentation.
B Is mostly organized. Has opening statement relevant to topic. Covers important
(Satisfactory) topics. Has appropriate pace and without distracting mannerisms. Looks at slides to
keep on track.
C Has opening statement relevant to topic and but does not give outline of speech; is
(Emerging) somewhat disorganized. Lacks content and depth in the discussion of the topic.
Delivery is fast and not clear; some items not covered well. Relies heavily on slides
and notes and makes little eye contact.
D Has no opening statement regarding the focus of the presentation. Does not give
(Unacceptable) adequate coverage ot topic. Is often hard to understand, with voice that is too soft or
too loud and pace that is too quick or too slow. Just reads slides; slides too much text.
Primary Trait Scoring. It focuses on only one aspect or criterion of a task, and a
learner’s performance is evaluated on only one trait. This scoring system defines a
primary trait in the task that will then be scored. For example, if a teacher in a political
science class asks his students to write an essay on the advantages and disadvantages
of Martial Law (i.e., the writing task), the basic question addressed in scoring is, “Did the
writer successfully accomplish the purpose of this task?” With this focus, teacher would
ignore errors in conventions of written language but instead focus on overall rhetorical
effectiveness. One disadvantage of this scoring scheme is that it is often difficult to
focus exclusively on one trait, such that other traits may be included when scoring.
Thus, it is important that a very detailed scoring guide is used for each specific task.
(3)
1. Raw Score. It is simply the number of items answered correctly on a test. A raw
score provides an indication of the variability in the performance of students in
the class. However, a raw score has no meaning unless you know what the test
is measuring and how many items it contains. A raw score also does not mean
much because it cannot be compared with a standard or with the performance of
another learner or of the class as a whole.
For example, a raw score of 95 would look impressive, but only if there are 100
items in the test. However, if the test contains 500 items, then the raw score of
95 is not good at all.
A test that only gives a raw score but not the total number of items does not
measure and communicate the learner’s performance or achievement. Raw
scores may be useful if everyone knows the test and what it covers, how many
possible right answers there are, and how learners typically do in the test.
(4)
3.2 Letter Grade. This is one of the most commonly used grading systems.
Letter grades are composed of five-level grading scale labelled from A to E or F, with A
representing the highest level of achievement or performance, and E or F – the lowest
grade – representing a Failing grade. These are often used for all forms of learners’
work, such as quizzes, essays, projects,and assignments. An example of the
descriptors for letter grades is presented below:
3.3 Plus (+) and Minus (−) Letter Grades. This grading provides a more
detailed descriptions of the level of learners’ achievement or task/test
performance by dividing each grade category into three levels, such that a
grade of A can be assigned as A+, A and A−; B+, B and B−; and so on. Plus
(+) and minus (−) grades provide a finer discrimination between achievement
or performance level. They also increase the accuracy of grades as a
reflection learner’s performance; enhance student motivation (i.e., to get a
high A rather than A−); and discriminate among courses or star sections.
+/− gradins system is viewed as unfair, particularly for learners in the highest
category; creates for stress for learners; and is more difficult for teachers as
they need to deal with more grade categories when grading learners.
Examples of the descriptors for plus (+) and minus (−) letter grades are
presented below:
Categorical grading methods have the same drawbacks as letter grades. Like
letter grades, the categorical grades provide cut-offs between levels that are
often arbitrary, lack the richness of more detailed reporting methods, and fail
to provide feedback or information that can be used to diagnose learners’
weaknesses and refer for remediation.
For example, a score of 7.5 means that the learner did as well as a
Grade 7 taking the test at the end of the fifth month of the school year.
4.1.2 Age-Equivalent Score. It indicates the age level that is typical to a
learner to
such raw score. It reflects a learner’s performance in terms of the
chronological age as compared to those in the norm group. Age-
equivalent scores are written with a hyphen between years and
months.
For example, a learner’s score of 11-5 means that his age equivalent is
11 years and 5 months old, indicating a test performance that is similar
to that of 111/2 year-olds in the norm group.
4.2 Percentile Rank. This indicates the percentage of scores that fall or below
a given score. Percentile ranks range from 1 to 99.
4.3 Stanine Score. This system expresses test results in nine equal steps,
which range from one (lowest) to nine (highest). A stanine score of 5 is
interpreted as “average” stanine. Percentile ranks are grouped into stanines,
with the following verbal interpretations:
(6)
4.4 Standard Scores. There are raw scores that are converted into a common
scale of measurement that provides meaningful description of the individual
scores within the distribution. A standard score describes the difference of the
raw score from a sample mean, expressed in standard deviation. Two most-
commonly used standard scores are 1) z-score and 2) T-score.
x−x x−μ
z = or z =
s σ
4.4.2 T-score. It is another type of standard score, where the mean is equal
to 50, and the standard deviation is equal to 10. It is linear
transformation of z-scores, which have mean 0 and standard deviation.
It is computed from a z-score with the following formula:
T-score = 50 + 10z
.
General Guidelines in Grading Tests or Performance Tasks
The following are the general guidelines in grading tests or performance tasks:
1. Stick to the Purpose of the Assessment.
Before coming up with an assessment, it is first important to determine the
purpose of the test. Will the assessment be used for diagnostic purposes?
Will it be a formative assessment or it is a summative assessment?
Diagnostic and formative assessments are generally not graded.
Diagnostic assessments are primarily used to gather feedback about the
learners’ prior knowledge or misconception before the start of a learning
activity, while results from formative assessments are used to determine
what learners need to improve on or what topics or course contents need
to be addressed and given emphasis by the teacher.
5. Decide on what Type of Test Scores to Use. As discussed earlier, there are
different ways by which students learning can be measured and presented.
Performance in a particular test can be measured and reported through raw
scores, percentage scores, criterion-referenced scores, or norm-referenced
scores. It is important that different types of grading scheme be used for different
tests, assignments, or performance tasks. Learners should also be informed at
the start of what grading system is to be used for a particular test or task.
Essays require more time to grade than the other types of traditional tests. Grading
essay tests can also be influenced by extraneous factors, such as learners’ handwriting
legibility and raters’ biases. The following are the general guidelines in scoring essay
tests:
1. Identify the Criteria for Rating the Essay. The criteria or standards for
evaluating the essay should be predetermined. Some of the criteria that can be
used include content, organization/format, grammar proficiency, development
and support, focus and details, etc. It is important that the specific standards and
criteria included are relevant to the type of performance task given.
2. Determine the Type of Rubric to Use. There are two types of rubric: holistic or
analytic scoring system. Holistic rubrics require evaluating the essay and taking
into consideration all the criteria. Only a single score is given based on the
overall judgment of the learner’s writing composition. Holistic rubric is viewed to
be more convenient for the teachers as it requires less area or aspect of writing
to evaluate. However, it does not provide feedback on what course topic/content
are weak and need to improve on. On the other hand, analytic scoring system
requires that the essay is evaluated based on each of the criteria. It provides
useful feedback on learner’s strength and weaknesses for each course content
or criterion.
3. Prepare the Rubric. In developing rubric, the skills and competencies related to
essay writing should first be identified. These skills and competencies represent
the criteria. Then, performance benchmarks and point values are determined.
Performance marks can be numerical categories, but the most frequently used
are descriptors with corresponding rating scales.
5. Score One Essay Question at a Time. This is to ensure that the same thinking
and standards are applied for all learners in the class. The rater should try to
avoid any distraction or interruption when evaluating the same item.
6. Be Conscious of Own Biases when Evaluating a Paper. The rate should not
be affected by learners’ handwriting, writing style, length of responses, and other
factors. He/she should stick to the criteria included in the rubric when evaluating
the essay.
7. Review Initial Scores and Comments Before Giving the Final Rating. This is
important especially for essays that were initially given a barely passing or failing
grade.
8. Get Two or More Raters for essays that are high-stake, such as those used for
admission, placement, or scholarship screening purposes. The final grade will be
the average of the all ratings given.
Steps Examples
Get Total Scores for WW1 + WW2 + WW3 + . . . = WWT (e.g., 145 out of 160)
each Component PT1 + PT2 + PT3 + . . . = PTT (e.g., 100 out of 120)
QA = 40 out of 50
Convert to 5 WW = 145/160 = 90.63
PT = 100/120 = 83.33
QA = 40/50 = 80.00
Convert 5 to weighted (See assigned weights for each component in the next tables)
Score (WS)* WS for WW English = 90.63 x 0.30 = 27.19
WS for PT English = 83.33 x 0.50 = 41.67
WS for English QA = 80.00 x x 0.20 = 16
Add weighted scores Initial Grade for English = 27.19 + 41.67 + 16.00 = 84.86
for the Initial Grade.
Transmute Initial Grade Use Transmutation Table from DepEd Order 8, s. 2015)
to Quarter Grade (QG) For 84.86, transmuted grade is 90, which is the QG.
Weights for the Three (3) Components for Grades 1 – 10 and Senior High-School.
Grades 1 - 10
Component Language AP Esp Science Math MAPEH EPP/TLE
Written Work 30% 40% 20%
Performance 50% 40% 605
Tasks
Quarterly
Assessment 20% 20% 20%
Weights for the Three (3) Components for Senior High School
For MAPEH, individual grades are given to each area (i.e., Music, Art, PE, and Health).
The quarterly grade for MAPEH is the average grade across the four areas, as follows:
The final grade for each subject is then computed by getting the average of the four
quarterly grades, as seen below:
1QG +2QG +3 QG+ 4 QG
Final Grade for each Learning Area =
4
(9)
The general grade on the other hand, is computed by getting the average of the final
grades for all subject areas. Each subject area has equal weight:
General Average =
∑ of FinalGrades of All Learning Areas
Total Number of Learning Areas∈a Grade Level
All grades reflected in the report card are reported as whole number. See an example of
a report card:
Quarter
Subject Area 1 2 3 4 Final Grade
Filipino 86 88 85 90 87
English 83 82 83 85 83
Mathematics 87 92 93 95 92
Science 82 84 88 86 85
Araling Panlipunan 90 92 92 93 92
Edukasyon sa Pagpapakatao 80 83 85 88 84
Edukasyong Pantahanan at Pangkabuhayan 86 82 85 83 84
MAPEH 90 92 93 94 92
General Avetage 87
Learners’ grades are then communicated to parents and guardians every quarter during
the parent-techer conference by showing and discussing with them the report card. The
grading system and the descriptors are as follows: