0% found this document useful (0 votes)
317 views65 pages

Module 1 - 4 ASSESSMENT IN LEARNING 1

Uploaded by

Hannah Mae Lucas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
317 views65 pages

Module 1 - 4 ASSESSMENT IN LEARNING 1

Uploaded by

Hannah Mae Lucas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

MODULE 1:

INTODUCTION TO ASSESSMENT IN LEARNING

Module 1 consists of three (3) lessons attainable for coverage of Prelim term as
follows:

Lesson 1 : Basic Concepts and Principles in Assessing Learning


Lesson:2 : Assessment Purposes, Learning Targets, and Appropriate Methods
Lesson:3 : Different Classifications of Assessment

Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.

LESSON 1 : BASIC CONCEPTS AND PRINCIPLES IN ASSESSING


LEARNING

THINK ABOUT THESE EXPECTATIONS:

1. Describe assessment in learning and related concepts.


2. Demonstrate understanding of the different principles in assessing learning
through the preparation of an assessment plan.

Assessment in Learning

The word Assessment is rooted in the Latin Word assidere, which means “to sit
beside another.” Assessment is generally defined as the process of gathering
quantitative and/or qualitative data for the purpose of making decisions. Assessment in
learning is vital to the educational process similar to curriculum and instruction. Schools
and teachers will not be able to determine the impact of curriculum and instruction on
students or learners without assessing learning.

Assessment in Learning can be defined as the systematic and purpose-oriented


collection, analysis, and interpretation of evidence of student learning in order to make
informed decisions relevant to the learners. In essence, the aim of assessment is to use
evidence on student learning to further promote and manage learning. Assessment in
learning can be characterized as (a) a process, (b) based on specific objectives, and (c)
from multiple sources.

Assessment in learning is different from the concept of measurement or evaluation


of learning. Measurement can be defined as the process of quantifying the attributes of
an object, whereas evaluation may refer to the process of making value judgments on
the information collected from measurement based on specified criteria. In the context
of assessment in learning, measurement refers to the actual collection of information
on student learning through the use of various strategies and tools, while evaluation
refers to the actual process of making a decision or judgment on student learning based
on the information collected from measurement.

The most common form of assessment is testing. In the educational context,


testing refers to the use of a test or battery of tests to collect information on student
learning over a specific period of time. A test is a form of assessment but not all
assessments use tests or testing. A test can be categorized as either:

a) Selected response (e.g., matching type of test)


b) Constructed response (e.g., essay test, short answer test)
c) Objective format (e.g., multiple choice, enumeration)
d) Subjective format (e.g., essay)

The objective format provides for a more bias-free scoring as the test items have exact
correct answers while the subjective format allows for a less objective means of scoring
especially if no rubric is used.
table of Specifications (TOS) is used that maps outs the essential aspects of a test
(e.g., test objectives, contents, topics covered by the test, item distribution).
Descriptive statistics are used to describe and interpret the results of tests. A test is
said to be good and effective if it is valid, reliable, has acceptable level of difficulty and
can discriminate between learners with higher and lower ability.

(2)

A related concept to assessment in learning is grading. Grading is defined as the


process of assigning value to the performance or achievement of a learner based on
specified criteria or standards. Aside from tests, other classroom tasks can serve as
bases for grading learners. These may include a learner’s performance in recitation,
seatwork, homework, and project. The final grade of a learner in a subject or course is
the summation of information from multiple sources (i.e., several assessment tasks or
requirements). Grading is a form of evaluation which provide information on whether a
learner passed or failed a subject or a particular assessment task.

Different Measurement Frameworks Used in Assessment

The two most common psychometric theories that serve as frameworks for
assessment and measurement are the Classical Test Theory (CTT) and the Item
Response Theory (IRT).

Classical Test Theory (CTT) is also known as the true score theory. It explains
that variations in the performance of examinees on a given measure is due to variations
in their abilities. The CTT assumes that an examinee’s observed score in a given
measure is the sum of the examinee’s true score and some degree of error in the
measurement caused by some internal and external conditions. The CTT also assumes
that all measures are imperfect, and the scores obtained from a measure could differ
from the true score (i.e., true ability) of an examinee.

The CTT provides an estimation of the item difficulty based on the frequency or
number of examinees who correctly answer a particular item; items with fewer number
of examinees with correct answers are considered more difficult. The CTT also provides
an estimation of item discrimination based on the number of examinees with higher or
lower ability to answer a particular item. Test reliability can also be estimated using
approaches from CTT (e.g., Kuder-Richardson 20, Cronbach’s Alpha). Item analysis
based on CTT has been the dominant approach because of the simplicity of calculating
the statistics (e.g., item difficulty index, item discrimination index, item-total correlation).

The Item Response Theory (IRT) analyses test items by estimating the probability
that an examinee answers an item correctly or incorrectly. One of the central differences
of IRT from CTT is that in IRT, it is assumed that the characteristic of an item can be
estimated independently of the characteristic or ability of the examinee and vice-versa.

Different Types of Assessment in Learning

Assessment in learning could be of different types. The most common types are
formative, summative, diagnostic, and placement. Other experts would describe the
types of assessment as traditional and authentic.
Formative Assessment. Refers to assessment activities that provide information to
both teachers and learners on how they can improve the teaching-learning process. It is
formative because it is used at the beginning and during instruction for teachers to
assess learner’s understanding. The information collected on student learning allows
teachers to make adjustments to their instructional process and strategies to facilitate
learning. It also inform learners about their strengths and weaknesses to enable them to
take steps to learn better and improve their performance as the class progresses.

Summative Assessment. Refers to assessment activities that aim to determine


learners’ mastery of content or attainment of learning outcomes. It also provide
information on the quantity or quality of what students have learned or achieved at the
end of instruction. The data from summative assessment are typically used for
evaluating learners’ performance in class. These data also provide teachers with
information about the effectiveness of their teaching strategies and how they can
improve their instruction in the future. Through performance reports and teachers
feedback summative assessment can also inform learners about6 what they have done
well and what they need to improve on in their future classes or subjects.

Diagnostic Assessment. It aims to detect the learning problems or difficulties


of the learners so that corrective measures or interventions are done to ensure learning.
It is usually done right after seeing signs

(3)
or learning problems in the course of teaching. It can also be done at the beginning of
the school year for spirally-designed curriculum so that corrective actions are applied if
pre-requisite knowledge and skills for the targets of instruction have not been mastered
yet.

Placement Assessment. It is usually done at the beginning of the school year to


determine what the learners already know or what are their needs that could inform
design of instruction. Grouping of learners based on the results of placement
assessment is usually done before instruction to make it relevant to address the needs
or accommodate the entry performance of the learners. The entrance examination given
in schools is an example of a placement assessment.

Traditional Assessment. Refers to the use of conventional strategies or tools to


provide information about the learning of students. Typically, objective (e.g., multiple
choice) and subjective (e.g., essay) paper-and-pencil tests are used. Traditional
assessment are often used as basis for evaluating and grading learners. They are more
commonly used in classroom because they are easier to design and quicker to be
scored.

Authentic Assessment. Refers to the use of assessment strategies or tools that


allow learners to perform or create a product that are meaningful to the learners, as they
are based on real-world contexts. The authenticity of assessment tasks is best
described in terms of degree rather than the presence or absence of authenticity.

Different Principles in Assessing Learning

There are many principles in the assessment in learning. The following are
considered as core principles.

1. Assessment should have a clear purpose. The methods used in collecting


information should be based on this purpose. The interpretation of the data
collected should be aligned with the purpose that has been set. This assessment
principle is congruent with the outcome-based education(OBE) principles of
clarity of focus and design down.
2. Assessment is not an end in itself. Assessment serves as a means to
enhance student learning. It is not a simple recording or documentation of what
learners know and do not know. Collecting information about student learning,
whether formative or summative, should lead to decisions that will allow
improvement of the learners.
3. Assessment is an ongoing, continuous, and a formative process.
Assessment consists of a series of tasks and activities conducted over time. It is
not a one-shot activity and should be cumulative. Continuous feedback is an
important element of assessment.
4. Assessment is learner-centered. Assessment is not about what the teacher
does but what the learner can do. Assessment of learners provides teachers with
an understanding on how they can improve their teaching which corresponds to
the goal of improving student learning.
5. Assessment is both process-and product-oriented. Assessment gives equal
importance to learner performance or product and the process they engage in to
perform or produce a product.
6. Assessment must be comprehensive and holistic. Assessment should be
performed using a variety of strategies and tools designed to assess student
learning in a holistic way. This assessment principle is also congruent with the
OBE principle of expanded opportunity.
7. Assessment requires the use of appropriate measures. For assessment to
be valid, the assessment tools or measures used must have sound psychometric
properties including but not limited to validity and reliability. This assessment
principle is consistent with the OBE principle of high expectations.
8. Assessment should be as authentic as possible. Assessment tasks or
activities should closely, if not fully, approximate real-life situations or
experiences. Authenticity of assessment can be thought of as a continuum from
least authentic to most authentic, with more authentic tasks expected to be more
meaningful for learners.

LESSON 2 :
ASSESSMENT PURPOSES, LEARNING
TARGETS, AND APPROPRIATE METHODS
THINK ABOUT THESE EXPECTATIONS:

1. Explain the purpose of classroom assessment.


2. Formulate learning targets that match appropriate assessment methods.

The Purpose of Classroom Assessment

Assessment works best when its purpose is clear. Without a clear purpose, it is
difficult to design or plan assessment effectively and efficiently. In classrooms, teachers
are expected to know the instructional goals and learning outcomes, which will inform
how they will design and implement their assessment. In general, the purpose of
classroom assessment may be classified in terms of the following:
1. Assessment of Learning. This refers to the use of assessment to determine
learner’s acquired knowledge and skills from instruction and whether they were
able to achieve the curriculum outcomes. It is generally summative in nature.
2. Assessment for Learning. This refers to the use of assessment to identify the
needs of learners in order to modify instruction or learning activities in the
classroom. It is formative in nature and it is meant to identify gaps in the learning
experiences of learners so that they can be assisted in achieving the curriculum
outcomes.
3. Assessment as Learning. This refers to the use of assessment to help learners
become self-regulated. It is formative in nature and meant to use assessment
tasks, results, and feedback to help learners practice self-regulation and make
adjustments to achieve the curriculum outcomes.

It is very important that assessment is aligned with instruction and the identified
learning outcomes for learners. Knowing what will be taught (curriculum content,
competency, and performance standards) and how it will be taught (instruction) are as
important as knowing what we want from the very start (curriculum outcome) in
determining the specific purpose and strategy for assessment. The alignment is easier if
teachers have clear purpose on why they are performing the assessment. Typically,
teachers use classroom assessment for assessment of learning more than assessment
for learning and assessment as learning.

The Roles of Classroom Assessment in the Teaching-Learning Process

While the purpose of assessment may be classified as assessment of learning,


assessment for learning, and assessment as learning, the specific purpose of an
assessment depends on the teacher’s objective in collecting and evaluating assessment
data from learners. More specific objectives for assessing student learning are
congruent to the following roles of classroom assessment in the teaching-learning
process:
Formative. Teachers conduct assessment because they want to acquire
information on the current
status and level of learners’ knowledge and skills or competencies. Teachers may
need information
(e.g., prior knowledge, strengths) about the learners prior to instruction, so they
can design their
instructional plan to better suit the needs of the learners.
Diagnostic. Teachers can use assessment to identify specific learners
weaknesses or difficulties that
may affect their achievement of the intended learning outcomes. Identifying these
weaknesses allows
teachers to focus on specific learning needs and provide opportunities for
instructional intervention or
remediation inside or outside the classroom.
Evaluative. Teachers conduct assessment to measure learners’ performance or
achievement for the
purpose of making judgment or grading in particular. Teachers need information on
whether the learners
have met the intended learning outcomes after the instruction is fully implemented.
Facilitative. Classroom assessment may affect student learning. On the teachers
part, assessment for
learning provides information on students’ learning and achievement that teachers
can use to improve
instruction and the learning experience of learners. On the part of learners,
assessment as learning
allows them to monitor, evaluate, and improve their own learning strategies. In
both cases, student
learning is facilitated.

(2)
Motivational. Classroom assessment can serve as a mechanism for teachers to
be motivated and
engaged in learning and achievement in the classroom. Grades, for instance, can
motivate and
demotivate learners.

Learning Targets

Before discussing what learning targets are, it is important to first define educational
goals, standards, and objectives.
Goals. Goals are general statements about desired learner outcomes in a given
year or during the
duration of a program (e.g., senior high school).
Standards. Standards are specific statements about what learners should know
and are capable of
doing at a particular grade level, subject, or course. McMillan (2014) described four
different types of
educational standards:
1. Content - desired outcomes in a content area.
2. Performance - what students do to demonstrate competence.
3. Developmental - sequence of growth and change over time).
4. Grade-level - outcomes for a specific grade.
Educational Objective. Educational objectives are specific statements of learner
performance at the end of an instructional unit. These are sometimes referred to as
behavioural objectives and are typically stated with the use of verbs. The most
popular taxonomy of educational objectives is blooms Taxonomy of Educational
Objectives.

The Bloom’s Taxonomy of Educational Objectives

Bloom’s Taxonomy consist of three domains: cognitive, affective, and


psychomotor. These three domains correspond to the three types of goals that
teachers want to assess: knowledge-based goals (cognitive), skills-based goals
(psychomotor), and affective goals (affective). Each taxonomy consists of different
levels of expertise with varying degrees of complexity. The most popular among the
tree taxonomies is the bloom’s Taxonomy of Educational Objectives for knowledge-
Based Goals. The taxonomy describes six levels of expertise: knowledge,
comprehension, application, analysis, synthesis, and evaluation as shown in Table
2.1 below.

Table 2.1. Bloom’s Taxonomy of Educational Objectives in the Cognitive Domain

Cognitive Level Description Illustrative Verbs Sample Objective


Recall or recognition of Defines, recalls, Enumerate the six
Knowledge learned materials like names, enumerates, levels of expertise in
concepts, events, facts, and labels the bloom’s
ideas, and procedures. Taxonomy of
objectives in the
cognitive domain.
Understanding the meaning Explains, describes, Explain each of the six
of a learned material, summarizes, levels of expertise in
Comprehension including interpretation, discusses, and the bloom’s
explanation, and literal translates Taxonomy of
translation objectives in the
cognitive domain
Use of abstract ideas, Applies, Demonstrate how to
principles, or methods to demonstrates, uses Bloom’s
Application specific concrete situations produces, illustrates, Taxonomy in
and uses formulating learning
objectives
Separation of a concept or Compares, contrasts, Compare and
idea into constituent parts or categorizes, classifies, contrasts the six
Analysis elements and an and calculates levels of expertise in
understanding of the nature Bloom’s taxonomy of
and association among the objectives in the
elements cognitive domain
Construction of elements or Composes, Compose learning
Synthesis parts from different sources constructs, creates, targets using Bloom’s
to form a more complex or designs and taxonomy
novel structure integrates,
Making judgment of ideas or Appraises, evaluates, Evaluate the
Evaluation methods based on sound judges, concludes, congruence between
and established criteria and criticizes learning targets and
assessment methods

(3)

The Revised Bloom’s Taxonomy of Educational Objectives

Anderson and Krathwohl proposed a revision of the Bloom’s Taxonomy in the


cognitive domain
by introducing a two-dimensional model for writing learning objectives. The first
dimension,
knowledge dimension, includes four types: factual, conceptual, procedural, and
metacognitive. The
second dimension, cognitive process dimension, consists of six learning objectives
formulated from
this two-dimensional model contains a noun (type of knowledge) and a verb (type of
cognitive
process). Below is an example of a learning objective.

Students will be able to differentiate qualitative research and quantitative


research.

In this example, differentiate is the verb that represents the type of cognitive
process (in this case,
analyse), while qualitative research and quantitative research is the noun phrase
that represents the
type of knowledge (in this case, conceptual). See Table 2.2 and Table 2.3 below.

Table 2.2.
Cognitive Process Dimensions in the Revised Bloom’s Taxonomy
of Educational Objectives

Cognitive Definition Illustrative Verbs Sample Objective


Process
Compose, produce, Propose a program of
Create Combining parts to make a develop, formulate, action to help solve
whole devise, prepare, design, Metro Manila’s traffic
construct, propose, and congestion.
re-organize
Critique the latest film
Judging the value of Assess, measure, that you have
Evaluate information or data estimate, evaluate, watched, Use the
critique, and judge critique guidelines and
format discussed in
the class.
Analyze, calculate, Classify the following
Analyze Breaking down information examine, test, compare, chemical elements
into parts differentiate, organize, based on some
and classify categories/areas.
A. Solve the
Apply Applying the facts, rules, Apply, employ, practice, following
concepts, and ideas in relate, use, implement, problems
another context carry-out, and solve using the
Understanding what the Describe, determine, Explain the causes of
Understand information means interpret, translate, malnutrition in the
paraphrase, and explain country.
Identify, list, name, Name the 7th
Remember Recognizing and recalling underline, recall, President of the

Table 2.3
Knowledge Dimensions in the Revised bloom’s taxonomy of
Educational Process

Knowledge Description Sample question


This type of knowledge is basic in every discipline. It
Factual tells the facts or bits of information one needs to know What is the capital city
in a discipline. This type of knowledge usually answers of the Philippines?
questions that begin with “who”, “where”, “what”, and
“when”.
Conceptual This type of knowledge is also fundamental in every What makes the
discipline. It tells the concepts, generalizations, Philippines the “Pearl
principles, theories, and models that one needs to of the orient seas”?
know in a discipline. This type of knowledge usually
answers questions that begin with “what”.
This type of knowledge is also fundamental in every How do we develop
Procedural discipline. It tells the processes, steps, techniques, items for an
methodologies, or specific skills needed in performing a achievement test?
specific task that one needs. “how”.
This type of knowledge makes the discipline relevant to Why is engineering
Metacognitive one’s life. It makes one understand the value of the most suitable
learning on one’s life. It requires reflective knowledge course for you?
and strategies on how to solve problems. It usually
answer questions that begin with “why”.

(4)

A learning target is a statement of student performance for a relatively restricted


type of learning outcome that will be achieved in a single lesson or a few days and
contains both a description of what students should know, understand, and be able to
do at the end of instruction and something about the criteria for judging the level of
performance demonstrated. In other words, learning targets are statements on what
learners are supposed to learn and what they can do because of instruction. Compared
with educational goals, standards, and objectives, learning targets are the most specific
and lead to more specific instructional and assessment activities.

Types of Learning Targets

Many experts consider four primary types of learning targets: knowledge,


reasoning, skill, and product. See Table 2.4 below.
Table 2.4
Description and Sample Learning Targets
Type of Learning Description Sample
Targets
I can explain the role of
conceptual framework in
Knowledge Targets Refers to factual, conceptual, and procedural
a research.
information that learners must learn in a subject or
content area.
Knowledge-based thought processes that learners I can justify my research
must learn. It involves application of knowledge in problems with a theory.
Reasoning Targets
problem-solving, decision making, and other tasks
that require mental skills.
Use of knowledge and/or reasoning to perform or I can facilitate a focus
demonstrate physical skills. group discussion (FGD)
Skills Targets
with research
participants.
Product Targets Use of Knowledge, reasoning, and skills in I can write a thesis
creating a concrete or tangible product. proposal.

Other experts consider a fifth type of learning target – affect, which refers to
affective characteristics that students can develop and demonstrate because of
instruction. This includes attitudes, beliefs, interests, and values. Some experts use
disposition as an alternative term for affect. The example is shown below.
I can appreciate the importance of addressing potential ethical issues in the conduct
of thesis research.

Appropriate Methods of Assessment

Once the learning targets are identified, appropriate assessment methods can be
selected to measure student learning. The match between a learning target and the
assessment method used to measure if students have met the target is very critical.
Matrix of the different types of learning targets and sample assessment methods are
shown in Table 2.5.1 and Table 2.5.2 below.

Table 2.5.1
Matching Learning Targets with Paper-and-Pencil Types of Assessment

Learning Selected Response Constructed Response


Targets
Multiple True or Matching Short Problem Essay
Choice False Type Answer Solving
Knowledge /// /// /// /// /// ///
Reasoning // / / / /// ///
Skills / / / / // //
Product / / / / / /

Note: More checks mean better matches.

(5)

Table 2.5.2
Matching Learning Targets with Other Types of Assessment

Learning Targets Project-Based Portfolio Recitation Observation


Knowledge / /// /// //
Reasoning // // /// //
Skills // /// / //
Product /// /// / /

Note: More checks mean better matches.

There are other types of assessment, and it is up to the teachers to select the
method of assessment and design appropriate assessment tasks and activities to
measure the identified learning targets.

LESSON 3 :
DIFFERENT CLASSIFICATIONS OF
ASSESSMENT
THINK ABOUT THESE EXPECTATIONS:

1. Illustrate Scenarios in the Use of Different Classifications of Assessment.


2. Rationalize the Purpose of Different Forms of Assessment.
3. Decide on the kind of Assessment to be Used.

Different Classifications of Assessment

The different forms of assessment are classified according to purpose, form,


interpretation of learning, function, ability, and kind of learning.

Classification Type
Educational
Purpose Psychological
Paper-and-Pencil
Form Performance-Based
Teacher-made
Function Standardized
Achievement
Kind of Learning Aptitude
Speed
Ability Power
Norm-Referenced
Interpretation of Learning Criterion-Referenced

When do we use Educational and Psychological Assessment?

Educational assessments are used in the school setting for the purpose of tracking
the growth of learners and grading their performance. This assessment in the
educational setting comes in the form of formative and summative assessment.

Formative assessment is a continuous process of gathering information about


student learning at the beginning, during, and after instruction so that teachers can
decide how to improve their instruction until learners are able to meet the learning
targets. When the learners are provided with enough scaffold as indicated by the
formative assessment, then summative assessment is conducted.

The purpose of formative assessment is to track and monitor student learning and
their progress toward the learning target. Formative assessment can be any form of
assessment (paper-and-pencil or performance-based) that is conducted before, during,
and after instruction. Before instruction begins, formative assessment serves as a
diagnostic tool to determine whether learners already know about the learning target.
More specifically, formative assessment given at the start of the lesson determine the
following:
1. What learners know and do not know so that instruction can supplement what
learners do not know.
2. Misconceptions of learners so that they can be corrected.
3. Confusion of learners so that they can be clarified.
4. What learners can and cannot do so that enough practice can be given to
perform the task.

The information from educational assessment at the beginning of the lesson is used
by the teacher to prepare relevant instruction for learners. During instruction,
educational assessment is done where the teacher stops at certain parts of the teaching
episodes to ask learners questions, assign exercises, short essays, board work, and
other tasks. If the majority of the learners are still unable to accomplish the task, then
the teacher realizes that further instruction is needed by learners.

When the teacher observes that majority or all of the learners are able to
demonstrate the learning target, then the teacher can now conduct the summative
assessment. The purpose of summative

(2)

assessment is to determine and record what the learners have learned. It is best to
have a summative assessment for each learning target so that there is an evidence that
learning has taken place.

Psychological assessments, such as tests and scales, are measures that determine
the learner’s cognitive and non -cognitive characteristics. Examples of cognitive tests
are those that measure ability, aptitude, intelligence, and critical thinking. Affective
measures are for personality, motivation, attitude, interest, and disposition. The results
of these assessments are used by the school’s guidance counsellor to perform
interventions on the learners’ academic, career, and social and emotional development.

When do We Use Paper-and-Pencil and Performance-Based Type of Assessment

Paper-and-pencil type of assessments are cognitive tasks that require a single


correct answer. They usually come in the form of test types, such as binary (true or
false), short answer (identification), matching type, and multiple choice. The items
usually pertain to a specific cognitive skill, such as recalling, understanding, applying,
analysing, and creating. On the other hand, performance-based type of assessments
require learners to perform tasks, such as demonstrations, arrive at a product, show
strategies, and present information.

The use of paper-and-pencil and performance-based tasks depends on the nature


and content of the learning target. Below are examples of learning targets that require a
paper-and-pencil type of assessment:
 Identify the parts of the plants
 Label the parts of microscope
 Compute the compound interest
 Classify the phase of a given matter
 Provide the appropriate verb in the sentence
 Identify the type of sentence

Below are learning targets that require performance-based assessment:


 Varnish a wooden cabinet
 Draw a landscape using paintbrush in the computer
 Write a word problem involving multiplication of polynomials
 Deliver a speech convincing your classmates that you are a good candidate for
the student council
 Write an essay explaining how humans and plants benefit from each other
 Mount a plant specimen on a glass slide

How do We Distinguish Teacher-Made from Standardized Test?

Standardized tests have fixed directions for administering and scoring. They can be
purchased with test manuals, booklets, and answer sheets. When these tests were
developed, the items were sampled on a large number of target group called the norm.
The norm group’s performance is used to compare the results of those who took the
test.
Non-standardized or teacher-made tests are usually intended for classroom
assessment. They are used for classroom purposes, such as determining whether
learners have reached the learning target. These intend to measure behaviour (such as
learning) in line with the objectives of the course. Examples are quizzes, long tests, and
exams. Formative and summative assessments are usually teacher-made tests.

Can a teacher-made test become a standardized test? Yes, as long as it is valid,


reliable, and with a standard procedure for administering, scoring, and interpreting
results.

What Information is Sought from Achievement and Aptitude Tests?

Achievement tests measure what learners have learned after instruction or after
going through a specific curricular program. Achievement tests provide information on
what learners can do and have acquired after training and instruction. Achievement is a
measure of what a person has learned within or up to a given time

(3)

(Yaremko et al. 1982). Achievement can be measured by a variety of means.


Achievement can be reflected in the final grades of learners within a quarter. A quarterly
test composed of several learning targets is also a good way of determining the
achievement of learners.

Aptitudes are the characteristics that influence a person’s behavior that aid goal
attainment in a particular situation (Lohgman 2005). Specifically, aptitude refers to the
degree of readiness to learn and perform well in a particular situation or domain (Corno
et al. 2002). Examples include the ability to comprehend instructions, manage one’s
time, use previously acquired knowledge appropriately, make good inferences and
generalizations, and manage one’s emotions.

How do We Differentiate Speed from Power Test?

Speed tests consist of easy items that need to be completed within a time limit.
Power tests consist of items with increasing level of difficulty, but time is sufficient to
complete the whole test.
Example of Power Test:
The one developed by the National Council of Teachers of Mathematics that
determines the ability of
the examinees to utilize data to reason and become creative, formulate, solve, and
reflect critically on
the problems provided.
Example of Speed Test:
A typing test in which examinees are required to correctly type as many words as
possible given a limited
amount of time.

How do We Differentiate Norm-Referenced from Criterion-Referenced Test?

There are two types of test based on how the scores are interpreted: Norm-
Referenced and Criterion-Referenced Tests. Criterion-Referenced Test has a given set
of standards, and the scores are compared to the given criterion.
For Example:
In a 50-item test: 40 – 50 is very high, 30 – 39 is high, 20 – 29 is average, and 10 –
19 is low, and
0 – 9 is very low.
One approach in criterion-Referenced interpretation is that the score is compared to
a specific cut-off.

The norm-referenced test interprets results using the distribution of scores of a


sample group. The mean and standard deviations are computed for the group. The
standing of every individual in a norm-referenced test is based on how far they are from
the mean and standard deviation of the sample. Standardized tests usually interpret
scores using a norm set from a large sample.

Having an established norm for a test means obtaining the normal or average
performance in the distribution of scores. A normal distribution is obtained by increasing
the sample size. A norm is a standard and is based on a very large group of samples.
Norms are reported in the manual of standardized tests.

What is the use of a norm? (1) ANorm is the basis of interpreting a test score. (2) A
Norm can be used to interpret a particular score.

MODULE 2:
DEVELOPMENT AND ADMINISTRATION OF TESTS

Module 2 consists of two (2) lessons attainable for coverage of Midterm as follows:

Lesson 4 : Planning a Written Test


Lesson 5 : Construction of Written Test

Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.

LESSON 4 : PLANNING A WRITTEN


TEST
THINK ABOUT THESE EXPECTATIONS:

3. Set appropriate instructional objectives for a written test.


4. Prepare a Table of Specifications (TOS) for a written test.

Opening

In designing a well planned written test, first and foremost you should be able to
identify the intended learning outcomes in a course, where a written test is an
appropriate method to use. These learning outcomes are knowledge, skills, attitudes,
and values that every student should develop throughout the course. Clear articulation
of learning outcomes is a primary consideration in lesson planning because it serves as
the basis for evaluating the effectiveness of the teaching and learning process
determined through testing or assessment. Learning objectives or outcomes are
measurable statements that articulate, at the beginning of the course, what students
should know and be able to do or value as a result of taking the course.

Objectives for Testing


In developing written test, the cognitive behaviors of learning outcomes are usually
targeted. Traditionally, Bloom’s Taxonomy was used to classify learning objectives
based on levels of complexity of the cognitive behaviors. With knowledge at the base
(i.e., lower order thinking skills), the categories progress to comprehension, application,
analysis, synthesis, and evaluation. However, Anderson and Krathwohl came up with a
revised taxonomy in which the nouns used to represent the levels of cognitive behaviour
were replaced by verbs, and the synthesis and evaluation were switched. See Figure
4.1 below.

Bloom (1956) Anderson and Krathwohl


(2001)

Evaluation Create
Synthesis Evaluate
Analysis Analyze
Application Apply
Comprehension Comprehension
Knowledge Remember

Figure 4.1. Taxonomies of Instructional Objectives

Table of Specifications (TOS)

A Table of Specification (TOS), sometimes called a test blueprint, is a tool used by


teachers to design a test. It is a table that maps out the test objectives, contents, or
topics covered by the test; the level of cognitive behaviour to be measured; the
distribution of items, number, placement, and weights of test items; and the test format.

(2)

Generally, the TOS is prepared before a test is created. However, it is ideal to


prepare one even before the start of instruction. Teachers need to create a TOS for
every test that they intend to develop. The test TOS is important because it does the
following:
1. Ensures that the instructional objectives and what the test captures match.
2. Ensures that the test developer will not overlook details that are considered
essential to a good test.
3. Makes developing a test easier and more efficient.
4. Ensures that the test will sample all important content areas and processes.
5. Useful in planning and organizing.
6. Offers an opportunity for teachers and students to clarify achievement
expectations.

General Steps in Developing a Table of Specifications


The following are the steps in developing a Table of Specifications:
Step 1. Determine the objectives of the test. The first step is to identify the test
objectives. This should
be based on the instructional objectives. In general, the instructional
objectives or the intended
learning outcomes are identified at the start, when the teacher creates the
course syllabus.
There are three types of objectives: (1) cognitive, (2) affective, and (3)
psychomotor. Cognitive
objectives are designed to increase an individual’s knowledge,
understanding, and awareness.
On the other hand, affective objectives aim to change an individual’s attitude
into something
desirable, while psychomotor objectives are designed to build physical or
motor skills.
Step 2. Determine the coverage of the test. The next step in increasing the TOS
is to determine the
contents of the test. Only topics or contents that have been discussed in
class and are relevant
should be included in the test.
Step 3. Calculate the weight for each step. Once the test coverage is
determined, the weight of each
topic covered in the test is determined. The weight assigned per topic in the
test is based on the
relevance and the time spent to cover each topic during instruction. The
percentage of time for a
topic in a test is determined by dividing the time spent for that topic during
instruction by the total
amount of time spent for all topics covered in the test.
For Example:
A Test on the Theories of Personality for General Psychology 101 Class,
the teacher spent ¼
to 1 ½ hours class sessions. As such, the weight for each topic is as
follows:

Topic No. of Sessions Time Spent % of Time


(Weight)
Theories and Concepts 0.5 class session 30 min 10.0
Psychoanalytic Theories 1.5 class sessions 90 min 30.0
Trait Theories 1 class session 60 min 20.0
Humanistic Theories 0.5 class session 30 min 10.0
Cognitive Theories 0.5 class session 30 min 10.0
Behavioral Theories 0.5 class session 30 min 10.0
Social Learning Theories 0.5 class session 30 min 10.0
TOTAL 5 Class Sessions 300 min or 5 100
hours

Step 4. Determine the number of items for the whole test. To determine the
number of items to be
included in the test, the amount of time needed to answer the items are
considered. As a general
rule, students are given 30 – 60 seconds for each item in test formats with
choices. For a one-
hour class, this means that the test should not exceed 60 items or maybe just
50 items.
Step 5. Determine the number of items per topic. To determine the number of
items to be
included in the test, the weights per topic are considered as shown below.
Topic % of Time No. of Items
(Weight)
Theories and Concepts 10.0 5
Psychoanalytic Theories 30.0 15
Trait Theories 20.0 10
Humanistic Theories 10.0 5
Cognitive Theories 10.0 5
Behavioral Theories 10.0 5
Social Learning Theories 10.0 5
TOTAL 100 50 Items
(3)
Different Formats of a Test table of Specifications
There are three (3) types of TOS:
1. One-Way TOS. A one-way TOS maps out the content or topic, test objectives,
number of hours spent, and format, number, and placement of items. This type
of TOS is easy to develop and use because it just works around the objectives
without considering the different levels of cognitive behaviors. However, a one-
way TOS cannot ensure that all levels of cognitive behaviors that should have
been developed are covered in the test.

No. of Format and No. and


Topic Test Objective Hours Placement of Percent of
Spent Items Items
Theories and Recognize important concepts in 0.5 Multiple Choice 5
concepts personality theories. Items # 1 - 5 (10.0%)
Psychoanalyti Identify the different theories of 1.5 Multiple Choice 15
c Theories personality under the psychoanalytic Items # 6 - 20 (30.0%)
Model.
etc.
50
TOTAL 5 (100%)

2. Two-Way TOS. A two-way TOS reflects not only the content, time spent, and
number of items but also the levels of cognitive behaviour targeted per test
content bsed on the theory behind cognitive testing.

For Example.

The common framework for testing at present in the DepEd Classroom


Assessment Policy is
the Revised Bloom’s Taxonomy (DepEd 2015).

One advantage of this format is that it allows one to see the levels of
cognitive skills and dimensions of knowledge that are emphasized by the test. It
also shows the framework of assessment used in the development of the test.
However, this format is more complex than the one-way format.

Content Time No. & KD* Level of Cognitive Behavior, Item Format, No.
Spent Percent and
of Items Placement of Items
K C AP AN SY E
F 1,3
Theories and 0.5 5 #1 - 3
Concepts hours (10.0%) C 1,2
#4 - 5
F 1,2
#6 - 7
Psychoanalytic 1.5 15 C 1,2 1,2
Theories hours (30.0%) #8 - 9 #10 -
11
P 1,2 1,2
#12 - #14 -
13 15
M 1,3 11,1 11,1
#16 - #41 #42
18
etc.
Scoring 1 point per item 2 points per 3 points per
item item
OVERALL 50
TOTAL (100.0%) 20 20 10

*Legend: KD = Knowledge Dimension (Factual, Conceptual, Procedural, Metacognitive)


I – Multiple Choice: II – Open Ended
(4)

3. Three-Way TOS. This type of TOS reflects the features of one-way and two-way
TOS. One advantage of this format is that it challenges the test writer to classify
objectives based on the theory behind the assessment. It also shows the
variability of thinking skills targeted by the test. However, it takes a much longer
to develop this type of TOS.

No. of Level of Cognitive Behavior and


Content Learning Time Items Knowledge Dimension*, Item
Objectives Spent Format, No. and
Placement of Items
K C AP AN SY E
Theories and Recognize 0.5 5 1,3 1,2
Concepts important concepts hours (10.0%) #1- #4-
in personality 3 5
theories. (F) ©
1.5 15 1,2 1,2 1,2 1,2 11,1 11,1
hours (30.0%) #6- #8- #10- #14- #41 #42
Psychoanalytic Identify the 7 9 11 15 (M) (M)
Theories different theories of (F) (C) (C) (P)
personality under 1,2 1,3
psychoanalytic #12- #16-
model. 13 18
(P) (M)
etc. 1 point 3 points per 5 points
per item item per item
Scoring
OVERALL 50
TOTAL (100.0%) 20 20 10

Legend: KD = Knowledge Dimension (Factual, Conceptual, Procedural,


Metacognitive.
I – Multiple Choice: 11 – Open Ended

LESSON 5 :
CONSTRUCTION OF WRITTEN TESTS
THINK ABOUT THESE EXPECTATIONS:

3. Identify the appropriate test format to measure learning outcomes.


4. Apply the general guidelines in constructing test items for different test formats.
5. Submit constructed test items for different test formats to my instructor.

Opening
Classroom assessment are an integral part learners’ learning. They do more than
measure learning. They also inform the learner’s what needs to be learned and to what
extent and how to learn them. They also provide the parents some feedback about their
child’s achievement of the desired learning outcomes. The schools also get to benefit
from classroom assessments because learners’ test results can provide them evidence-
based data that are useful for instructional planning and decision making. It is important
that assessment tasks or tests are meaningful and further promote deep learning, as
well as fulfil the criteria and principles of test construction.

General Guidelines in Choosing the Appropriate Test Format


Not every test is universally valid for every type of learning outcome. To guide you
on choosing the appropriate test format and designing fair and appropriate yet
challenging tests, you should ask the following important questions:
1. What are the objectives or desired learning outcomes of the subject/unit/lesson
being assessed?
Deciding on what test format to use generally depends on your learning
objectives or the desired learning outcomes of the subject/unit/lesson. Desired
learning outcomes (DLOs) are statements of what learners are expected to do or
demonstrate as a result of engaging in the learning process.
2. What level of thinking is to be assessed (i.e., remembering, understanding,
applying, analyzing, evaluating, and creating)? Does the cognitive level of the
test question match your instructional objectives or DLOs?
The level of thinking to be assessed is also important factor to consider when
designing your test, as this will guide you in choosing the appropriate test format.
For Example:
 If you intend to assess how much your learners are able to identify
important concepts discussed in class (i.e., remembering or understanding
level), a selected-response format such as Multiple Choice test would be
appropriate.
 If you intend to assess how your students will be able to explain and apply
in another setting a concept or framework learned in class (i.e., applying
and/or analyzing level), you may consider giving constructed-response test
formats such as Essays.

3. Is the test matched or aligned with the course’s DLOs and the course contents
learning activities?
The assessment tasks should be aligned with the instructional activities and the
DLOs.
For Example:
 If you want learners to articulate and justify their stand on ethical decision-
making and social practices in business (i.e., DLO), then an essay test and
class debate are appropriate measures and task for this learning outcome.
A multiple-choice test may be used but only if you intend learners’ ability to
recognize what is ethical versus unethical decision-making practice.
 Matching-type items may be appropriate if you want to know whether your
students can differentiate and match the different approaches or terms to
their definitions.

4. Are the test items realistic to the students?


Test items should be meaningful and realistic to the learners. They should be
relevant or related to their everyday experiences. The use of concepts, terms, or
situations that have not been dis-

(2)

cussed in the class or that they have never encountered, read, or heard about
should be minimized or avoided.

The Major Categories and Formats of Traditional Tests

For the purposes of classroom assessment, traditional tests fall into two general
categories:
1. Selected-Response Type – in which learners select the correct response from
one given options.
2. Constructed-Response Type – in which the learners are asked to formulate
their own answers.
Selected-Response Tests. The learners are required to choose the correct
answer or best
alternative from several choices. They are limited when assessing learning
outcomes that involve
more complex and higher level thinking skills. Selected-Response tests include:
1. Multiple Choice Test. It is the most commonly used format in formal testing
and typically consists of a stem (problem), one correct or best alternative
(correct answer), and three or more incorrect or inferior alternatives(distractors).
2. True-False or Alternative Response Test. It generally consists of a statement
and deciding if the statement is true (accurate/correct) or false (inaccurate,
incorrect).
3. Matching-Type Test. It consists of two sets of items to be matched with each
other based on a specified attribute.

Constructed-Response Tests. The learners are required to supply answers to a


given question or
problem. These include:
1. Short Answer Test. It consist of open-ended questions or incomplete
sentences that require learners to create an answer for each item, which is
typically a single word or short phrase. This includes the following types:
 Completion. It consists of incomplete statements that require the learners to
fill in the blanks with correct word or phrase.
 Identification. It consists of statements that require the learners to identify or
recall the terms/concepts, people, places, or events that are being described.
 Enumeration. It requires the learners to list down all possible answers to the
question.

2. Essay Test. It consists of problems/questions that require learners to compose


or construct written responses, usually long ones with several paragraphs.

3. Problem-Solving Test. It consists of problems/questions that require learners


to solve problems in quantitative or non-quantitative stings using knowledge and
skills in mathematical concepts and procedures, and/or other higher-order
cognitive skills (e.g., reasoning, analysis, critical thinking, and skills?

General Guidelines in Writing Multiple-Choice Test Items


Writing multiple-choice items requires content mastery, writing skills, and time. Only
good and affective items should be included in the test. Poorly written test items could
be confusing and frustrating to learners and yields test scores that are not appropriate
to evaluate their learning and achievement. The following are the general guidelines in
writing good multiple-choice items:

Content:

1. Write items that reflect only one specific content and cognitive processing skills.

Faulty: Which of the following is a type of statistical procedure used to test a


hypothesis regarding significant relationship between variables, particularly in
terms of the extent and direction of association?
a) ANCOVA b) ANOVA c) Chi-Square d) t-test

Good: Which of the following is an inferential statistical procedure used to test


a hypothesis regarding significant differences between two qualitative variables.
a) ANCOVA b) ANOVA c) Chi-Square d)
Mann-Whitney Test

2. Do not lift and use statements from the textbook or other learning materials as
test questions.

3. Keep the vocabulary simple and understandable based on the level of learners /
examinees.

4. Edit and proof read the items for grammatical and spelling before administering
them to the learners.

Stem:
1. Write the directions in the stem in a clear and understandable manner.
Faulty: Read each question and indicate your answer by shading the circle
corresponding to your
answer.
Good: The test consist of two parts. Part A is a reading comprehension test,
and Part B is a
grammar / language test. Each question is multiple-choice test with five
(5) options. You
are to answer each question but will not be penalized for a wrong
answer or for guessing.
You can go back and review your answers during the time allotted.
2. Write stems that are consistent in form and structure, that is, present all items in
question form or in descriptive or declarative form.
Faulty: 1) Who was the Philippine President during the Martial Law?
2) The first president of the Commonwealth of the Philippines was
______.
Good: 1) Who was the Philippine president during Martial Law?
2) Who was the first president of the Commonwealth of the Philippines?
3. Word the stem positively and avoid double negatives, such as NOT and
EXCEPT in a stem. If a negative word is necessary, underline or capitalize the
words for emphasis.
Faulty: Which of the following is not a measure of variability?
Good: Which of the following is NOT a measure of variability?
4. Refrain from making the stem too wordy or containing too much information
unless the problem/question requires the facts presented to solve the problem.
Faulty: What does DNA stand for, and what is the organic chemical of complex
molecular
structure found in all cells and viruses and codes genetic information for
the transmission
of inherited traits?

Good: As a chemical compound, what does DNA stand for?

Options:

1. Provide three (3) to five (5) options per item, with only one being the correct or
best answer/alternative.
2. Write options that are parallel or similar in form and length to avoid giving clues
about the correct answer.
Faulty: What is an ecosystem?
a) It is a community of living organisms in conjunction with the non-living
components of their environment that interact as a system.
b) It is a place on Earth’s surface where life dwells.
c) It is an area that one or more individual organisms defend against
competition from other organisms.
d) It is the biotic and abiotic surroundings of an organism of population.

Good: What is an ecosystem?


a) It is a place on the Earth’s surface where life dwells.
b) It is the biotic and abiotic surroundings of an organism or population.
c) It is the largest division of the Earth’s surface filled with living organisms.
d) It is a large community of living and non-living organisms in a particular area.

3. Place options in a logical order (e.g., alphabetical, from shortest to longest).


Faulty: Which experimental gas law describes how the pressure of a gas tends
to increase as the volume of the container decreases? (i.e., “The absolute
pressure exerted by a given mass of an ideal gas as inversely proportional to the
volume it occupies.”)
a) Boyle’s Law c) Avogadro’s Law
b) Charles Law d) Faraday’s Law

Good: Which experimental gas law describes how the pressure of gas tends to
increase as the volume of the container decreases? (i.e., “The absolute pressure
exerted by a given mass of an ideal gas as inversely proportional to the volume
it occupies.”)
a) Avogadro’s Law c) Charles Law
b) Boyle’s Law d) Faraday’s Law
4. Place correct response randomly to avoid a discernable pattern of correct
answer.
5. Use None of the above carefully and only when there is one absolutely correct
answer, such as in spelling or math items.
Faulty: Which of the following is a nonparametric statistic?
a) ANCOVA b) ANOVA c) Correlation d) None of the
Above
Good: Which of the following is a nonparametric statistic?
a) ANCOVA b) ANOVA c) Correlation d) t-test
6. Avoid All of the Above option, especially if it is intended to be the correct answer.
Faulty: Who among the following has become President of Philippine Senate?
a) Ferdinand Marcos c) Quintin Paredes
b) Manuel Quezon d) All of the Above
Good: Who was the first ever President of the Philippine Senate?
a) Ferdinand Marcos c) Manuel Quezon
b) Quintin Paredes d) Manuel Roxas
7. Make all options realistic and reasonable.

The General Guidelines in Writing Matching-Type Items


The matching test item format requires learners to match a word, sentence, or
phrase in one column (i.e., premise) to a corresponding word, sentence, or phrase in a
second column (i.e., response). It is most appropriate when you need to measure the
learners’ ability to identify the relationship or association between similar items.
The following are the general guidelines in writing good and effective matching-type
tests:
1. Clearly state in the directions the basis for matching the stimuli with the
responses.
Faulty: Directions: Match the following.
Good: Directions: Column I is a list of countries while Column II presents the
continent where these countries are located. Write the letter of the continent
corresponding to the country on the line provided in Column I.
Item #1’s instruction is less preferred as it does not detail the basis for matching
the stem and the response options.

2. Ensure that the stimuli are longer and the responses are shorter.
Faulty: Match the description of the flag to its country.
A B
_____ Bangladesh a) Green background with red circle in the center.
_____ Indonesia b) One red strip on top and white strip at the botton.
_____ Japan c) Red background with white five-petal flower in the
center.
_____ Singapore d) Red background with large yellow in the center.
_____ Thailand e) red background with large yellow pointed star in
the center.
f) White background with large red circle in the
center.

(5)

Good: Match the description of the flag to its country.


A B
_____ 1. Green background with red circle in the center. a)
Bangladesh
_____ 2. One red strip on top and white strip at the botton. b)
Hongkong
_____ 3. Red background with white five-petal flower in the center. c)
Indonesia
_____ 4. Red background with large yellow pointed star in the center. d) Japan
_____ 5. White background with large red circle in the center. e)
Singapore
f)
Vietnam
Item #2 is a better version because the descriptions are presented in the first
column while the
response options are in the second column. The stems re also longer than the
options.
3. For each item, include only topics that are related with one another and share the
same foundation of information.
Faulty: Match the following:
A B
_____ 1. Indonesia a) Asia
_____ 2. Malaysia b) Bangkok
_____ 3. Philippines c) Jakarta
_____ 4. Thailand d) Kuala Lumpur
_____ 5. Year ASEAN was established e) Manila
f) 1967
Good: On the line to the left of each country in Column I, write the letter of the
country’s capital
presented in Column II.
A B
_____ 1. Indonesia a) Bandar Seri Begawan
_____ 2. Malaysia b) Bangkok
_____ 3. Philippines c) Jakarta
_____ 4. Thailand d) Kuala Lumpur
e) Manila

Item #1 is considered an unacceptable item because its response options are not
parallel and include different kinds of information that can provide clues to the
correct/wrong answers. On the other hand, Item#2 details the basis for matching
and the response options only include related concepts.

The General Guidelines in Writing True or False Items

True or false items are used to measure learner’s ability to identify whether a
statement or proposition is correct/true or incorrect/false. They are best used when
learners’ ability to judge or evaluate is one of the desired learning outcomes of the
course.
There are different variations of the true or false items. These include the following:
1. T – F Correction or Modified True-or-False Question. In this format, the
statement is presented with a key word or phrase that is underlined, and the
learner has to supply the correct word or phrase.
e.g., Multiple –Choice is authentic.
2. Yes – No Variation. In this format, the learner has to choose yes or no, rather
than true or false.e.g., The following are kinds of test. Circle Yes if it is an
authentic test and No if not.
Multiple Choice test Yes No
Debates Yes No
End-of-the-Term Project Yes No
True or False Test Yes No
3. A – B Variation. In this format, the learner has to choose A or B, rather than
true or false. e.g., Indicate which of the following are traditional or authentic
tests by Circling A if it is traditional test and B if it is authentic.
Traditional Authentic
Multiple Choice test A B
(6)

Debates A B
End-of-the-Term Project A B
True or False Test A B

The General Guidelines in Writing Short Answer Test Items


A short answer test item requires the learner to answer a question or to finish an
incomplete statement by filling in the blank with the correct word or phrase.
The following are the general guidelines in writing good fill-in-the-blank or
completion test items.
1. Omit only significant words from the statement.
Faulty: Every atom has a central _______ called a nucleus.
Good: Every atom has a central core called a(n) _________.
In Item #1, the word “core” is not the significant word. The item is also prone to
many and varied interpretations, resulting to many possible answers.
2. Do not omit too many words from the statement such that the intended meaning
is lost.
Faulty: ______ is to Spain as the _______ is to United States and as _______ is
to Germany.
Good: Madrid is to Spain as the ________ is to France.
Item #1 is prone to many and varied answers. Item #2 is preferred because it is
more specific and requires only one correct answer.
3. Avoid obvious clues to the correct response.
Faulty: Ferdinand Marcos declared martial law in 1972. Who was the president
during that period.
Good: The president during the martial law years was ________.
Item #1 already gives a clue that Ferdinand Marcos was the president during this
time because only the president can declare martial law.
4. Be sure that there is only one correct response.
Faulty: The government should start using renewable energy sources for
generating electricity,
such as ____________.
Good: The government should start using renewable sources of energy by using
turbines called
____________.
Item #1 has many possible answers because the statement is very general (e.g.,
wind, solar, biomass, geothermal, and hydroelectric). Item # 2 is more specific
and only requires one correct answer (i.e., wind).
5. Avoid grammatical clues to the correct response.
Faulty: A subatomic particle with a negative electric charge is called an
________.
Good: A subatomic particle with a negative electric charge is called a(n)
________.
The word an in item #1 provides a clue that the correct answer start with a vowel.
6. If possible, put the blank at the end of a statement rather than at the beginning.
Faulty: _________ is the basic building block of matter.
Good: The basic building block of matter is _____________.
In Item #1, learners may need to read the sentence until the end before they can
recognize the problem, and then re-read it again and then answer the question.
On the other hand, in Item #2, learners can already identify the context of the
problem by reading through the sentence only once and without having to go
back and re-read the sentence.

The General Guidelines in Writing Essay tests


Teachers generally choose and employ essay tests over other forms of assessment
because essay tests require learners to create a response rather than to simply select a
response from among alternatives.

(7)

They are the preferred form of assessment when teachers want to measure learners’
higher-order thinking skills, particularly their ability to reason, analyse, synthesize, and
evaluate.

The following are the guidelines in constructing good essay questions.


1. Clearly defined the intended learning outcome to be assessed by the essay test.
2. Refrain from using essay test for intended learning outcomes that are better
assessed by other kinds of assessment.
3. Clearly define and situate the task within a problem situation as well as the type
of thinking required to answer the test.
4. Present task that are fair, reasonable, and realistic to the students.
5. Be specific in the prompts about the time allotment and criteria for grading the
response.

The General Guidelines in Problem–Solving Test Items

Problem-solving test items are used to measure learners’ ability to solve problems
that require quantitative knowledge and competencies and/or critical thinking skills.
These items present problem situation or task that will require learners to demonstrate
work procedures or come up with a correct solution.
There are different variations of the quantitative problem-solving items. These
include the following:
1. One Answer Choice. This type of question contains four or five options, and
students are required to choose the best answer.
Example: What is the mean of the following score distribution: 32, 44, 56, 69,
75, 77, 95, 96?
a) 68 b) 69 c) 72 d) 74
The correct answer is a) 68.
2. All Possible Answer Choices. This type of question has four or five options,
and students are required to choose all of the options that are correct.
Example: Consider the following score distribution: 12, 14, 14, 14, 17, 24,
27, 28, 30. Which
of the following is/are the correct measure(s) of central tendency?
Indicate all possible
answers.
a) Mean = 20 c) Median = 17
b) Mean = 22 d) Mode = 14
Correct answers are options a, c, and d.

3. Type-In Answer. This type of question does not provide options to choose
from. Instead, the learners are asked to supply the correct answer. The teacher
should inform the learners at the start how their answers will be rated. The
teacher may require just the correct answer or may require learners to present
the step-by-step procedures in coming up their answers. On the other hand, for
non-mathematical problem solving, such as a case study, the teacher may
present a rubric how their answers will be rated.
Example: Compute the mean of the following score distribution: 32, 44, 56,
69, 75, 77, 95, 96.
Indicate your answer in the blank provided.

In this case, the learners will only need to give the correct answer without having
to show the procedures for computation.

The following are some of the general guidelines in constructing problem-solving


test items:

1. Identify and explain the problem clearly.


Faulty: Marcela was 135.6 lbs when she started with her zumba/aerobics
exercises. After three
months of attending the sessions three times a week, her weight was
down to 122.8 lbs.
About how many lbs did she lose after three months? Write you final
answer in the space
Good: Marcela was 135.6 lbs when started with her zumba/aerobics exercises.
After three
months of attending the sessions three times a week, her weight was
down to 122.8 lbs.
How many lbs did she lose after three months? Write you final answer in
the space
provided and show your computations. Write the exact weight; do not
round off.

In Item#1, This question asks “about how many” and does not indicate whether
learners need to give the exact weight or whether they need to round off their
answer and to what extent.

2. Be specific and clear of the type of response required from the students.
Faulty: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity juice in
Philippines, aside
from their Singapore market. The sales for the juice in the Singapore
market were S 5 million
more than those of their Philippine market in 2016, S 3 million more than
in 2017, and S 4.5
million in 2018. If the sales in Philippine market in 2018 was P 35 million,
what were the sales
in Singapore market during that year?

This is a faulty question because it does not specify in what currency should the
answer be presented.

Good: ASEANA Bottlers, Inc. has been producing and selling Tutti Fruity juice in
Philippines, aside
from their Singapore market. The sales for the juice in the Singapore
market were S 5 million
more than those of their Philippine market in 2016, S 3 million more than
in 2017, and S 4.5
million in 2018. If the sales in Mexican market in 2018 was P 35 million,
what were the sales
in U.S market during that year? Provide answer in Singapore dollar (S 1 =
P 36.50).

This is a better item because it specifies in what currency should the answer be
presented, and the exchange rate was given.

MODULE 3:
ADMINISTRATION OF TESTS AND
ORGANIZATION OF TEST RESULTS
Module 3 consists of two (2) lessons attainable for coverage of Midterm as follows:

Lesson 6 : Establishing Test Validity and Reliability


Lesson 7 : Organization of Test Data Using Tables and Graphs

Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.

LESSON 6 :

ESTABLISHING TEST VALIDITY AND


RELIABILITY
THINK ABOUT THESE EXPECTATIONS:

4. Use Procedures and Statistical Analysis to Establish Test Validity and Reliability.
5. Decide whether a Test is Valid or Reliable.
6. Decide which Test Items are Easy and Difficult.

Opening

In order to establish the validity and reliability of an assessment tool, you need to
know the different ways of establishing test validity and reliability.

Test Reliability

Reliability is the consistency of the responses to measure under three conditions:


1. when retested on the same persons;
2. when retested on the same measure; and,
3. similarity of responses across items that measure the same characteristic.

In the first condition, consistent response is expected when the test is given to the
same participants. In the second condition, reliability is attained if the responses to the
same test is consistent with the same test or its equivalent . In the third condition, there
is reliability when the person responded in the same way or consistently across items
that measure the same characteristic.

Methods in Testing Reliability

1. Test-Retest Method. You have a test, and you need to administer it at one time
to a group of examinees. Administer it again at another time to the “same group”
of examinees. There is a time interval of not more than 6 months between the
first and second administration of tests that measure stable characteristics, such
as standardized aptitude tests. The post test can be given with a minimum time
interval of 30 minutes.
Test-retest is applicable for tests that measure stable variables, such as aptitude
and psychomotor measure (e.g., typing test, task in physical education).

Correlate the test scores from the first and second administration. Significant and
positive correlation indicates that the test has temporal stability. Correlation is
refer to a statistical procedure where linear relationship is expected for two
variables. You may use Pearson Product Moment Correlation or Pearson r
because test data are usually in an interval scale.

2. Parallel Forms Method. There are two versions of a test. The items need to
exactly measure the same skill. Each test version is called a “form.” Administer
one form at one time and the other form to another time to the “same” group of
participants. The responses on the two forms should be more or less the same.

Parallel Forms are applicable if there are two versions of the test. This is usually
done when the test is repeatedly used for different groups, such as entrance
examinations and licensure examinations. Different versions of the test are given
to a different group of examinees.

Correlate the test results for the first form and the second form. Significant and
positive correlation coefficient are expected. The significant and positive
correlation indicates in the two forms are the same or consistent. Pearson r
usually used for this analysis.

3. Split-Half Method. Administer a test to a group of examinees. The items need


to be split into halves, usually using the odd-even technique. In this technique,
get the sum of the points in the odd numbered items and correlate it with the sum
of points of the even numbered items. Each examinee will have two scores
coming from the same test. The scores on each set should be close or
consistent.

Split-Half is applicable when the test has a large number of items.

Correlate the two sets of scores using Pearson r. After the correlation, use
another formula called Spearman Brown Coefficient. The correlation coefficient
obtained using Pearson r and Spearman Brown should be significant and
positive to mean that the test has internal consistency reliability.

4. Test of Internal Consistency Using Kuder-Richardson and Cronbach’s


Alpha Method. This procedure involves determining if the scores for each item
are consistently answered by the examinees. After administering the test to a
group of examinees , it is necessary to determine and record the scores for each
item. The idea here is to see if the responses per item are consistent with each
other.

This technique will; work well when the assessment tool has a large number of
items. It is also applicable for scales and inventories (e.g., Likert Scale from
“strongly agree” to “strongly disagree”).

A statistical analysis called Cronbach’s Alpha or the Kuder Richardson is


used to determine the internal consistency of the items. A Cronbach’s Alpha
value of 0.60 and above indicates that the test items have internal consistency.

5. Inter-Rater Reliability Method. This procedure is used to determine the


consistency of multiple raters when using rating scales and rubrics to judge
performance. The reliability here refers to the similar or consistent ratings
provided by more than one rater or judge when they use an assessment tool.
Inter-Rater is applicable when the assessment requires the use of multiple
raters.

A statistical analysis called Kendall”s Tau Coefficient of Concordance is used


to determine if the ratings provided by multiple raters agree with each other.
Significant Kendall’s Tau value indicates that the raters concur or agree with
each other in their ratings.

You will notice that statistical analysis is required to determine the reliability of a
measure. The very basis of statistical analysis to determine reliability is the use of linear
regression.

1. Linear Regression.

Linear regression is demonstrated when you have two variables that are
measured, such as two sets of scores in a test taken at two different times by
the same participants. When the two scores are plotted in a graph (with x – axis
and y – axis), they tend to form a straight line . The straight line formed for the
two sets of scores can produce a linear regression. When a straight line is
formed, you can say that there is a correlation between the two sets of scores.
This can be seen in the graph shown. The graph is called a scatterplot. Each
point in the scatterplot is a respondent with two scores (one for each test).
Given point: P (2, 2), M (4, 6), and Q (10, 8).

(3)

y - axis

• Q (10, 8)

• MN (4,6)


P (2, 2)

x - axis

2. Computation of Pearson r Correlation


The index of the linear regression is called a correlation coefficient. When
the points in a scatterplot tend to fall within the linear line, the correlation is said
to be strong. When the direction of the scatterplot is directly proportional, the
correlation coefficient will have a positive value. If the line is inverse, the
correlation will have a negative value. The statistical analysis used to determine
the correlation coefficient is called the Pearson r. How the Pearson r is obtained
is illustrated in the example below.

Example. Suppose that a teacher gave the spelling of two-syllable words with
20 items for Monday
and Tuesday. The teacher wanted to determine the reliability of two
set of scores by
computing for the Pearson r.

N ( Σxy ) −Σx (Σy)


Formula: r =
√ [N ( Σx 2 )−( Σx ) 2][ N ( Σy 2 )−( Σy ) 2]

Where: r = Pearson Σxy = summation of x


times y
x = first variable Σx (Σx) = summation of x
times summation of y
y = second variable (Σx) 2 = square of
summation of x
Σx2 = sum of square of the first variable ( Σy)2 = square of
summation of y
Σy2 = sum of square of the second variable

Monday Test Tuesday Test


x y X2 Y2 xy
10 20 100 400 200
9 15 81 225 135
6 12 36 144 72
10 18 100 324 180
12 19 144 361 228
4 8 16 64 32
5 7 25 49 35
7 10 49 100 70
16 17 256 289 272
8 13 64 169 104
Σx = 87 Σy = 139 Σx2 = 871 Σy2 = 2125 Σxy = 1328

N ( Σxy ) −Σx (Σy)


Substitute: r =
√ [N ( Σx 2 )−( Σx ) 2][ N ( Σy 2 )−( Σy ) 2]

10 (1328 )−87 (139)


r =
√ [10 ( 871 ) −( 87 ) 2][10 ( 2125 )−( 139 ) 2]
(13280 )−12093
r =
√ [ ( 8710 )−7569][ ( 21250 )−19321]
1187
r =
√ [1141][1929]
1187
r =
√ 2200989
1187
r = ; r=0.80
14835.573052

The value of a correlation coefficient does not exceed 1.00 or – 1.00. A value of
1.00 and – 1.00 indicates perfect correlation.
3. Difference Between a Positive and a Negative Correlation
When the value of the correlation is positive, it means that the higher the scores
in x, the higher the scores in y. This is called a positive correlation.
When the value of the correlation is negative, it means that the higher the
scores in x, the lower the scores in y. This is called a negative correlation.
4. Determining the Strength of a Correlation
The strength of the correlation also indicates the strength of the reliability of
the test. This is indicated by the value of the correlation coefficient. The closer
the value to 1.00 or - 1.00, the stronger is the correlation. Below is the guide:
± 1.00 - Perfect (±) correlation
±0.91 - ± 0.99 - Very strong relationship
±0.71 - ± 0.90 - Strong relationship
±0.41 - ± 0.70 - Moderate strong relationship
±0.21 - ± 0.40 - Low relationship
±0.01 - ± 0.20 - Negligible relationship
5. Determining the Significance of the Correlation
The correlation obtained between two variables could be due to chance. In
order to determine if the correlation is free of certain errors, it is tested for
significance. When a correlation is significant, it means that the probability of the
two variables being related is free of certain errors.
In order to determine if a correlation coefficient value is significant, it is
compared with an expected probability of correlation coefficient values called a
critical value. When the value computed is greater than the critical value, it
means that the information obtained has more than 95% chance of being
correlated and is significant.
Another statistical analysis mentioned to determine the internal consistency
of test is the Cronbach’s alpha. Follow the procedure to determine the internal
consistency.

Illustration:

Suppose that five students answered a checklist about their hygiene


with a scale 1 to 5 , where in the following are the corresponding scores:

5 - always, 4 - often, 3 - sometimes, 2 - rarely, 1 -


never

The checklist has five items. The teacher wanted to determine if the
items have internal consistency.

Item Item Item 3 Item Item Total for Score


Student 1 2 4 5 each - (Score - Mean)2
x case (x) Mean
x2
A 5 5 4 4 1 19 2.8 7.84
25
B 3 4 3 3 2 15 -1.2 1.44
9
C 2 5 3 3 3 16 -0.2 0.04
4
D 1 4 2 3 3 13 -3.2 10.24
1
E 3 3 4 4 4 18 1.8 3.24
9
Total 14 21 16 17 13 x = 16.2 Σ(S - M)2 = 22.8
(Σx)
Σx2 48 91 54 59 39 σ2 = Σ (S - M)2 / (n
– 1)
σ2 = 22.8 / (5 – 1)
SD2 2.2 0.7 0.7 0.3 1.3 ΣSD2 = 5.2 σ2 = 22.8 / 4
σ2 = 5.7
(5)
Formula: Cronbach’s α =
Support solution of finding SD2. n σ 2−ΣSD 2
Formula: SD2 = (Σx2 - CF) / (N - 1) ( )( )
n−1 σ2
Formula: CF = (Σx)2 / N
Substitute:
Solution Item 1: b) SD2 = (Σx2 - CF)/(N - 1)
a) CF = (14)2/5 SD2 = (48 – 39.2)/(5 - 1) Cronbach’s α =
CF = 196/5 SD2 = 8.8 / 4 5 5.7−5.2
( )( )
CF = 39.2 SD2 = 2.2 5−1 5.7
Cronbach’s α =
5 0.5
( )( )
4 5.7
Cronbach’s α = ( 22.8
2.5
) = 0.109649
or 0.11
The internal consistency of the responses in the attitude toward
teaching is 0.11, indicating negligible internal consistency.

The consistency of ratings can also be obtained using a coefficient of


concordance. The Kendall’s ω coefficient of concordance is used to test
the agreement among raters.

Below is a performance task demonstrated by five students rated by


three raters. The rubric used a scale of 1 to 4, where in 4 is the highest
and 1 is the lowest.

Five Rater Rater Rater Sum of D=R-x D2


Demonstration 1 2 3 Ratings
s
A 4 4 3 11 2.6 6.76
B 3 2 3 8 -0.4 0.16
C 3 4 4 11 2.6 6.76
D 3 3 2 8 -0.4 0.16
E 1 1 2 4 -4.4 19.36
Total 42 ΣD2 =
x = 8.4 33.2

12 ΣD 2
Formula: Kendall’s ω =
m2(N )(N 2−1)

Where: ω = Kendall’s Tau coefficient


m = number of raters
N = number of observation
ΣD2 = summation of squared difference
x = mean

12(33.2) 398.4
Substitute: Kendall’s ω = ; Kendall’s ω =
32 (5)(52−1) 1080

398.4
Kendall’s ω = ; Kendall’s ω =
9(5)(24)
0.3688888 or 0.37

A Kendall’s ω coefficient value of 0.37 indicates the agreement of the


three raters in the five demonstrations. There is weak concordance among the
three raters because the value is far from 1.00.

Validity
A measure is valid when it measures what it is supposed to measure. If a quarterly
examination is valid, then the contents should directly measure the objectives of the
curriculum. If a scale that measures personality is composed of five factors, then the
scores on the five factors should have items that are highly correlated. If an entrance
examination is valid, it should predict students’ grades after the first semester.

Different Ways to Establish Test Validity

There are different ways to establish test validity:


1. Content Validity. When the items represent the domain being measured. The
items are compared with the objectives of the program. The items need to
measure directly the objectives (for achievement) or definition (for scales). A
reviewer conducts the checking.

A coordinator in science is checking the science test paper for grade 4. She
asked the grade 4 science teacher to submit the table of specifications containing
the objectives of the lesson and the corresponding items. The coordinator
checked whether each item is aligned with the objectives.

2. Face Validity. When the test is presented well, free of errors and administered
well. The test items and layout are reviewed and tried out on a small group of
respondents. A manual for administration can be made as a guide for the test
administrator.

The assistant principal browsed the test paper made by the math teacher. She
checked if the contents of the items are about mathematics. She examined if
instructions are clear. She browsed through the items if the grammar is correct
and if the vocabulary is within the students’ level of understanding.

3. Predictive Validity. A measure should predict a future criterion. Example is an


entrance exam predicting the grades of the students after the first semester. A
correlation coefficient is obtained where the x-variable is used as the predictor
and the y-variable as the criterion.

The school admission’s office developed an entrance examination. The officials


wanted to determine if the results of the entrance examination are accurate in
identifying good students. They took the grades of the students accepted for the
first quarter. They correlated the entrance exam results and the first quarter
grades. They found significant and positive correlations between the entrance
examination scores and grades. The entrance examination results predicted the
grades of students after the first quarter. Thus, there was prediction validity.

4. Construct Validity. The components or factors of the test should contain items
that are strongly correlated. The pearson r can be used to correlate the items
for each factor. However, there is a technique called factor analysis to determine
which items are highly correlated to form a factor.

A science test was made by a grade 10 teacher composed of four domains:


matter, living things, force, and earth and space. There are 10 items under each
domain. The teacher wanted to determine if the 10 items made under each
domain really belonged to that domain. The teacher consulted an expert in test
measurement. They conducted a procedure called factor analysis. Factor
analysis is a statistical procedure done to determine if the items written will load
under the domain they belong.

5. Concurrent Validity. When two or more measures are present for each
examinee that measure the same characteristic. The scores on the measures
should be correlated.

A School guidance counsellor administered a math achievement test to grade 6


students. She also has a copy of the students’ grade in math. She wanted to
verify if the math grades of the students are measuring the same competencies
as the math achievement test. The school counsellor correlated the math
achievement scores and math grades to determine if they are measuring the
same competencies.
6. Convergent Validity. When the components or factors of a test are
hypothesized to have a positive correlation. Correlation is done doe the factors of
the test.

A math teacher developed a test to be administered at the end of the school


year, which measures number sense, patterns and algebra, measurement,
geometry, and statistics. After administering the test, the scores were separated
for each area, and these five domains were intercorrelated using Pearson r. The
positive correlation between number sense and patterns and algebra indicates
that, when number sense scores increase, the patterns and algebra scores also
increase. This shows student learning of number sense scaffold patterns and
algebra competencies.

(7)

7. Divergent Validity. When the components or factors of a test are hypothesized


to have a negative correlation. An example to correlate are the scores in as test
on intrinsic and extrinsic motivation.

An English teacher taught metacognitive awareness strategy to comprehend a


paragraph for grade 11 students. She wanted to determine if the performance of
her students in reading comprehension would reflect well in the reading
comprehension test. She administered the same reading comprehension test to
another class which was not taught the metacognitive awareness strategy. She
compared the results using a t-test for independent samples and found that the
class that was taught metacognitive awareness strategy performed significantly
better than the other group. The test has divergent validity.

How to Determine if an Item is Easy or Difficult (Item Analysis)

An item is difficult if majority of students are unable to provide the correct answer.
The item is easy if majority of the students are able to answer correctly. An item can
discriminate if the examinees who score high in the test can answer more the items
correctly than examinees who got low scores.

The Item Analysis Procedure is as follows:

Step 1. Arrange the test papers from highest to lowest score.

Step 2. Select 27% of the papers from the lower group and 27% from the upper group.
 For smaller classes such as a group of only 20 students, you may just divide
it in half with 10 test papers (students) belonging to the lower group and 10
test papers (students) belonging in the upper group.
 In the example (40 high school students 0, 27% would be 10.8 or 11. You
are going to get the bottom 11 test papers (lower group) and upper 11 test
papers (upper group) and set aside the middle 18 test papers.

Step 3. Tabulate the number of students in both the upper and lower groups who
selected each alternative.

Example: A tabulation of the number of students who selected each alternative for the
first five terms of the
given test is shown in Table 6.1.
Table 6.1. Sample Tabulation of Students’ Responses
Groups (upper Alternatives No. of Students
and lower 27%) who got the
Item No. a b c d Total
item right
Upper 0 0 1 10 10 11
1 Lower 1 0 1 9 9 11
Upper 8 1 1 1 8 11
2 Lower 4 2 2 3 4 11
Upper 8 1 2 0 8 11
3 Lower 5 2 3 1 5 11
Upper 1 0 0 10 10 11
4 Lower 0 1 0 10 10 11
Upper 3 2 1 5 5 11
5 Lower 5 4 2 0 0 11

A. Determining the Difficulty Index

In computing for the difficulty index of each item using the formula below:

R
Item Difficulty =
T
(8)

Where:
R = Number of students who got the item right from both groups.
T = Total number of students from both group.

Example: Compute for the difficulty index of the first five test items given
earlier.

Table 6.2. Difficulty Index of the Sample Test Items

No. of Students who


Got the Item Correct Verbal
Item Difficulty Decision
(From both Groups) Interpretation
No. Index

1 19 0.86 Very Easy Reject/Revise


2 12 0.55 Ideal Difficulty Retain
3 13 0.59 Ideal Difficulty Retain
4 20 0.91 Very Easy Reject/Revise
5 5 0.23 Difficult Retain

R
Formula: Item Difficulty =
T
Solutions:

Item No. 1. Item No. 2. Item No. 3. Item No. 4.


Item No. 5.
D = 19 / 22 D = 12 / 22 D = 13 / 22 D = 20 / 22
D = 5 / 22
D = 0.86 D = 0.55 D = 0.59 D = 0.91
D = 0.23

Guide in Interpreting the Computed Difficulty

1.00 - 0.81 - Very Easy


0.80 - 0.61 - Easy
0.60 - 0.41 - Ideal Difficulty
0.40 - 0.21 - Difficulty
0.20 - 0.00 - Very Difficult
B. Determining the Discrimination Index
Compute for the Discrimination Index of each item using the formula below:
RU −RL
Discrimination Index =
1 /2 T
Where:
RU = Number of students in the upper group who answered the
item correctly.
RL = Number of students in the lower group who answered the
item correctly.
½ T = One half of the total number of students included in the
analysis which is also
equal to the number of students in one of the two groups
(lower and upper
group).

Example: Compute for the discrimination index of the first five test items
given earlier.

(9)

Table 6.3. Discrimination Index of the Sample Test Items

Verbal
Item Upper Lower Difficult Discrimination Interpretation
No. Group Group y Index Index (Discriminating Decision
Index)
1 10 9 0.86 0.09 Poor Reject/Revise
2 8 4 0.55 0.36 Good Retain
3 8 5 0.59 0.27 Moderate Retain
4 10 10 0.91 0 Poor Reject/Revise
5 5 0 0.23 0.45 High Retain

Formula:
RU −RL
Discrimination Index =
1 /2 T
Solutions:

Item No. 1. Item No. 2. Item No. 3.


Disc = (10 - 9) / 11 Disc = (8 - 4) / 11 Disc = (8 - 5) / 11
Disc = 1 / 11 Disc = 4 / 11 Disc = 3 / 11
Disc = 0.09 Disc = 0.36 Disc = 0.27

Item No. 4. Item No. 5.


Disc = (10 - 10) / 11 Disc = (5 - 0) / 11
Disc = 0 / 11 Disc = 5 / 11
Disc = 0 Disc = 0.45
Guide in Interpreting the Computed Difficulty

Below - 0.20 - Poor Discriminating Index


0.21 - 0.30 - Moderate Discriminating Index
0.31 - 0.40 - Good Discriminating Index
0.41 - 0.00 - High Discriminating Index

In extreme cases, a negative value for the discriminating index might occur. This
would mean that there are more students in the lower group who got the item correctly
compared to the upper group. This could mean that the item is questionable and there
might be high degree of ambiguity in the test item. Remember however, that these are
assumptions or guesses as to the reasons why it occurred. The data from item analysis
tell us only what specific items are poorly functioning, and it does not tell us the reasons
or causes of its poor.

LESSON 7 :
ORGANIZATION OF TEST DATA USING TABLES AND
GRAPHS
THINK ABOUT THESE EXPECTATIONS:

1. Organize data using tables and graphs.


2. Interpret frequency distribution of test data.

Opening

The appropriate statistical tools and procedures to apply for the results of testing are
as follows:
For Traditional assessment, the common statistical tools to assess the scores are
measures of central tendency, point measures, and measures of variability.
For Authentic assessment, particularly on performance test, the common statistical
tools to assess the scores are measures of central tendency, point measures, and
measures of variability are still applied.
For Rubric assessment, weighted arithmetic mean is used.
For Investigatory projects, usually mean, t-Test (bivariate experimental design), z-
Test (bivariate descriptive design), F-test or ANOVA (analysis of variance), and many
others are employed.
The scores collected from assessments are arranged in a methodical order by
grouping them in classes in a form of frequency distribution. This Lesson 7 of Module 3
presents the frequency distributions, tallying the scores, and graphical representation
like bar graph, line graph, pictograph, and circle graph.

Frequency Distribution
Frequency distribution is applicable when the number of cases (N) is 30 or more.
Table 2.1 scores are results of 50 teacher education students in a 110-item test in
Assessment of Learning 2 in a certain State University in Metro Manila.

Table 7.1. Scores of 50 Teacher Education Students in a 110-item Test in


Assessment
of Learning 2 in a certain State University in Metro Manila (Artificial Data)

50 97 96 95 48 55 58 59 51 53

85 80 83 77 70 60 62 63 64 65
(2)
90 91 92 93 90 83 82 66 67 68

98 70 71 72 73 74 75 76 77 69

98 71 72 73 75 78 79 84 86 87

Generally speaking, frequency distribution is any arrangement of data that shows


the frequency of occurrence of different values of variable or frequency of occurrence of
values falling within arbitrary defined ranges of the variable known as class interval.

In arranging the scores in a form of frequency distribution, the steps are as follows:
Step 1. Find the absolute range. The range is obtained by subtracting the highest score
(HS) and lowest
score (LS).
R = HS - LS; R = 98 - 48; R = 50
Step 2. Find the class interval. In finding the class interval, divide the range by 10 and
by 20 such that the
class limits are not less than 10 and not more than 20, provided that the class
covers the total
number of scores.

Step 2 can be modified in finding the class interval by the use of Sturge’s
formula to obtain a
common result as follows:

Formula of k:
k = 1 + 3.32 log N (2.2)
Where: c = class interval
R = Range
k = definite divisor

Computation:
a) Solve for k:

k = 1 + 3.32 (log 50) use logarithm table in


your cellphone
k = 1 + 3.32 (1.69897)
k = 1 + 5.64058
k = 6.64

b) Solve for c:
R
c =
k
50
c = ; c = 7.53 or c = 8 (rounded off)
6.64

Step 3. Set up the classes. Look for a multiple of c whose product is less than or
equal to the lowest
score.
8 x 6 = 48

Step 4. Choose the starting lower class limit. The product is the lower limit to the upper
limit whose value
of c is decrease by 1 added to lower limit to serve as upper limit..

48 - 55 lower class limit


lower limit upper limit

Step 5. List down the class limits or class interval and tally the score for each class
interval. The
procedure is starting from the lower class limit in a vertical column going
upward.

Table 7.2. Frequency Distribution of Scores in Assessment of Learning 2


Test Taken by 50 Teachers Education Students in a
State University in Metro Manila (Artificial Data)

Class Interval Tally Frequency (f)

96 - 103 IIII 4
88 - 95 IIII – I 6
80 - 87 IIII - III 8
72 - 79 IIII –IIII - II 12
64 - 71 IIII - IIII 10
56 - 63 IIII 5
48 - 55 IIII 5

Total N = 50

The tally must be carefully checked if the sum of each class is equal and also to the
number of cases. If unequal tally occurs, tallying must be repeated and rechecked to
arrive at an exact tally and frequency. At the bottom of column 3, symbol N or Σf is
written which means number of cases (N) or ‘sum of” frequency (Σf) equals to 50.

Present Test Data Graphically

There is a saying, “A picture is worth a thousand words.” In a similar manner, “a


graph can be worth a hundred or a thousand numbers.” The use of tables may not be
enough to give a clear picture of the properties of a group of test scores. If numbers
presented in tables are transformed into visual models, then the reader becomes more
interested in reading the material.

There are many types of graphs, but the most common methods of graphing a
frequency distribution are the following:

1. Histogram. A histogram is a type of graph appropriate for quantitative data such


as test scores. This graph consists of columns – each has a base that represents
one class interval, and its height represents the number of observations or simply
the frequency in that class interval. There are statistical software that are
available to help construct histograms and other forms of graphs. Look at the
graph in Figure 7.1 below.

25 •

20 •

15 •

10 •

5 •

0 • • • • •
20.00 40.00 60.00 80.00

Figure 7.1. Histogram of Test Scores of College Students

2. Frequency Polygon. This is also used for quantitative data, and it is one of the
most commonly used methods in presenting test scores. It is the line graph of a
frequency polygon. It is very similar to a histogram, but instead of bars, it uses
lines to compare sets of test data in the same axes. Figure 7.2 illustrates a
frequency polygon.
14

12

10
8

0 90 95 100 105 110 115 120 125 130 135 140 145 150

Figure 7.2. Frequency Polygon in Reading Comprehension Test

(4)

You can construct a frequency polygon manually using the histogram in Figure 7.1
by following these simple steps.
Step 1. Locate the midpoint on the top of each bar. Bear in mind that the height of
each bar represents the frequency in each class interval, and the width of the
bar is the class interval. As such, that point in the middle of each bar is
actually the midpoint of that class interval. In the histogram on Figure 7.1,
there are two spaces without bars. In such a case, the midpoint falls on the
line.
Step 2. Draw a line to connect all the midpoints in consecutive order.
Step 3. The line graph is an estimate of the frequency polygon of the test scores.
Following the above steps, you can draw a frequency polygon using the histogram
presented earlier in Figure 7.1.

25 •

20 •

15 •

10 •

5 •

0 • • • • •
20.00 40.00 60.00 80.00

Figure 7.3. Frequency Polygon of Test Scores of


College Students

3. Bar Graph. This graph is often used to present frequencies in categories of a


qualitative variable. It looks similar to a histogram, constructed in the same
manner, but spaces are placed in between the consecutive bars. The columns
represent the categories and the height of each bar as in a histogram represents
the frequency. Figure 7.4 is shown below.

25 •

20 •

15 •

10 •

5 •

0 • • • • •
20.00 40.00 60.00 80.00

Figure 7.4. Bar Graph of Test Scores of College


Students

4. Pie Graph. One commonly used method to represent categorical data is the use
of a circle graph. You have learned in basic mathematics that there are 360° in a
full circle. As such the categories can be represented by the slices of the circle
that appear like a pie, thus, the name pie graph. The size of the pie is
determined by the percentage of students who belong in each category.
Example. In a class of 100 students, results were categorized according to
different levels which
shown below.

No. of Percent
Group Students Percentage Equivalent in
(%) the Circle
Above Average 10 10 % 0.10 x 360 =
36°
Average 40 40 % 0.40 x 360 =
144°
Below Average 30 30 % 0.30 x 360 =
108°
Poor 20 20 % 0.20 x 360 =
72°
Total10 100 100 % 360°

Graph:
Above
Poor Average
72% 10%

Average
Below 40%
Average
30%

Figure 7.5. Students Results According to Different Groups

Skewness

Examine the graphs below.

Figure 7.6. Symmetrical Distribution of Test Scores

Y Y

x x

Figure 7.7. Negatively Skewed Distribution Figure 7.8. Positively Skewed


Distribution

Figure 7.6 is labeled as normal distribution. Note that half the area of the curve is
a mirror reflection of the other half. It I symmetrical distribution, which is also referred to
as bell-shaped distribution. The higher frequencies are concentrated in the middle of the
distribution. A number of experiments have shown that IQ scores, height, and weight of
human beings follow a normal distribution.

The graphs in Figure 7.7 and Figure 7.8 are symmetrical in shape. The degree of
asymmetry of graph is its skewness. Basic principle of a coordinate system tells you
that, as you move toward the right of the x-axis, the numerical value increases.
Likewise, s you move up the y-axis, the scale value becomes higher. Thus, in a
negatively-skewed distribution, there are more who get higher scores and the tail,
indicting lower frequencies of distribution points to the left or to the lower scores. In
positive-skewed distribution, lower scores are clustered on the left side. This means that
there are more who get lower scores and the tail indicates the lower frequencies are on
the right or to the higher scores.

Kurtosis

Another way of differentiating frequency distributions is shown below. Consider now


the graphs of three frequency distributions in Figure 7.9.

f z

Test Scores

What is common among the three distributions?

What difference can you observe among the three distribution of test scores?

It is the flatness of the distribution, which is also the consequence of how high or
peaked the distribution is. This property is referred to as kurtosis.

x is the flattest distribution. It has a platykurtic (platy, meaning broad or flat)


distribution. y is the normal distribution and it is a mesokurtic (meso, meaning
intermediate) distribution. z is the steepest or slimmest, and is called leptokurtic (lepto,
meaning narrow) distribution.

What curve has more extreme scores than the normal distribution?

What curve has more scores that are far from the central value (or average) than
does the normal distribution?

For the meantime, the characteristics are simply described visually. The next lesson
will connect these visual characteristics to important statistical measures.

MODULE 4:
UTILIZATION AND COMMUNICATION
OF TEST RESULTS
Module 4 consists of two (2) lessons attainable for coverage of Final Term as
follows:

Lesson 8 : Analysis, Interpretation, and the Use of Test Data.


Lesson 9 : Grading and reporting Test Results

Each lesson contains theory, questions and activity. Quiz is provided for you to
answer as required.

LESSON 8 :
ANALYSIS, INTERPRETATION, AND THE USE OF TEST DATA

THINK ABOUT THESE EXPECTATIONS:

7. Analyze, interpret, and use test data applying:


a) measures of central tendency;
b) measures of variability;
c) measures of position; and,
d) measures of co-variability.

Opening

The discussion in this lesson will build upon the concepts and examples presented
in Lesson 7, which focused on the tabular and graphical presentation and interpretation
of test results. In this lesson, other ways of summarizing test data using descriptive
statistics, which provides a more precise means of describing a set of scores, will be
discussed. The word “measures” is commonly associated with numerical and
quantitative data.

Measures of Central tendency

The word “measures of central tendency” means the central location or point of
convergence of a set of values. Test scores have a tendency to converge a central
value. This value is the average of the set of scores. In other words, a measure of
central tendency gives a single value that represents a given set of scores. Three
commonly used measures of central tendency or central location are the mean, median,
and the mode.

Mean. This is the most preferred measure of central tendency for use with test
scores, also referred to as the “arithmetic mean”. The computation is very simple.
Σx
That is, x =
N
Where: x = the mean
Σx = the sum of all the scores
N = the number of scores in the set
Consider the test scores of 15 students given Table 8.1.

Table 8.1. Scores of 15 College Students in a Final Examination

50 97 96 95 48 55 58 59

85 80 83 77 70 51 53
The given data is ungroup data, use the formula in finding for mean.

Σx 50+97+ 96+95+ 48+55+58+59+ 85+80+83+ 77+70+51+53


x = =
N 15

Σx 1057
x = = = 70.466667 or 70.47
N 15

You have many ways in computing the mean. The traditional long computation
techniques have outlived their relevance due to advancement of technology and the
emergence of statistical software. Using your scientific calculator, you will see the
symbols x, Σx. Just follow the steps indicated in the guide. There are also simple steps
in Excel.

While you recognize the power of technology, there is information that is


unappreciated because of the short-hand processing of data through mechanical
computations. Look at the conventional way of presenting data in a frequency
distribution table as done in Lesson 7.

Table 8.2. Frequency Distribution of Scores in Assessment of Learning 2


Test Taken by 50 Teachers Education Students in a
State University in Metro Manila (Artificial Data)

Class Interval Midpoint Frequency mf


(m) (f)

96 - 103 99.5 4 398


88 - 95 91.5 6 549
80 - 87 83.5 8 668
72 - 79 75.5 12 906
64 - 71 67.5 10 675
56 - 63 59.5 5 297.5
48 - 55 51.5 5 257.5

Total N = 50 Σmf =
3751

In the traditional way, it cannot be argued that you can see at a glance how the
scores are distributed among the range of values in a condensed manner. You can
even estimate the average of the scores by looking at the frequency in each class
interval. In the absence of statistical program, the mean can be computed with the
following formula:

Σmf
x =
N
Where:
x = the mean
m = midpoint of the class interval
f = frequency of each class interval
N = total frequency

Thus, the mean of the test scores in Table 8.2 is calculated as follows:

Σmf 3751
x = = = 75.02
N 50

The easiest way is to use SPSS (Social Statistical Software for Social Sciences) by
simply following these steps:
1. Open the Data Editor window. It is understood you have prepared the data set
earlier.
2. On the menu bar click Analyze, then Descriptive Statistic, the Frequencies. This
opens the Frequencies dialog box.
3. Press Continue on the Descriptive Option box, then press OK on the left
Descriptive Box, and you will finally see the following image.

Descriptive Variables = scores


Statistics = Mean StdDev Min Max

Median

Median is the value that divides the ranked score into halves, or the middle value of
the ranked scores. If the number of scores is odd, then there is only one middle value
that gives the median. However, if the number of scores in the set is even, then there
are two middle values. In this case, the median is the average of these two middle
values.

If there are more than 30 scores, arranging the scores and finding the middle value
will take time. The scientific calculator will not give you the median. Again, statistical
software can do this for you with simple steps similar to finding the mean.
1. On the menu bar click on Analyze, then descriptive Statistic, the frequencies.
This opens the frequencies dialog box.
2. Click on the desired variable name in the left box. In the data set, let us consider
the test scores also in the Table. Move your cursor to Statistics and the
frequency box will pop out. Click Median.
3. You will also see that you can use the same process in finding the mean. Earlier,
you opted to use Descriptive instead of the frequencies. The click Continue. Then
press OK.

Again, how do you work it out the conventional way? Either, you rank the 50 scores,
which takes time, or you arrange the scores in the frequency distribution as shown here:

Table 8.2. Frequency Distribution of Scores in Assessment of Learning 2


Test Taken by 50 Teachers Education Students in a
State University in Metro Manila (Artificial Data)

Less than
Class Interval Midpoin Frequency mf Cumulative
t (f) Frequency
(m) (<Cf)

96 - 103 99.5 4 398 50


88 - 95 91.5 6 549 46
80 - 87 83.5 8 668 40
72 - 79 75.5 12 906 32 mdn
64 - 71 67.5 10 675 20
56 - 63 59.5 5 297.5 10
48 - 55 51.5 5 257.5 5

Total N = 50 Σmf = 3751

This formula will help you determine the median:

N
−¿ Cf
Mdn = L + c [ 2 ]
f1

Where: L = lower limit of median class


c = size of the class interval
N
= half No. of cases
2
<Cf = less than cumulative frequency
f1 = frequency of the median class

N
−¿ Cf 25−20
Solution: Mdn = L + c [ 2 ] = 71.5 + 8 [ ]
12
f1

5
Mdn = 71.5 + 8 [ ] = 71.5 + 8 [0.41666667)
12
Mdn = 71.5 + 3.3333333336 = 74.83333336 or 74.83

Mode

Mode is the easiest measure of central tendency to obtain. It is the score or value
with the highest frequency in the set of scores. If the scores are arranged in a frequency
distribution, the mode is estimated as the midpoint of the class interval which has the
highest frequency. The class interval with the highest frequency is also called the modal
class. In a graphical representation of the frequency distribution, the mode is the value
in the horizontal axis at which the curve is at its highest point. If there are two highest
points, then, there are two modes. When all the scores in a group have the same
frequency, the group of scores has no mode.

Considering the test data in Table 8.2, it can be seen that highest frequency of 12
occurred in the class interval 72 - 79. The rough estimate of the mode is 75.5, which
is the midpoint of the class interval. Using statistical software and following the steps in
finding the mean and the median, the image will appear.

Measures of Dispersion

One important descriptive statistic in the area of assessment is the measures of


dispersion, which indicates “variability,” “spread,” or “scatter, ” See Figure 8.1.

You can see that different distributions may be asymmetrical, may have the same
average value (mean, median, mode), but how the scores in each distribution are
spread out around these measures are different.

In A, as shown in Figure 8.1, scores range between 40 and 60; in B, between 30


and 70, and in C between about 20 and 80. Measures of variability give us the estimate
to determine how the scores are compressed, which contributes to the “flatness” of the
distribution.

x
10 20 30 40 50 60 70 80 90
Figure 8.1. Measures of Variability of Sets of Test Scores

There are several indices of variability, and the most commonly used in the area of
assessment are the following:

Range. It is the difference between the highest score and the lowest score in a
distribution. It is the simplest measure of variability but also considered as the least
accurate measure of dispersion because its value is determined by just two scores in
the group. It does not take into consideration the spread of all scores; its value simply
depends on the highest and lowest scores. Its value could be drastically changed by a
single value. Consider the following examples:

Determine the range for the following scores:


9, 9, 9, 12, 12, 13, 15, 15, 17, 17, 18, 18, 20, 20, 20.

Range = Highest Score (HS) - Lowest Score (LS)


Range = 20 - 9 = 11

Now, replace a high score in one of the scores, say, the last score and make it 40.
The range becomes:

Range = HS - LS
Range = 40 - 9
Range = 31

You will see that with just a single score, the range increased so high, which can be
interpreted as large dispersion of test scores; however, when you look at the individual
scores, it is not.

Variance and Standard Deviation. Standard Deviation is the most widely used
measure of variability and is considered as the most accurate to represent the
deviations of individual scores from the mean values in the distribution.

Examine the following test score distributions:

Class A Class B Class C


22 16 12
18 15 12
16 15 12
14 14 12
12 12 12
11 11 12
9 11 12
7 9 12
6 9 12
5 8 12
Σx = 120 Σx = 120 Σx = 120

120 120
Solve for mean: xA = xB = xC =
10 10
120
10
xA = 12 xB = 12 xC = 12

You will note that while the distributions contain different scores, they have the
same mean. If you ask how each mean represents the score in their respective
distribution, there will be no doubt with the mean of distribution C because each score in
the distribution is 12. How about in distributions A and B? For these two distributions,
the mean of 12 is better estimate of the scores in distribution B than in distribution A.
You can see that no score in B is more than 4 points away from the mean of 12.
However, in distribution A, half of the 12 scores is 4 points or more away from the
mean. You can see that there is less variability of scores in B than A.

Recall that Σ(x - x ) is the sum of the deviation scores from the mean, which is
equal to zero. As such, you square each deviation score, then sum up all the squared
deviation scores, and divide it by the number of cases. This yields the variance. Getting
its square root is the standard deviation.

The measure is generally defined by the formula:

Σ ( x−μ ) 2
σ2 =
N
Where:
σ2 = population variance
μ = population mean
x = score in the distribution
N = number of scores in the distribution

Finding the square give us this formula for the standard deviation. That is,

Σ ( x−μ ) 2
σ = √
N
Where:
σ = population standard deviation; x = score in the distribution
μ = population mean; N = number of scores in the
distribution

If you are dealing with the sample data and wish to calculate an estimate of σ, the
following formula is used for such statistic:

Σ ( x−x ) 2
s = √
N−1
Where:
s = standard deviation; x = raw score in the
distribution
x = sample mean; N = number of scores in the
distribution

The measure is also defined by the Sample Variance formula:

Σ ( x−x ) 2
s2 =
N−1
Where:
s2 = variance; x = raw score in the distribution
x = sample mean; N = number of scores in the
distribution

Using the scores in Class A and Class B in the above data set, you can apply the
formula:

Class A Class B
x (x - x) (x x (x - x) (x
- x )2 - x )2
22 22 - 12 = 10 16 16 - 12 = 4
100 16
18 18 - 12 = 6 15 15 - 12 = 3
36 9
16 16 - 12 = 4 15 15 - 12 = 3
16 9
14 14 - 12 = 2 14 14 - 12 = 2
4 4
12 12 - 12 = 0 12 12 - 12 = 0
0 0
11 11 - 12 = -1 11 11 - 12 = -1
1 1
9 9 - 12 = -3 11 11 - 12 = -1
9 1
7 7 - 12 = -5 9 9 - 12 = -3
25 9
6 6 - 12 = -6 9 9 - 12 = -3
36 9
5 5 - 12 = -7 8 8 - 12 = -4
49 16
Σx = 120 Σ( x - x Σx = 120 Σ( x -
)2 = 276 x )2 = 74

Σx 120
a) Solve for Mean: x = = = 12
N 10

b) Solve for Variance:


Σ ( x−x ) 2 Σ ( x−x ) 2
s2 = s2 =
N−1 N−1

276 74
s2 = s2 =
10−1 10−1

276 74
s2 = s2 =
9 9

s2 = 30.67 s 2 = 8.22

c) Solve for Standard Deviation:


Σ ( x−x ) 2 Σ ( x−x ) 2
s = √ or s = √ s 2 ; s = √ s =
N −1 N−1
√ s2

s = √ 30.67 s = √ 8.22

s = 5.538 s = 2.8674

You may be thinking that the process will be difficult if you are dealing with many
scores in a distribution. This is not really a problem if you have a scientific calculator.
Secure scientific calculator for easy and accurate solutions.

Measures of Position

While measures of central tendency and measures of dispersion are used often in
assessment, there are other methods of describing data distributions such as using
measures of position or location. What are these measures?

Quartile. In our discussion about the measure of central tendency, you learned that
median of a distribution divides the data into two equal groups. In a similar way, the
quartiles are the three values that divide a set of scores into four equal parts, with one-
fourth of the data values in each part. This means about 25% of the data falls at or
below the first quartile (Q1); 505 of the data falls at or below the 2 nd quartile (Q2); and
75% falls or below the 3rd quartile (Q3).

Notice that Q2 is also the median. You can also say that Q1 is the median of the first
half of the values, and Q3 the median of the 2nd half of the values.
Example: Given the following scores, find the first quartile, 3 rd quartile, and quartile
deviation.
90 85 85 86 100 105 109 110 88 105 100 112

Steps: 1. Arrange the scores in the decreasing order.


2. From the bottom, find the points below which 25% of the score value
and 75% of the score
values fall.
3. Find the average of the two scores in each of these points to determine
Q1 and Q3,
respectively.
4. Find Q using the formula:

Q3−Q 1
Q =
2
Solutions:
85 85 86 88 90 100 100 105 105 109 110 112

86+88 105+109
Q1 = Q3 =
2 2

174 214
Q1 = Q3 =
2 2

Q 1 = 87 Q 3 = 107

Note that in the above example, the left and right 50% contains even center values,
so the median in each half is the average of the two center values.

Q3−Q 1
Consequently, applying the formula: Q = gives the quartile deviation.
2
That is,
107−87 20
Q = = = 10
2 2

Decile. It divides the distribution into ten equal parts. There are nine (9) deciles
such that 10% of the distribution are equals or less than decile 1, (D 1), 20% of the
scores are equal or less than decile 2, (D 2), and so on. A student whose mark is below
the first decile is said to belong in decile 1. A student whose mark is between the first
and second deciles is in decile 2. A student whose mark is above the ninth decile
belongs to decile 10.

Percentile. It divides the distribution into 100 equal parts. In the same manner, for
percentiles, there are 99 percentiles such that 1% of the scores are less than the first
percentile, 2% of the scores are less than the second percentile, and so on.

Example: If you scored 95 in a 100-item test, and your percentile rank is 99 th,
this means that 99% of those who took the test performed lower than you. This is
also means that you belong to the top 1% of those who took the test.

(8)

Another Example: 75% as a percentage score means you get 75 items


correct out of a hundred items, which is a mark or grade reflecting performance
level. But percentile is a measure of position such that 75 th percentile as your mark
means that 75% of the students who took the test got lower score than you, or your
score is located at the upper 25% of the class who took the same test. For very
large data set, percentile is appropriate to use for accuracy.
The Normal Distribution

The normal distribution is a special kind of symmetrical distribution that is most


frequently used to compare scores. It has been found that when a frequency polygon for
a large distribution of scores of a natural phenomenon and occurring characteristics (IQ,
height, income, test scores, etc) are drawn as a smooth curve, one curve stands out,
which is the bell-shaped curve. As seen below, this curve has a small percentage of
observations on both tails, and the bigger percentage on the inner part of the curve.
This shape of this particular curve is known as the normal curve, hence the name
normal distribution.

Figure 8.2. The Normal Curve

It is also called Gaussian Distribution, named after Carl Friedrich Gauss. This
distribution has been used as a standard reference for many statistical decisions in the
field of research and evaluation.

In assessment, the area in the curve refers to the number of scores that fall within a
specific standard deviation from the mean score. In other words, each portion under the
curve contains a fixed percentage of cases as follows:

 68% of the scores fall between one standard deviations below and above the
mean.
 95% of the scores fall between two standard deviations below and above the
mean.
 99.77% of the scores fall between three standard deviations below and
above the mean.

The following figure further illustrates the theoretical model.

68%

95%

99.77%

-3 -2 -1 0 +1 +2 +3

Figure 8.3. The Areas Under the Normal Curve

From the above figures, you can state the properties of the normal curve:
1. The mean, median, and mode are all equal.
2. The curve is symmetrical. As such, the value in a specific areas on the left is
equal to the value of its corresponding area on the right.
3. The curve changes form concave to convex and approaches the x-axis, but the
tails do not touch the horizontal axis.
4. The total are on the curve is equal to one (1).
(9)
Standard Scores

In the preceding topic, you discussed raw scores, which are the original scores
collection from an actual testing situation. However, there are situations where
computing measures from raw scores may not be enough.

Consider a situation where you, as a student, want to know in what subjects you
performed best and poorest to determine where you need to exert more effort. In cases
like these, you cannot find the answer by merely relying on a single score. More
concretely, if you get a score of 86 in Science and 90 in English, you cannot conclude
that you perform better in English, simply because 90 is higher than 86. Say, you later
learned that the mean score of the class in Science was 80, and in English, the mean
score was 95. This situation indicates that a single score like 86 or 90 is not meaningful
unless it is compared with other test scores.

In particular, a score can be interpreted more meaningfully if you know the mean
and variability of the other scores where the single score belongs. Knowing this, a raw
score can be converted into Standard Scores.

There are many kinds of standard scores.


1. z – score. The most useful which is often used to express a raw score in
relation to the mean and standard deviation. This relationship is expressed in the
following formula:

x−x
z =
s

Where: z = z-score x = mean


x = raw score s = standard deviation

The standard deviation helps you locate the relative position of the score in
a distribution. The equation gives you the z-score, which can indicate the
number of standard deviations the score is above or below the mean. A z-score
is called a standard score, simply because it is a deviation score expressed in
standard deviation units.

If raw scores are expressed as z-scores, you can see their relative position
in their respective distribution. If the raw scores are already converted into
standard scores, you can now compare the two scores even when these scores
come from different distributions or when scores are measuring two different
things, like knowledge in English or Science. The following figure illustrates this
point.

70 80 86 90 70 90 95 100
x x x x

Science English
Figure 8.4. A comparison of Score Distributions with Different
Means and
Standard Deviation
In the above figure, a score of 86 in Science indicates better performance
than a score of 90
in English. Let us suppose that standard deviations in Science and English are
3 and 2, respect –
tively. You can express these raw scores as z-scores.

( 10 )

Science English
x−x x−x
z = z =
s s
86−80 90−95
z = z =
3 2
6 −5
z = = 2 z = = - 2.5
3 2

From the above, if 86 and 90 are your scores in the two subjects, you can
confidently say that, compared with the rest of your class, you performed better
in Science than in English.

2. T – score. As you can see in the computation of the z-score, it can give you a
negative number, which simply means the score is below the mean. However,
communicating negative z-score as below the mean may not be understandable
to others. You will not even say to students that they got a negative z-score. A z-
score may also be a repeating or non-repeating decimal, which may not be
comfortable to others. One option is to convert a z-score into a T-score, which is
a transformed standard score. To do this, there is scaling in which a mean of 0 in
a z-score is transformed into a mean of 50, and the standard deviation in z-score
is multiplied by 10. The corresponding equation is:

T-score = 50 + 10z

For example, a z-score of – 2 is equivalent to a T-score of 30. That is:

T-score = 50 + 10(- 2)
T-score = 50 - 20
T-score = 30

Looking back at the Science score of 86, which resulted in a z-score of 2 as


shown above, T-score equivalent is:

T-score = 50 + 10( 2)
T-score = 50 + 20
T-score = 70

3. Stanine Scores. Another standard score is stanine, shortened from standard


nine. With nine in its name, the scores are on a nine point scale. In a z-score
distribution, the mean is 0, and the standard deviation is 1. Each stanine is one-
half standard deviation-wide. Like T-score, stanine score can be calculated from
the z-score by multiplying the z-score by 2 and adding 5. That is,

Stanine = 2z + 5

Going back to our example on a score of 86 in Science that is equivalent to


a z-score of 2, its
stanine equivalent is,

Stanine = 2(2) + 5 = 4 + 5 = 9.
Example:
Scores in stanine scale have some limitations. Since they are in a 9-point
scale and
expressed as a whole number, they are not precise. Different Z-scores or T-
scores may have the
same stanine score equivalent.

z-score T-score Stanine


2.1 71 9
2.0 70 9
1.9 69 9

On the assumption that stanine scores are normally distributed, the


percentages of cases in
each band or range of scores in the scale is as follows:
( 11 )

Stanine Score Percentage of Scores


1 Lowest 4%
2 Next Low 7%
3 Next Low 12%
4 Next Low 17%
5 Middle 20%
6 Next High 17%
7 Next High 12%
8 Next High 7%
9 Highest 4%

With the above percentage distribution of scores in each stanine, you can
directly convert a set of raw scores into stanine scores. Simply arrange the raw scores
from lowest to highest, and with the percentage of scores in each stanine, you can
directly assign the appropriate stanine score in each raw score.

On interpretation of stanine score, let us say Minerva has a stanine score of


2. You can see that her score is somewhere at the low or bottom 7 percent of the
scores. In the same way, If Dondre’s score is in the 6 th stanine, it falls between 60th and
77th percentile, simply because 60 percent of the scores are below the 6th stanine and
23 percent of the scores are above the 6th stanine.

For qualitative description, stanine scores of 1, 2, and 3 are considered as


below average; 4, 5, and 6 are average; and 7, 8, and 9 are above average. Thus you
can say that your score of 86 in Science is above average. Similarly, Minerva’s score is
below average while that of Dondre average.

LESSON 9` :
ANALYSIS, INTERPRETATION, AND THE USE OF TEST DATA

THINK ABOUT THESE EXPECTATIONS:

1. Assess and communicate learner’s level of achievement and performance


through:
a) fair;
b) accurate; and,
c) meaningful grading and reporting methods.
Opening

Grading and reporting are fundamental elements in the teaching-learning process.


Assignment of grades represents the teacher’s assessment of the learners’
performance on the tests and on the desired learning outcomes as a whole. It is
important that the bases and criteria for grading (i.e., scoring) and reporting test results
are clearly established and articulated from the very start of the course. Teachers
should ensure that grading and reporting of learners’ test results are meaningful, fair,
and accurate.

It is important that you review your prior knowledge and experiences, as well as the
standards or policies used by your institution in grading and reporting learners’
performance in the test and the course as a whole. You may also need to read books
and other references on the topics to validate you’re a priori knowledge and to enhance
further your knowledge and skills.

Purposes of Grading and Reporting Learners’ Test Performance

Grades do not exist in a vacuum but are part of the instructional process and serve
as feedback loop between the teacher and learners. They give feedback on what
specific topic/s learners have mastered and what they need to focus more when they
review for summative assessment or final exams. Grades serve as a motivator for
learners to study and do better in the next tests to maintain or improve their final grade.

Grades also give the parents information about their children’s achievements. They
provide teachers some bases for improving their teaching and learning practices and for
identifying learners who need further educational intervention. They are also useful to
school administrators who want to evaluate the effectiveness of the instructional
programs in deve3loping the needed skills and competencies of the learners.

Different Methods in Scoring Tests or Performance Tasks

There are various ways to score and grade results in multiple-choice tests.
Traditionally, the two most commonly-used scoring methods are number right scoring
(NR) and negative marking (NM).

Number Right Scoring (NR). It entails assigning positive values only to correct
answers while giving a score of zero to incorrect answers. The test score is the sum of
the scores for correct responses. One major concern with this scoring method is that
learners may get the correct answer by guessing; thus, affecting the test reliability and
validity.
Example: Solve for 3(x + 8) - (x - 2) = - 28.
a) x = 32 b) x = 8 c) x = - 8 d) x = - 32

For the above item, the correct answer is d) x = - 32 and this will a score.
Responses other than d will be given zero (0) point.

Negative Marking (NM). It entails assigning positive values to correct answers


while punishing the learners for incorrect responses (i.e., right-minus-wrong correcting
method). In this model, a fraction of the number of wrong answers is subtracted from
the number of correct answers. Other models for this type of scoring method include:
1) giving a positive score to correct answer while assigning no mark for omitted
items; and,
2) rewarding learners for not guessing by awarding points rather than penalizing
learners for incorrect answers. The recommended penalty for an incorrect
answer is 1 / (n – 1), where n stands for the number of choices.
(2)
For the above item, scoring will be as follows:
Learners who chose letter d will be given a score, those who left the item
unanswered will be given
zero (0) point, and those who chose a, b, or c will get a negative score. The total
score is computed by adding the scores (e.g., 1, 0, -0.25) across all items.

Both NR and NM methods of scoring multiple-choice tests are prone to guessing,


which affect test validity and reliability.

There are four types of rating scales for the assessment of writing, which can also
be applied to other authentic or performance-type assessment. These four types of
scoring are (1) Holistic, (2) Analytic, (3) Primary Trait, and (4) Multiple Trait Scoring.

Holistic Scoring. It involves giving a single, overall assessment score for an


essay, writing composition, or other performance-type of assessment as a whole.
Although the scoring rubric for holistic scoring lays out specific criteria for evaluating a
task, raters do not assign a score for each criterion. Holistic scoring is considered
efficient in terms of time and cost. It also does not penalize poor performance based on
only one aspect (e.g., content, delivery, organization, vocabulary, or coherence for oral
presentation).

The following is an example of a rubric for an oral presentation:

Rating/Grade Characteristics
A Is very organized. Has a clear opening statement that catches audience’s interest.
(Exemplary) Content of report is comprehensive and demonstrates substance and depth. Delivery
is very clear and understandable. Uses slides/multimedia equipment effortlessly to
enhance presentation.
B Is mostly organized. Has opening statement relevant to topic. Covers important
(Satisfactory) topics. Has appropriate pace and without distracting mannerisms. Looks at slides to
keep on track.
C Has opening statement relevant to topic and but does not give outline of speech; is
(Emerging) somewhat disorganized. Lacks content and depth in the discussion of the topic.
Delivery is fast and not clear; some items not covered well. Relies heavily on slides
and notes and makes little eye contact.
D Has no opening statement regarding the focus of the presentation. Does not give
(Unacceptable) adequate coverage ot topic. Is often hard to understand, with voice that is too soft or
too loud and pace that is too quick or too slow. Just reads slides; slides too much text.

Analytic Scoring. It involves assessing each aspect of a performance task (e.g.,


essay writing, oral presentation, class debate, and research paper) and assigning a
score for each criterion. Sometimes, an overall score is given by averaging the scores in
all criteria. One advantage of analytic scoring is its reliability. It also provides information
that can be used as diagnostic as it presents learners’ strengths and weaknesses and in
what area/s and eventually as basis for remedial instructions. However, it is more time
consuming and therefore, expensive. It is also difficult to create.

Primary Trait Scoring. It focuses on only one aspect or criterion of a task, and a
learner’s performance is evaluated on only one trait. This scoring system defines a
primary trait in the task that will then be scored. For example, if a teacher in a political
science class asks his students to write an essay on the advantages and disadvantages
of Martial Law (i.e., the writing task), the basic question addressed in scoring is, “Did the
writer successfully accomplish the purpose of this task?” With this focus, teacher would
ignore errors in conventions of written language but instead focus on overall rhetorical
effectiveness. One disadvantage of this scoring scheme is that it is often difficult to
focus exclusively on one trait, such that other traits may be included when scoring.
Thus, it is important that a very detailed scoring guide is used for each specific task.

Multiple-Trait Scoring. It requires that an essay test or performance task is scored


on more than one aspect, with scoring criteria in place so that they are consistent with
the prompt. Multiple-trait scoring is task specific, and the features to be scored vary
from task to task; thus, requiring separate scores for different criteria. Multiple-trait
scoring is similar to analytic scoring because of its focus on several categories or
criteria. For example, scoring criteria for writing performance may include abilities to
present argument clearly, to organize one'

(3)

Different Types of Test Scores

Grading methods communicate the teachers’ evaluative appraisal of learners’ level


of achievement or performance in a test or task. In grading, teachers convert different
types of descriptive information and various measures of learners’ performance into
grades or marks that will provide feedback to learners, parents, and other stakeholders
about learners’ achievement. Test scores can take the form of any of the following: 1)
raw scores, 2) percentage scores, and 3) derived scores.

1. Raw Score. It is simply the number of items answered correctly on a test. A raw
score provides an indication of the variability in the performance of students in
the class. However, a raw score has no meaning unless you know what the test
is measuring and how many items it contains. A raw score also does not mean
much because it cannot be compared with a standard or with the performance of
another learner or of the class as a whole.

For example, a raw score of 95 would look impressive, but only if there are 100
items in the test. However, if the test contains 500 items, then the raw score of
95 is not good at all.

A test that only gives a raw score but not the total number of items does not
measure and communicate the learner’s performance or achievement. Raw
scores may be useful if everyone knows the test and what it covers, how many
possible right answers there are, and how learners typically do in the test.

2. Percentage Score. This refers to the percent of items answered correctly in a


test. The number of items answered correctly is typically converted to percent
based on the total possible score. The percentage score is interpreted as the
percent of content, skills, or knowledge that the learner has a solid grasp of. Just
like raw score, percentage score has limitation because there is no way of
comparing the percentage correct obtained in a test with the percentage correct
in another test with a different difficulty level.

Percentage score is most appropriate to use in teacher-made test or criterion-


referenced test. Percentage score is appropriate to use in teacher-made test that
administered commonly to a class or to students taking the same course with the
same contents or syllabus. In this way, the students test performances can be
compared among each other in the class or with their peers in another section.

3. Criterion-Referenced Grading System. This is a grading system wherein


learners’ test scores or achievement levels are based on their performance in
specified learning goals and outcomes and performance standards. Criterion-
referenced grading is premised on the assumption that learners’ performance is
independent of the performance of the other learners in their group/class.

The following are some of the types of criterion-referenced scores or grades:


3.1 Pass or Fail Grade. This type of score is most appropriate if the test or
assessment is primarily or entirely to make a pass or fail decision. In this type
of scoring, a standard or cut-off score is present, and a learner is given a
score of pass if he or she surpassed the expected level of performance or the
cut-off score. Pass or fail scoring is most appropriate for comprehensive or
licensure exams because there is no limit to the number of examinees who
can pass or fail. Each individual examinee’s performance is compared to an
absolute standard and not to the performance of others.

Pass or fail grading has the following advantages:


1. It takes pressure off the learners in getting a high letter or numerical
grade, allowing them to relax while still getting the needed education;
2. It gives learners a clear cut idea of their strengths and weaknesses; and,
3. It allows learners to focus on true understanding or learning of the course
content rather than on specific details that will help them receive a high
letter or numerical score.

(4)
3.2 Letter Grade. This is one of the most commonly used grading systems.
Letter grades are composed of five-level grading scale labelled from A to E or F, with A
representing the highest level of achievement or performance, and E or F – the lowest
grade – representing a Failing grade. These are often used for all forms of learners’
work, such as quizzes, essays, projects,and assignments. An example of the
descriptors for letter grades is presented below:

Letter Grades Interpretation


A Excellent
B Good
C Satisfactory
D Poor
E Unacceptable
The above evaluative descriptors indicate the teachers’ criterion-referenced
judgment of the learners’ achievement or performance level. However, it would be best
that these descriptors are paired with specific performance indicators that identify the
qualitative differences between categories.

The disadvantage of letter grades is that the cut-offs between grade


categories are always arbitrary and difficult to justify.

For example: If a score of C ranges from 76 to 85, learners who get a


grade of 76 in a writing test and those who receive a grade of 85 will both get the same
letter grade of C despite the nine-point difference. If the next range of grades is 86 to
96, then the one who gets an 86 receives grade of B although it is just one more higher
than 85, which receives a grade of C. Furthermore, letter grades lack the richness of
more detailed grading methods.

3.3 Plus (+) and Minus (−) Letter Grades. This grading provides a more
detailed descriptions of the level of learners’ achievement or task/test
performance by dividing each grade category into three levels, such that a
grade of A can be assigned as A+, A and A−; B+, B and B−; and so on. Plus
(+) and minus (−) grades provide a finer discrimination between achievement
or performance level. They also increase the accuracy of grades as a
reflection learner’s performance; enhance student motivation (i.e., to get a
high A rather than A−); and discriminate among courses or star sections.

+/− gradins system is viewed as unfair, particularly for learners in the highest
category; creates for stress for learners; and is more difficult for teachers as
they need to deal with more grade categories when grading learners.
Examples of the descriptors for plus (+) and minus (−) letter grades are
presented below:

(+)/(−) Letter Interpretation


Grades
A+ Excellent
A Superior
A− Very Good
B+ Good
B Very Satisfactory
B− High Average
C+ Average
C Fair
C− Pass
D Conditional
E/F Failed

3.4 Categorical Grades. This system of grading is generally more descriptive


than letter grades, especially if coupled with verbal labels. Verbal labels
eliminate the need for a key or legend to explain what each grade category
means. Examples of categorical grades are:

Exceeding Meeting Approaching Emerging Not Exceeding


Standards Standards Standards Standards Standards
Advanced Intermediate Basic Novice Below Basic
Exemplary Accomplished Developing Beginning Inadequate
Expert Proficient Competent Apprentice Novice
Master Distinguished Proficient Intermediate Novice

Categorical grading methods have the same drawbacks as letter grades. Like
letter grades, the categorical grades provide cut-offs between levels that are
often arbitrary, lack the richness of more detailed reporting methods, and fail
to provide feedback or information that can be used to diagnose learners’
weaknesses and refer for remediation.

4. Norm-Referenced Grading System. In this method of grading, learners’ test


scores are compared with those of their peers. Norm-referenced grading involves
rank ordering learners and expressing a learner’s score in relation to the
achievement of the test of the group (i.e., class or grade level, school, etc.).
Norm-referenced grading allows teachers to:
1) Compare learners’ test performance with that of other learners;
2) Compare learners’ performance in one test (subtest) with another test
(subtest); and,
3) Compare learners’ performance in one form of the test with another form of
the test administered at an earlier date.

There are different types of norm-referenced scores:


4.1 Developmental Score. This is the score that has been transformed from
raw scores and reflect the average performance at age and grade levels.
There are two kinds of developmental scores: 1) grade-equivalent and 2)
age-equivalent scores.

4.1.1 Grade-Equivalent Score. It is described as both a growth score and


status score. The grade equivalent of a given raw score on any test
indicates the grade level at which the typical learner earns this raw
score. It describes test performance of a learner in terms of a grade
level and the months since the beginning of the school year. A decimal
point is used between the grade and month in grade equivalent.

For example, a score of 7.5 means that the learner did as well as a
Grade 7 taking the test at the end of the fifth month of the school year.
4.1.2 Age-Equivalent Score. It indicates the age level that is typical to a
learner to
such raw score. It reflects a learner’s performance in terms of the
chronological age as compared to those in the norm group. Age-
equivalent scores are written with a hyphen between years and
months.

For example, a learner’s score of 11-5 means that his age equivalent is
11 years and 5 months old, indicating a test performance that is similar
to that of 111/2 year-olds in the norm group.

4.2 Percentile Rank. This indicates the percentage of scores that fall or below
a given score. Percentile ranks range from 1 to 99.

For example, if a learner obtained a score of 75 th percentile rank in a


standardized achievement test, it means that the learner was able to get a
higher score than 75% of the learners or peers in the group norm.

4.3 Stanine Score. This system expresses test results in nine equal steps,
which range from one (lowest) to nine (highest). A stanine score of 5 is
interpreted as “average” stanine. Percentile ranks are grouped into stanines,
with the following verbal interpretations:
(6)

Description Stanine Percentile Rank


Very High 9 96 and above
Above Average 8 90 - 95
7 77 - 89
6 60 - 76
Average 5 40 - 59
4 23 - 39
Below Average 3 11 - 22
2 4 - 10
Very Low 1 3 and below

4.4 Standard Scores. There are raw scores that are converted into a common
scale of measurement that provides meaningful description of the individual
scores within the distribution. A standard score describes the difference of the
raw score from a sample mean, expressed in standard deviation. Two most-
commonly used standard scores are 1) z-score and 2) T-score.

4.4.1 z-score. It is one type of a standard score. Z-scores have a mean of 0


and a standard deviation of 12. It is computed using the following
formula:

x−x x−μ
z = or z =
s σ

4.4.2 T-score. It is another type of standard score, where the mean is equal
to 50, and the standard deviation is equal to 10. It is linear
transformation of z-scores, which have mean 0 and standard deviation.
It is computed from a z-score with the following formula:

A T-score of 50 is considered “average”, with T-scores ranging from 40


to 60 as within the normal range. T-scores of 30 and below and T-
scores of 70 and above are interpreted as low and high test
performance, respectively

T-score = 50 + 10z
.
General Guidelines in Grading Tests or Performance Tasks

The following are the general guidelines in grading tests or performance tasks:
1. Stick to the Purpose of the Assessment.
 Before coming up with an assessment, it is first important to determine the
purpose of the test. Will the assessment be used for diagnostic purposes?
Will it be a formative assessment or it is a summative assessment?
Diagnostic and formative assessments are generally not graded.
 Diagnostic assessments are primarily used to gather feedback about the
learners’ prior knowledge or misconception before the start of a learning
activity, while results from formative assessments are used to determine
what learners need to improve on or what topics or course contents need
to be addressed and given emphasis by the teacher.

2. Be Guided by the Desired Learning Outcomes.


 The learners should be informed early on what are expected of them
insofar as learning outcomes are concerned, as well as how they will be
assessed and graded in the test. Such information can be disseminated
through the course syllabus or during course introduction.

3. Develop Grading Criteria.


Grading criteria to be used in traditional tests, and performance tasks should be
made clear to the students. Similarly, learners should also be informed of the
weight of each criterion. Grading criteria and weights should be applied fairly and
consistently.

4. Inform Learners what Scoring Methods are to be Used. Learners should be


made aware before the start of testing, whether their test responses are to be
scored based on the number right, negative marking, or through non-
conventional scoring methods.

5. Decide on what Type of Test Scores to Use. As discussed earlier, there are
different ways by which students learning can be measured and presented.
Performance in a particular test can be measured and reported through raw
scores, percentage scores, criterion-referenced scores, or norm-referenced
scores. It is important that different types of grading scheme be used for different
tests, assignments, or performance tasks. Learners should also be informed at
the start of what grading system is to be used for a particular test or task.

General Guidelines in Grading Essay Tests

Essays require more time to grade than the other types of traditional tests. Grading
essay tests can also be influenced by extraneous factors, such as learners’ handwriting
legibility and raters’ biases. The following are the general guidelines in scoring essay
tests:
1. Identify the Criteria for Rating the Essay. The criteria or standards for
evaluating the essay should be predetermined. Some of the criteria that can be
used include content, organization/format, grammar proficiency, development
and support, focus and details, etc. It is important that the specific standards and
criteria included are relevant to the type of performance task given.

2. Determine the Type of Rubric to Use. There are two types of rubric: holistic or
analytic scoring system. Holistic rubrics require evaluating the essay and taking
into consideration all the criteria. Only a single score is given based on the
overall judgment of the learner’s writing composition. Holistic rubric is viewed to
be more convenient for the teachers as it requires less area or aspect of writing
to evaluate. However, it does not provide feedback on what course topic/content
are weak and need to improve on. On the other hand, analytic scoring system
requires that the essay is evaluated based on each of the criteria. It provides
useful feedback on learner’s strength and weaknesses for each course content
or criterion.

3. Prepare the Rubric. In developing rubric, the skills and competencies related to
essay writing should first be identified. These skills and competencies represent
the criteria. Then, performance benchmarks and point values are determined.
Performance marks can be numerical categories, but the most frequently used
are descriptors with corresponding rating scales.

Point Sample Performance Benchmarks


Values
1 Needs Beginning Novice Inadequate
Improvement
2 Satisfactory Developing Apprentice Developing
3 Good Accomplished Proficient Proficient
4 Exemplary Exceptional Distinguished Skilled

4. Evaluate essay Anonymously. Checking essay should be done anonymously.


It is important that the rater does not identify whose paper he/she is rating.

5. Score One Essay Question at a Time. This is to ensure that the same thinking
and standards are applied for all learners in the class. The rater should try to
avoid any distraction or interruption when evaluating the same item.

6. Be Conscious of Own Biases when Evaluating a Paper. The rate should not
be affected by learners’ handwriting, writing style, length of responses, and other
factors. He/she should stick to the criteria included in the rubric when evaluating
the essay.

7. Review Initial Scores and Comments Before Giving the Final Rating. This is
important especially for essays that were initially given a barely passing or failing
grade.

8. Get Two or More Raters for essays that are high-stake, such as those used for
admission, placement, or scholarship screening purposes. The final grade will be
the average of the all ratings given.

9. Write Comments. Write comment next to the learner’s responses to provide


feedback on how well one has performed in the essay test.

The New Grading System of the Philippine K-12 Program

On April 1, 2015, the department of Education, through DepEd Order 8 s. 2015,


announced the implementation of a new grading system for all grade levels in public
schools from elementary to Senior High School. Although private schools are not
required to implement the same guidelines, they are encouraged to follow them and are
permitted to modify them in accordance to their institution’s Philosophy, Vision, and
Mission. The grading system is described as a standard and competency-based graded
system, where 60 is the minimum grade needed to pass a specific learning area, which
transmuted to 75 in the report card. The lowest mark that can appear in the report card
is 60 for quarterly grades and final grades. Grades will be based on the weighted raw
score of the learners’ summative assessments based on three components: Written
Work (WW), Performance Task (PT), and Quarterly Assessment (QA)

Steps Examples
Get Total Scores for WW1 + WW2 + WW3 + . . . = WWT (e.g., 145 out of 160)
each Component PT1 + PT2 + PT3 + . . . = PTT (e.g., 100 out of 120)
QA = 40 out of 50
Convert to 5 WW = 145/160 = 90.63
PT = 100/120 = 83.33
QA = 40/50 = 80.00
Convert 5 to weighted (See assigned weights for each component in the next tables)
Score (WS)* WS for WW English = 90.63 x 0.30 = 27.19
WS for PT English = 83.33 x 0.50 = 41.67
WS for English QA = 80.00 x x 0.20 = 16
Add weighted scores Initial Grade for English = 27.19 + 41.67 + 16.00 = 84.86
for the Initial Grade.
Transmute Initial Grade Use Transmutation Table from DepEd Order 8, s. 2015)
to Quarter Grade (QG) For 84.86, transmuted grade is 90, which is the QG.

Weights for the Three (3) Components for Grades 1 – 10 and Senior High-School.

Grades 1 - 10
Component Language AP Esp Science Math MAPEH EPP/TLE
Written Work 30% 40% 20%
Performance 50% 40% 605
Tasks
Quarterly
Assessment 20% 20% 20%

Weights for the Three (3) Components for Senior High School

Senior High School


Components Academic Track Tech-Voc and Livelihood/Sports/Arts
And Design track
Immersion / Immersion /
Core Subjects Research All Other Subjects Research
Business Simulation/ Exhibit /
Exhibit / Performance
Performance
Written Work 30% 40% 20%
Performance 50% 405 60%
tasks
Quarterly 20% 20% 20%
Assessment

For MAPEH, individual grades are given to each area (i.e., Music, Art, PE, and Health).
The quarterly grade for MAPEH is the average grade across the four areas, as follows:

QG for Music +QG for Arts+QG for PE+ QG for Health


QG for MAPEH =
4

The final grade for each subject is then computed by getting the average of the four
quarterly grades, as seen below:
1QG +2QG +3 QG+ 4 QG
Final Grade for each Learning Area =
4

(9)

The general grade on the other hand, is computed by getting the average of the final
grades for all subject areas. Each subject area has equal weight:
General Average =
∑ of FinalGrades of All Learning Areas
Total Number of Learning Areas∈a Grade Level

All grades reflected in the report card are reported as whole number. See an example of
a report card:

Quarter
Subject Area 1 2 3 4 Final Grade
Filipino 86 88 85 90 87
English 83 82 83 85 83
Mathematics 87 92 93 95 92
Science 82 84 88 86 85
Araling Panlipunan 90 92 92 93 92
Edukasyon sa Pagpapakatao 80 83 85 88 84
Edukasyong Pantahanan at Pangkabuhayan 86 82 85 83 84
MAPEH 90 92 93 94 92
General Avetage 87

Learners’ grades are then communicated to parents and guardians every quarter during
the parent-techer conference by showing and discussing with them the report card. The
grading system and the descriptors are as follows:

Descriptor Grading System Remarks


Outstanding 90 - 100 Passed
Very Satisfactory 85 - 89 Passed
Satisfactory 80 - 84 Passed
Fairly Satisfactory 75 - 79 Passed
Did Not Meet Expectations Below 75 Failed

You might also like