INTRODUCTION
Education aims at imparting new knowledge. Every
educational
institute or system is guided by this goal. To determine whether these goals are met and to know to what extent these aims are met we need to have a carefull appraisal and study. This is the significance of evaluation in any educational system. The realization of goals and objectives in the educative process is based on the accuracy of the judgements and inferences made by decision makers in every stage.
TERMINOLOGY
Evaluation :it is a process of determining to what extent the educational objectives are being realized. RALPH TYLER
Evaluation includes both qualitative and quantitative means. quantitative description of the pupil achievement and qualitative description of pupils ability ,value judgements about achievements and abilities. Achievement test: It is an important an important tool in school evaluation and has a great significance in measuring instructional progress and progress of student in the subject area. Definition of Achievement test: Any test that measures the attainment of accomplishments of an individual after a period of training or learning. - N M DOWNIE.
ITEM ANALYSIS
The last step in the construction of a test is appraising the test or item analysis. Item Analysis describes the statistical analyses which allow measurement of the
effectiveness of individual test items. An understanding of the factors which govern effectiveness (and a means of measuring them) can enable us to create more effective test questions and judgment. As an educator you should be able to recognize the most critical pieces of data from an Item Analysis Report and evaluate whether or not that item needs revision.
ITEM ANALYSIS or APPRAISING THE TEST It is the procedure used to judge the quality of an item. To ascertain whether the questions /items do their job effectively, a detailed test and analysis has to be done before a meaningful and scientific inference about the test can be made in terms of its validity ,reliability, objectivity and usability.
Item analysis is a post administration examination of a test (Remmers, Gage and Rummel, 1967: 267). Item analysis is a process which examines students responses to individual test items (questions) in order to assess the quality of those items and of the test as a whole. Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test administration. In addition, item analysis is valuable for increasing instructors' skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity.
ITEM STATISTICS:-Item statistics are used to assess the performance of individual test items on the assumption that the overall quality of a test derives from the quality of its items.
The ITEM ANALYSIS report provides the following item information. Item Number. This is the question number taken from the student answer sheet, and the Key Sheet. Mean and S.D. The mean is the "average" student response to an item. It is computed by adding up the number of points earned by all students for the item, and dividing that total by the number of students.
The standard deviation, or S.D., is a measure of the dispersion of student scores on that item, that is, it indicates how "spread out" the responses were. The item standard deviation is most meaningful when comparing items which have more than one correct alternative and when scale scoring is used. means. the mean total test score (minus that item) is shown for students who selected each of the possible response alternatives.
ITEMS can be analysed qualitatively ,in terms of their content and form, and quantitatively, in terms of their statistical properties. Qualitative analysis of items Content validity: It is the degree to which the test contains a representative
sample of the material taught in the course and is determined against the course content. Evaluation of items in terms of effective item writing procedures.
Quantitative analysis Measurement of item difficulty . Measurement of item discrimination.
Both the reliability and the validity of any test depends ultimately on the characteristics of its items. High reliability and validity can be built in to attest in
advance through item analysis. Tests can be improved through the selection, substitution or revision of items.
CONCEPTS OF MEASUREMENT:
VALIDITY :It refers to the appropriateness, meaningfulness, and usefulness of inferences made from test scores. Validity is the judgment made about a tests ability to measure what it is intended to measure.( according to the standards for educational and psychological testing, by American psychological association, American education research association, national council on measurement in education, 1985) This judgment is based on categories of evidence: content-related evidence; criterion related: and construct-related.
RELIABILITY: ability of a test to give dependable and consistent scores. A judgement about reliability can be made based on the extent to which two similar measures agree. Jacobs & Chase(1992) recommended a minimum test length of 25 multiple-choice questions with an item difficulty sufficient to ensure heterogeneous performance of the group. Reliability is measured on a scale of 0.1-1. For classroom tests reliability of 0.7-0.8 is acceptable. Reliability coefficients theoretically range in value from zero (no reliability) to 1.00 (perfect reliability). High reliability means that the questions of a test tended to "pull together." Students who answered a given question correctly were more likely to answer other questions correctly. If a parallel test were developed by using similar items, the relative scores of students would show little change.
Low reliability means that the questions tended to be unrelated to each other in terms of who answered them correctly. The resulting test scores reflect peculiarities of the items or the testing situation more than students' knowledge of the subject matter.
Reliability Interpretation 0.90 and above : Excellent reliability; at the level of the best standardized tests 0.80 -0.90: Very good for a classroom test 0.70 - 0.80: Good for a classroom test; in the range of most. 0.60 - 0.70 : Somewhat low. This test needs to be supplemented by other measures (e.g., more tests) to determine grades. There are probably some items which could be improved. 0.50 -0 .60: Suggests need for revision of test, unless it is quite short (ten or fewer items). The test definitely needs to be supplemented by other measures (e.g., more tests) for grading. 0.50 or below : Questionable reliability, and it needs revision.
OBJECTIVITY : A test is said to be objective , when the scorers personal judgement does not affect the scoring. It is a prerequisite of reliability and validity.
USABILITY or PRACTICABILITY : It is overall simplicity for both constructor and [Link] depends on various factors like ease of administrability, scoring , interpretation and economy. It is an important criterion used for assessing the value of a test.
TYPES OF ITEM ANALYSIS
There are three main types of Item Analysis: Item Response Theory, Rasch Measurement and Classical Test Theory. 1.
CLASSICAL TEST THEORY
: (traditionally the main method used in the
United Kingdom) it utilises two main statistics - Facility and Discrimination. Facility is essentially a measure of the difficulty of an item, arrived at by dividing the mean mark obtained by a sample of candidates and the maximum mark available. As a whole, a test should aim to have an overall facility of around 0.5, however it is acceptable for individual items to have higher or lower facility (ranging from 0.2 to 0.8) Discrimination measures how performance on one item correlates to performance in the test as a whole. There should always be some correlation between item and test performance, however it is expected that discrimination will fall in a range between 0.2 and 1.0. 2.
RASCH MEASUREMENT: Rasch measurement is very similar to IRT1 - in that it
considers only one parameter (difficulty) and the ICC is calculated in the same way. When it comes to utilising these theories to categorise items however, there is a significant difference. If you have a set of data, and analyse it with IRT1, then you arrive at an ICC that fits the data observed. If you use Rasch measurement, extreme data (e.g. questions which are consistently well or poorly answered) is discarded and the model is fitted to the remaining data.
[Link] RESPONSE THEORY
Item Response Theory (IRT) assumes that there is a correlation between the score gained by a candidate for one item/test (measurable) and their overall ability on the latent trait which underlies test performance . Critically, the 'characteristics' of an item are said to be independent of the ability of the candidates who were [Link] Response Theory comes in three forms: IRT1, IRT2, and IRT3 reflecting the number of parameters considered in each case. For IRT1, only the difficulty of an item is considered,
(difficulty is the level of ability required to be more likely to correctly answer the question than answer it wrongly). For IRT2, difficulty and discrimination are considered,
(discrimination is how well the question is at separating out candidates of similar abilities). IRT3, difficulty, discrimination and chance are considered,
(chance is the random factor which enhances a candidates probability of success through guessing. IRT can be used to create a unique plot for each item (the Item Characteristic Curve ICC). The ICC is a plot of Probability that the Item will be answered correctly against Ability.
ITEM CHARACTERISTIC CURVES (ICC) AND ITEM RESPONSE THEORY (IRT):
Test developers the concepts associated with IRT for item analysis. In essence, IRT relates each test item's performance to a complex statistical estimate of the test taker's knowledge or ability on the measured construct. A basic characteristic of IRT is an ICC.
Item Characteristic Curves (ICC) An ICC is a graphical representation of the probability of answering an item correctly with the level of ability on the construct being measured. The shape of the ICC reflects the influence of the three factors: 1) 2) 3) The item's difficulty; The item's discriminatory power; The probability of answering correctly by guessing.
Increasing the difficulty of an item causes the curve to shift right - as candidates need to be more able to have the same chance of passing. Increasing the discrimination of an item causes the gradient of the curve to increase. Candidates below a given ability are less likely to answer correctly, whilst candidates above a given ability are more likely to answer correctly. Increasing the chance raises the baseline of the curve. IRT logically assumes that individuals who have high scores on the test have greater ability than those who score low on the test. With this in mind it can be concluded that the greater the slope of the ICC the better the item is at discriminating between high and low test performers. Difficulty, on an ICC, is operationally defined by the point at which the curve indicates a chance probability of 0.5 (a 50-50 chance) for answering the item correctly. The higher the level of ability needed to obtain a 0.5 probability (curve shifted to the right) the more difficult the item.
Item Difficulty : The difficulty level of item is defined as the proportion or percentage of the examinees or individuals who answered the items correctly. - (Singh, 1986: and Remmers, Gage and Rummel, 1967). According to J.P. Guilford " the difficulty value of an item is defined as the proportion or percentage of the examinees who have answered the item correctly" -(Freeman, 1962:112-113 and Sharma, 2000) The proportion of examinees who answered the item correctly. It is frequently also called the p-value. Item difficulty is most commonly measured by calculating the percentage of test-takers who answer the item correctly {p value for an item = (%of people responding correctly) / (% of people taking the test)}. Jacobs & chase (1992) recommended that most items in the test be approximately p=0.5 (50% chance) to help ensure that questions separate learners from non-learners(a good discrimination index). The upper limit of the item difficulty is 1.00( 100% students answered the question correctly). The lower limit of item difficulty depends on the number of possible responses and the probability of guessing the answer.( for ex. For an item with 4 response or options ,p=0.25). Thus, most test developers seek to develop tests where the average difficulty scores is about 0.5. Often items with difficulty levels between 0 - 0.2 and 0.8 - 1.0 are discarded because they are either too difficult or too easy, respectively. They are not differentiating the population.
In a test it is customary to arrange items in order of difficulty, so that test takers begin with relatively easy items and proceed to items of increasing difficulty . this arrangement gives the test takers confidence in approaching the test and also reduces the likelihood of wasting time in items beyond their ability and neglecting the items they can correctly complete.
Item Discrimination : Item discrimination refers to the way an item differentiates students who know the content from those who do not know.A measure of how well an item distinguishes between students who are knowledgeable and those who are not. Definition: According to Marshall Hales (1972) the discriminating power of the item may be defined as the extent to which success or failure on that item indicates the possession of the achievement being measured. Discrimination can be measured as a point biserial correlation. If a question discriminates well,the point biserial correlation will be highly positive for the correct answer and negative for the distracters. The Discrimination Index; A discrimination index (D) is calculated To measure how well a test item separates those test takers who show a high degree of a skill, knowledge, attitude, or personality from those who have low skill, knowledge, etc.,. This index compares, for each test item, the performance of those who scored the best (U upper group) with those who scored the worst (L lower group). 1)Rank-order your test scores from lowest to highest.
2) The upper 25-35% and the lower 25-35% form your analysis groups. 3) Calculate the percentage of test-takers passing each item in both groups {i.e., U = (# of uppers who responded correctly) / (Total number in the Upper group); L = (# of lowers who responded correctly) / (Total number in the lower group)}; 4) D = U L Item Discrimination indices range from -1.0 to 1.0. The higher the value, the better that choice is able to discriminate between strong and weak [Link] logic of the D statistic is simple. Tests are more difficult for those who score poorly (lower group). If an item is measuring the same thing as a test, then the item should be more difficult for the lower group. The D statistic provides a measure of each items discriminating power with respect to the upper and lower groups. On the basis of discriminating power, items are classified into three types (Sharma, 2000: 201). Positive Discrimination: If an item is answered correctly by superiors (upper groups) and but not answered correctly by inferiors (lower group) such item possess positive discrimination. Negative Discrimination: An item answered correctly by inferiors (lower group) but not answered correctly by the superiors (upper groups) such item possess negative discrimination Zero Discrimination: If an item is answered correctly by the same number of superiors as well as inferiors examinees of the same group. The item cannot discriminate between superior and inferior examinees. Thus, the discrimination power of the item is zero. Inter-Item Correlations: The inter-item correlation matrix is another important component of item analysis. This matrix displays the correlation of each item with every other item. Usually each
item is coded as dichotomous (incorrect = 0, correct = 1), and the resulting matrix is composed of phi coefficients, that are interpreted much like the Pearson productmoment correlation coefficients. This matrix provides important information about a tests internal consistency, and what could be done to improve it. Ideally each item should be correlated highly with the other items measuring the same construct. Items that do not correlate with the other items measuring the same construct can be dropped without reducing the tests reliability.
Item-Total Correlations Point-biserial or item-total correlations assess the usefulness of an item as a measure of individual differences in knowledge, ability, or personality characteristic. Here each dichotomous test item (incorrect = 0; correct =1) is correlated with the persons total test score. Interpretation of the item-total correlation is similar to that of the D statistic. A modest positive correlation illustrates 2 things: 1) That the item in question is measuring the same construct as the test; 2) The item is successfully discriminating between those who perform well and those who perform poorly.
Distractor Analysis Distractor Difficulties and Distractor Discriminations measure the proportion of students selecting each wrong answer and the ability of each wrong answer to distinguish between strong and weak [Link] multiple choice tests
there is usually one correct answer and a few wrong answers or distractors. A lot can be learned from analyzing the frequency with which test-takers choose distractors. Effective distracters should appeal to the non learner , as indicated by negative point biserial correlation. Distracters with a point biserial correlation of zero indicate that the students did not select them and that they need to be revised or replaced with a more plausible option for students who did not understand the content. Consider that perfect multiple-choice questions should have 2 distinctive features: 1) Person's who know the answer pick the correct answer; 2) People who do not know the answer guess among the possible responses. . This means that each distractor should be equally popular. It also means that ; the number of correct answers = those who truly know + some random amount. To account for this, should professors subtract the randomness factor from each person's score to get a more accurate view of a person's true knowledge.
STEPS IN ITEM ANALYSIS
[Link] the exam and sort the results by score. [Link] in order of merit and identify high and low groups. Select an equal number of students from each end, e.g. top 25% (upper 1/4) and
bottom 25% (lower 1/4).
3. For each item count the number of students in each group who answered the item correctly. For each alternative response type of questions,count the number of students in each group who choose each alternative [Link] the performance of these two groups on each of the test items. For any well-written item:
--a greater portion of students in the upper group should have selected the correct answer. --a greater portion of students in the lower group should have selected each of the distracter (incorrect) answers. [Link] of difficulty index of a question. Item difficulty index: percentage of students who get the item correct, D = R/N *100 R: number of pupils who answered the item correctly N: total number of students who tried them. Item Difficulty Level or facility level of a test: it is the index of how easy or difficult the test is from the point of view of the teachers. It is the ratio of the average score of a sample on the test to the maximum possible score on the test. The percentage of students who answered the item correctly. Difficulty level = average on the test Maximum possible score *100
Difficulty index = H+L * 100 N H: number of correct answers to the high group L: number of correct answers to the low group N:total number of students in both groups. In case of objective tests find the facility value
Facility value = number of students answering questions correctly *100 Number of students who have taken the test High (Difficult) <= 30% Medium (Moderate) > 30% AND < 80% Low (Easy) >=80%
[Link] Difficulty Level: Discussion Is a test that nobody failed too easy? Is a test on which nobody got 100% too difficult? Should items that are too easy or too difficult be thrown out? [Link] Item Discrimination Generally, students who did well on the exam should select the correct answer to any given item on the exam. The Discrimination Index distinguishes for each item between the performance of students who did well on the exam and students who did poorly. For each item, subtract the number of students in the lower group who answered correctly(RL) from the number of students in the upper group who answered correctly(RU). Divide the result by the number of students in one groupN/2. DI= RU-RL N/2
Formula -2
DI= no: of HAQ- no: of LAQ No: of HAG
HAQ=No: of students in high ability group answering questions correctly. LAQ= no: of students in low ability group answering questions correctly. HAG= No: of students in high ability group.
The Discrimination Index is listed in decimal format and ranges between -1 and 1 For exams with a normal distribution, a discrimination of 0.3 and above is good; 0.6 and above is very good.
Item Discrimination (D)
Item Difficulty High Medium Review Low review Ok review
D =< 0% 0% < D < 30%
Review Ok
D >= 30%
Ok
Ok
Ok
ADVANTAGES OF ITEM ANALYSIS.
Helps in judging the worth or quality of a test Aids in subsequent test revisions. Increased skill in test construction. Provides diagnostic value and help in planning future learning activities. Basis for discussing test results.
Makes decision about the promotion of students to the next higher grade. Brings about improvement in teaching methods and techniques.
ITEM REVISION
Items with negative or low positive Item Discrimination should either be revised or deleted from item bank. Distractors with Item Discrimination that are either positive or negative but too low considering the Item Difficulty, should be replaced. For an item to be revised successfully, it is often necessary to have at least one solid distractor that will not be changed. If either all distractors are poor, or none is particularly strong, delete item and write a new one. Change only pieces of the item that caused problems. If an item of the test fails, even after revision, it should be deleted and replaced by a new one.
REVIEW OF LITERATURE
Item analysis for the written test of Taiwanese board certification examination in anaesthesiology using the Rasch model AUTHORS:- K.-Y. Chang, M.-Y. Tsou,
ABSTRACT Background On the written test of board certification examination for
anaesthesiology, the probability of a question being answered correctly is subject to two main factors, item difficulty and examinee ability. Thus, item analysis can provide insight into the appropriateness of a particular test, given the ability of examinees.
Methods Study subjects were 36 Taiwanese examinees tested with 100 questions related to anaesthesiology. We used the Rasch model to perform item analysis of questions answered by each examinee to assess the effects of question difficulty and examinee ability using a common logit scale. Additionally, we evaluated test reliability and virtual failure rates under different criteria.
Results The mean examinee ability was higher than the mean item difficulty in this written test by 1.28 (SD=0.57) logit units, which means that the examinees, on average, were able to correctly answer 78% of items. The difficulty of items decreased from 4.25 to 2.43 on the logit scale, corresponding to the probability of having a correct answer from 5% to 98%. There were 60 items with difficulty lower than the least able examinee and seven difficult items beyond the most able one. The agreement of item difficulty between test developers and our Rasch model was poor (weighted =0.23).
Conclusions We demonstrated how to assess the construct validity and reliability of the written examination in order to provide useful information for future board certification examinations.
CONCLUSION
Developing , administering, and analyzing though is like monumental task, a step by step approach simplifies the process.. validating the tests using item ,analysis makes an effective method of assessing outcomes in a classroom. Item analysis is especially valuable in improving items which will be used again in later tests, but it can also be used to eliminate ambiguous or misleading items in a single test
administration. In addition, item analysis is valuable for increasing instructors' skills in test construction, and identifying specific areas of course content which need greater emphasis or clarity.
BIBLIOGRAPHY
J C Aggarwal. Essentials Of Examination System. 1st Edition. New Delhi : Vikas Publishing House Pvt Ltd ;1997. Page No,265-72 J J Guilbert. Handbook For Health Professionals. 6th Edition: Who Offset Publications ;1992 Page No:477-81 Diane M Billings. Judith A Halstead. Teaching In Nursing , Aguide For Faculty. 1st [Link]: W.B Saunders Company;1998 Page No.396-405 K.P Neeraja. Textbook Of Nursing Education. 1st Edition. Newdelhi: Jaypee Brothers Medical Publishers(P) Ltd;2003 Page No.415-17 Arul jyothi, D.L. Balaji, Pratiksha jagran, Curriculum [Link] press. New Delhi. B. Sankaranarayan, B. sindhu, Learning and teaching Nursing. [Link]. India B. T. Basavanthappa, Nursing Education. Jaypee. New Delhi. India. Loretta E. Heidgerken, Teaching and Learning in Schools of Nursing. 3rd edition. Konark Pvt ltd.
Vimal g. Thakkar, Nursing and Nursing Education. 2nd edition. Vora medical [Link]. India. [Link] [Link] Http://[Link]/tomh/item_analysis.html