Paper 2 Rubrik
Paper 2 Rubrik
A R T I C L E I N F O A B S T R A C T
Article history: Construct validity of peer assessment (PA) is important for PA application, yet difficult to achieve. The
Received 22 May 2013 present study investigated the impact of an assessment rubric and friendship between the assessor and
Received in revised form 9 October 2013 assessee on construct validity of PA. Two-hundred nine bachelor students participated: half of them
Accepted 14 October 2013
assessed a peer’s concept map with a rubric whereas the other half did not use a rubric. The results
revealed a substantial reliability and construct validity for PA. All students over-score their peers’
Keywords: performance, but students using a rubric were more valid. Moreover, when using a rubric a high level of
Peer assessment
friendship between assessor and assessee resulted in more over-scoring. Use of a rubric resulted in
Reliability
higher quality concept maps for peer and expert ratings.
Construct validity
Rubric ß 2013 Elsevier Ltd. All rights reserved.
Friendship
In the past decades peer assessment (PA) has been promoted in in terms of ‘reciprocity effects’ in PA literature, it has remained an
higher education as a valuable approach to formative assessment under-researched area. Furthermore, the value of rubrics to
(Falchikov, 2003, 2005; Falchikov & Goldfinch, 2000). PA is an alleviate the potential scoring bias due to friendship has not been
educational arrangement where students evaluate a peer’s systematically investigated.
performance with scores, and/or with written or oral feedback The following sections will elaborate on rubrics and evidence of
(Topping, 1998). PA is also regarded as a specific type of their use in PA, the influence of friendship on PA (in collaborative
collaborative learning (Boud, Cohen, & Sampson, 1999; Strijbos, learning in general and within PA specifically), the potential
Ochoa, Sluijsmans, Segers, & Tillema, 2009). influence of a rubric on perceived fairness and comfort with PA,
Despite positive effects of PA, such as increased ‘perceived and the quality of the performance (concept map).
learning’, essay and writing revision and presentation skills
(Topping, 2003; Van Gennip, Segers, & Tillema, 2009), a consistent Rubrics for assessment purposes
challenge for the acceptance of PA among students and teachers is
the reliability and validity of PA (Cho, Schunn, & Wilson, 2006; A rubric articulates the expectations for an assignment by
Dochy, Segers, & Sluijsmans, 1999; Falchikov & Goldfinch, 2000; listing the assessment criteria and by describing levels of quality in
Van Zundert, Sluijsmans, & Van Merriënboer, 2010). Typical relation to each criterion (Andrade & Valtcheva, 2009). Historically,
sources that can reduce reliability and validity, such as poorly research on rubrics has followed either a summative or a formative
designed assessment instruments or lack of assessor training, approach (Panadero & Jonsson, 2013). The summative approach
might be alleviated via structured assessment scaffolds, for aims to increase the inter-rater reliability and intra-rater reliability
example an assessment rubric. However, the reliability and of assessors (Jonsson & Svingby, 2007; Stellmack, Konheim-
validity of PA can also suffer from a friendship between the Kalkstein, Manor, Massey, & Schmitz, 2009). The formative
assessor and assessee. Although friendship is typically addressed approach applies rubrics to enhance students’ learning by
promoting reflections on their own work (Andrade & Valtcheva,
2009; Panadero, Alonso-Tapia, & Reche, 2013) or the work by a
peer (Sadler & Good, 2006). Irrespective of a summative or
* Corresponding author at: Department of Educational Sciences and Teacher
Education, Learning and Educational Technology Research Unit (LET Team),
formative approach, indicators for reliability and validity must be
University of Oulu, PO Box 2000, 90014, Finland. Tel.: +35 8503 470 915. must be examined for the use of rubrics in general, as well as for
E-mail address: [Link]@oulu.fi (E. Panadero). their use with PA in particular.
0191-491X/$ – see front matter ß 2013 Elsevier Ltd. All rights reserved.
[Link]
196 E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203
Indicators for reliability and validity of rubrics and for (2007) found only seven studies that reported Cronbach’s alpha as
construct validity of rubrics in PA a consistency indicator for intra-rater agreement; the majority
reported a value over .70, considered as sufficient (Brown et al.,
Reliability can be expressed in general as inter-rater and intra- 2004).
rater agreement, both of which in turn can be expressed as
consensus agreement (often also referred to as objectivity) or Construct validity
consistency agreement. Validity typically addresses the degree to
which a performance in an educational setting is captured by a Another important aspect to consider is construct validity of
specific measurement instrument; also referred to as the ‘content assessment. In the case of scores on performance assessment,
aspect’ (Messick, 1995) of construct validity. Finally, in the construct validity is based on what Kane (2001) refers to as
assessment literature the term ‘accuracy’ is also used (e.g., Brown, ‘observable attributes’ and such scores typically reflect what
Glasswell, & Harland, 2004), but refers to overall psychometric Messick (1995) refers to as the content aspect of construct validity
quality combining the reliability and validity indicators – as such which includes content relevance and representativeness. In
accuracy is at present too ill-defined for practical application. education it is not a property of a test or an assessment, but
rather an interpretation of the outcomes (Jonsson & Svingby,
Inter-rater consensus 2007). It is typically determined via Pearson correlations of
different raters (usually experts) that use the same instrument to
Inter-rater agreement in terms of consensus refers to different measure the same construct.
assessors awarding the same score to the same performance In PA construct validity can be determined by comparing the
(Brown, Glasswell, & Harland, 2004), which is typically expressed peer score(s) with the teacher score(s), which results in construct
as percent exact agreement, adjacent agreement (same score plus validity indicators from both the teacher perspective and student
or minus one category), Cohen’s kappa and/or Krippendorff’s perspective (see Cho et al., 2006). PA validity from the teacher
alpha to correct for agreement by chance, and intraclass perspective (TPv) (i.e., a pattern in the scoring distribution) is
correlations (ICC) when each assessor also evaluates multiple typically expressed as a Pearson correlation (for pairs) or
performances (Cho et al., 2006). In their review on reliability and Cronbach’s alpha (with more than two peer assessors), where
validity of rubrics, Jonsson and Svingby (2007) found that more the teacher assessment serves as the unbiased baseline. Validity
than half of the studies (40 of 75) reviewed reported inter-rater from the student perspective (SPv) is examined by the deviation
reliability. In terms of percentage exact agreement with rubrics, between a peer (or multiple peers) and teacher assessment,
Jonsson and Svingby (2007) found that the majority of consensus which reveals the degree of under- or over-scoring when
agreement varied from 55 to 75%, but in terms of adjacent compared to the unbiased teacher score (Wang & Imbrie,
agreement the consensus increased to 90%. In general, the 70% 2010). Cho et al. (2006) advocate to take ‘‘the square-root (. . .)
criterion is used for percent agreement, and for Cohen’s kappa and of the sum of the squared differences divided by the number of
Krippendorff’s alpha, values between .40 and .75 represent a fair peer ratings minus 1 (to produce an unbiased estimate) (. . .) the
agreement beyond chance. Jonsson and Svingby (2007) concluded higher the score, the lower the perceived validity’’ (p. 896). Yet,
that the number of levels of a rubric directly affects consensus this calculation provides the average deviation to the expert and
agreement. not the expert-peer deviations at the individual level for each
peer. As such variance within research conditions will be lost.
Inter-rater consistency Thus, the expert-peer deviation at the individual level constitutes
a second student perspective of validity in PA, which also enables
Inter-rater agreement in terms of consistency refers to the the determination of over-scoring and under-scoring by peer
pattern in the distribution of scores across a set of assessors and is relative to the expert. The average deviation to the expert will be
expressed by a Pearson (or Spearman) correlation in the case of referred to as SPv1, and the expert-peer deviation at the
two assessors or by a Cronbach’s alpha or intraclass correlations individual level will be referred to as SPv2.
when each assessor evaluates multiple performances. Jonsson and
Svingby (2007) observed in the case or rubrics that the majority of Measurement estimates
studies reported a Pearson’s or Spearman correlation which varied
between .55 and .75 (where in values of .70 and above are The application of measurement estimates spans both inter-
considered acceptable). With more than two assessors Cronbach’s rater reliability, intra-rater reliability, as well as (construct)
alpha is reported, and Jonsson and Svingby (2007) found eight validity aspects. Their application thus depends on the indicators
studies with coefficients ranging .50–.92 (with .70 as the threshold reported as relevant for a specific study and research purpose,
for an acceptable consistency). Overall, in the studies reviewed by although in general these techniques (multi-faceted Rasch model
Jonsson and Svingby, the inter-rater agreement (consensus and or Generalizability Theory) are applicable to address most of the
consistency) varied extensively. indicators.
Intra-rater consensus agreement refers to the reliability of a In PA it is crucial that the assessor knows the assessment
single assessor when scoring the performance of an assessee at two criteria in order provide a reliable and valid assessment (Panadero
different occasions, and it can be expressed by a Pearson’s or & Jonsson, 2013). Rubrics are particularly helpful as they provide
Spearman’s correlation or Cohen’s kappa and Krippendorff’s alpha. the assessment criteria in a structured format. Moreover, when
However, this type is rarely reported – instead most studies on conducting PA students are (a) reluctant to being assessed by a
rubrics report intra-rater consistency (Jonsson & Svingby, 2007). peer who is not an expert in the domain, and (b) they believe that
Intra-rater consistency refers to the reliability of an assessor when assessment is the responsibility of the teacher (Ballantyne, Hughes,
scoring the performance of different assesses at a single occasion & Mylonas, 2002). Rubrics hold the potential to alleviate both of
(Cronbach’s alpha), for example a teacher scoring the performance these issues (Hafner & Hafner, 2003) and thus enhance perceived
of all students. In their review on rubrics, Jonsson and Svingby fairness and comfort with PA.
E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203 197
Nevertheless, research on the application of rubrics to PA is teachers consider friendship as a negative factor influencing
scarce and predominantly focused on inter-rater agreement fairness and thus reducing the reliability and validity of PA
between peers and validity from the teacher perspective. The (Karaca, 2009). For example, Papinczak, Young, and Groves (2007,
majority of studies revealed that rubrics increase the inter-rater p. 180 observed that ‘‘(. . .) a strong reaction to peer assessment
reliability and validity of PA from the teachers’ perspective (i.e., was the widespread perception that this process could be
consistency in terms of a Pearson correlation or Cronbach’s alpha; corrupted by bias due to friendship marking or lack of honesty’’
see for example Hafner & Hafner, 2003; Sadler & Good, 2006; Tsai & (p. 180). Similarly, Harris and Brown (2013) conducted interviews
Liang, 2009; Tseng & Tsai, 2007), yet few studies have focused with K-12 teachers and found that one of them (as well as her
explicitly on the effect of a rubric-supported PA on students’ students) specifically reported the experience that friendship(s)
performance (Panadero & Jonsson, 2013). Moreover, the potential resulted in biased PA.
of rubrics to decrease over-scoring due to friendship has not been In more recent PA literature specific interpersonal variables
systematically explored. Since PA is regarded as a specific type of have been identified such as psychological safety and trust, as well
collaborative learning (Boud et al., 1999; Strijbos et al., 2009), the as the impact of structural features (i.e., type of peer interaction
following sections will elaborate on friendship in collaborative and group constellation; Van Gennip, Segers, & Tillema, 2009).
learning in general and within PA specifically. Trust and psychological safety are particularly relevant processes
in relation to friendship. First of all, trust during PA positively
Friendship in collaborative learning influences students’ conceptions about PA (Van Gennip et al., 2009)
and could foster a critical analysis of a peers’ performance (Jehn &
Friendship has been of interest in collaborative learning with Shah, 1997; Tolmie et al., 2010). Secondly, friendship can foster a
respect to group formation, organisation of the activity and psychologically safer atmosphere for criticising peers (MacDonald
assessment of the collaboration (Barron, 2003; Jones & Issroff, & Miell, 2000), which in turn fosters interpersonal risk-taking in a
2005). Wang and Imbrie (2010) for example observed that group: ‘‘(. . .) a sense of confidence that the team will not
students – when offered to assemble their own groups – choose embarrass, reject, or punish someone for speaking up’’ (Edmonson,
peers with whom they were better befriended. In addition, they 1999, p. 354).
observed that group-member selection on the basis of friendship A high degree of trust and psychological safety is likely to result
positively influenced interdependence, individual accountability in perceived comfort with PA on the part of the students,
and sense of community – leading to a warmer atmosphere and a irrespective of the degree of friendship. In fact, several researchers
better collaborative learning process. Wang and Imbrie (2010) have emphasised that stress and (dis)comfort should be consid-
attributed this effect partly to implicit rules of collaboration and ered when determining the impact of PA on students’ emotions
trust that already exist between friends. Vass (2002) observed the (Hanrahan & Isaacs, 2001; Lindblom-Ylänne, Pihlajamäki, &
same implicit collaboration rules between friends in the context of Kotkas, 2006; Pope, 2001, 2005). Comfort with the procedure of
creative collaborative writing. Finally, Tolmie et al. (2010) assessing peers and being assessed by peers might enhance the
observed that friendship pairs performed better on collaborative reliability and validity of PA. Finally, specificity and a common
assignments and showed a better ability to be critical and develop understanding of the criteria (e.g., by a rubric) might also increase
their peers’ ideas when the task was challenging. students’ comfort with PA.
Friendship in PA and its relation to perceived comfort and Application of rubrics to assess concept map performance
fairness
Concept mapping is a learning strategy that increases students’
Within the PA literature the term ‘reciprocity effects’ is typically performance and is an effective technique to evaluate students’
used to refer to bias in PA caused by interpersonal processes domain knowledge (Nesbit & Adesope, 2006) and especially the
(Strijbos et al., 2009). Specific indicators for reciprocity when structure of declarative knowledge (Shavelson, Ruiz-Primo, &
conducting PA within groups are collusive marking (high ratings to Wiley, 2005). A concept map is defined by Shavelson et al. (2005)
fellow group members), decibel marking (high ratings to dominant as ‘‘(. . .) a graph in which the nodes represent concepts, the lines
group members) and parasite marking (profiting from the efforts represent relations, and the labels on the lines represent the
invested by fellow group members) (Cheng & Warren, 1997; Harris nature of the relation between concepts. A pair of nodes and the
& Brown, 2013; Pond & Ul-Haq, 1997). Magin (2001) is one of the labeled line connecting them is defined as a proposition.’’ (p. 417).
rare studies that explicitly investigated reciprocity effects, Designing a concept map is cognitively demanding, yet leads to
operationalised as follows: if person A rates person B higher than better learning and is therefore an interesting strategy to train
expected, person B also rates Person A higher than expected. among higher education students (Berry & Chew, 2008; Jacobs-
Reciprocity effects were found to be minuscule (1% explained Lawson & Hershey, 2002). The use of concept maps has been
variance), however, the operationalisation only considered direc- studied previously in combination with metacognitive activities
tionality and excluded other aspects interpersonal processes such (e.g., Hilbert & Renkl, 2009) and with the use of rubrics to enhance
as friendship. learning and the reliability and validity of students’ self- and peer
Although friendship has been acknowledged as a potential bias assessment (Besterfield-Sacre, Gerchak, Lyons, Shuman, & Wolfe,
in PA, empirical studies that specifically investigate the effect of 2004; Moni, Beswick, & Moni, 2005; Moni & Moni, 2008;
friendship are scarce. Friendship has been acknowledged as a Panadero, Alonso-Tapia, & Huertas, 2014; Toth, Suthers, &
potential bias in peer assessment and could lead to over-scoring Lesgold, 2002). Therefore, in this study using a rubric and peer
(Pond & Ul-Haq, 1997; Strijbos et al., 2009). Students indeed assessment will add empirical evidence to the application of
typically prefer not to assess their friends too harshly (Cheng & rubrics for assessment of concept maps and the extent to which
Warren, 1997) and/or fear that ‘‘(. . .) other students did not take rubrics promote learning gains.
the exercise seriously and might have been cheating by favouring
friends’’ (Smith, Cooper, & Lancaster, 2002, p. 74). A higher degree Research questions and hypotheses
of friendship could lead to over-scoring (Pond & Ul-Haq, 1997) and
negatively affect the reliability, validity, and perceived fairness of The current study investigates the potential moderating effect
assessment (Sambell, McDowell, & Brown, 1997). Moreover, of the application of a rubric and degree of friendship between the
198 E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203
assessor and assessee on the reliability and construct validity of PA, (N = 104) and the other two classrooms assigned to the non-
perceived comfort and fairness during PA, and quality of rubric condition (N = 105). Each student was randomly assigned
performance (concept map). This quasi-experimental study to assess the concept map by one of his or her peers. The activity
investigates two variations: (a) rubric vs. non-rubric and (b) low counted for students’ course credit but not their actual
vs. medium vs. high level of friendship between the assessor and performance in terms of the quality of the concept map.
assessee. The research questions and hypotheses are as follows:
Instruments and measures
RQ 1. What is the effect of a rubric on the reliability and construct
validity of PA of a concept map? It is expected that use of rubrics Task
results in a higher construct validity from the student perspec- Students read an excerpt from the text ‘‘Estrategias docentes
tive (i.e., smaller deviation between the peer score and the para un aprendizaje significativo’’ [Teachers’ strategies for a
teacher score) compared to no use of rubrics (Hypothesis 1). meaningful learning] by Diaz Barriga (2002). The text is a
RQ 2. What is the effect of a rubric and the level of friendship on mandatory reading from the official curriculum. They subsequent-
the construct validity from the student perspective of PA of a ly constructed a concept map of the text. This concept mapping
concept map, and is there an interaction between the use of a task was selected due to their widespread application as a learning
rubric and the level of friendship? It is expected that rubrics strategy (Nesbit & Adesope, 2006) and when combined with a
reduce the bias of friendship (Hypothesis 2a). It is expected that rubric resulted in strategy enhancement and better performance
students with a low level of friendship evaluate their peers more (Panadero et al., 2014; Toth et al., 2002).
validly compared to students with a high level of friendship, who The students were required to extract instructional concepts
will show a tendency to favour their peers by over-scoring that were related to declarative, procedural and attitudinal
(Hypothesis 2b). knowledge. Then, organise them in a hierarchical order and
RQ 3. What is the effect of a rubric on students’ perceived comfort specify relationships between the concepts. Fig. 1 provides an
and fairness while conducting PA? It is expected that rubrics illustration of a concept map by an expert in which the expected
increase participants’ comfort while performing PA (Hypothesis components are indicated.
3a) and increase perceived fairness of PA (Hypothesis 3b).
RQ4. What is the effect of a rubric on the quality of the concept Concept map assessment
map in terms of the peer score and expert score? It is expected A concept map was assessed on five criteria: (1) concepts, (2)
that a rubric will lead to a higher quality concept map in terms of hierarchy, (3) relationships among concepts in different hierarchi-
the peer score (Hypothesis 4a), and to a higher quality concept cal levels, (4) relationships among concepts from different
map in terms of the teacher score (Hypothesis 4b). columns, and (5) simplicity and easiness of understanding.
Assessment by condition. Students and experts in the rubric
Method condition used the same rubric to assess the concept maps.
Students in the non-rubric condition rated each of the criteria on a
Participants 4-point Likert scale (1 = poor performance to 4 = best performance),
whereas the experts used the rubric to ensure comparability of
Two-hundred and nine third year pre-service bachelor conditions to determine construct validity.
student teachers at a public university in Spain participated Rubric. The rubric was created using expert models of concept
in this study of which 182 were female (87.08%) and 27 male maps in other studies (Panadero, Alonso-Tapia, & Huertas,
(12.92%), and with a mean age of 22.17 (SD = 3.92). The high 2012). Each criterion was further specified into four levels
presence of females is representative for pre-service teacher ranging from the poorest (1) to the best (4) performance (see
programmes. Appendix A for the rubric used). Inter-rater reliability for the
experts was determined by three independent experts scoring
Design 63 concept maps (31 from the rubric condition, 32 from the non-
rubric condition) which were selected at random and repre-
The students were enrolled in the ‘‘Learning and development sented 30% of all maps (>10% minimum threshold; see
II’’ course and recruited from four classrooms. Each classroom was Neuendorf, 2002). Consensus agreement was calculated for
taught by a different teacher. each criterion scored by three independent experts with
The students read an excerpt from an instructional text and Krippendorff’s alpha (Hayes & Krippendorff, 2007), at the
subsequently constructed a concept map identifying the main ordinal scale level: (1) concepts [.92, 95% CI: .88–.95], (2)
concepts and relations between them. Students from two hierarchy [.96, 95% CI: .93–.99], (3) relationships among
classrooms were randomly selected for the rubric condition concepts in different hierarchical levels [.90, 95% CI: .85–.95],
Fig. 1. Illustration of a concept map by an expert (Note: the concept map has been simplified to visualise the task).
E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203 199
(4) relationships among concepts from different columns [.94, actual performance in terms of the quality of the concept map).
95% CI: .87–.99], and (5) simplicity and easiness of understand- Subsequently, the students were instructed to design a
ing [.97, 95% CI: .93–1.00]. Consensus agreement was calculated concept map from the text-excerpt they had read. The rubric
for the sum-scores by the three independent experts with was handed out to the corresponding condition and its
Krippendorff’s alpha (ordinal scale) and proved to be excellent: application was explained. The rubric condition was asked to
.96, [95% CI: .94–.98]. Consistency agreement of the sum-scores score each rubric’s criterion independently and then add them
for the maps rated was excellent as well (Cronbach’s a = .99). into a sum-score. In the non-rubric condition the five assess-
Once the scoring system was reliable, one of the experts scored ment criteria were stated aloud (see criteria column of the
the remaining concept maps. Rubric): ‘‘When assessing a concept map an expert would
Concept map score. The score for each assessment of a concept consider the following features: all relevant concepts have to be
map – whether by a peer or the expert, and irrespective of included, the hierarchy has to be clearly defined’’. Next, the
condition – consists of the sum-score of all criteria and ranges students in the non-rubric condition assessed their peers’
between a minimum of 5 to a maximum of 20 points. concept map by awarding 1–4 points for each criterion with
higher score reflecting a better performance, resulting in a
Friendship possible sum-score of 5–20 points (similar to the rubric
The degree of friendship was determined with a single 9 point condition). All students then worked for 30 min on their concept
Likert-scale item: ‘‘My level of friendship with the peer whose map. Afterwards they received the concept map by a peer with
concept map I have to evaluate is . . .’’ (1 = enmity to 9 = close 15 min to assess it. In a follow-up session, students received
friends). The item was based on Hays (1984) who used a 7-point their scores (peer- and expert assessment).
Likert-scale. In line with three levels discerned by Hays (1984) and
Rose and Serafica (1986) we applied a 9 point Likert-scale which Results
could be easily grouped in three levels of friendship: low (1–3),
medium (4–6) and high (7–9). Friendship levels were reported Data inspection
across the rubric and non-rubric condition as follows: low (N = 23;
with 15 rubric and 8 non-rubric), medium (N = 172; with 78 rubric Prior to the analyses distribution assumptions were checked.
and 94 non-rubric) and high (N = 14; with 11 rubric and 3 non- The standardised skewness and kurtosis were within the +3 and
rubric). The research conditions were equivalent in their mean 3 criterion (Tabachnik & Fidell, 2001) for all variables, except for
level of friendship, F(1, 207) = 0.60, p = .438, Mrubric = 4.90 peer assessment SPv (zskewness = 3.5 (.589/.168); zkurtosis = 3.43
(SD = 1.26), Mnon-rubric = 4.79 (SD = 0.81). (1.149/.335) and friendship (zkurtosis = 3.81 (1.242/.335). Since the
standardised skewness for peer assessment SPv and friendship
Peer assessment construct validity were not extremely outside the criterion range a normal
In line with Brown et al. (2004), Cho et al. (2006) and Wang distribution was assumed.
and Imbrie (2010), construct validity from the teacher perspec-
tive (TP) was determined by a Pearson correlation between the Checks for equivalence of conditions over classrooms
peer and expert rating, with the expert rating as unbiased.
Construct validity from the student perspective (SPv) was Firstly, prior to the intervention students’ experience with
calculated as the deviation between the peer and expert score concept map design was checked. A sample of concept maps
for each individual (SPv2; see the section on construct validity) (N = 62) previously created by students, evenly distributed over
instead of the average deviation to the expert (SPv1, see Cho the rubric and non-rubric condition, revealed no significant
et al., 2006). More specifically, SPv was calculated as the peer differences (p = .652). Although the intervention was conducted
rating minus the expert rating. Therefore, a deviation value of by one of the researchers, analyses were performed on all
zero represents that peer rating does not deviate from expert depending variables comparing the four classrooms as they were
rating. A positive deviation value represents over-scoring taught by four different teachers. The following differences were
compared to the expert (i.e., peer rating was higher than the found. Expert rating: rubric classrooms differ, F(1, 102) = 9.19,
expert rating). A negative deviation value represents under- p = .003, h2 = .08, Mgroup1 = 15.18 (SD = 2.14), Mgroup2 = 13.94
scoring compared to the expert (i.e., peer rating was lower than (SD = 1.99). Peer rating: non-rubric classrooms differ, F(1,
the expert rating). 103) = 4.25, p = .042, h2 = .04. Mgroup3 = 13.65 (SD = 1.74),
Mgroup4 = 14.44 (SD = 2.18). Peer assessment SPv: non-rubric class-
Perceived comfort rooms differ, F (1, 103) = 4.53, p = .036, h2 = .04. Mgroup3 = 1.96
Perceived comfort was determined with a single 7 point Likert- (SD = 1.79), Mgroup4 = 2.77 (SD = 2.09). Peer assessment SPv: rubric
scale item specifically developed for this study: ‘‘My level of classrooms differ, F(1, 102) = 5.27, p = .024, h2 = .05. Mgroup1 = 0.37
comfort when scoring the concept map from a peer is . . .’’ (1 = none (SD = 2.21), Mgroup2 = 1.40 (SD = 2.38). Given the observed differ-
to 7 = high). ences the instructional setting within the courses was further
examined. The four teachers used the same pedagogical method in
Perceived fairness a highly structure programme, sharing their activities, lectures and
Perceived fairness was determined with a single 5 point Likert- assessment tasks, aimed at creating the same instructional setting
scale item specifically developed for this study: ‘‘Do you believe for the four groups. Hence, the observed differences can be
that your peer will conduct a fair assessment of your concept regarded as natural teacher variance and do not affect the internal
map?’’ (1 = no to 5 = yes). validity of the study.
Procedure Reliability and construct validity of peer assessment from the teacher
perspective
Students were informed one week in advance that they
would be asked to conduct a task during their seminar for which Each concept-map was assessed by one peer and one expert
they were required to read a designated text-excerpt and that which is a special case where reliability and construct validity from
the activity would count for their course credit (but not their the teacher perspective (TPv) are expressed by the same
200 E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203
consistency indicator, that is, a Pearson correlation between the condition comfort and fairness are weak but positively correlated
experts’ rating and peers’ rating. A moderate correlation was found (r = .21, p = .030). In the rubric condition a weak positive
(r = .47, p < .001; Cohen, 1988: .10 (small), .30 (medium), .80 correlation was observed between fairness and the peer rating
(high), which reveals that peer assessment scores were overall (r = .23, p = .020), and a moderate positive correlation between
fairly reliable and valid, but still reflect large individual variations. comfort and student perspective of validity (r = .30, p = .002).
When split by condition the reliability and construct validity were
moderate for students in the rubric (r = .34, p < .001) and non- Effect of a rubric on performance quality in terms of peer score and
rubric (r = .38, p < .001) condition. expert score
Effect of rubrics and friendship level on peer assessment construct A one-way ANOVA revealed a significant difference for the
validity from the student perspective (SPv) quality in terms of the peer score, F(1, 207) = 28.82, p < .001,
h2 = .12 (medium effect), between the rubric (M = 15.45, SD = 1.90)
Since each concept-map was assessed by one peer, inter-rater and non-rubric (M = 14.01, SD = 1.98) condition.
reliability of peer ratings (consensus and consistency) could not A one-way ANOVA revealed a significant difference for the
be determined, since each concept-map in that case should have quality in terms of the expert score, F(1, 207) = 133.93, p < .001,
been assessed by a minimum of two raters. The peer assessment h2 = .39 (large effect), between the rubric (M = 14.62, SD = 2.16)
construct validity from the student perspective (SPv) was and non-rubric (M = 11.68, SD = 1.45) condition.
calculated as the peer rating minus the expert rating.
A two-way ANOVA was conducted to investigate main and Discussion and conclusion
interaction effects of rubrics and friendship levels on SPv of peer
assessment. There was a main effect for the rubric versus non- In line with the need for more (quasi) experimental studies on
rubric condition on SPv of peer assessment, F(1, 203) = 5.66, PA (see Strijbos & Sluijsmans, 2010), a quasi-experimental study
p = .018, h2 = .03 (small effect; Cohen, 1988: .01 < .06 = small, was conducted to investigate the impact of a rubric and friendship
.06 < .14 = medium, >.14 = large). Students in the rubric condition on construct validity of peer assessment, perceived fairness and
(M = 0.84, SD = 2.33) showed less deviation from the expert score comfort, and performance. We will first present a summary and
compared to students in the non-rubric condition (M = 2.33, interpretation of the results organised around the research
SD = 1.96). The lower deviation from the expert (i.e., a score closer questions.
to zero) reflects that students in the rubric condition were more
construct valid than the non-rubric condition. Both groups tended Effect of a rubric on PA reliability and construct validity
to over score their peers as reflected by the positive values. No
main effect was observed for friendship level on SPv, F(1, Correlation of the expert and peer ratings revealed that peer
203) = 1.66, p = .193, h2 = .016 (small effect), Mlow = 2.18 ratings were reliable (consistency) and construct valid from the
(SD = 0.46), Mmedium = 1.43 (SD = .16), Mhigh = 2.23 (SD = 0.69). teacher perspective (TPv) reflected by their overall moderate
The analysis revealed that the RUBRIC FRIENDSHIP interac- correlation with the experts’ rating, as well as the moderate
tion for SPv of peer assessment was not significant, F(2, 203) = 2.06, correlation between peer- and expert ratings within each
p = .131, h2 = .018 (small effect). research condition. Although lower than the .60–.80 range
The means and standard deviations by friendship level (low, reported by Falchikov and Goldfinch (2000) and Cho et al.
medium, and high) signalled a potential difference between (2006), these correlations also contain a degree of systematic
friendship level in the rubric condition: Low: Mrubric = 0.73 bias. Nevertheless, the observed correlation warrants closer
(SD = 2.25); Mnon-rubric = 3.63 (SD = 1.59), Medium: Mrubric = 0.63 scrutiny of expert and peer ratings. Construct validity from the
(SD = 2.40); Mnon-rubric = 2.23 (SD = 1.97), and High: Mrubric = 2.45 student perspective (SPv) revealed that the students in the
(SD = 1.29); Mnon-rubric = 2.00 (SD = 2.00). Follow-up one-way rubric condition were also more valid (i.e., a smaller deviation
ANOVA for each condition revealed no significant difference from the expert as reflected by less over-scoring) in comparison
between friendship levels for the non-rubric condition, F(2, to students in the non-rubric condition (Hypothesis 1 was
102) = 1.93, p = .151. However, for the rubric condition there confirmed). This is in line with previous research on rubrics
was a significant difference between the friendship levels, which showed that rubrics increase peer assessment construct
F(2,101) = 3.08, p = .050 and a contrast analysis comparing the validity (e.g., Jonsson & Svingby, 2007; Sadler & Good, 2006). It is
low and medium levels combined to the high level of friendship also noteworthy that the deviation between students’ rating
was also significant, t(101) = 2.33, p = .022, d = 0.98 (large with a rubric and the expert rating was small (M = 0.84), i.e.
effect: > 0.80, see Cohen, 1988). There is more over-scoring in within one point on a scale of 0–20 points. Hence, it can be
the rubric condition by students with a high level of friendship concluded that the use of a rubric has a strong potential to
compared to the combined over-scoring by students with a low or increase construct validity of PA. These results emphasise the
medium level of friendship. importance of rubrics as a scaffold for implementing peer
assessment in the classroom (Jonsson & Svingby, 2007), because
Perceived comfort and fairness of peer assessment rubrics contain the assessment criteria and facilitate more
reliable and valid PA (Andrade, 2010; Panadero & Jonsson, 2013).
A one-way ANOVA revealed no significant difference for
perceived comfort, F(1, 207) = 0.08, p = .77, between the rubric Effect of a rubric and friendship level on PA construct validity
(M = 4.59, SD = 0.81) and non-rubric (M = 4.62, SD = 0.83) condi-
tion. Likewise, a one-way ANOVA revealed no significant No main effect was observed for friendship levels on
difference for perceived fairness, F(1, 207) = 0.00, p = .99, between construct validity from the student perspective (SPv). The
the rubric (M = 3.88, SD = 0.82) and non-rubric (M = 3.89, ‘rubric friendship’ interaction was not significant, however,
SD = 0.74) condition. the descriptives for SPv by friendship level hinted at differences
We explored the relations between perceived comfort, per- within the rubric and non-rubric conditions. Univariate analyses
ceived fairness, PA rating, and the student perspective of PA revealed no significant difference between friendship levels for
validity (SPv) within each research condition. In the non-rubric the non-rubric condition, however, in the rubric condition the
E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203 201
students with a high level of friendship significantly over-scored the Effect of a rubric on performance (i.e., quality of a concept map)
medium and low friendship levels. Thus, rubrics do not appear to
reduce friendship bias (Hypothesis 2a was rejected) and only Students with a rubric clearly outperformed students without a
students with low and medium friendship in the rubric condition rubric regarding the quality of their concept maps – irrespective of
appear to be more valid than students with a high level of friendship whether peer scores or expert scores are considered. Rubrics not
(Hypothesis 2b was partially confirmed). The use of a rubric leads to only appear to facilitate a more valid peer assessment, but the
more valid assessment on the one hand, but also seems to amplify – rubric and specifically its clear specification of assessment criteria
or make more visible – potential friendship bias. Finally, irrespective also appears to act as feed-forward information for students by
of friendship level all students over-scored their peers’ concept map. specifying clear goals to be attained.
These results are supported by previous research on friendship
marking (i.e., students over-scoring friends) (Cheng & Warren, 1997; Limitations
Pond & Ul-Haq, 1997), although it should be noted that in previous
studies no explicit differentiation was made between friendship A quasi-experimental study is challenging in a naturalistic
levels as was the case in the present study. It seems that students in context, in which full equivalence of research conditions is very
the rubric condition with low and medium levels of friendship are hard to guarantee. The random assignment of condition was
less ‘‘afraid’’ to provide a realistic rating because they have the rubric achieved at the class-level to prevent any confusion for the teacher
to justify their decision, whereas students with a high level of and students, as well as unintended mixing of conditions. Checks
friendship did not want to ‘‘confront’’ a close friend with their real on all dependent variables revealed that the classes were
performance or overlooked mistakes as they were more lenient sufficiently equivalent – despite some limited, but non-systematic
towards a friend, however, these assertions should be examined in differences. Future studies might separate the students more
future studies – for example, by ‘‘assessing the assessor’’ (Kali & systematically within classes according to research condition,
Ronen, 2008). which might increase internal validity, but possibly at the expense
In sum, the findings are promising as friendship has not been of ecological validity.
systematically investigated within the area of peer assessment,
however, it also highlights the need for a follow-up study Practical implications
containing a more even distribution of friendship levels because
the operationalization of friendship following Hays (1984) There are two important implications for practice. First of all,
(three levels derived from one item) resulted in relatively low when teachers want to increase the reliability and construct
variance. A more fine-grained operationalization of friendship – validity of peer assessment, rubrics should be provided to the
treating friendship as a multi-layered and compound construct – students. Secondly, since friendships are a persistent aspect of
could be more accurate. A multi-layer operationalization of classrooms it is important to consider the role of friendship
friendship might consider (a) whether a student collaborates for between the assessee and assessor and its effects on peer
their school work with specific students, (b) whether they meet assessment reliability and construct validity. Rubrics appear to
with other students outside of class for social activities, (c) have the potential to enhance construct validity from the student
whether they were already acquainted before enrolling a perspective for low and medium friendship students when the
programme, and (d) what their relationship is to specific class assessee is known, but they appear to amplify over-scoring due to
mates (inside and outside of class). a high level of friendship – the latter might be countered by
assessing the assessor for the quality of his/her peer assessment.
Effect of a rubric on students’ perceived comfort and fairness during PA Although anonymity is often advocated as a solution, and with
online voting and debating it revealed promising results (Ains-
The results revealed no difference with respect to perceived worth et al., 2011), use of anonymity in combination with
comfort and fairness between the rubric and non-rubric condition assessment is still an open issue. Whether anonymity counters the
(Hypotheses 3a and 3b were rejected). impact of friendship or decreases authenticity in terms of future
Nevertheless, perceived comfort and fairness are correlated work contexts (where assessment is usually not anonymous)
in the non-rubric condition (small effect), and in the rubric needs to be determined. Nevertheless, rubrics have a positive
condition perceived fairness is correlated with peer rating (small effect on the reliability and construct validity of peer assessment
effect) and perceived comfort with student perspective of through clear assessment criteria and a structured format and it
validity (moderate effect). Apparently a higher degree of might even reduce potential friendship bias, but it is up to future
perceived fairness is related to a higher peer rating of the research to further elaborate on our findings.
concept map, and a higher degree of perceived comfort is related
to a lower degree of under-estimation and a higher degree of Acknowledgements
over-estimation in the rubric condition. Nevertheless, future
research could adopt longer interventions to test whether Research funded through Grant to Ernesto Panadero by the
perceived comfort and fairness increase over time and enhance Alianza 4 Universidades Project EDU2012-37382 by the Spanish
a positive view of peer assessment. Education Deparment and EUROCAT project (FP7 Program).
202 E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203
Appendix A. Rubric used for expert and peer assessment of concept maps
Criteria Score
4 3 2 1
Concepts All the important and Contains the important and The important concepts are included Some key concepts are lacking.
secondary concepts are some secondary concepts but but not the secondary ones.
included. not all.
Hierarchy The organisation is complete The organisation is correct but The organisation is complete but The organisation is incomplete
and correct, and reflected by incomplete: some levels or incorrect: there are concepts in the and incorrect.
the map. elements are lacking. wrong places.
Relationships among Relationships: They are correct Relationships: They are correct Relationships: Some are incorrect Relationships: The majority are
concepts in different making connections among the but incomplete: some making connections among concepts incorrect or there are only a
hierarchical levels correct concepts. connections are lacking. without relationship. few.
Links: They are explicit and help Links: They are incomplete. Links: Only some are explicit, but some Links: They are incomplete and
to better understand the Only some are explicit but they are incorrect. incorrect.
relationships among concepts. are correct.
Relationships among There are several connections There is only one. None None
concepts from making relevant relationships.
different columns
Simplicity and easiness Its design is simple and easily There are examples. Some Contains a few examples. There is an There are no examples. Neither
of understanding understandable. relationships are difficult to excessive number of connections. the relationships nor the
understand. hierarchy are understandable.
References Harris, L. R., & Brown, G. T. L. (2013). Opportunities and obstacles to consider when
using peer- and self-assessment to improve student learning: Case studies into
Ainsworth, S., Gelmini-Hornsby, G., Threapleton, K., Crook, C., O’Malley, C., & Buda, M. teachers’ implementation. Teaching and Teacher Education, 36, 101–111http://
(2011). Anonimity in classroom voting and debating. Learning and Instruction, 21, [Link]/10.1016/[Link].2013.07.008.
365–378. Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability
Andrade, H. (2010). Students as the definitive source of formative assessment: Aca- measure for coding data. Communication Methods and Measures, 1, 77–89.
demic self-assessment and the self-regulation of learning. In H. J. Andrade & G. J. Hays, R. B. (1984). The development and maintenance of friendship. Journal of Social
Cizek (Eds.), Handbook of formative assessment (pp. 90–105). New York: Routledge. and Personal Relationships, 1(1), 75–98 [Link]
Andrade, H., & Valtcheva, A. (2009). Promoting learning and achievement through self- 584011005.
assessment. Theory into Practice, 48(1), 12–19. Hilbert, T. S., & Renkl, A. (2009). Learning how to use a computer-based concept-
Ballantyne, R., Hughes, K., & Mylonas, A. (2002). Developing procedures for imple- mapping tool: Self-explaining examples helps. Computers in Human Behavior,
menting peer assessment in large classes using an action research process. 25(2), 267–274 [Link]
Assessment & Evaluation in Higher Education, 27(5), 427–441 [Link] Jacobs-Lawson, J. M., & Hershey, D. A. (2002). Concept maps as an assessment tool in
10.1080/0260293022000009302. psychology courses. Teaching of Psychology, 29(1), 25–29 [Link]
Barron, B. (2003). When smart groups fail. Journal of the Learning Sciences, 12(3), 307– 10.1207/s15328023top2901_06.
359 [Link] Jehn, K. A., & Shah, P. P. (1997). Interpersonal relationships and task performance: An
Berry, J. W., & Chew, S. L. (2008). Improving learning through interventions of student- examination of mediation processes in friendship and acquaintance groups. Jour-
generated questions and concept maps. Teaching of Psychology, 35(4), 305–312 nal of Personality and Social Psychology, 72(4), 775–790 [Link]
[Link] 0022-3514.72.4.775.
Besterfield-Sacre, M., Gerchak, J., Lyons, M., Shuman, L. J., & Wolfe, H. (2004). Scoring Jones, A., & Issroff, K. (2005). Learning technologies: Affective and social issues in
concept maps: An integrated rubric for assessing engineering education. Journal of computer-supported collaborative learning. Computers & Education, 44(4), 395–
Engineering Education, 93(2), 105–115. 408[Link]
Boud, D., Cohen, R., & Sampson, J. (1999). Peer learning and assessment. Assessment & Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability validity and
Evaluation in Higher Education, 24(4), 413–426 [Link] educational consequences. Educational Research Review, 2, 130–144.
0260293990240405. Kali, Y., & Ronen, M. (2008). Assessing the assessors: Added value in web-based multi-
Brown, G. T. L., Glasswell, K., & Harland, D. (2004). Accuracy in the scoring of writing: cycle peer assessment in higher education. Research and Practice in Technology
Studies of reliability and validity using a New Zealand writing assessment system. Enhanced Learning, 3, 3–32.
Assessing Writing, 9(2), 105–121 [Link] Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measure-
Cheng, W., & Warren, M. (1997). Having second thoughts: Student perceptions before ment, 38(4), 319–342.
and after a peer assessment exercise. Studies in Higher Education, 22(2), 233–239 Karaca, E. (2009). An evaluation of teacher trainees’ opinions of the peer assessment in
[Link] terms of some variables. World Applied Sciences Journal, 6(1), 123–128.
Cho, K., Schunn, C. D., & Wilson, R. W. (2006). Validity and reliability of scaffolded peer Lindblom-Ylänne, S., Pihlajamäki, H., & Kotkas, T. (2006). Self-, peer- and teacher-
assessment of writing from instructor and student perspectives. Journal of Educa- assessment of student essays. Active Learning in Higher Education, 7(1), 51–62
tional Psychology, 98(4), 891–901 [Link] [Link]
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: MacDonald, R. A. R., & Miell, D. (2000). Creativity and music education: The impact of
Lawrence Erlbaum Associates. social variables. International Journal of Music Education, os-36(1), 58–68 http://
Diaz Barriga, F. (2002). Estrategias docentes para un aprendizaje significativo: Una [Link]/10.1177/025576140003600107.
interpretación constructivista. [Teachers’ strategies for a meaningful learning: A Magin, D. (2001). Reciprocity as a source of bias in multiple peer assessment of group
constructivist interpretation]. Mexico, DF: McGraw-Hill. work. Studies in Higher Education, 26(1), 53–63 [Link]
Dochy, F., Segers, M., & Sluijsmans, D. (1999). The use of self-, peer- and co-assessment 03075070020030715.
in higher education. A review. Studies in Higher Education, 24(3), 331–350. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from
Edmonson, A. C. (1999). Psychological safety and learning behavior in work teams. persons’ responses and performances as scientific inquiry in score meaning.
Administrative Science Quarterly, 44, 350–385. American Psychologist, 50(9), 741–749.
Falchikov, N. (2003). Involving students in assessment. Psychology Learning and Teach- Moni, R. W., Beswick, E., & Moni, K. B. (2005). Using student feedback to construct an
ing, 3(2), 102–108. assessment rubric for a concept map in physiology. Advances in Physiology Educa-
Falchikov, N. (2005). Improving assessment through student involvement: Practical solu- tion, 29, 197–203.
tions for aiding learning in higher and further education. Oxon, UK: Routledge. Moni, R. W., & Moni, K. B. (2008). Student perceptions and use of an assessment
Falchikov, N., & Goldfinch, J. (2000). Student peer assessment in higher education: A rubric for a group concept map in physiology. Advances in Physiology Education,
meta-analysis comparing peer and teacher marks. Review of Educational Research, 32, 47–54.
70(3), 287–322. Nesbit, J. C., & Adesope, O. O. (2006). Learning with concept and knowledge maps: A
Hafner, O. C., & Hafner, P. (2003). Quantitative analysis of the rubric as an assessment meta-analysis. Review of Educational Research, 76(3), 413–448 [Link]
tool: An empirical study of student peer-group rating. International Journal of 10.3102/00346543076003413.
Science Education, 25(12), 1509–1528 [Link] Neuendorf, K. A. (2002). The content analysis guidebook. Thousand Oaks, CA: Sage.
038268. Panadero, E., Alonso-Tapia, J., & Huertas, J. A. (2012). Rubrics and self-assessment
Hanrahan, S. J., & Isaacs, G. (2001). Assessing self- and peer-assessment: The students’ scripts effects on self-regulation, learning and self-efficacy in secondary education.
views. Higher Education Research & Development, 20(1), 53–70 [Link] Learning and Individual Differences, 22(6), 806–813 [Link]
10.1080/07294360123776. dif.2012.04.007.
E. Panadero et al. / Studies in Educational Evaluation 39 (2013) 195–203 203
Panadero, E., Alonso-Tapia, J., & Huertas, J. A. (2014). Rubrics vs. self-assessment collaborative learning environments. In C. Mourlas, N. Tsianos, & P. Germanakos
scripts: Effects on first year university students’ self-regulation and performance. (Eds.), Cognitive and emotional processes in web-based education: Integrating human
Infancia y Aprendizaje37(1). factors and personalization (pp. 375–395). Hershey, PA: IGI Global.
Panadero, E., Alonso-Tapia, J., & Reche, E. (2013). Rubrics vs. self-assessment scripts Strijbos, J. W., & Sluijsmans, D. (2010). Unravelling peer assessment: Methodological,
effect on self-regulation, performance and self-efficacy in pre-service teachers. functional, and conceptual developments. Learning and Instruction, 20(4), 265–269
Studies in Educational Evaluation [Link] [Link]
Panadero, E., & Jonsson, A. (2013). The use of scoring rubrics for formative assessment Tabachnik, B. G., & Fidell, L. S. (2001). Using multivariate statistics. Needham Heights,
purposes revisited: A review. Educational Research Review, 9, 129–144http:// MA: Allyn & Bacon.
[Link]/10.1016/[Link].2013.01.002. Tolmie, A. K., Topping, K. J., Christie, D., Donaldson, C., Howe, C., Jessiman, E., et al.
Papinczak, T., Young, L., & Groves, M. (2007). Peer assessment in problem-based (2010). Social effects of collaborative learning in primary schools. Learning and
learning: A qualitative study. Advances in Health Sciences Education, 12(2), 169– Instruction, 20(3), 177–191[Link]
186 [Link] Topping, K. (1998). Peer assessment between students in colleges and universities.
Pond, K., & Ul-Haq, R. (1997). Learning to assess students using peer review. Studies in Review of Educational Research, 68(3), 249–276.
Educational Evaluation, 23(4), 331–348. Topping, K. (2003). Self and peer assessment in school and university: Reliability,
Pope, N. K. L. (2001). An examination of the use of peer rating for formative assessment validity and utility. In M. Segers, F. Dochy, & E. Cascallar (Eds.), Optimising new
in the context of the theory of consumption values. Assessment & Evaluation in modes of assessment: In search of qualities and standards (pp. 55–87). Dordrecht, The
Higher Education, 26(3), 235–246 [Link] 20052396. Netherlands: Springer.
Pope, N. K. L. (2005). The impact of stress in self- and peer assessment. Assessment & Toth, E. E., Suthers, D. D., & Lesgold, A. M. (2002). Mapping to know: The effects of
Evaluation in Higher Education, 30(1), 51–63 [Link] representational guidance and reflective assessment on scientific inquiry. Science
0260293042003243896. Education, 86(2), 264–286.
Rose, S., & Serafica, F. C. (1986). Keeping and ending casual, close and best friendships. Tsai, C.-C., & Liang, J.-C. (2009). The development of science activities via on-line peer
Journal of Social and Personal Relationships, 3(3), 275–288 [Link] assessment: The role of scientific epistemological views. Instructional Science, 37,
10.1177/0265407586033002. 293–310.
Sadler, P. M., & Good, E. (2006). The impact of self- and peer-grading on student Tseng, S.-C., & Tsai, C.-C. (2007). On-line peer assessment and the role for of the peer
learning. Educational Assessment, 11(1), 1–31. feedback: A study of high school computer course. Computers and Education, 49,
Sambell, K., McDowell, L., & Brown, S. (1997). But is it fair? An explorative study of 1161–1174.
student perceptions of the consequential validity of assessment. Studies in Educa- Van Gennip, N. A. E., Segers, M. S. R., & Tillema, H. H. (2009). Peer assessment for
tional Evaluation, 23(4), 349–371. learning from a social perspective: The influence of interpersonal variables and
Shavelson, R. J., Ruiz-Primo, M. A., & Wiley, E. W. (2005). Windows into the mind. Higher structural features. Educational Research Review, 4(1), 41–54 [Link]
Education, 49(4), 413–430 [Link] 10.1016/[Link].2008.11.002.
Smith, H., Cooper, A., & Lancaster, L. (2002). Improving the quality of undergraduate peer Van Zundert, M., Sluijsmans, D., & Van Merriënboer, J. (2010). Effective peer assessment
assessment: A case for student and staff development. Innovations in Education and processes: Research findings and future directions. Learning and Instruction, 20(4),
Teaching International, 39(1), 71–81 [Link] 270–279 [Link]
Stellmack, M. A., Konheim-Kalkstein, Y. L., Manor, J. E., Massey, A. R., & Schmitz, J. A. P. Vass, E. (2002). Friendship and collaborative creative writing in the primary classroom.
(2009). An assessment of reliability and validity of a rubric for grading APA-style Journal of Computer Assisted Learning, 18(1), 102–110 [Link]
introductions. Teaching of Psychology, 36(2), 102–107 [Link] 0266-4909.2001.00216.x.
00986280902739776. Wang, J., & Imbrie, P. K. (2010, June). ‘‘Students’’ peer evaluation calibration through
Strijbos, J. W., Ochoa, T. A., Sluijsmans, D. M. A., Segers, M. S. R., & Tillema, H. H. (2009). the administration of vignettes. Paper presented at the 2010 American Society for
Fostering interactivity through formative peer assessment in (web-based) Engineering Education Annual Conference & Exposition.