Formative and summative assessments are fundamental components of student-centered teaching and learning (NRC 2001, 2007), and faculty administering these assessments are frequently faced with choices about when and how to collect data from students. Educators and administrators in higher education settings would therefore benefit from evidence-based guidelines to inform these choices. Although many studies have investigated the impact of testing conditions on assessment scores, few have done so using constructed-response items. It therefore remains an open question as to whether implementation biases characteristic of closed-response assessments generalize to more contemporary item types (cf. AAAS 2011; NRC 2012). The Next Generation Science Standards and Vision and Change emphasize that undergraduate educators should move away from recognition-based approaches to science assessment and towards more authentic performance tasks such as model building, explaining, and arguing using evidence (AAAS 2012; NRC 2012). As more biology instructors heed these guidelines, questions about how to administer constructed-response assessments like the ACORNS in ways that minimize bias become increasingly important. Ultimately, understanding how best to foster student understanding of core biology topics such as evolution will rely on both the quality of assessment tools and the administration procedures that minimize bias.
In this study, two test administration conditions that instructors routinely employ–participation incentives (in this case, regular credit vs. extra credit) and end-of-course time point (final exam vs post-test)–were studied using the ACORNS instrument. A quasi-experimental design in which students were randomly assigned to a treatment condition was used. The findings indicated that the variations in these two administration conditions did not meaningfully impact inferences about evolution understanding; all differences between conditions were either insignificant or, if significant, considered to be small effect sizes. Furthermore, these administration conditions did not meaningfully impact inferences about evolution learning in terms of reasoning approach, increases in core concepts, or declines in misconceptions. Importantly, these findings were consistent across race/ethnicity and gender groups.
Prior work on the impact of testing conditions on closed-response assessment scores have produced results that both align with and diverge from those presented here. For example, Smith et al. (2012) reported that scores on a True/False genetics assessment did not differ between the biology majors who took it on the last day of the course and those who took it as part of the final exam (with no relevant interceding instruction). However, students were told that they would receive extra credit only if they scored 100% on the assessment. Although this work conflated incentive-related and timing-related testing conditions, the conclusions were similar to those in the present study; the end-of semester time point did not meaningfully impact instrument scores.
Ding et al. (2008) tested a variety of incentive conditions on a multiple-choice physics assessment using a cross-sectional design. In contrast to Smith et al., Ding et al. found that some conditions they tested were associated with different instrument scores. In particular, the extra credit and regular credit incentive conditions produced significant differences. However, the Ding et al. (2008) study has several design limitations, notably (i) a lack of controls for pre-test measures as well as background variables (all of which may differ among the students in cross-sectional study designs), and (ii) the conflation of multiple testing conditions (i.e., timing and incentives). Thus, the Ding et al. study lacks many of the controls used in the current study.
The only study in our literature review that tested administration condition effects on a constructed-response evolution test found remarkably similar results to those presented in this study. Specifically, Nehm and Reilly (2007) administered a constructed-response item about evolutionary change at two end-of-semester time points (one week apart: as an extra credit item on a post-test and as an extra credit item on a final exam). Responses were scored for seven key concepts of evolution (three of which overlapped with the ACORNS core concepts used in this study) and six misconceptions (three of which overlapped with the ACORNS misconceptions used in this study). In alignment with the findings reported in the present study, Nehm and Reilly found that the number of misconceptions did not differ between administration time points, but students used significantly fewer evolutionary concepts in the post-test as compared to the final exam. As in our study, the size of this difference (i.e., average of 0.5 key concepts) was relatively small.
As one might predict, the assessment time points used for each of our research questions were associated with different participation rates (96% of students completed the final exam items [RQ1], whereas 85% completed both the final exam and post-test items [RQ2], and 76% completed all three assessments [RQ3]). Although these three participation rates are generally very high for introductory biology settings, Ding et al. (2008) also reported a lower participation rate for some assessment conditions. Because reduced participation coincided with significantly different assessment scores in their study, the authors concluded that the different conditions attracted differently motivated “fractions'' of the class. Although the present study was not designed to investigate student motivation, our findings do not align with this conclusion because none of the analyses (regardless of participation rate) resulted in meaningful differences between conditions. More specifically, performance on the ACORNS items in both the incentive and timing conditions led to similar inferences about the magnitudes of evolution learning in the course. Additionally, because the percentages of URM and male students who completed each assessment were similar (as shown in Additional file 1: Table S5 [RQ1 vs. RQs 2–3]), it does not appear that participation motivation in our sample was strongly related to gender, race/ethnicity, or assessment outcomes.
Directions for future work on testing conditions in biology education
Three areas of work on testing conditions in biology education would benefit from further attention: (1) anchoring research in relevant conceptual and theoretical frameworks, (2) conceptualizing the array of possible testing condition dimensions and studying them independently, (3) implementing longitudinal study designs, and (4) analyzing a broader array of assessment types and time points. These four points are discussed below.
Most of the studies of testing conditions in biology contexts that we reviewed were not anchored in explicit conceptual or theoretical frameworks (cf. Nehm 2019; see also Sbeglia et al. 2021). In other educational fields, in contrast, the impact of test incentives on assessment scores have been grounded in motivation-related perspectives, such as variations of the Expectancy-Value Framework (for examples, see Eccles 1983; Duckworth et al. 2011; Wigfield and Eccles 2000; Wise and DeMars 2005; for an exception in biology education see Uminski and Couch 2001). There are other categories of testing conditions beyond test incentives that could also introduce construct-irrelevant variation (e.g., “noise”) to biology assessment scores that have not been explicitly situated within appropriate frameworks (e.g., assessment timing, assessment administration within an exam or independent of exams). Future work should ground empirical studies within theoretical models that seek to explain (rather than only test for) testing condition outcomes.
Many testing conditions that have the potential to impact assessment scores have not been clearly defined in biology education, which may explain why prior work has conflated distinct categories of conditions instead of isolating salient dimensions within categories. For example, participation incentive is a category of assessment condition that can differ along many axes (e.g., regular credit vs. extra credit; no incentive vs. small incentive vs. large incentive; scored for accuracy vs. scored for completion). The present study focused on only one dimension of incentive condition–regular credit vs. extra credit (and controlled for these other dimensions). The development of a matrix of possible conditions would allow more complete testing and prevent weak research designs (e.g., conflating testing conditions).
Studying how a broader array of testing dimensions impact student performance would be valuable because interaction effects among conditions are likely. Some participation incentives, for example, have been proposed as contributors to the perceived “stakes”Footnote 11 of a test (low vs. high) which in turn influence students’ test-taking motivationFootnote 12 (Cole et al. 2008; Ding et al. 2008; Duckworth et al. 2011; Wise and DeMars 2005). Test taking motivation, in turn, may impact assessment scores and learning inferences (Wise and DeMars 2005). Although studying these interactions in controlled settings may be possible, some of these conditions may not apply to real classroom settings. For example, a testing condition with a required post-course test does not align with standard university practices (e.g., requiring students to complete an assessment after a course has been completed). Nevertheless, many more testing conditions need to be investigated.
Future studies of test administration conditions should include the collection of datasets that permit longitudinal analyses of how these conditions impact inferences about changes in response to instruction. For many educators in undergraduate settings, the goal of pre-post assessment is to understand how instruction impacts learning objectives. At present, it is unclear if prior findings about the impacts of test administration conditions from static datasets translate longitudinally. Our findings suggest that we should not necessarily expect that they will. For example, although the number of ACORNS CC scores differed significantly between the two end-of-course time points, these time points generated similar magnitudes of pre-post change; for both end-of-course time points, assessment scores indicated that significant and large magnitudes of learning occurred in the two semesters.
Finally, it is important to emphasize that faculty inferences about student understanding can be derived from formative and/or summative assessments. These assessment artifacts can vary widely (e.g., take-home assignments, in-class writing tasks). This study examined only more traditional testing approaches for summative assessment purposes (i.e. Did students learn evolution in this course?). Analyzing a broader array of assessment approaches (formative, summative) and types (assignments, in-class tasks) would be a valuable direction for future research. In addition, many studies in higher education tend to focus on beginning and end-of-course time points (pre-post). Yet numerous assessment events occur throughout a course, and future work should therefore not be restricted to traditional summative testing time points.
Study limitations
This study focused on two test administration conditions and their impacts on ACORNS scores: test participation incentive and end-of-course timing. Several limitations apply to these study conditions.
Participation incentives
The analyses for RQ1 focused on one dimension of participation incentive–the extra credit vs. regular credit dimension–and controlled for the size of the incentive (i.e., the amount of credit given was held constant). However, another dimension of participation incentive condition–the extra credit scoring procedure (e.g., graded for completion vs. graded for accuracy)–was not explicitly mentioned to students. Furthermore, because only the final exam included both an extra credit and regular credit incentive for the ACORNS items (the pre-test and post-test included only an extra credit incentive for these items), the study design was unbalanced for this assessment condition, which precluded us from answering questions about how the participation incentive interacted with end-of-semester timing (or impacted inferences about pre-post learning). Answering these questions would require a “regular credit” post-test, which was not realistic or appropriate in our–and perhaps most–instructional settings. Simply put, requiring students to complete an assessment after a course has been completed would be unusual. The participation incentive was included in our analyses as a control variable (for RQ2 and RQ3) to account for potential impacts. Evolutionary knowledge outcomes for both research questions were not significantly associated with the participation incentive (which aligns with the finding that participation incentive did not impact final exam scores).
End-of-course timing
Many studies of the impact of test administration conditions on assessment scores conflate multiple conditions (e.g., Ding et al. 2008, Smith et al. 2012). The present study was designed to tease apart two test administration conditions: participation incentive and end-of-course assessment timing. This design goal focused on the ACORNS items themselves (i.e., two participation incentives for the ACORNS were studied while controlling for test timing and two test time points for the ACORNS were studied while controlling for participation incentive). However, at each time point, the two ACORNS items were situated within a broader assessment and this broader assessment may have inadvertently conflated the participation incentive and the timing. Specifically, the final exam as a whole was a required assessment (in which ACORNS items were randomly assigned as either extra credit or regular credit). In contrast, the post-test as a whole was a purely voluntary assessment. Therefore, although the design of the ACORNS items themselves effectively controlled for participation incentive condition across these time points (and vice versa), the design of the broader assessments did not. Whether the participation incentive of the broader test impacts scores even when it differs from the incentive of the ACORNS items themselves (e.g., an extra credit ACORNS item within a required final exam) is not clear but taking our findings in combination with those of Nehm and Reilly (2007) suggests that it might be. Specifically, the results of these two studies collectively suggest that administering voluntary ACORNS items within a required test (e.g., a final exam) vs. administering them within a voluntary assessment may indeed impact ACORNS scores; although both studies employed opposite assessment sequencing for their voluntary and required assessments (Nehm and Reilly administered their voluntary assessment before their required test whereas in this study, it was administered after the required test), these two studies non-the-less found that students scored consistently lower on the voluntary assessment. This consistent pattern of student performance reported in both studies may therefore be better explained by the participation incentive of the broader assessment rather than by the participation incentive of the items themselves or by the timing/squencing of test administration. Regardless, both studies found that the few differences between administrations were small or not significant, resulting in similar inferences about the magnitude of evolution learning.
Interpreting effect sizes
Although the benchmarks for small, medium, and large effects are generally well accepted for many effect size measures, interpretation frameworks differ and exactly how to use these standards to draw inferences varies in the literature. For example, published effect size benchmarks have been conceptualized as minimum values for each level of effect (e.g. a medium effect size benchmark of 0.6 could imply that only values above this benchmark be classified as a medium effect; e.g., Olejnik and Algina 2000). Conversely, effect sizes can be interpreted based on which published benchmark an effect size value is closest to (e.g. Olejnik and Algina 2000). Our prior work uses the former interpretation framework (e.g., Sbeglia and Nehm 2018), which we maintain in the present study so as not to bias interpretation from study to study. Regardless, these interpretation discrepancies in the literature more broadly indicate that authors and readers should be careful in how definitively they position effect size claims.