Skip to main content

Applying measurement standards to evolution education assessment instruments


Over the past 25 years a number of instruments have been published that attempt to measure understanding and acceptance of evolution. Science educators have been administering these instruments and reporting results, however, it is not clear these instruments are being used appropriately. The goal of this paper is to review these instruments, noting the original criteria and population for which evidence of validity and reliability was assessed, and to survey other publications that report their use, examining each for evidence of validity and reliability with subsequent populations. Our hope is that such a comprehensive review will engage researchers and practitioners in a careful examination of how they intend to use a particular instrument and whether it can provide an accurate and meaningful assessment of the desired outcomes. We encourage the community to administer evolution education assessments with the consideration of an instrument’s measurement support and past use with similar populations. We also encourage researchers to add additional evidence of validity and reliability for these instruments, especially if modifications have been made to the instrument or if its use has been extended to new populations.


Evolution is both a foundational concept and organizing principle in biology and as such has secured a central place in biology education as evidenced by science education reforms (National Research Council 2012; Brownell et al. 2014). Yet, a disconnect still exists between the central role of evolution in biology, student understanding of evolutionary mechanisms, and the general level of public acceptance as measured by polling questions administered by organizations such as Gallop (Swift 2017) and Pew Research Center (Funk and Rainie 2015). To further complicate its teaching and learning, the various relationships between acceptance and understanding of evolution and the nature of science (Smith 2010a; Smith and Siegel 2004), along with religiosity and the use of teleological reasoning (Allmon 2011; Shtulman 2006), impact student understanding and potentially their ability to successfully integrate evolutionary concepts into their understanding of the biological world (Sinatra et al. 2003; Smith 2010b). In a recent study of the general public, Weisberg et al. (2018) found that knowledge of evolution predicted level of acceptance, possibly suggesting student views may be amenable to change. However, a different study suggests teleological reasoning and not acceptance of evolution influences understanding of natural selection (Barnes et al. 2017). The relationship between understanding and acceptance is complex, and while not addressed directly in this paper, it is important to be aware of this complexity when assessing students and evaluating instruments. The wording and content of an assessment can impact student responses if their acceptance hinders their ability to answer questions addressing understanding. There are a number of papers that provide extensive discussion of this particular challenge to teaching and learning evolution (Smith 2010a, b), however, we have not addressed this directly in our review of instruments aside from potential issues associated with a particular instrument based on our review criteria.

Educational research has also found that how a student responds to questions on the topic of evolution is context dependent, e.g. taxa, or the direction of change via trait gain vs. loss (Nehm et al. 2012; Nehm and Ha 2011), and many students retain naive or non-scientific concepts even after instruction (Ha et al. 2015; Nehm and Reilly 2007). Given these findings, and the various challenges to student understanding of evolution (Branch and Mead 2008; Mead and Scott 2010a, b; Petto and Mead 2008), many science educators are now interested in assessing how well students understand, and in some cases, accept, the basic premise and mechanisms underlying evolutionary change, in either formative or summative ways. In addition, instructors seek to assess the effectiveness of curricular interventions designed to improve student understanding.

Perhaps as a result of recent interest in the teaching and assessment of evolution, or the growing field of discipline-based education research, a number of instruments designed to assess student understanding and acceptance of evolution have been created over the last 25 years (see Table 1 for examples). At the undergraduate biology level, these include, but are not limited to, assessments designed to measure student understanding of natural selection (e.g. concept inventory of natural selection—CINS, Bishop and Anderson 1990; concept assessment of natural selection—CANS, Kalinowski et al. 2016), macroevolution (e.g. measure of understanding of macroevolution—MUM, Nadelson and Southerland 2009); genetic drift (e.g. genetic drift inventory—GeDI, Price et al. 2014); and acceptance of evolution (e.g. Measure of the Acceptance of the Theory of Evolution—MATE, Rutledge and Warden 1999; Evolutionary Attitudes and Literacy Survey—EALS, Hawley et al. 2011; generalized acceptance of evolution evaluation—GAENE, Smith et al. 2016). These instruments can provide an opportunity for instructors to measure gains in student understanding; however, the conclusions drawn from them are dependent on the quality, accuracy, and relevancy of the instrument. For example, in a review of assessments addressing student understanding of bioinformatics concepts, Campbell and Nehm (2013) found many of the instruments they reviewed provided only minimal evidence of reliability or validity.

Table 1 List of published instruments that measure understanding and/or acceptance of evolution reviewed in current paper

The decision to use any instrument should include an examination of the instrument and its development to ascertain if it meets the accepted measurement standards, specifically whether there is strong evidence that the instrument provides valid and reliable results. Evidence that an instrument provides valid results suggests the variable being measured by the instrument accurately represents the construct or item of interest. Evidence that an instrument provides reliable results suggests the instrument gives consistent results when implemented under similar circumstances. There are multiple forms of evidence for reliability (e.g. stability, internal consistency, interrater reliability) and validity (e.g. content, internal and external structure, generalization). Box 1 provides examples of the different sources of evidence that can be used to evaluate validity and reliability (Messick 1995; Campbell and Nehm 2013; AERA 2014).

Box 1. Methods and descriptions for various sources of validity and reliability (modified from Messick 1995; Campbell and Nehm 2013; AERA 2014)



Methodology (examples)

Validity—do scores represent the variable(s) intended?


Assessment represents knowledge domain

Expert survey, textbook analysis, Delphi Study


Thinking processes used to answer are as intended

“Think aloud” interviews, cognitive task analysis

 Internal structure

Items capture intended construct structure

Factor analysis, Rasch analysis

 External structure

Construct aligns with expected external patterns

Correlational analysis


Scores meaningful across populations and contexts

Comparisons across contextual diversity, Differential item functioning


Scores lead to positive or negative consequences

Studying social consequences resulting from use of test score

Reliability—refers to the consistency of the measure


Scores consistent from one administration to another

Stability coefficient

 Alternate forms

Scores comparable when using similar items

Spearman-Brown double length formula: split half

 Internal consistency

Items correlate with one another

Coefficient alpha (Cronbach’s), Kuder-Richardson 20

 Inter-rater agreement

Assessment scored consistently by different raters

Cohen’s or Fleiss’s kappa

Assessment of student understanding in educational settings should include systematic evaluation of instruments in order to meet the quality control benchmarks established by, for example, the American Educational Research Association (AERA et al. 2014). Not doing so is “at odds with the principles of scientific research in education” (Campbell and Nehm 2013) and since a reliance on faulty or misleading information for the purposes of evaluation and reform is misguided, it is therefore necessary to establish an assurance of such information’s positive utility. Campbell and Nehm (2013) are careful to point out that validity and reliability are not properties of the instrument itself, but rather relate to the inferences derived from the scores it produces. It is therefore incorrect to describe an assessment instrument itself as being valid and reliable. Instead, our interpretation of validity and reliability needs to shift such that an assessments’ scores and implementation contexts are foremost. For example, a correct statement is that the instrument produces valid and reliable inferences under the particular circumstances it was administered. One cannot assume that an instrument developed using a population of undergraduate non-majors in their 1st year of college necessarily has the same evidence of reliability and validity for a population of students in an upper level evolution course.

In our own efforts to identify ways of assessing understanding of evolutionary concepts, we found many studies simply reported using a published instrument, often modified from an earlier published instrument, and often lacking any additional information about the implementation or adherence to measurement standards. To address these issues, we (1) reviewed the various published instruments designed to measure understanding and acceptance of evolution, (2) examined the types of evidence of validity and reliability provided in the original publication(s), and (3) characterized the use of these instruments in subsequent publications, specifically noting any additional evidences of reliability and validity.


In 2016 and 2017 we (LM, CK, AW, KS) carried out searches of Google Scholar, ERIC, and Web of Science using the following keyword searches: “student understanding of evolution”; “student understanding of natural selection”; “student acceptance of evolution”. We compiled a list of papers that referenced these key phrases, focusing on ones that were aimed at college undergraduates. We reviewed abstracts to identify papers that specifically mentioned measuring student understanding or acceptance of evolution using the following criteria: population—undergraduates; level/course—any; content assessed—evolution understanding, evolution acceptance, natural selection, genetic drift. If the information could not be readily assessed from the abstract, we examined the methods section of the paper in more detail. In this initial review of the published literature it became clear that many of the papers we reviewed referenced using some portion of an earlier published instrument or set of questions. For example, many studies reported using portions of the original assessment developed by Bishop and Anderson (1990). We used this information to identify a set of 13 instruments that would become the focus of the remainder of our research, and that appeared to form the basis of many studies.

The criteria for our more in-depth analysis of assessment instruments included instruments created with the intention of being used by others to assess understanding and acceptance of evolution. We made three exceptions to these criteria: the ECT referenced in Bishop and Anderson (1990), the KEE (knowledge of evolution exam) referenced in Moore and Cotner (2009), and the ATEEK (assessment tool for evaluating evolution knowledge) referenced in White et al. (2013). We chose to include these because they were subsequently treated as instruments by other researchers who used them as the basis of assessing student understanding. Two of these, the KEE and ATEEK, were given a specific name for use and referenced by others. We did not include instruments measuring genetics only or combinations of other biological sub-disciplines (e.g. EcoEvo-MAPS in Summers et al. 2018) because we wanted to evaluate only instruments reported to measure student understanding and/or acceptance of evolution. We also chose to exclude the topic of phylogenetics for a number of reasons. First, phylogenetic trees are visual representations of both patterns and processes, and therefore it can be difficult to isolate specific elements from a cognitive perspective (Novick and Catley 2012). Second, at the time of our review, the only published instruments included one provided in Baum et al. (2005), the Basic Tree Thinking Assessment, which was developed as a formative quiz and not meant to be used as an assessment instrument (pers. com.), and the PhAT (Phylogeny Assessment Tool) comprised only three questions (Smith et al. 2013), all related to a single phylogenetic tree.

Our final list included 13 focal instruments (Table 1). We first reviewed the original publication and characterized the instrument (i.e., content and population assessed, type and number of questions, how it was developed) and the evidence of reliability and validity described in the population. These original instruments were reviewed and discussed by all co-authors so as to ensure consistency.

Next, we performed a citation search for each of the focal instruments to generate a list of publications that cited the instrument, suggesting possible use. We performed these searches using Google Scholar, first performing a search of the original paper (e.g. Bishop and Anderson 1990) and then examining all of the papers listed as “cited by” (e.g. at the time of our search Google Scholar reported 703 papers had cited Bishop and Anderson 1990). Our data represent publications that appeared in Google Scholar through March 2018. Our review of these secondary publications involved an initial read of the abstract, followed by a search for the original reference. These methods allowed us to ascertain if the secondary publication used the original instrument. If the paper did use the focal instrument, the paper was marked for later review. Once we identified papers that reported use of the focal instruments, all authors reviewed a subset in entirety, checking for consistency in identifying new populations and new uses. Each author then took one or more of the focal instruments and reviewed all secondary uses, further characterizing these citations and recording the use of the focal instrument. For each publication (secondary usage) we recorded the population, a description of the portion of instrument used (e.g. Andrews et al. (2011) reported using an abbreviated CINS comprised of 10 of the original 20 questions), additional evidence for reliability/validity (e.g. Rissler et al. (2014) reported Cronbach’s alpha associated with administration of the MATE to undergraduates at the University of Alabama). To determine whether the study used the instrument on a new population we considered: (1) geographic area; (2) grade level; (3) field of study; and (4) academic level—introductory courses, advanced courses, or graduating seniors. We categorized the population based on the geographic region of the United States (midwestern, southwestern, southeastern, western, northwestern, northeastern) or the country. In the case of papers that were in languages other than English we relied on Google translator to evaluate if and how an instrument was used. In some cases, the description of the population in the new implementation was less specific than that of the original population in which case we did not consider it a new population because we could not tell whether the new implementation was potentially inclusive of the original population. For grade, field of study, and academic level we identified the following categories: undergraduates not enrolled in a specific course, undergraduates enrolled in a non-majors introductory biology course, undergraduates enrolled in a majors-level introductory biology course, undergraduates enrolled in an advanced biology course, undergraduates enrolled in a psychology course, undergraduate preservice teachers, high school teachers, high school students. When questions arose regarding how to characterize a particular use, we discussed it as a group that included at least three of the authors at any given point. For studies suggesting new implementations we were especially interested to know whether new uses of the instrument also included new measures of reliability/validity, as applicable. We evaluated these based on the criteria and examples outlined in Box 1. We recorded these data for each study we encountered.


Initial review of focal instruments

Our initial review of the 13 focal instruments published between 1990 and 2016 found that two instruments included multiple versions (MATE, EALS). For the MATE we considered two of the versions unique enough to evaluate separately. The EALS Short-form was created directly from the Long-form and we therefore combined results for this instrument. Two of the assessments included only open ended, constructed response questions (ACORNS—assessing contextual reasoning about natural selection, ATEEK). Two included both constructed response and multiple-choice questions (ECT, MUM), and the remainder were some form of multiple choice, including Likert, agree/disagree, etc. (CINS, MATE, I-SEA, EALS, KEE, GAENE, GeDI, EvoDevoCI, CANS). We recorded information on instrument design, concepts covered, initial population, and evidence of validity and reliability. One (KEE) reported neither evidence of validity nor reliability, one reported some form of evidence of reliability only (ATEEK) and one reported evidence of validity only (ECT). Given the limitations of the KEE and ATEEK we do not discuss them further in this section, but results of our analysis can be found in Table 2. The remainder of the instruments had at least one type of evidence of both validity and reliability reported in the original publication. All assessments included undergraduates, either majors or non-majors, at some point during development. The early version of the MATE assessed high school biology teachers, but a later version was used with undergraduates. The I-SEA and GAENE included high school students in addition to undergraduates during development.

Table 2 Summary of review of citations reporting new implementations of each instrument

Assessments measuring natural selection

The ECT developed by Bishop and Anderson (1990) clearly served as the foundation for a number of subsequent studies, and the ORI in particular noted questions coming directly from the ECT. The original instrument developed by Bishop and Anderson consisted of six questions and claimed to measure understanding of natural selection among non-major undergraduates at a large midwestern university. The authors indicated that interrater reliability (IRR) was evaluated, stating that reliability was checked “by comparing the codes assigned to randomly selected student responses by two different coders” and that if disagreements occurred “coding was modified to produce better agreement”. When disagreement between coders occurred, the coding procedure was modified to produce better agreement. However, no statistic for IRR was provided. The authors also report a number of sources of evidence of validity—review of textbook material as content, and student interviews as substantive.

The ACORNS instrument, developed following the ORI (open response instrument) which was based on the ECT, evaluates student “ability to use natural selection to explain evolutionary change” across a range of conditions (trait gain, trait loss, etc.). The instrument does focus on assessing elements of natural selection and non-scientific explanations (misconceptions) but also provides the option of scoring student responses for non-adaptive explanations for change as well (e.g. random changes in response to sampling error and drift). Nehm et al. (2012) report evidence of internal consistency by measuring Cronbach’s alpha for key concepts and misconceptions (0.77 and 0.67 respectively) and report that IRR was greater than 80%. Content validity was assumed because the questions represent a number of possible biological scenarios. Evidence of internal consistency was provided by student interviews, and external structure was evaluated by comparing student responses on ACORNS questions to scores on the CINS. Using the ACORNS does require training in how to score student responses, alternatively, instructors can use EvoGrader (Moharreri et al. 2014) a machine-learning program that has been trained to score ACORNS questions.

The CINS was originally developed as a 20-question instrument with evidence of validity and reliability provided for undergraduate non-majors in the southwestern region of the United States. The authors used Kuder-Richardson 20 to examine reliability, obtaining measurements of 0.58 and 0.64 on initial sections of the instrument. A good classroom instrument should have a reliability coefficient of 0.60 or higher. Expert reviewers provided evidence of content validity, interviews were used to evaluate if student responses on the multiple-choice questions reflected their thinking and principle component analysis (PCA) was used to examine internal structure. The authors also claimed that the instrument was generalizable because the original population used during development came from “large, ethnically diverse, community colleges”. However, specific information about the demographics of the population was not provided and this claim has not been directly tested.

The CANS is composed of 24 multiple choice questions designed to measure five concepts related to natural selection: variation, selection, inheritance, mutation, and how these elements work together to cause evolution. Initial development was iterative, relying on student interviews and expert review to asses evidence of substantive and content validity, respectively. Kalinowski et al. (2016) also applied Item Response Theory to assess how well sets of questions assessed the same concept and if student responses fit a priori expectations. The authors also compared scores before and after instruction to evaluate reliability, reporting Cronbach’s alpha before and after instruction (0.87 and 0.86, respectively), providing good evidence of reliability. The authors estimated that 88% of the variance in test scores in the experimental classroom was due to differences in student understanding of natural selection.

Assessments measuring additional evolutionary concepts

We found a single instrument purported to measure student understanding of macroevolution. The MUM was developed to measure student understanding of five essential concepts related to macroevolution: deep time, phylogenetics, fossils, speciation, and nature of science. Development of the instrument relied on responses generated by undergraduates taking courses in either introductory biology or upper-level evolution at a large southeastern university. Textbook analysis and expert reviews were used as evidence of content validity. The authors used Cronbach’s alpha as a measure of internal consistency and report a value for the entire sample that is considered acceptable (0.86). However, Cronbach’s alpha varied across their samples, ranging from values considered questionable to values considered acceptable, possibly suggesting the instrument provides better evidence for some populations than others. No additional evidence was provided.

The GeDI was developed to measure upper-level biology majors understanding of genetic drift as a process of evolutionary change. The authors used an iterative development process that included open-ended questions, student interviews, multiple expert reviews, and item analysis. The final instrument was also evaluated for evidence of reliability. A coefficient of stability of 0.82 was reported in a test–retest administration. Cronbach’s alpha varied across populations (0.58–0.88), and the authors note that the concepts covered in the instrument align best with upper-level evolution courses.

The EvoDevo CI is a concept inventory developed specifically to measure student understanding of six core concepts related to evolutionary changes caused by development. The authors relied on iterative development that included expert review, student interviews, testing and item revision. They reported Cronbach’s alpha, calculated for different groups, as a measure of whether the instrument assessed the intended construct among biology majors. In addition, tests for evidence of reliability reported good stability as measured by Pearson correlation of 0.960, P < 0.01.

Assessments reporting to measure acceptance of evolution

The MATE was designed to measure overall acceptance of evolutionary theory by assessing perceptions of concepts considered fundamental to evolution. Originally developed using a population of high school biology teachers (Rutledge and Warden 1999), it was then updated using undergraduate non-majors (Rutledge and Sadler 2007). Both versions include 20 items assessed using a five-point Likert scale. The original version published by Rutledge and Warden (1999) reported internal consistency using Cronbach’s alpha (0.98) as evidence of reliability, expert review by a panel of five experts as evidence of content validity, and a principle factor analysis as evidence of internal structure validity. The second version of the MATE examined reliability of the instrument for a population of non-major undergraduate students and reported Cronbach’s alpha reliability coefficient of 0.94 as evidence of internal consistency. No additional evidence was reported.

The EALS Long-Form was developed to assess predominant regional belief systems and their roles in science understanding and attitudes, particularly as pertain to evolution, drawing from previous literature and published instruments to generate Likert scale items. The EALS Short-Form was then tested on undergraduates in an introductory biology course. Both forms included items for the 16 lower order constructs and then used confirmatory analysis to determine the six higher order constructs. We suspect the EALS Short-Form is more likely to be used, and therefore provide a summary here. Additional information on the long form can be found in Table 2. The authors reported a range of alpha coefficients for the 16 lower-order constructs as evidence of internal consistency and suggested loadings from a confirmatory factor analysis provided evidence of internal structure validity.

The I-SEA was also designed to measure student acceptance of evolution, based on three subscales: microevolution, macroevolution, and human evolution. Development included using open-ended questions and student interviews. An initial 49 item Likert scale instrument was developed and tested, and then modified to the current 24 item instrument. The overall Cronbach’s alpha was 0.95, providing evidence of internal consistency. Experienced biology teachers, science teacher educators, and college biology faculty served as expert reviewers, providing evidence of content validity. Evidence of internal structure was measured using an exploratory factor analysis, however, there were some issues here because only loadings for the first four items for each subscale were reported, making it difficult to fully evaluate these measures. The populations used in development included high school students and undergraduates, predominantly at institutions in the western United States.

The most recently published instrument developed that measures acceptance of evolution is the GAENE, specifically designed to measure only acceptance of evolution, defined as “the mental act or policy of deeming, positing, or postulating that the current theory of evolution is the best current available scientific explanation of the origin of new species from preexisting species”. The GAENE was also developed based on other instruments, relying on extensive interviews and testing, followed by multiple rounds of revision, and expert feedback. Smith et al. (2016) reported Cronbach’s alpha of 0.956 for later versions, providing excellent evidence of internal consistency. Evidence of validity was provided by Rasch analysis, demonstrating discrimination between respondents with low and high levels of acceptance, and PCA that supported a unidimensional structure accounting for 60% of the variance. A range of populations were used in developing the instrument, including high school students and undergraduates at a range of institutions.

Secondary uses of focal instruments

Using the “cited by” link provided in Google scholar for each of the publications associated with the 13 focal instruments, we examined over 2000 peer-reviewed citations that made reference to one or more of the 13 focal instruments. Many of the citations simply referenced the publication but did not use any portion of the instrument. We did identify 182 studies that used at least one of the 13 instruments we reviewed. Figure 1 shows the relative frequency of re-use of each of the instruments ranging from 0 (CANS) to 88 (MATE). We defined a new use of the instrument as either using a different version (altered measurement scale or item set and item rewording or language translation) and/or administering the instrument to a new population. Our review found that most new uses of the instruments did involve either administration to a new population and/or the use of a revised version, particularly if the instrument was published more than 5 years ago (Fig. 2, Table 2). Figure 2a shows the proportion of studies that indicated a new use of the instrument for six of the 13 instruments. Figure 2b shows the proportion of these new uses that reported new evidence of reliability or validity. Figure 2 shows only a subset of the instruments as a number of instruments were so recently published that there have been few secondary uses. Table 2 summarizes all data, indicating the specific types of reliability and validity evidence provided. Additional file 1: Table S1 is a searchable database with additional details for each of the secondary uses of the instruments.

Fig. 1
figure 1

Proportional re-use of all instruments. For example, the MATE was used in 70 subsequent studies, the I-SEA in only three. Gray text indicates the assessment has yet to be used in a new study. Instruments are organized according to construct (content and psychology dimension)

Fig. 2
figure 2

a The proportion use of instruments categorized by type of use, e.g. proportion of secondary uses for the ECT that altered the original version. b The proportional of secondary uses that reported additional or new evidence of reliability or validity, whether for a new population or new implementation of the instrument

The ECT, first published by Bishop and Anderson (1990), was initially used with undergraduate non-majors. Our analysis suggests the instrument (or some approximation of the instrument) has been used in 27 subsequent studies. Two studies (Nehm and Reilly 2007; Andrews et al. 2011) altered the ECT, three studies administered the complete instrument to a new population (Settlage 1994; Demastes et al. 1995), and 20 of the re-administrations of the ECT involved a new population and used only a subset of the original questions presented in Bishop and Anderson (1990). Included in this category were studies that report using the ORI (open response instrument) because Nehm and Reilly (2007) report modifying questions from Bishop and Anderson (1990) in creating the ORI. We also found reference to the ACORNS questions as being derived from the ECT as well; however, we evaluated the ACORNS separately. In many cases, reuse of the ECT did not include any new evidence of reliability and validity (Fig. 2b). The exceptions involved uses of the ORI, new implementations often included new measures (Ha et al. 2012, Nehm and Schonfeld 2007). For example, Nehm and Schonfeld (2007) provided additional evidence of both reliability (i.e., internal consistency and IRR) and validity (e.g. content and substantive) for students in a graduate teacher education program.

We identified 31 publications that referenced using the Concept Inventory for Natural Selection (CINS), one used some version of the instrument (Pope et al. 2017), most likely administering a portion of the full instrument, 19 administered the instrument to a new population, and ten studies reported using the instrument with a new population and changing the question structure. A few of these studies reported additional evidence of reliability and validity. Athanasiou and Mavrikaki (2013) reported evidence of reliability (Cronbach’s alpha) and validity (construct validity using PCA) for biology and non-biology majors in Greece. Nehm and Schonfeld (2008) report additional evidence of convergent validity (between the CINS and ORI) and discriminant validity for undergraduate biology majors in northeast. Ha et al. (2012) also looked at the correlation between scores on the ORI and the CINS, and report Cronbach’s alpha for undergraduates in preservice biology. Weisberg et al. (2018) administered the CINS to a sample from the general public and reported Cronbach’s alpha. Finally, Pope et al. (2017) also report Cronbach’s alpha and interrater reliability for biology majors in the northeast.

The ACORNS instrument has been used in nine subsequent studies. The ability to vary the open-ended questions allows researchers to create new versions without altering the general framework of the instrument, therefore none of the subsequent uses were considered new versions. The original population reported in Nehm et al. (2012) stated the population used to assess reliability and validity were undergraduates at a midwestern university. The instrument was then used in subsequent studies, most commonly listing the population as undergraduate biology majors. It was therefore not possible to determine if the re-uses of the instrument qualified as new populations. However, all of these studies did report IRR as evidence of reliability.

The MUM has been used infrequently, perhaps because of issues identified by Novick and Catley (2012) or because instructors are often more interested in students understanding of natural selection. However, Romine and Walter (2014) administered the MUM to undergraduates enrolled in non-majors’ biology and found internal construct validity to be strongly supported using Rasch analysis but did find a single construct as opposed to five in the original study. Of the studies that do report using the instrument, two report using slightly modified versions and one modified the version and administered it to a new population.

At the time of our analysis, the concept assessment of natural selection (CANS), the knowledge of evolution exam (KEE), the Assessment Tool for Evaluating Evolutionary Knowledge (ATEEK), the genetic drift inventory (GeDI), and the EvoDevo Concept Inventory (EvoDevo CI) had not been used very often and currently no additional evidence of reliability or validity has been provided for these instruments.

For the MATE, of the total 88 new uses of the instrument, 48 of the implementations provided new evidence of reliability while 18 provided new evidence of validity, although with wildly different rigor (Fig. 2b). Having been one of the original and seemingly most versatile instruments, the MATE has been implemented in quite diverse contexts and forms, including being used in fourteen countries, and translated to five other languages, often with multiple independent translations. The primary non-USA and non-English use of the MATE is in Turkey and Turkish, and with likely six independent translations. Many populations unique from the original in terms of educational background have been assessed, including early childhood or primary school teachers, university faculty, and museum visitors. The number of items administered have fluctuated between 4 and 27 through item reduction, splitting, and/or combination with other items (not including other identified instruments). Finally, the measurement scale has varied between four-, six-, and seven-point Likert scales. Notable implementations that introduce validity and reliability evidence are largely limited to Turkish populations (Akyol et al. 2010, 2012a, b; Irez and Özyeral Bakanay 2011; Tekkaya et al. 2012; Yüce and Önel 2015) with two notable studies (Manwaring et al. 2015 and Romine et al. 2017) providing the strongest evidence of internal structure validity with populations similar to the original American undergraduate implementations. The dearth of evidence regarding validity for the MATE pales in comparison to its diversity of implementations—an undesirable state indeed for measurements standards.

We found eight additional uses of the Evolution Attitudes and Literacy Survey (EALS), either the short or long form. Three studies reported using the EALS in the original format and administered it to similar populations as those used in the initial studies. One altered the format and another four changed both the version and the population. Of these only one reported new evidence of reliability or validity (Mead et al. 2015).

The Inventory of Student Evolution Acceptance (I-SEA) and the Generalized Acceptance of Evolution Evaluation (GAENE) have also not been used very often. In the case of the I-SEA only one publication reported using the instrument and it was not possible to determine if it was a new population or new version. However, no additional evidence of reliability or validity were provided. We suspect the GAENE has not been used because it was so recently published. However, the strong evidence offered in the initial description of the instrument suggest it may be used more often in the future.


The ability of any instrument to measure student understanding is dependent on a number of factors—for example, the development process, initial population assessed, evidence of validity and reliability, evaluation of what we think it measures, and consistency in measurement (Campbell and Nehm 2013). We found new uses of the original instruments overall provided sparse new evidence of validity or reliability and encountered various issues while evaluating the instruments and their subsequent reuse. These included the narrow character of the original population (e.g. MATE) and the failure of adhering to measurement standards by entirely lacking validity and reliability evidence (e.g. KEE). In reviewing subsequent uses it was often difficult to ascertain what portion and/or version of the original instrument was used, for example, some studies simply referenced using questions from Bishop and Anderson (1990) but did not indicate which questions were used (Gregory and Ellis 2009). Further, the authors of the MATE have published four distinct versions (Rutledge and Sadler 2007, 2011; Rutledge and Warden 1999, 2000) that differ with respect to item wording and/or ordering, and this fact has remained unremarked upon in the literature.

Use of the MATE is further complicated by the fact that, although there is evidence of validity, it is not clear what is meant by “acceptance” (Smith 2010a). More recently, the internal structure of the MATE in terms of the number and identity of measurable constructs (i.e., named sets of items measuring the same concept) has been found to be unclear. Wagler and Wagler challenged the content and internal structure validity for the MATE, and studies report the MATE represents one (Rutledge and Warden 1999; Rissler et al. 2014; Deniz et al. 2008), two (Romine et al. 2017), four (Manwaring et al. 2015), six (untested: Rutledge and Sadler 2007), or an unidentifiable number of constructs (e.g. Wagler and Wagler 2013; Hermann 2012, 2016; Rowe et al. 2015). However, more recently, Romine et al. (2017) has suggested the MATE is psychometrically sound.

We also encountered published debates regarding validity, including content and substantive validity, for the MUM (i.e., Novick and Catley 2012; Nehm and Kampourakis 2014). Novick and Catley (2012) found significant issues with respect to validity evidence for the MUM, suggesting it does not adequately measure student understanding of macroevolution. However, Romine and Walter (2014) challenged the findings of Novick and Catley (2012) suggesting that their analysis provided evidence that the MUM is a psychometrically sound instrument. These debates emphasize again the importance of testing any instrument for evidence of reliability and validity when using it in a new implementation.

Instruments developed more recently (GeDI, EvoDevCI, CANS, GAENE) have not yet been used widely. However, we note that these studies included relatively broad initial populations in their development and provided multiple lines of evidence for both reliability and validity, suggesting these may be useful across a wide range of future implementations.

Conclusions and recommendations

The focus on evaluating teaching and learning in undergraduate biology has led to the creation of a number of different instruments that can be used to assess student understanding and acceptance of evolution. However, it is clear that examining each instrument for evidence of reliability and validity for a particular intended use is important for being able to make accurate and valid inferences. Our analysis of published instruments provides useful information to consider. We strongly recommend that research on student understanding and acceptance of evolution include continued evaluation. For example, owing to its popularity in the literature, we have specific recommendations for readers if they intend to administer the MATE. The authors’ most recent version (Rutledge and Sadler 2011) is the soundest grammatically and, although further study on this is warranted. Therefore, this English version is most highly recommended, if modifications are desired due to cultural incongruence, ESL (English Second Language) interpretation, non-English usability, neutrality avoidance, etc. Doing so would maintain adherence to measurement standards and aid comparison within the literature by reducing the increasing diversity of versions lacking any—let alone adequate—evidence of validity and reliability. However, unease regarding the content and internal structure validity for the MATE (see above) was a driving factor in the creation of alternative instruments to measure acceptance (i.e., EALS, I-SEA, GAENE). The GAENE in particular went through multiple iterations, included a broad population in its testing, and meets criteria for measuring “acceptance of evolution” (Smith et al. 2016), in addition to evidence of reliability and validity.

In addition to concerns about evidence of validity and reliability, many studies reported using only portions of a particular instrument. In some cases, however, it may be important to use the instrument as developed—administering all of the items and using their original wording and measurement scale—if one wishes to draw comparisons or rely on previous evidence of validity and reliability for similar populations. While some forms of validity (for example substantive or content) may not be affected, instruments are developed to measure a particular construct, or set of related constructs, and changing the structure of the assessment may influence how well it measures the constructs of interest.

We strongly support extending measurement criteria to all the instruments reviewed here and recommend against using instruments for which the original publication did not report evidence of reliability and validity, or for which this evidence is weak. Researchers should review the literature, paying particular attention to alignment between learning goals and choice of instrument. Furthermore, as instruments are modified and/or used on new populations, measurement standards should be adhered to, and reported in the literature. Such reports will further extend the uses of these instruments and strengthen the ability of researchers to draw meaningful conclusions from studies.

In addition, we want to recognize that many of the instruments developed more recently (e.g. CANS, GeDI, EvoDevoCI, GAENE) include multiple lines of evidence referencing strong reliability and validity, and these should be used as models for continued development of new instruments. Developers of scientific instruments need to clearly lay out under what conditions their assessment is to be used and to encourage those using the assessment outside of those parameters to gather more evidence. Ziadie and Andrews (2018) point out that any assessment should include the dimensions of the topic that are important to assess and include consistent methodology and interpretation of results.

Our review highlights the importance of applying measurement standards to instruments, hopefully helping researchers to assess student understanding and acceptance of evolution. We have provided a supplemental database that allows researchers to easily examine a particular instrument, and any subsequent uses that may help determine if it is an appropriate instrument for a given population. We cannot emphasize enough, however, that it is imperative that any new implementation of these instruments be tested according to accepted measurement criteria and that researchers publish any new evidence of reliability and validity.



assessing contextual reasoning about natural selection


assessment tool for evaluating evolution knowledge


concept assessment of natural selection


evolution concept test


concept inventory of natural selection


Evolutionary Attitudes and Literacy Survey


english second language


evolutionary developmental concept inventory


generalized acceptance of evolution evaluation


genetic drift inventory


inter-rater reliability


inventory of student acceptance of evolution


knowledge of evolution exam


measure of acceptance of the theory of evolution


measure of understanding of macroevolution


open response instrument


principle component analysis


Download references

Authors’ contributions

LSM, CK, AW and KS all contributed equally to reviewing the literature. LSM took the lead on writing. CK and AW contributed specific sections associated with instruments they reviewed. KS presented original results at Evolution 2016. All authors read and approved the final manuscript.


The BEACON Evo Ed Curriculum group provided helpful feedback on initial discussions of the manuscript. Initial ideas for project were also discussed with Dr. Ross Nehm whom we thank for emphasizing the importance of measurement standards.

Competing interests

The authors declare that they have no competing interests.

Availability of data and materials

We have created a searchable Google spreadsheet for the Additional file 1: Table S1. Please see file at the following link:


Howard Hughes Medical Institute; National Science Foundation, Grant Numbers: OIA-0939454; DBI-144683.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Louise S. Mead.

Additional file

Additional file 1.

Searchable database of an overview of each instrument reviewed and characterization of any published studies that report using the instrument, specifying additional evidence of reliability and validity for new implementation.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mead, L.S., Kohn, C., Warwick, A. et al. Applying measurement standards to evolution education assessment instruments. Evo Edu Outreach 12, 5 (2019).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: