RQ1: Does the SECM adhere to well-accepted criteria of robust measurement?
To address RQ1, we modeled each scale using an approach that is appropriate for the type of response data (i.e., ordered) and the structure of the latent construct (i.e., continuous). For each of these considerations, Rasch analysis is appropriate (de Ayala 2019; Liu 2010). Specifically, we modeled the SECM using a partial credit Rasch model (i.e., item + item*step, PCM2 in TAM) with the R package Test Analysis Modules (TAM, v. 2.10–24, Robitzsch et al. 2018). Rasch analysis, and IRT more generally, estimates respondents' latent measures using a probabilistic approach, and thus does not claim to measure a true score. Rather, a respondent’s likelihood of selecting a particular response is based on the difference between a respondent’s measure on the trait and each item’s level of agreeability (or difficulty). These approaches theorize that in order to generate robust measures of a latent construct, the following characteristics of the underlying data must be present: (1) acceptable item fit, (2) acceptable rating scale functioning, (3) unidimensionality, (4) acceptable item and person reliability, (5) acceptable person-item alignment (Wright maps), and (6) measurement invariance (Boone 2017; Boone et al. 2014). These six criteria may be considered a benchmark for productive measurement using the Rasch model, and if met, suggest that the instrument can generate robust measures of the latent construct (Borsboom et al. 2005). Here, “measures” refers to both an item measure (i.e., the agreeability or difficulty of an item) and a person measure (i.e., the agreeability or ability of a person). Item and person measures are on the same logit scale and can be compared to each other (Boone et al. 2014). In Rasch analysis, unlike IRT, the item measure is the only parameter considered in the calculation of the person measure (using a weighted maximum likelihood estimation [WLE] of the item parameter). IRT models, on the other hand, also include other parameters that can be added or removed to improve the fit of the model. Rasch analysis assumes that no additional parameters are needed for productive measurement of a latent construct (Boone et al. 2014). Therefore, although Rasch analysis and IRT are considered to be conceptually different approaches, the Rasch model is mathematically equivalent to a 1-parameter (1PL) IRT model (Boone et al. 2014). A benefit of the strict 1-parameter assumption of the Rasch model is that it calibrates instruments using an equivalent standard (Romine et al. 2017); the probability of selecting a particular level of conflict for an item is proportional only to the difference between the agreeability of the item and the level of conflict of the respondent. Furthermore, this approach converts raw, ordered data to a continuous linear scale, making Rasch and IRT measures suitable for parametric statistical analyses. We briefly summarize each of these evaluation criteria below.
Item fit
To address if the items that compose the instrument have an acceptable fit to model expectations (RQ 1.1), we analyzed the information-weighted (i.e., Infit) and unweighted (i.e., Outfit, which is sensitive to outliers) mean squares fit (MNSQ) statistics for each item. In alignment with psychometric standards, we considered MNSQ fit values of 0.5–1.5 logits to be acceptable (Boone et al. 2014). Fit values that were slightly outside this range indicate that an item does not meaningfully contribute to measurement, and values > 2 indicate that the item is degrading to measurement (Boone et al. 2014).
Rating scale functioning
To evaluate if the rating scale of the SECM functions as expected (RQ1.5), we used two approaches. First, we examined the correspondence between the participants’ answer choices and their overall Rasch person measures (Boone et al. 2014; Sbeglia and Nehm 2018, 2019). Well-functioning items should have a high correspondence. In the second approach, we examined the Rasch-Andrich thresholds (also called step parameters or Andrich deltas), which represent the locations on the Rasch category probability curve where the curves for adjacent answer options meet, and indicate the point at which there is a 50% probability of selecting adjacent answer categories (Linacre 1999). Thresholds that are close together, or not in the expected sequential order (e.g., “strongly agree”, “disagree, agree”), are said to be disordered. Depending upon the cause of the anomaly, threshold disorder may or may not indicate that the item is unable to predictably discriminate abilities on the latent trait (Adams et al. 2012; Andrich 2013; Boone et al. 2014). Collectively, we used rating scale functioning and item fit to assess the overall functioning and appropriateness of each item in the SECM.
Dimensionality
The items of an instrument must measure only one construct or topic (i.e. be unidimensional) in order for the resulting latent measures to indicate the relative position of respondents along the same trait. Therefore, it is necessary to evaluate the dimensionality of the item sets. We conducted two analyses to determine if the instrument is best modeled as one dimension (all conflict scales combined) or three dimensions (each scale on a separate dimension) (RQ1.2). First, we used a principal components analysis (PCA) of the residuals from a unidimensional Rasch model to evaluate patterns of unexplained variance. If the eigenvalue of the first contrast is greater than 2, it indicates sufficient unexplained variation among the residuals to suggest the possibility of additional, unmodeled dimensions (Boone et al. 2014). We also plotted the eigenvalue of the first PCA contrast against the agreeability of each item to visualize the pattern of shared unexplained variation among items. Items that cluster together can be hypothesized to represent a distinct dimension. This approach allows additional dimensions to be discovered based on patterns of unexplained variation.
Second, we used a likelihood ratio test to compare the relative fit of unidimensional and multidimensional models of the response data to Rasch expectations (see Robitzsch et al. 2018). In this approach, dimensions are hypothesized a priori and the resulting models are tested for data-model fit.
Item and person reliability
Item reliability quantifies the extent to which the instrument is able to consistently order items by their difficulties, and may be measured using expected a posteriori/plausible value reliability values (EAP/PV) (Bond and Fox 2001). Person reliability quantifies the extent to which an instrument is able to order respondents based on their abilities, which can be measured using Warm's Mean Weighted Likelihood Estimates (WLE) (Bond and Fox 2001). Reliabilities range from 0 to 1 and can be interpreted much like Cronbach’s alpha (Boone et al. 2017). Values > 0.70 are acceptable (Grigg and Manderson 2016; Yang et al. 2017). Collectively, these measures indicate the ability of the instrument to reliably order items by their agreeability and respondents by their level on the latent trait (RQ1.3).
Person-Item alignment
The alignment of an instrument to the sample in which it was administered indicates the level of measurement precision the instrument can achieve. Precise measurement occurs when the agreeability of items or of the categories on the rating scale (for polytomous items) span the full spectrum of respondent abilities, and precision declines when the items and respondents are less aligned. Items or categories that differ in agreeability act like tick marks on a ruler that allow you to bin respondents based on their abilities. The fewer distinct tick marks on the ruler, the fewer bins respondents can populate, and the lower the precision of measurement. To measure how precisely the SECM measures the latent trait (RQ1.4), we visualized person-item alignment using Wright maps. Wright maps plot Rasch item difficulties against Rasch person measures. If the instrument is polytomous (i.e., not dichotomous) in nature, Thurstonian thresholds for each rating scale category may also be plotted for each item. Thurstonian thresholds are the locations on the Wright map where a respondent has a 50% probability of selecting a particular answer category (or higher) for an item. For this format of instrument, item agreeability is the mean of the Thurstonian thresholds (see Sbeglia and Nehm 2019 for more detail). Respondents with high abilities on the latent trait are positioned at the top of the Wright map. Likewise, items and thresholds with the highest agreeabilities are also positioned at the top of the map, which reflects their high agreeability because top students only have a 50% probability of choosing a given answer (less able students have lower probabilities).
Measurement invariance
Measurement invariance describes situations in which the underlying measurement structure of an instrument (e.g., item descrimination/factor loadings, item thresholds, residual variances, dimensionality) remains stable through time (or across groups) (van de Schoot et al. 2015). While test respondents are often expected to show a change in their amount of a particular latent trait through time (e.g., knowledge of evolution before and after taking a biology course), the underlying measurement structure of the instrument must remain stable in order for a comparison of latent measures to be meaningful (Lommen et al. 2014). To establish if the SECM displayed measurement invariance pre- to post-instruction (RQ1.6), we conducted a differential item functioning (DIF) analysis on the SECM items. An item displays DIF when respondents with equal abilities, but from different groups or time points, differ in their expected responses for the item. An item has “non-DIF” if respondents with equal abilities have the same expected response, regardless of group or time. A finding of “non-DIF” from the pre- to the post-survey would suggest measurement invariance, and thus allow for the meaningful comparison of SECM measures across time. DIF may be calculated by running a multifaceted Rasch model in which the variable being examined (the facet, in this case time) is modeled as having an interaction with each item (Robitzsch et al. 2018). To evaluate the significance of DIF, the absolute value of the t-ratio for the interaction parameter must be greater than 2. If the SECM does not exhibit DIF from pre-to post-course, it may be considered to have measurement invariance, and therefore pre-post comparisons can be meaningfully made.
RQ2. Are respondents interpreting items as anticipated?
In order to gather evidence to test the claim that respondents were interpreting SECM items as anticipated (i.e., substantive validity evidence), a sample of 619 students completing the SECM were also asked to answer a follow-up question. This question was used to examine the correspondence between the intended interpretation of the “community” item and participants’ actual definitions of community. After answering the “community” item, respondents were asked to select the specific groups that they considered to be part of their community. Each respondent was allowed to choose and rank a maximum of three of the following options, or no option at all: (1) My friends at college, (2) My friends from high school, (3) My significant other or partner, (4) People in my major or professional track, (5) People from my race group, (6) People from my neighborhood, (7) People from my church or who share my religion, (8) People from my place of work, and (9) People from my online social network. The first choice was indicated as the choice most important to one’s community.
We performed two analyses. First, we analyzed the correspondence of our intended interpretation of the community item (see above) and participants’ actual chosen definitions by evaluating the proportion of the sample that selected “Not applicable” for one or more of the three specific community categories. This response was interpreted as indicating that the categories of community defined in our conceptual framework and offered to students were not well-matched to their definition of community. Second, we analyzed if respondents defined their communities similarly to one another by evaluating if a subset of categories were more frequently selected than others, and if this pattern differed by conflict level. A 2-sample z-test was used to test for the equality of proportions between high and low conflict respondents. For this analysis, respondents were separated into high and low conflict categories based on whether their Rasch measures were above or below the population’s mean conflict level. Overall, these analyses on a large sample were used to test the claim that respondents were interpreting the item as anticipated and that respondents from different conflict groups were interpreting the features of the items as designed. We use a critical p-value of 0.01 for all analyses.
RQ3: Are latent SECM measures convergent with measures of similar constructs?
To address RQ3, we correlated latent measures of each respondent’s perception of their family’s conflict with evolution ideas (i.e., SECM Family item set) with the modified IOS item using a Spearman correlation. As described above, the modified IOS item asked about perceived compatibility between respondents’ families and their evolutionary ideas.
RQ4: Does the SECM contribute to the explanation of evolution acceptance above and beyond the contributions of religiosity and evolution knowledge?
To address RQ4, we shifted our approach from a Rasch framework to a Structural Equation Modeling (SEM) framework. Whereas Rasch or IRT is a preferred approach when the test and its categorical items are the focus of study (Wright 1996), Latent variable path analysis (LVPA, a SEM method) is preferred when modeling putative causal relationships among latent variables (Mueller and Hancock 2019). LVPA models include a measurement component and a structural (i.e., theoretical) component. The measurement component of a LVPA is akin to a confirmatory factor analysis (CFA), which models latent traits based on the patterns of covariation among its items (i.e., measured variables). CFA and IRT are similar in this regard (though modeling assumptions may differ). However, CFA fits within a broader path analysis framework, in which the measurement model is situated within a structural model of causal relationships among variables. Though CFA and LVPA are traditionally reserved for traits with continuous items (not Likert scale items as in the SECM) due to the use of maximum likelihood estimation (Wright 1996), recent work has resulted in the development of more flexible estimation approaches, including those appropriate for ordered categorical data (e.g., diagonally weighted least squares [DWLS] and its robust variants [e.g., WLSMV]) (Rosseel 2020).
SEM allows the testing of a priori theory-driven hypotheses, and is not designed to generate hypotheses post-hoc (or to model hypotheses derived from previous exploration of the same data set) (Mueller and Hancock 2019). Therefore, the theoretical framework underlying the model being tested must be articulated and justified, which we do in the following section (see section titled Theoretical framework for SECM factor and item relationships). Using this theoretical framework, which seeks to outline how SECM factors and items may relate to each other, we built a structural model using LVPA in the R program Lavaan v. 0.6-6 (Rosseel 2020a). However, this particular theoretical framework need not be adopted in order to use the SECM, and we encourage continued discussion on the appropriateness of our proposed relationships.
Theoretical framework for SECM factor and item relationships
Individuals who experience personal conflict with normative scientific ideas do so because of the ways in which they perceive or process relevant information and events. These perceptions (along with perceptions more generally) may be linked to a person’s group memberships and resulting social identities (Xiao et al. 2016; Kahan et al. 2007). For example, individuals who identify themselves as being members of a particular group may align their perceptions and perspectives with those of the group (Kahan et al. 2007), which is a phenomenon that has been explicitly connected to evolution acceptance, evolution rejection, and science denial more broadly (Walker et al. 2017Footnote 1). Furthermore, exposures to social groups during human development are thought to calibrate peoples' perceptual systems (Xiao et al. 2016), possibly forming cognitive models that can be broadly applied across contexts. Therefore, we propose that aspects of social identity (e.g., the ideas and perspectives held by the social group with which one identifies) may have a causal relationship with one’s personal perceptions of conflict with evolution. Other aspects of identity (e.g., one’s values, cultures, and beliefs) may be indicative of (i.e. manifestations of) one’s latent level of perceived conflict with evolution.
Description of the measurement model
Before implementing a structural model that aligns with the theoretical framework for SECM factor and item relationships (described above), we first evaluated the fit of the measurement model. The measurement model is the part of the model that relates the items (i.e., measured variables) with the factors (i.e., latent variables). A well-fitting measurement model establishes that each factor and its associated items acceptably measures the intended construct. Once a well-fitting measurement model is established, hypothesized causal paths among factors may be modeled and evaluated. In a measurement model, factors are linked to their associated items and all factors (or their residuals [i.e., disturbances] if the factors are endogenous) are allowed to covary with each other (Mueller and Hancock 2019). Next, theory should be used to model covariances between the residual variance (i.e., error variance) of appropriate items. Error variance is the part of the measured variable that does not relate to the factor. If two items have something in common that is not captured by the factor, then their error variances may be correlated with each other (Rosseel 2020b). In order for the measurement model to fit the underlying data, possible error covariances among the items must be considered a priori using theory, and then modeled. Below we detail how we modeled each latent trait in the measurement model.
In alignment with the conceptual framework for conflict perception (see introduction) and the theoretical framework for SECM factor and item relationships (see methods above), the SECM was modeled as three factors, one for each scale of conflict. For each factor, the items (i.e., the culture, values, and belief items) were modeled as indicators (i.e., a reflective relationship between the latent trait and the measured variables [see Mikulić and Ryan 2018 for more on reflective vs. formative models]). Error covariances were modeled among items from different SECM factors that had parallel forms (e.g., the error variances of the three items about “values” were allowed to covary). The CANS was modeled as one factor and error covariances were modeled among items with parallel forms, and among items that focused on the same taxon. Taxon is a feature of instrument items that has been hypothesized to impact evolutionary reasoning and test performance (Kalinowski et al. 2016; Opfer et al. 2012). The I-SEA was modeled as three factors (microevolution, macroevolution, and human evolution) as recommended by the instrument’s authors (Nadelson and Southerland, 2012), and error covariances were modeled among items with negative valence, among items about human microevolution, and among items about human macroevolution, all of which have been hypothesized as possible additional dimensions within the instrument (see Sbeglia and Nehm 2019). Religiosity was modeled as one factor and error covariance was modeled between the two religious participation items. Background variables (i.e., plan, prior biology coursework, level, ELL status, reading and writing ability, gender, race) were also included in this model. All factors were allowed to covary. Modification indices were run and evaluated for possible theory-based changes to the model. We used the WLSMV estimator, which allowed all indicators to be modeled as ordered. Given an acceptable data-model fit for the measurement model, the structural portion of the model could then be estimated (van Riper and Kyle 2014).
Description of the structural model
Structural models are built from measurement models, but in structural models, only theoretically important paths are retained. Theoretically important paths are those that align with the theoretical framework for factor and item relationships laid out by the researcher. Specifically, in line with our theoretical framework for SECM factor and item relationships, we built a LVPA model with the following features: The latent traits of family and community conflict perception were modeled as being causal to personal conflict perception, and personal conflict perception was modeled as causal to the three scales of evolution acceptance. Family and community conflict were allowed to covary and the three factors of evolution acceptance were allowed to covary. Background variables (i.e., plan, prior biology coursework, level, ELL status, reading and writing ability, gender, race), evolution knowledge, and religiosity were modeled as having structural paths to all factors within the model, which removes the linear effects of these variables on parameter estimates (i.e., it controls for them) (Mueller and Hancock 2019). This model is visualized in the results section. With these controls in place, we estimated the significance of the causal paths among the scales of conflict, and between the personal conflict and the scales of evolution acceptance by generating asymptotic standard errors of parameter estimates using the Delta method (Rosseel 2020b). This analysis allowed the investigation of the unique contribution of the causal paths between the SECM and evolution acceptance, above and beyond religiosity and evolution knowledge (RQ3).
Fit statistics
We used the following fit statistics and cutoffs: root mean square error of approximation (RMSEA) < 0.05, standardized root mean square residual (SRMR) < 0.08, and Comparative Fit Index (CFI) > 0.95 (Mueller and Hancock 2019). If a model has acceptable fit, then the parameters are considered interpretable.