Duff p. 1

Duff, P. (to appear). In M. Chalhoub-Deville, C. Chapelle, & P. Duff (Eds.), Inference and generalizability in applied linguistics: Multiple research perspectives. Dordrecht, Netherlands: Amsterdam: John Benjaminss..

Beyond Generalizability:

Contextualization, Complexity, and Credibility in Applied Linguistics Research

Patricia A. Duff

University of British Columbia

Introduction

Research in education and the social sciences has long been concerned with the basis for inferences and conclusions drawn from empirical studies, and applied linguistics is no exception. With the emergence of qualitative, mixed-method, and innovative new approaches to research in recent decades, issues connected with validity in research and testing have received renewed attention by applied linguists, not only about how to design and interpret one’s own studies in order to make legitimate claims, but also how to interpret others’ research or the tools and products of research, such as tests, typologies, and so on. One particular area of concern is the nature and scope of insights that can be generated from qualitative research such as case study, ethnography, narrative inquiry, conversation analysis, and so on, as well as within more familiar quantitative research paradigms or inquiry traditions, especially when small sample sizes are involved. Seeking and reaching consensus regarding criteria for conducting and evaluating high quality research within each tradition has therefore become a priority of late (e.g., Chapelle & Duff, 2003; Edge & Richards, 1998; Lazaraton, 2003).

This chapter discusses inference and generalizability as they apply to qualitative applied linguistic research primarily, while conceding that the terms qualitative and quantitative are overstated binaries when describing contemporary so-called “new-paradigm” research designs. First, I define inference and generalizability as they are often understood and used in quantitative research and consider their relevance to qualitative research. I then explore such themes as contextualization, complexity, and credibility in relation to generalizability in most qualitative research as fundamental ways of broaching validity. Case studies and ethnographies in second language acquisition (SLA) and second language education (SLE) are selected to illustrate these principles. I conclude that both quantitative and qualitative applied linguistic research should seek to maximize the awareness and importance of contextualization, complexity, and credibility as well as analytic or naturalistic generalizability (however these concepts are taken up) as we promote rigorous, systematic, and meaningful forms of inquiry and as we consider the inferences and interpretations that can be drawn from our research.

Inference

The aim of research is to generate new insights and knowledge--in other words, to make various kinds of inferences based on observations. Quantitative research often stresses the importance of inference. In that paradigm, inference may refer to the degree to which we infer from people’s observable behaviors aspects of their underlying competence or knowledge systems,; or it may refer to the nature of claims that are inferred from various kinds of evidence (e.g., about the relationships among variables). Inference, unlike generalizability, is a term that is seldom addressed explicitly or in a technical sense in meta-methodological discussions of qualitative research, as a perusal of the subject indexes of many qualitative research methods textbooks reveals (e.g., Crabtree & Miller, 1999; Creswell, 1998; Denzin & Lincoln, 2000; Eisner & Peshkin, 1990; Holliday, 2002; Merriam, 1998; Miles & Huberman, 1994; Neuman, 1994; Silverman, 2000). Even in quantitative or mixed-methods textbooks, the term is mainly used in conjunction with particular types of statistics.[1] Inferential statistics, such as t-tests or analysis of variance, make it possible to draw certain kinds of conclusions about data in relation to research questions (e.g., causality), and especially about the relationship between the sample data and the characteristics of the larger population (Brown & Rogers, 2002; Fraenkel & Wallen, 1996; Gall, Borg & Gall, 1996; Palys, 1997). Inference, therefore, is closely related to the notion of generalizability: namely, whether results are generalizable to a larger group or to theoretical principles, that is, whether we can infer generality. In some respects then, inference and generalization can refer to the same process, since generalization is a kind of inference. However, inference refers to a broader cognitive process connected with logical reasoning, generality being just one sort of reasoning. Thus, although inference is a concept not typically discussed at length--or even in passing--in most qualitative research, it is a form of reasoning that is implicit in the presentation and interpretation of research within any paradigm. For example, qualitative research in language education seeks to draw inferences about such topics as the values, conditions, linguistic and sociocultural knowledge, and personal experience that underpin observable behaviors in order to understand how and why people behave or interact in certain ways, how they interpret their behaviors and situations, how learning proceeds, what instructional processes are deemed effective, the characteristics of learning cultures, the attributes of certain kinds of learners and teachers, and so on. Instead of inference, most qualitative methodology textbooks discuss processes of interpretation, a kind of inference, a search for patterns and understandings, which is central to the meaning-making in qualitative research.

Generalizability in Quantitative vs. Qualitative Research: Problematics and Possibilities

Generalizability, a crucial concept in positivist (generally quantitative) experimental research, aims to establish the relevance, significance, and external validity of findings for situations or people beyond the immediate research project. That is, it is part of the process of establishing the nature of inferences that can be made about the findings and their applicability to the larger population and to different environmental conditions and to theory more universally. Generalizability (or generality as some describe it, e.g., Krathwohl, 19936not in references), while typically discussed in connection with inferences about populations (whether the observations about sampled individuals can be generalized to others in the same population), can also involve the ability to generalize (effects) to “treatments, measures, study designs, and procedures other than those used in a given study,” according to Krathwohl (p. 735). Sampling procedures are one of several elements (e.g., research design) that affect the kinds of inferences that can be drawn about generality. It is commonly accepted that quantitative research, with appropriate sampling (random selection, large numbers, etc.), research design (e.g., counterbalancing of treatments, ideally with a control group, pre-post measures, and careful testing and coding procedures), and inferential statistics where appropriate, has the potential to yield generalizable results.

However, carefully controlled research may nevertheless provide inadequate contextualization of the study, the participants, the tasks/treatments, and so on, and therefore may be less easily generalized than might otherwise be the case. As Gall et al. (1996) point out, “generalizing research findings from [an] experimentally accessible population to a target population is risky” (p. 474). The two populations must be compared for crucial similarities. In addition, “if the treatment effects can be obtained only under a limited set of conditions or only by the original researcher, the experimental findings are said to have low ecological validity,” (p. 45), and thus low external validity as well. Safeguards must be in place both in designing and carrying out the experiments and also in reporting the results. Gall et al. (1996) provide a list of factors associated with external validity in experiments that researchers need to take into account (see Table 1), listed under the headings of population validity and ecological validity (after Bracht & Glass, 1968).

Insert Table 1 about here

Key sociocultural variables such as institutional context, first language (L1) background or the relationship between first and second language/culture, and other characteristics of the sample/population, and the tasks, and even the relationship between the researcher and those researched might be underspecified or omitted, because of space limitations or because the variables are not considered important or central to the study. The generalizability of findings to wider populations and contexts can inadvertently be reduced as a result--regardless of the claims made by the researchers about generality. In a field as international, interlingual, and intercultural as SLA or SLE, the sociocultural, educational, and linguistic contexts of research are of great importance (both macro-level social contexts and micro-level discursive/task contexts, Duff, 1995). To date, most of the published research conducted in TESOL, for example, has taken place in American university or college programs with students within a particular proficiency range of proficiency and educational preparedness that may or may not be easily generalized to other types of programs (e.g., with children) or in countries with very different educational systems, cultures, histories, and economies (e.g., in EFL regions).

The more controlled and laboratory-like the SLA studies (e.g., Hulstijn & deKeyser, 1997), often using very contrived tasks (even involving nonsense artificial ? languages or nonsense images in some cases in order to control for prior knowledge), the less generalizable the findings are, in my view, either to larger populations from which samples are drawn or to broader understandings of language teaching, learning, or use in classrooms or other naturally occurring settings. That is, it may be difficult and unwise to generalize from behaviors of unfamiliar pairs of interlocutors doing unclassroom-like research tasks under laboratory-like conditions to how language learners, as familiar classmates with their own history of interacting with one another and undertaking tasks, would do classroom tasks or engage with interlocutors in natural, non-experimental settings.[2] Furthermore, while the research may speak to issues of how they would engage in one particular type of task, it does not shed light on how curriculum can be developed linking such tasks in meaningful, educationally sound ways. Most SLA studies continue to examine L2 learners of English or other Indo-European languages and much less often Indo-European-L1 learners of non-Indo-European L2s (e.g., English learners of Chinese or Arabic; see Duff & Li, 2004, on this point). They have also privileged a small set of fairly basic tasks associated with communicative language teaching (e.g., spatial “spot-the-difference” or “plant the garden” tasks), and have not explored other instructional contexts such as EFL instruction to the same extent, and far too seldom investigate interactions or language development over an extended period of time. These issues, in my view, also reduce the generalizability and utility of the findings of such studies, no matter how rigorously they are conducted. The onus is on researchers in quantitative studies to convincingly demonstrate the external validity of their findings (if that is their objective), rather than take it for granted that generalizability is possible in quantitative research but categorically impossible in qualitative research.

Table 1 captures some basic differences between quantitative vs. qualitative research, especially in relation to generalizability and the strengths and weaknesses inherent in each paradigm with respect to validity concerns, more broadly. Here I summarize some of the most commonly cited differences. Whereas quantitative research emphasizes both internal and external validity (or generalizability), in addition to reliability, qualitative research, especially postpositivist, naturalistic, interpretive studies, typically emphasize elements associated with a combination of internal validity and reliability (to borrow terms from quantitative research). Internal validity in quantitative research, like its counterpart in qualitative research, is related to the credibility of results and interpretations, as a function of the conceptual foundations and the evidence that is provided. As Krathwohl (199393 not in references) puts it, internal validity in quantitative research is related to the study’s “conceptual evidence linking linking the variables,” supported by “empirical evidence linking the variables”—demonstrated results, the elimination of alternative explanations, and judgments of the overall credibility of the results (p. 271). In qualitative research, internal validity is addressed by means of contextualization; thick description; holistic, inductive analysis; triangulation (or “crystallization,” to use a more multifaceted metaphor; Richardson, 1994); prolonged engagement; ecological validity of tasks; and a recognition of the complex and dynamic interactions that may exist among factors; as well as the need for the credibility or trustworthiness of observations and interpretations (Davis, 1995; Watson-Gegeo, 1988). Thick description, one of the most touted strengths of case study and ethnography (but not of conversation analysis or certain other kinds of qualitative inquiry), may draw on the following sources of information: documentation, archival records, interviews, direct observations, participant-observation, and physical artifacts (Yin, 2003). Gall, Gall, and Borg (2003) suggest that a suitably thick description of research participants and concepts allows “readers of a case study report [to] determine the generalizability of findings to their particular situation or to other situations” (p. 466). The aim is to understand and accurately represent people’s experiences and the meanings they have constructed, whether as learners, immigrants, teachers, or administrators, or members of a particular culture.

Table 1: Generalizability and Quantitative vs. Qualitative Research

QUANTITATIVE / QUALITATIVE
Research designs/methods / quasi-/experiments, correlations, surveys, regressions, factor analyses, etc. / case study, ethnography, conversation analysis (CA), other micro-discourse analyses, interview research, document analysis, or some combination of these
View of generalizability (external validity) / emphasis on external validity (e.g. in experiments; Gall et al., 1996, p. 466):
1. population validity
2. ecological validity, with attention to:
explicit description of experimental treatment
“representative design” (reflects real-life environments and natural characteristics of learners)
multiple-treatment interference
Hawthorne effect
novelty and disruption effects
experimenter effect
pretest sensitization
posttest sensitization
interaction of history and treatment effects
measurement of dependent variable
interaction of time of measurement and treatment effects
Krathwohl (1993): explanation generality, translation generality, demonstrated generality, restrictive explanations (conditions) eliminated, replicable result / limited relevance, in traditional views, to the goals of qualitative research; validity is described instead in terms of “transferability,” catalytic and ecological validity, credibility, dependability, confirmability, etc.; “validity” and “reliability” are often not compartmentalized (Gall et al., 1996)
some (positivist) qualitative researchers believe that through the aggregation of qualitative studies from multiple sites (e.g. case studies through a case survey method, through “meta-ethnography” or cross-case “translation,” and the comparative method) generalizations may be warranted; others see generalizability in terms of “fit” between one study/situation and another, based on thick description (Schofield, 1990)
Types of generalization / to populations, settings, conditions, treatments
to theories/models (hypotheses) / to populations (similar cases/contexts; also known as case-to-case translation), Lincoln & Guba, 1985), based on similarities across situations/contexts, and enhanced by the representativeness of sites/cases/situations studied
to theories/models (this is the more common application; also referred to as “analytic generalizability,” Firestone, 1993)
Strengths / multiple occurrences of phenomenon/effects in relatively controlled setting; quantification helps establish clear trends and relationships; statistical inferences or extrapolation of results to defined wider population possible (especially with random or probability sampling)
normally based on data from large sample populations
internal, external validity and other types of validity often demonstrated convincingly; attention paid to reliability of coding, testing, etc. / great potential for rich contextualization, accounting for complexity of social/linguistic phenomena
potential to resonate with readers because of accessibility, description, narrative genre and examples
in critical/feminist research, the potential to move readers to action to create change; i.e. “catalytic validity”
more opportunities to document ecological or internal validity than in much quantitative research; but less potential or desire to make claims of external validity
based on theoretical or purposive sampling
transferability of findings is determined not only by researcher but also by reader and by the typicality, representativeness or depth of description of case/situation
focus on multiple meanings and interpretations
Way of strengthening further / using ecologically valid tasks, settings, procedures; addressing underlying constraints on generalizability
documenting contexts, sampling, procedures, instruments, etc. in as much detail as possible / having an aggregation of multiple-case or multiple-site studies; triangulation
thick description
include participants’ own judgment of generalizability/representativeness or typicality (Hammersley, 1992); corroboration from other studies (cf. Maxwell, 1996), in addition to observations
Potential problems / over-emphasis on external validity in some cases, at the expense of internal validity (e.g., task-based research, with strangers, or using artificial languages, contrived tasks): How far can findings about interaction patterns be generalized to “normal/natural” learning conditions?
failure to replicate or follow up on studies with different populations and in different contexts may lead to de facto generalization
too often, the research focuses on only the analysts’ analysis/perspectives and not those of participants or other observers (although coding is usually strengthened by getting 2nd raters, using operational definitions/constructs, etc.)
statistical inferences may be based on inappropriate statistical procedures for types of data involved (cf., Hatch & Lazaraton, 1991) / over-emphasis on possibly atypical, critical, extreme, ideal, unique, or pathological cases, rather than typical or representative cases (e.g., Genie, Alberto, Wes; i.e. in terms of fossilization, critical period studies, exceptionally good or ineffective language learners)
“telling cases” may not always be highly representative cases; they may be very helpful in providing insights into SLA but conditions or insights may not apply broadly to others; e.g., autobiographical cases of metalinguistically sophisticated learners (Schmidt & Frota, 1996) who may be atypical
tendency to generalize widely to theory (acculturation model, notice-the-gap principle), despite disclaimers, with few similar case studies conducted (i.e. in lieu of replication, multiple-case-studies)
observations may lack relevance to field more widely, outside of immediate context
need for enduring key themes, constructions, or relationships among factors that will be helpful to other researchers in other contexts; going beyond particularistic local observations to more general trends
rigorous evidence, or authentication may seem to be lacking (Edge & Richards, 1998) but goal should be to “interpret and … offer a [context-specific] understanding” (p. 350)
see Lazaraton’s (2003) discussion of “criteriology” and qualitative social research”

Qualitative research in education, according to Schofield (1990), first began to address issues of generalizability because of large-scale, primarily quantitative, multi-method program evaluation research in the 1980s and 1990s that incorporated significant qualitative components and yet, given the over-arching quantitative structure, still framed discussions in terms of generalizability. She observed that a general “rapprochement” of the two major paradigms since then, emphasizing their complementarity as opposed to fundamental incompatibility, has further prompted researchers to examine reliability and validity or their proxies. Although it is often said that qualitative research is neither interested in nor able to achieve generalizability or to generate causal models or explanations, there are in fact many diverging opinions on this issue, as we will see in what follows (Bogdan & Biklen, 1992).