LLL: Language, Linguistics & Learning: March 2012

Another post (back in November) notified the publication of "Linguistic Nativism and the Poverty of Stimulus" by Alex Clark & Shalom Lappin. This post is a review of the book. Below is the extended version, while this link sends you to the edited version on linguistlist.

AUTHORS: Clark, Alexander and Lappin, Shalom
TITLE: Linguistic Nativism and the Poverty of the Stimulus
PUBLISHER: Wiley-Blackwell
YEAR: 2011

Nick Moore, Khalifa University, United Arab Emirates

In 12 chapters, with front material, contents, a preface, reference, an author index and a subject index, Alexander Clark and Shlalom Lappin’s “Linguisic Nativism and the Poverty of the Stimulus” tackles a key issue in linguistics over 260 pages. The book is intended for a general linguistics audience, but the reader needs some familiarity with basic concepts in formal linguistics, at least an elementary understanding of computational linguistics, and enough statistical, programming or mathematical knowledge not to shy away from algorithms. It is written, however, for a wide range of undergraduate, graduate and practicing linguists, particularly researchers working in formal grammar, computational linguistics and linguistic theory.

The main aim of the book is to replace the view that humans have an innate bias towards learning language that is specific to language with the view that the innate bias towards language acquisition depends on abilities that are used in other domains of learning. The first view is characterised as the argument for a strong bias, or linguistic nativism, while the second view is characterised as a weak bias or domain-general view. The principle line of argument is that computational, statistical and machine-learning methods demonstrate superior success in modeling, describing and explaining language acquisition, especially when compared to studies from formal linguistic models based on strong bias arguments, typically from Universal Grammar, Principles and Parameters, the Minimalist Program and other theories inspired by the work of Noam Chomsky.

Summary

Chapter 1, the introduction, establishes the boundaries of the discussion for the book. The authors focus throughout on a viable computationally-explicit model of language acquisition. While briefly presenting arguments on the evolutionary and biological nature of linguistic nativism, they rarely consider these questions again in this book. Their first of many criticisms of Universal Grammar (UG) theories proposed and inspired by Noam Chomsky is the Minimalist Program appears to disregard the puzzle of acquisition, unlike earlier versions of the theory which placed innate language-specific learning at the core of the theory, in order to explain how children consistently learn language from apparently inadequate data. Clark and Lappin (hereafter C&L) do not dismiss nativism entirely. They point out that it is fairly uncontroversial, first, that humans alone acquire language, and, second, that the environment plays a significant role in determining the language and the level of acquisition. What they intend to establish in the book, however, is that any innate cognitive faculties employed in the acquisition of language are not specific to language, as suggested by Chomsky and the Universal Grammar (UG) framework, but are general cognitive abilities that are employed in other learning tasks. It is from this ‘weak bias’ angle that they critique the Argument from the Poverty of Stimulus (APS).

Chapter 2 focuses on the Argument from the Poverty of Stimulus (APS). The APS is considered central to the nativist position because it provides an explanation for Chomsky’s core assumptions for UG: (1) grammar is rich and abstract; (2) data-driven learning is inadequate; (3) the linguistic data that the child is exposed to is qualitatively and quantitatively degenerate; and (4) the acquired grammar for a given language is uniform despite variations in intelligence. Most of these assumptions are dealt with throughout the book, and many replaced with assumptions from a machine-learning, ‘weak bias’ perspective, although little evidence is supplied to counter assumption three; rather, the reader is referred to other sources. Some APS theories have attacked connectionist approaches to the language acquisition puzzle by proposing that frequency alone would predict a radically different learning order than that observed. C&L counter that an argument against a connectionist claim does not prove the UG assumption of linguistic nativism correct – the strong bias towards language-specific learning mechanisms remains unproven – and they contend that the puzzle of language acquisition can be more explicitly delineated by computational methods than the proposals so far provided by the various UG frameworks. Reviewing the UG evidence for a strong bias, C&L claim that many arguments become self-serving: “a particular grammatical representation is not motivated by the APS, but rather it becomes an argument for the APS.” (p.33) Two examples of the APS – auxiliary inversion and pronominal ‘one’ – are provided as cases in point. In this chapter C&L introduce machine-learning alternatives to linguistic nativism based on distributional, or relative frequency, criteria and a Bayesian statistical model to account for these same learning phenomenon. These alternatives are elucidated further in the book.

Chapter 3 examines the extent to which the stimulus (accurately, the Primary Linguistic Data) really is impoverished. C&L avoid arguments between the more empirically-demonstrated richer linguistic environment and the naively-assumed impoverished input (the reader is referred to MacWhinney and others), but examine a key question for the computational modelling of the learning process: the existence or prevalence of negative evidence in the learning process. Even a small amount of negative evidence, for instance through reformulation, can make a significant difference to the problem of learning, and C&L allow for far less negative evidence than has been demonstrated in a range of corpora of child-directed speech. Other indirect negative evidence, such as the non-existence of hypothesised structures, can also significantly alter the learning task assuming that the learner is free to make these hypotheses. C&L challenge the premise of no negative evidence, partly because it forms such a central tenet of Gold’s “Identification In the Limit” – a theory of language acquisition that provides considerable support for the UG position of linguistic nativism. Gold’s highly-influential study argues that because learning cannot succeed within the cognitive, developmental and time limits imposed, then children must have prior knowledge of the language system. For instance, Gold claims that since the Primary Linguistic Data does not contain structural information, such as tagging for correctness, the knowledge that children rapidly acquire (e.g. knowing if a string is grammatical) can only be explained by strong linguistic nativism. However, C&L point out that this view is partly a consequence of ignoring non-syntactic information in the learning environment. Maybe it is not possible to explain structural acquisition independently, but the addition of the semantic, pragmatic and prosodic information of the learning context cannot be ignored in the learning process without producing a circular argument. The “Identification In the Limit” theory is examined further in the following chapter.

Clark and Lappin’s overall goal is to establish formal, computational descriptions of the language learning process that can be demonstrated to be viable and feasible, although they are keen to point out that demonstrating the tractability of the learning problem does not equate to modelling the actual neurological or psychological processes that may be employed in language acquisition. Chapter 4 discusses a major argument against the machine learning approach: Gold’s “Identification In the Limit” theory, which concludes that language acquisition is not viable for a ‘learning machine’ and so only linguistic nativism can explain success in language acquisition. C&L reject a number of assumptions in Gold’s model. They do not agree that presentation of the target language should be considered unlimited, as this allows for the unnatural condition of the malevolent presentation of data – intentionally withholding crucial evidence samples and offering evidence in a sequence detrimental to learning. They reject Gold’s lack of time limitation placed on learning. They reject the impossibility of learners querying the grammaticality of a string. They reject the argument that ‘side’ information – information relating to the pragmatics and semantics of the learning context – has no influence on learning syntax, and they provide further evidence to reject the view that learning is through positive evidence only. Perhaps the most significant assumption made in the Gold model that C&L reject is the insistence on limiting the hypothesis space available to the learner. Rather, C&L insist that it is the learnable class of a language that is limited, while the learner is free to form unlimited hypotheses on the limited language. It seems that Gold’s approach is an argument for APS which does not consider alternative approaches: “The argument for the subset principle rests on similar misconceptions as the argument that a target-learnable class must be known to the learner.” (p.97)

Proceeding from a critique of the UG position of a strong bias towards linguistic nativism, C&L begin to introduce their alternative machine-learning approach in chapter 5, “Probabilistic Learning Theory”. The initial step in the weak bias argument is to replace a binary definition of a convergent grammar, typical in UG threories, with a probabilistic definition as this more accurately reflects natural learning processes. C&L on this, and a number of other occasions, object to the simplistic lines of argument employed by Chomsky and his followers in their rejection of statistically-based models of learning. While it may be true that the primitive statistical models critiqued by Chomsky are incapable of producing a satisfactory distinction between grammatical and ungrammatical strings, this does not prove that all statistical methods are inferior to UG descriptions: “the failure of a simple word bigram model to predict a difference in probability between observed events does not entail that statistical language modeling methods in general are unable to handle grammar induction.” (p.101) Consequently C&L introduce a range of statistical methods that they propose can better represent the nature of language acquisition than the under-specified domain-specific mechanisms presumed in UG theories. Central to a plausible probabilistic approach to modelling language is the distinction between simple measures of frequency and distributional measures. Here C&L are proposing that a learner will hypothesise the likelihood of a sequence, based on observations, in order to converge on the most likely grammar. Recent studies using such probability-based grammars are reported in this and later chapters to offer very reliable results, even approaching a reliability factor close to 90%. This general framework uses Probably Approximately Correct (PAC) learning algorithms. PAC algorithms predict efficient, time-limited learning, without requiring the learner to know that their grammar has converged on the correct grammar. However, traditional PAC models have problems, including the requirement that data samples are labelled and the conditions under which a language can be learned, which make them unlikely candidates for reliable models of natural language processing. These issues are dealt with in the following chapters by modifying PAC algorithms in order to better reflect the conditions of natural language processing.

Replacing Gold’s paradigm and the PAC learning with three key assumptions (1. language data is presented to the learner unlabelled; 2. the data includes a proportion of ungrammatical sentences; and 3. efficient learning takes place despite negative examples) allows C&L to introduce the Disjoint Distribution Assumption to more accurately reflect natural language learning, in chapter 6. This probabilistic algorithm depends on the observed distribution of segmented strings, and on the adoption of the principle of converging on a probabilistic grammar (a string is probably correct) in place of a categorical grammar (a string is definitely correct). Using a distributional measure ensures that the probability of any utterance will never be zero, allowing for errors in the presented language data, but each string will be measured against its observed likelihood of distribution. In fact, this model predicts over-generalisation and under-generalisation in the learner’s output because, with an unlimited hypothesis space, “It is the ungrammatical strings that an incorrect hypothesis would wrongly predict to be grammatical, and of high probability, that provide crucial data for learning.” (p.133) The addition of word class distributional data – the likelihood of a certain word class in preceding and succeeding position – also ensures greater reliability of judging the probability of a string being grammatical.

A major aim of this book is to provide a computational account of the language learning puzzle that may not necessarily replicate natural language acquisition, but will make the problem tractable – possible within the defined assumptions. It is the contention of the authors that UG theories have made the wrong assumptions in relation to the learning task and the learning conditions, and in chapter 7 “Computational Complexity and Efficient Learning” they set out the assumptions that allow learning to be efficient without positing a strong bias toward linguistic nativism. To achieve this, they examine the amount and complexity of the data, and the nature and constraints of the learning task, in order to propose learning algorithms that can simulate learning under these conditions, while warning that the purpose here is to demonstrate the possibility in a computational environment not the actual psychological processes that enable human language learning. That is, demonstrating a computational or a linguistic model of learning and language does not entail its psychological reality. A central assumption that is essential to C&L’s machine learning approach is that the input data is not homogenous, resulting in some parts of language being ‘more learnable’ than others. Using a standard domain-general capacity to cluster, the language learner can focus on the easier learning tasks leaving the more difficult parts of grammar for later. More complex learning tasks can then be attacked class-by-class according to a Fixed Parameter Tractability algorithm. Ultimately, C&L argue that complex grammatical problems are no better solved by a UG Principles and Parameters approach; the learning problem remains just as complex and learning need not be achieved any more efficiently. Thus, when UG theories use ‘strong bias’ position as the only argument to deal with complexity, they have not solved the problem posed by a seemingly intractable learning task.

If we are to reject the presumption of the strong bias in linguistic nativism, we need to be confident that its replacement can produce reliable results. Chapter 8 starts to provide those results, illustrated in a range of proposed algorithms. The process of hypothesis generation in Gold’s ‘Identification In the Limit’, a key support of UG, is described as being close to random, and consequently “hopelessly inefficient” (p.153). Various replacements that have been tested, initially in non-linguistic contexts, include (Non-)Deterministic Finite State Automata. These algorithms have then proved effective in restricted language learning contexts. Simple distributional and statistical (including hidden Markov models) learning algorithms offer promising results, but must be adapted to also simulate grammar deduction. Lattice based formalisms are offered as one prospect to “demonstrate tractable learnability for a nontrivial subclass of context-sensitive representations.” (p.161)

Despite promising results, there are still objections to distributional models, and these are countered in chapter 9, “Grammar Induction through Implemented Machine Learning,” which describes the results of real algorithms working on real data. In general, learning algorithms are tested against a ‘gold standard.’ Typically the algorithm performs a parsing, tagging, segmenting or similar task on a corpus, which may or may not be labelled in some way, and the results are measured against a previously-annotated version of the corpus. Corpora in these experiments tend to be samples of English intended for adults – such as the extracts from the Wall Street Journal included in the Penn Treebank (Marcus, Marcinkiewicz and Santorini, 1993). Success is measured by how closely the algorithm matches the previous results and is typically presented as a percentage. Learning algorithms can be divided into “supervised” – requiring the corpus to carry some form of information such as part of speech tagging – and “unsupervised” – working on a ‘bare’ corpus of language. Not surprisingly, supervised learning algorithms, such as the Maximum Likelihood Estimate, match the ‘gold standard’ in about 88-91% of cases. More surprising, perhaps, are the success rates of unsupervised learning algorithms in word segmentation, in learning word classes and morphology, and in parsing. For instance, parsing algorithms match from 52 to 87% of cases when compared to the ‘gold standard’ of previously-annotated corpora. However, close examination of results often reveals explainable differences – that is, the algorithm produces results which are plausible even if they do not match the human annotation. In brief, various experiments have demonstrated that in the vast majority of cases, learning algorithms can accurately segment and categorise adult language without guidance, and make suitable hypotheses about their grammatical role.

Chapter 10 returns to UG models of language learning, and ‘Principles and Parameters’ arguments in particular, in order to compare them with statistical models of learning. C&L claim that the strong bias in this UG model would require an innate phonological segmenter and part of speech tagger, and that by limiting the hypothesis and parameter space available to the learner, the language learning task actually becomes far more complex, particularly as the highly abstract nature of UG parameters appear to have very little direct relationship to the primary language data. Tellingly, C&L lament the paucity of theoretical and experimental evidence for the Principles and Parameters (P&P) framework to provide examples of language variation and learning that fit with the proposal that learning one ‘parameter’ in a language automatically entails the learning of a set of features:

“The burden of argument falls on the advocates of the P&P view to construct a workable and empirically adequate type hierarchy that captures the main patterns of language variation. The fact that such a theory has not yet emerged, even in general outline, after so many years of work within the P&P framework provides good grounds for questioning the concept of UG that it presupposes.” (p.184)

Even more worrying for UG supporters of a strong bias towards innate language acquisition is the near-indifference to questions of acquisition in Minimalist Program research, the latest theory in UG. A strong bias towards linguistic nativism requires the human brain to evolve these biases in order to learn language efficiently. However, this places language prior to humans, as part of the environment to which humans must adapt. Christiansen and Chater (2008) are credited by C&L with discrediting this notion: it is language that must adapt to fit the human mind. Seen in this light, it is far more likely that languages adapt to learning biases that have already evolved in the human brain, providing another ‘weak bias’ argument, and so general tendencies in human languages are not examples of Universal Grammar but “emergent properties of the processes through which natural language is adapted over generational cycles of transmission.” (p.185) In place of the UG models, C&L again offer sophisticated statistical models of language learning and generation. Promising results using “Probabilistic Context-Free Grammars” and “Hierarchical Bayesian Models” currently provide the best alternatives to UG models by accounting for language acquisition through comprehensive descriptions and high levels of success in simulations.

In chapter 11, “A Brief Look at Some Biological and Psychological Evidence” C&L quickly review accounts of language learning that support a weak bias. Even where evidence from genetic, biological or psychological studies have been used to support a strong bias, C&L are able to show that this evidence does not necessarily favour nativist arguments. For instance, the FOXP2 gene has been attributed, by Pinker and Jackendoff among others, as the gene responsible for the capacity for human language. In a family where this gene is mutated, all family members suffer a similar severe language disability. C&L point out, though, that this gene is not unique to humans and, as it is responsible for a disruption in motor learning and development in other animals, it is far more likely that this genetic mutation results in a more general learning disability that also severely affects speech and language development. Similarly, psychological disabilities such as Williams Syndrome that may appear to be language-specific are, on closer inspection, related to a range of learning and developmental abnormalities.

In the final chapter, ‘Conclusion,’ C&L review their evidence against the argument from the poverty of the stimulus. They remind us that they are arguing for a weak innate bias towards learning language based on general-domain learning abilities rather than language-specific abilities. They point out that the UG framework has produced few concrete experiments or cases that explain the language variation or acquisition described in theoretical accounts or that “produce an explicit, computationally viable, and psychologically credible account of language acquisition” (p.214). What they have attempted to demonstrate in “Linguistic Nativism and the Poverty of the Stimulus” is that there are explicit computational models of learning that have produced a credible account of learning without requiring language-specific parameters. Although they are far from perfect and much work needs to be done, computational models have already provided a more adequate account than the UG models:

“We hope that if this book establishes only one claim it is to finally expose, as without merit, the claims that Gold’s negative results motivate linguistic nativism. IIL is not an appropriate model. Conditions on learnable classes need not be domain specific. Finally, the learner does not have to (and in general, cannot) restrict its hypothesis to the elements of the learnable class.” (p.215)

Instead C&L advocate the use of domain-general statistical models of language acquisition.

Evaluation

In 12 chapters, Clark and Lappin use “Linguistic Nativisim and the Poverty of Stimulus” to evaluate a key concept in modern linguistics, taking a clearly computational perspective and examining a wide variety of topics. This monograph presupposes familiarity with most of the core issues but does not demand in-depth knowledge of computational linguistics. In such a review I have, naturally, simplified or ignored a number of important arguments presented in this book, and skimmed over significant presentations of learning algorithms. I would suggest however, from this reviewer’s point of view, that C&L present a very cogent and coherent argument.

There are so many sides from which to attack linguistic nativism in general, and the argument from the poverty of stimulus in particular. Opponents have argued that most UG theories are unfalsifiable (e.g. Pullum and Scholz, 2002), that corpora designed to reflect children’s exposure to language demonstrate that the stimulus is not impoverished (e.g. MacWhinney, 2004), that it is absurd to posit the notion that the brain adapted to language, as if language exists in the environment prior to man, rather than language adapting to the general abilities of the brain (Deacon, 1998; Christiansen and Chater, 2008). These arguments, alongside alternatives to linguistic nativism from functional linguistics (e.g. Halliday, 1975; 1993; Halliday and Matthiessen, 1999), are often easily dismissed as being irrelevant to the theory of UG. Some of these arguments are mentioned in this book, but what sets Clark and Lappin’s book apart, and why it must be taken seriously by everyone who proposes some form of UG to explain language acquisition and typology, is that it attacks from within. It claims the very ground claimed by theories of UG. UG attempts to formally and explicitly account for the apparent mismatch between the complexity of the language learning task and the near-universal success of humans in achieving it with such apparently meagre resources. The methods proposed by Clark and Lappin identify what methods could be applied to make the complex task tractable. Specifically, these methods are not restricted to language, but are generally useful learning methods – they are domain-general. If there is one criticism I would make of Clark and Lappin’s argument it is that they do not demonstrate clearly enough – at least to this rather naive reader – just how likely it is that we all use the domain-general learning methods that they propose. For instance, we are left to presume that Probabilistic Approximately Correct and Probabilistic Context-Free Grammar algorithms represent general, non-language specific, models of learning, but this is not demonstrated.

My biggest fear with this approach is that we may be fooled by the apparent sophistication of the tools at our disposal. We need to remember that when using computational tools to help us understand a phenomenon far more complex than computers, we must not allow the tools to force us to see the phenomenon as the tool understands it. It seems more than coincidental that a computational approach to language acquisition mirrors the findings about language that corpus linguistics has revealed; for instance, that language can be viewed as inherently statistically structured. That it can be analysed in this way, or in the form of transformational trees, does not prove that this is how humans learn language. Fortunately, Clark and Lappin are well aware of this trap and frequently warn readers that no matter how well their computational theories may reflect language acquisition facts, the aim of computational models is to demonstrate what is possible, or even likely, not what really happens in the human mind. Until we better understand exactly what neurological processes are actually involved in language acquisition, our task is to try to represent acquisition as best we can. In this endeavour, we have been expertly assisted by Clark and Lappin’s book.

Linguistic Nativism and the Poverty of the Stimulus is a challenging book. It challenges the reader to deal with a range of linguistic, philosophical, mathematical and computational issues. It challenges the reader to remember a dizzying array of acronyms and abbreviations (APS, CFG, DDA, DDL, EFS, GB, HBM, IID, IIL, MP, PCFG, PLD and UG to name but a few). Most of all, it challenges basic concepts in mainstream linguistics. It examines key tenets of UG in the light of advances in machine learning theory and research in the computational modelling of the language acquisition process, and finds them sorely deficient. It exposes so-called proofs supporting the poverty of stimulus, and reveals alternatives that are formally more comprehensive than the explanations previously provided by the linguistic nativism inherent in UG, and empirically more likely to match natural language acquisition processes.

REFERENCES

Christiansen, Morten H. and Chater, Nick. 2008. Language as Shaped by the Brain. Behavioural and Brain Sciences 31. pp.489-558

Deacon, Terrence W. 1998. The Symbolic Species: The Co-Evolution of Language and the Brain. New York: W.W. Norton

Halliday, Michael A.K. 1975. Learning How to Mean: Explorations in the Development of Language. London: Edward Arnold

Halliday, Michael A.K. 1993. Towards a language-based theory of learning. Linguistics and Education 5 p.93-116

Halliday, Michael A.K. & Matthiessen, Christian M.I.M. 1999. Construing Experience Through Meaning. London: Continuum

MacWhinney, Brian. 2004. A Multiple Solution to the Logical Problem of Language Acquisition. Journal of Child Language 31, pp. 883-914

Marcus, Mitchell P., Marcinkiewicz, Mary Ann and Santorini, Beatrice. 1993. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics19/2, p.313-330

Pullum, Geoffrey K. and Scholz, Barbara C. 2002. Empirical Assessment of Stimulus Poverty Arguments. The Linguistic Review 19, pp.9-50