Four distributional arguments

by Eric Baković

0. Introduction

From July 2004 to June 2013, I ran a blog called phonoloblog. It was relatively unsuccessful, in the sense that it didn’t achieve the kind of wide and/or devoted readership that one hopes for when one blogs. But it was also a huge success for me personally because it was a great way to somewhat-less-than-formally work through some ideas I was having about phonology at the time. Several of those ideas eventually made their way into formal research publications, but of course many did not (or, at least, not yet).

What follows are four of these underdeveloped phonoloblog posts, spanning from December 2004 to October 2007, all tied together by a theme reflected in the shared portion of their titles: “distributional arguments”. I have done some editing, mostly to remove some of the less relevant (or just more phonoloblog-y) bits of the original. But I also provide links to each original post for the sake of posterity, context, and scholarship (see in particular the comments from several thoughtful readers at the end of the original posts). I also attempt to provide a kind of summary conclusion at the end.

I thought it appropriate to reproduce these here, as I think there’s a point (maybe even two) in them that Alan would appreciate — not to mention the fact that I kick things off with discussion of an argument from McCarthy & Prince (1993).

1. Distributional arguments

[original post: December 2, 2004]

Consider the following argument by McCarthy & Prince (1993: 181):

(1) The (velar glide-final) Axininca Campa root /iraɰ/ behaves as if it were /raɰ/; that is, a single syllable as opposed to two (for the purposes of the phonology of the velar glide).
(2) Suppose that the /i/ in /iraɰ/ (and in all /ir/-initial roots) is epenthetic, and that the monosyllabic behavior of /iraɰ/ is calculated before epenthesis applies (or however epenthetic segments are ignored).
(3) As it turns out, /r/-initial roots are unknown in Axininca Campa, save for a single borrowing (rapisi ‘pencil’, from Spanish lápiz). This is expected if underlyingly /r/-initial roots undergo /i/-epenthesis, becoming /ir/-initial roots.
(4) Furthermore, /ir/-initial roots are far more common than other /Vr/-initial roots. This is expected if /ir/-initial roots have two underlying sources, as opposed to only one for other /Vr/-initial roots.

(3) may already be convincing enough for some folks to believe (2) as an explanation for (1). (Note: the empirical claim in (3) is based on “an examination of [the] root lexicon of [David Payne’s (1981) The Phonology and Morphology of Axininca Campa], containing approximately 850 entries”.) I’m not going to address that here; what I’m interested in is (4), which appears to rely on the following (unstated) assumption:

(5) Underlyingly, all segmental strings (of equal length) have equal distributions (= probabilities of occurrence).

I find the assumption in (5) to be less than convincing, though perhaps I wouldn’t have given it another thought had I not also encountered an invocation of a very similar assumption (also unstated) in a talk by Gessiane Picanço in 2004, when she was still a graduate student at UBC. The talk concerned facts about Mundurukú (a.k.a. Mundurucú), a small part of Picanço’s very interesting work on the phonetics, phonology, and diachrony (through comparative reconstruction) of the Mundurukú subfamily of Tupi languages. (Tupi also includes the Guaraní subfamily.) Some but not all of Picanço’s argument from that talk as summarized below appears in her 2005 UBC dissertation, §§5.7-5.8, pp. 206-216.

Picanço’s talk was specifically about the possible diachronic sources of some phonotactic restrictions in Mundurukú. Picanço counted the relative distributions of consonants and vowels in 1,252 CV(C) syllables (one instance per alternant per morpheme from Picanço’s fieldnotes); among other interesting facts discussed in the talk, Picanço found the facts in (6) and (7) below.

(6) There are 37 instances of /ʃi/ sequences in Mundurukú. (That’s just a hair under 3% of the counted CV(C) syllables.)
(7) /si/ sequences, on the other hand, are nonexistent. (One exceptional /si/ sequence exists in a borrowing, pasí ‘go for a walk’, from Portuguese passear.)

Picanço considers and rejects the synchronic analysis in (8), plausible-seeming though it may be.

(8) Suppose that the distribution in (6) and (7) is due to a palatalization rule, sʃ / _ i.

Instead, Picanço argues that the absence of /si/ in Mundurukú may be an “emergent phonotactic”, arising from the vagaries of regular sound change. Picanço offers comparative evidence between Mundurukú and Kuruaya (another Mundurukú language), showing that Mundurukú /ʃi/ has at least two diachronic sources, **/ci/ and **/ki/ — but, crucially, not **/si/ (where the double asterisks in the preceding denote reconstructed Proto-Mundurukú sequences). Regardless of whether you’re partial to the synchronic analysis or the diachronic one, there is also the following to consider:

(9) /ʃi/ sequences are far more common than both other /Ci/ sequences and other /ʃV/ sequences. This is expected if /ʃi/ sequences have either two underlying or historical sources, as opposed to only one for both /Ci/ sequences and /ʃV/ sequences.

Sound familiar? If not, recall (4) above. Like (4), (9) also appears to be based on the same assumption cited earlier in (5), repeated here as (10) and modified to include the diachronic analytical possibility:

(10) Underlyingly/historically, all segmental strings (of equal length) have equal distributions (= probabilities of occurrence).

Given what (I think!) we know about the myriad factors that contribute to the inequality of surface string distributions in a given language, the assumption in (10) just seems like a non-starter to me. (This is most obvious to me in the historical case, where what we’re comparing are two sets of surface distributions; one is reconstructed, but that’s beside the point.) If I’m right, then I think that any argument based on this assumption — such as the arguments in (4) and (9) — is invalid. (Of course, both Picanço and McCarthy & Prince offer other arguments for their respective claims, which must be assessed on their own. As already noted, I have nothing in particular to say about those other arguments here.)

But I could also be missing the point; after all, the assumption in (10) is not stated in either of the works cited above (as I have already noted). The arguments in (4) and (9) are not even pursued very far by the respective authors; numbers are cited and pointed to in the relevant discussion, but that’s about it. Note that in addition to (10), I’m also roughly inferring the “two underlying/historical sources vs. one” thing in (4) and (9); neither Picanço nor McCarthy & Prince invoke any such numbers in this context. I recognize that in both cases things are more complicated than two vs. one, and that the distributional numbers are not expected to correlate exactly with the relative numbers of underlying/historical sources — but my point is, do we even expect them to correlate somewhat, or for any correlation we may find to be a positive indication of one-way or mutual influence?


2. More distributional arguments

[original post: January 17, 2005]

There’s an excellent problem set from Russian discussed at the beginning of Chapter 3 (titled simply “Alternations”) of my favorite textbook, Kenstowicz & Kisseberth’s Generative Phonology: Description and Theory (Academic Press, 1979). This problem set packs a lot of punch for the beginning phonology student: the final devoicing rule that demonstrates that the basic alternant is not necessarily the unsuffixed one, a crucial feeding order between l-drop and final devoicing, and an equally crucial bleeding order between l-drop and dental stop deletion (also a crucial counterbleeding order). And, as usual, K&K’79 proceed through the problem set with some of the most thorough argumentation that you’re likely to see anywhere.

In their discussion of the dental stop deletion rule, K&K’79 present an argument for deletion as opposed to epenthesis that is highly reminiscent of the distributional arguments commented on in §1 above.

Here is the relevant paragraph (from p. 58) in its entirety; I think that no further background is necessary to understand it (assuming you’re a phonologist, anyway). The most relevant part is underlined:

There is another reason for rejecting the insertion analysis. If stems of such shapes as me- and kra- were set up as basic, we would be creating an odd gap in the inventory of basic stem shapes. There would be stems ending in labial consonants (greb-, skreb-), in velars (mog-, pek-), and dental fricatives (nes-, lez-), but none in dental stops, despite the fact that dental stops are basic sounds in Russian and occur in other positions (for example, stem initially). Furthermore, it would just happen to be the case that the consonants that get inserted before the -u suffix are dental stops, precisely the sounds that would be absent from stem-final position in the proposed URs. The point here is that typically the distribution of sounds is fairly symmetrical in underlying representations, and a skewed distribution in phonetic representations is characteristically the result of the application of some rule (e.g., in Russian there are no voiced obstruents at the end of a word in phonetic representation because of the rule of final devoicing).

Let me clarify that I have no dispute whatsoever with the basic argument here; the deletion analysis is clearly more plausible than the insertion alternative (even without the argument that K&K’79 present first, that the voicing of the hypothetically inserted dental stop would be unpredictable). The bolded “point here”, however, is a lot more than just a restatement of the basic argument. K&K’79 could have written the following instead:

The point here is that the rule we propose to account for the alternation between dental stops and Ø should ideally provide a full account of the distribution of dental stops. A hypothetical insertion rule requires that there be an accidental gap in the lexicon such that there are no stems ending in dental stops. The distribution of dental stops would thus not be fully accounted for.

Or something like that, anyway. Instead, K&K’79 talk about the “typically symmetrical” distribution of sounds in underlying representations. This does not strike me as something we would want to elevate to the status of a result; it’s more like a low-level problem-solving heuristic in generative phonology, and moreover it’s one that has plenty of apparent counterexamples (think of the evidence we have for the underlying distributions of /ŋ/ and /h/ in English, for example).

It’s true that we often derive intuitive satisfaction from an analysis that achieves underlying symmetry when none is found on the surface. But what is responsible for this intuition? Why should we presume underlying inventories or distributions to be symmetrical when surface inventories or distributions are not?

A particularly notable example of the argument from underlying symmetry that I’ve found is in the literature on Nez Perce vowel harmony immediately surrounding its mention in SPE (e.g. Aoki 1966, 1968, 1970, Jacobsen 1968, Kiparsky 1968, Rigsby & Silverstein 1969, Zwicky 1971 — see the bibliography in my dissertation for full references). Nez Perce has a small and relatively skewed surface vowel inventory i u o æ a, but the form of the dominant-recessive vowel harmony process — in particular, the fact that some i‘s are dominant while others are recessive, suggests a more symmetrical underlying inventory. Underlying symmetry is specifically noted to be an analytical desideratum in several of the analyses cited above, as helpfully summarized by Zwicky (“More on Nez Perce Vowels: On Alternative Analyses“, IJAL 37.2, 1971). Zwicky lists “the character of the underlying vowel system” as the first of “at least four sorts of consideration bear[ing] upon the plausibility of an analysis” on p. 123, soon thereafter making it clear that this consideration would be satisfied by filling the asymmetrical gap in the Nez Perce inventory with /e/. Other considerations may conflict with and supersede this one, but what evidence do we have for this sort of consideration in the first place?


3. Still more distributional arguments

[original post: October 14, 2005]

A seminar discussion of Jaye Padgett‘s “Unabridged feature classes in phonology” (abridged published version appeared in Language, 2002; the paper dates back to these two) relates to the topic of the previous two sections. But first, a little background on Padgett’s paper.

Back in the heyday of feature geometry, there seemed to be plenty of good reasons why you would want to group a set of features that pattern (e.g., spread) together under a class node. Perhaps the most compelling of these reasons can be loosely characterized as follows. Borrowing an example from Padgett, suppose you’re looking at Turkish high vowels and you want to express the generalization that both [back] and [round] spread. If these are completely separate features with no class node, the best you could do would be to write two rules, one for each feature. By lumping [back] and [round] under the class node Color, however, you could write one rule spreading Color, which takes both [back] and [round] along for the ride.

So what? Well, the feature-geometric heyday of which we speak was still under the heavy influence of the SPE evaluation metric: roughly, a grammar with fewer and simpler rules (that express linguistically significant generalizations) is more highly valued than a grammar with more and more complex rules (that do not express linguistically significant generalizations). Translated into OT: more general constraints and more explanation-from-interaction is more highly valued than more specific and ad hoc constraints.

One of Padgett’s points is that standard feature geometry has a problem with partial class behavior of the type you find in Turkish: [back] spreading affects high and nonhigh vowels alike, while [round] spreading only affects high vowels. Padgett’s proposal is that feature classes like Color are sets that can be referred to by gradiently violable constraints: Spread(Color) prefers spreading of both [back] and [round], but [round] is prevented from spreading to nonhigh vowels, in which case it’s better to at least spread [back] than to spread nothing at all. This results from the ranking in (1):

(1) *Nonhigh/round >> Spread(Color) >> { Ident(back), Ident(round) }

As Cahill & Parkinson (1997) point out, the really significant part of Padgett’s proposal is the bit about gradient evaluation of Spread(Class), which they argue is not incompatible with standard feature geometry. (There’s more to their argument, and to Padgett’s, that I’m not going into here; see also Halle (1995) and Halle, Vaux, & Wolfe (2000) for a model of feature geometry that allows partial class behavior in a rule-based framework.)

What exactly is gained by referring to the spread of [back] and [round] in Turkish with a single Spread(Color) constraint? As McCarthy (2003: 85) notes (in a different context), if we still acknowledge the independent existence of Spread(back) and Spread(round), then the additional existence of Spread(Color) does not alter the predicted factorial typology. More to the point, the fact that [back] and [round] both spread in Turkish but that [round] spreading is limited to [+high] vowels can be equally described with some combination of the rankings in (2) and (3):

(2) Spread(back) >> Ident(back) — [back] spreading
(3) *Nonhigh/round >> Spread(round) >> Ident(round) — [+high]-limited [round] spreading

You might be able to argue that Spread(back) and Spread(round) don’t exist, and that to describe cases where only one of the features spreads at all, you have (for example) Ident(round) >> Spread(Color) >> Ident(back), which would describe [back] spreading only. Without Spread(Color) you’d need the two independent rankings Ident(round) >> Spread(round) and Spread(back) >> Ident(back). (Possible interesting difference: the ranking with Spread(Color) has Ident(round) >> Ident(back) by transitivity, whereas the ranking without Spread(Color) does not.)

But Padgett claims (p. 10) that “some researchers have taken it to be significant that [back] and [round] harmonize within a single language recurrently”. What appears to be assumed here is that having a feature class Color will somehow lead to an explanation for this recurrence, and I think the only plausible argument for this assumption must take something like the following form:

(4) Given Spread(Color), there are more total rankings of constraints compatible with [back] and [round] spreading together than there would be without Spread(Color).
(5) Given no Spread(Class)-type constraint for an arbitrary pair of features (say, [back] and [high]), there are fewer total rankings of constraints compatible with [back] and [high] spreading together.
(6) Assuming that total rankings are roughly evenly distributed throughout the world’s languages, [back] and [round] spreading together is more likely to be found than [back] and [high] spreading.

Let me clarify that Padgett did not make this argument explicitly, but it’s hard to see how there’s any other way to interpret the appeal to “recurrence” in an overall argument for a UG construct such as the one Padgett proposes. Even if we were able to unequivocally state that [back] and [round] spreading recur in a controlled/balanced sample of (attested!) languages in a way that [back] and [high] spreading do not, why should we assume that total rankings are evenly distributed throughout that sample? I suppose what I’m questioning here is (a) whether it’s the job of the formal theory of grammar to account for recurrence, and (b) if so, whether the assumption of even distribution at the heart of the argument above is the right way (or even a right way) to go about it.

I should add that Padgett’s paper concludes (p. 35) with the following remark about how research into the phonetic underpinnings of phonology may provide an answer to the “deeper questions” of why and which feature classes exist:

Why do features pattern into classes at all, and why the particular classes found? Though Feature Geometry and Feature Class Theory are noteworthy in capturing feature class patterning, their formal mechanisms do not provide any answer to these deeper questions. Instead, the answers have been attributed to phonetic underpinnings: feature classes have a basis in the phonetic parameters of place of articulation, laryngeal state, and so on. Yet this assumption deserves further scrutiny, not because it is likely to be wrong, but rather because more attention to the phonetic bases would probably bring a new depth of explanation to the research program.

Recurrence seems to me to be a subcase of these deeper questions, and it also seems to me that if we find the answers in the phonetic underpinnings, we should even more seriously question whether we even need feature classes as formal objects (whether in feature geometry or in Padgett’s feature class theory). That is, if [back] and [round] pattern together for good phonetic reasons, then those reasons might simply be held responsible for the fact that some rankings/grammars are more commonly attested than others, without the need for a separate Color class/node.


4. Distributional arguments noch einmal

[original post: October 28, 2007]

In the Notes and Discussion section of JLing 43.3 (2007) there are two articles that I discuss here: Dick Hudson‘s “Inherent variability and Minimalism: Comments on Adger’s ‘Combinatorial variability'” and David Adger‘s “Variability and modularity: A response to Hudson”. (Adger’s “Combinatorial Variability” is in JLing 42.3, 2006.) Some of Hudson’s comments echo issues I’ve brought up in the preceding three sections, and the exchange between Hudson and Adger bears directly on some current work in phonology; specifically, some of the work that addresses variation.

The Hudson and Adger exchange is short (12 pp. and 6 pp., respectively), and what I’m interested in discussing here does not necessarily require reading Adger’s original article (28 pp.). Let me quickly summarize what is most relevant to this post.

Adger (2006) is a “plausibility argument for a new way of thinking about intra-personal morphosyntactic variation” (p. 503). (Note: Adger’s overall approach doesn’t sound so “new” to me, but I suppose it depends on how narrowly you construe the content of that “about” phrase.) In short, Adger argues that an observed 2:1 ratio of occurrence between two morphosyntactic forms (you/we was and you/we were) in Buckie English (the vernacular of a small, isolated Scottish community, as described in Jennifer Smith‘s 2000 U. of York dissertation) is the result of a lexicon in which there are two morphosyntactic items that both happen to spell out as was but only one that happens to spell out as were. The trick is this assumption (p. 510):

[I]f there is a random choice of which [lexical item x, y, or z] is entered into the system, then we should find x, y and z in equal proportions. However, if some of the PF outputs of the lexical items are the same, we predict a disproportionality in the final output […] [For example, suppose there are] two ways that the grammar can output an x, but only one way to make a z. We therefore predict a statistical variance in the output, such that we will find x more often than z.

Hudson (2007: §2.3, pp. 689-691) rightly (in my view) questions this assumption, and offers good reasons for seriously doubting its validity. It’s worth reading the section in full, but I quote here the most salient passages.

The explanation is ingenious, but no evidence is offered for the underlying assumption that every lexical item has an equal chance of being used, which predicts that whenever two items share the same meaning, they should each have about 50% of the total usage. Common experience suggests that this is not so; for example, pairs of synonyms like try and attempt (as in try/attempt to open the door) offer speakers a lexical choice, but stylistic differences strongly favour try in ordinary casual conversation. Research evidence supports this conclusion. […] This conclusion is typical of findings in quantitative sociolinguistics, where the data normally show context-sensitive bias in favour of one of two synonymous alternatives […]. Adger’s theory therefore rests on the unsupported assertion that in general lexical choices are random: ‘I have assumed that there is a random choice of lexical items (that is, that there is an equal probability that any of the three lexical items is chosen)’ (p. 511). […]

Adger’s (2007: 699) defense of the relevant assumption is limited to a couple of paragraphs, slightly modified here to better fit my truncation of the Hudson passage above, and emphasis added.

The grammar G predicts n variants for a particular meaning with a uniform probability distribution. This uniform distribution of variants does not predict a uniform distribution of phonological forms, since the phonological forms are not themselves always evenly distributed over the variants. Moreover, in any particular speech event, the speaker’s choice of variant will be given by [the performance choice function] U, which is sensitive to speaker-internal properties such as intention, processing and memory and to (ultimately internalised) properties of the utterance context, such as who the interlocutor is and what conversation has gone before. Across sociolinguistically categorised groupings we may (or may not) see emergent patterns of higher or lower frequencies for particular forms. Collating the data together is one way to empirically bring out the effect of the uneven distribution of phonological forms over variants predicted by the theory. […] I assume that every lexical item has an equal chance of being used; that is, I assume a uniform probability distribution in the set of variants. [Hudson] claims that common experience shows this is not so […]. But this argument is backwards. It is an empirical finding that the distribution of [e.g. try and attempt] is non-uniform, and a departure from the null hypothesis. As is well known in Bayesian probability theory, it is crucial to assume a prior probability distribution and I simply assumed a uniform distribution, the null hypothesis; it would have been a curious move to assume anything else. That this assumption led to the correct predictions is itself an interesting finding.

In the underlined hedge in this passage, Adger acknowledges — but at the same time essentially dimisses — the influence of non-linguistic factors on lexical choice. A similar hedge makes a brief appearance in Adger (2006: 511):

Choice of a lexical item by a speaker in any particular utterance is potentially influenced by social and/or psychological factors, so that a particular lexical item may have a higher probability of being chosen in a particular utterance (for example, if that lexical item has been recently accessed, it may be easier to access again; or if a lexical item is simply more frequent overall, it may be easier to access). […] Assuming we can, in fact, control for input probabilities, what we have seen here is that the combinatorics of the syntactic system itself, working on the featural specifications of lexical items, predicts not only variability, but also particular frequencies of surface variants.

Popping back to Adger (2007: 699), we see this justification for the hedge:

[In Adger 2006] I explicitly discuss the fact that various factors will impact directly on the use of a particular variant in any speech event. My question was whether one could see a general pattern emerging when these factors were controlled for, and my suggestion was that such a pattern could be attributed to the structure of the pool of variants, and hence ultimately to the grammar. I argued that this was precisely what happened when we look at the patterns as a whole (see p. 527, where this is discussed).

But it’s also worth quoting what Adger (2006) says on p. 527 (in a single paragraph right before the “Conclusions and Implications” section):

The extra assumption I am making is that every community member will have the same grammar and that it is legitimate to collapse the data from a number of individuals into a single analysis. I think that this assumption is reasonably motivated by the fact that the general patterns seen across individuals hold, for the most part, within a single individual’s data (for example, all individuals have a categorical/variable split exactly as described here). However, it is true that there just isn’t enough data to be sure that the detailed FREQUENCY effects discussed here actually hold for every individual. This is a shortcoming of the analysis which I am aware of, and it is the reason that I offer this analysis as a plausibility argument rather than as a detailed empirical study.

My view (and Hudson’s, as I read it) is that Adger’s “plausibility argument” is founded on at least the following four assumptions, all but Assumption 3 of which are problematic (or at least questionable):

Assumption 1. Individual grammars are significantly responsible for the average distribution of forms across a (sample of) a speech community.

Assumption 2. Forms are evenly distributed in our mental lexicons in some significant sense.

Assumption 3. There is lexical homophony (specifically, the was of I was and the was of you was are different lexical items that happen to both be spelled out as was, while the were of we were and of you were are one and the same.)

Assumption 4. The “average distribution of forms across a (speech sample of) an entire community” of Assumption 1 significantly reflects Assumptions 2 and 3.

Looking at it this way, these four assumptions pretty straightforwardly map onto assumptions made in e.g. work on variation by Arto Anttila. Those assumptions, as I understand them, are as follows.

Assumption 1. Individual grammars are significantly responsible for the average distribution of forms across a (sample of) a speech community.

Assumption 2. The members of the set of totally-ordered constraint rankings consistent with a given partial order are evenly distributed in our mental grammars in some significant sense.

Assumption 3. Some different total orders of constraints result in identical surface forms.

Assumption 4. The “average distribution of forms across a (speech sample of) an entire community” of Assumption 1 significantly reflects Assumptions 2 and 3.

Note that Assumptions 1 and 4 here are identical to Adger’s corresponding assumptions, whereas Assumptions 2 and 3 in each case differ only slightly to accomodate the different basic tools of analysis, rankings vs. lexical forms. I’ve already noted that I think Adger’s Assumption 3 is not questionable, and I feel the same way about Anttila’s corresponding assumption. But Assumptions 1 and 2 (and by implication, Assumption 4) are no safer, in my view, in Anttila’s theory than they are in Adger’s.


5. Concluding remarks

The preceding four sections identify (and critique) several situations in which linguists make the assumption, either implicitly or explicitly, that a set of linguistic structures (phonemes, segmental strings, or lexical items) or a set of analytical devices (feature/class nodes or constraint rankings) have an even distribution within some domain (underlyingly, historically, as usage choice sets, etc.). I haven’t quite put my finger on why I think this is a highly questionable assumption to make, but I do.


Suggested citation:
Baković, Eric. 2015. Four distributional arguments. In Short ’schrift for Alan Prince, compiled by Eric Baković. https://princeshortschrift.wordpress.com/squibs/bakovic/.

 back to Squibs