corpus-based analysis of linguistic trends

I’ve been invited to comment on the following article and the work which it summarises:

http://online.wsj.com/article/SB10001424052702304459804577285610212146258.html

Studies of this kind are by no means ‘alien’ to linguists, who are concerned with empirical aspects of language studies and thus are perhaps (understandably) more interested in the ‘scientific’ approach than many scholars in more traditional humanities subjects (indeed, at times we’ve arguably been overly concerned with the goal of behaving like scholars in the ‘harder’ sciences, and with the associated philosophical issues). The view of a language as a system rather than a set of unordered phenomena (‘atoms and molecules’) – while not to be pressed too far (especially where vocabulary and/or ongoing change are at issue) – goes back at least a hundred years. And there is much discussion of ‘cultural evolution’ in the linguistic literature. More specifically, much modern linguistics is ‘corpus’-based; and, while this has led at times to absurd claims (e.g. some linguists deny that phenomena with extensive anecdotal exemplification are genuine, on the ground that for some reason they do not occur in the relevant corpora), this development has in general been highly beneficial, especially by way of putting hitherto-unknown figures to observed trends (both synchronic and diachronic) and thus assisting in their explanation in linguistic and extra-linguistic terms. But new input from new sources, of the kind instantiated here, is wholly welcome. And some of the points made here – for example the role of spell-checkers in promoting some variant forms at the expense of others, and the role of technology more generally in contributing to language change – are, if not wholly novel, striking, and warrant closer examination.

On the other hand, the search for universal principles and ‘laws’ in this domain is fraught with difficulties. Strong claims on such fronts would require stronger evidence than is usually forthcoming. A truly fairly compiled database, even for one language (particularly one as rich and varied as English), would be enormous and highly complex and would involve a plethora of factors, some of them yet to be fully understood. Some such factors would involve pre-existing dialectological diversity and the varied and dynamically changing statuses of different varieties of the language. For instance, the increasing worldwide preference for snuck over sneaked involves (doubtless among other things) the fact that the former form has long been dominant in American usage specifically; in this particular case, this factor has been strong enough to outweigh the greater simplicity and learnability (with no apparent counter-balancing ‘cost’) of regular past tense forms in –ed such as sneaked.

Incidentally, this example points up the possibility that grammatical changes may operate differently from those involving vocabulary per se. The former (which are more clearly parts of linguistic systems) are certainly fewer (inevitably; there are far more common words than there are grammatical constructions) and slower-moving than the latter, as is shown by e.g. contemporary teenage British and Australian usage, heavily ‘Americanised’ at the lexical level but less so in respect of grammar – and still less for phonology (pronunciation), where English continues to diversify in some respects internationally and even within each country (this has been explained to a degree).

Linguists will look forward eagerly to further work of this kind.

Mark

This entry was posted on Thursday, March 22nd, 2012 at 4:05 am and is filed under Uncategorized. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

3 Responses to corpus-based analysis of linguistic trends

Pacal says:

March 22, 2012 at 7:16 pm

This is fascinating. I would like to know though how they did the searches. google with copywrited texts frequently limits the amount that can be viewed. I.E, limted preview or snippet view.

I’m wondering if this would affect the results. I’m aware that 5 million+ books is a large data base but is it in fact an accurate view over how English is spoken as against written?

As for snuck over sneaked. I haven’t heard anyone in my millieu use snuck in years. Well maybe it’s the company I keep. It is also a damn ugly word.

Reply
why I do not believe in the existence of atheists « JRFibonacci's blog: partnering with reality says:

March 28, 2012 at 9:47 pm

[…] corpus-based analysis of linguistic trends (skepticalhumanities.com) […]

Reply
dialogue on the language of identity « power of language blog: partnering with reality by JR Fibonacci says:

June 2, 2012 at 10:59 pm

[…] corpus-based analysis of linguistic trends (skepticalhumanities.com) […]

Reply

Skeptical Humanities