Monday, December 20, 2010

Google Corpus

A new corpus tool has been announced by Google, as reported in this NYTimes article. This represents a 500 billion-word corpus (yes, you read it right) of written English taken from a selection of their scanned books published in the 200 years between 1800 and 2000. The corpus allows for varieties of British and American English, or all varieties, as well as a selection of publications in other languages.

The researchers that developed the resource have enabled online searching of words & n-grams on a dedicated Google website. The website provides handy graphical comparisons of relative frequency over time, which can include combinations of words or phrases (n-grams). The example here compares common quantifying expressions. Links are then provided to 'search for' your word/phrase in Google books, making an instant web-concordance.

If that is not enough for you, you can also download the datasets from Googlelabs. These provide the already analysed relative frequency and distribution of strings, although both OpenOffice & MSOffice are unable to open the enormous files.

Two things that have struck me as surprising with the announcement of this resource are the reaction by some linguists and the approach taken by some of the principle researchers. In the NYTimes article, "Alan Brinkley, the former provost at Columbia and a professor of American history, said it was too early to tell what the impact of word and phrase searches would be. “I could imagine lots of interesting uses, I just don’t know enough about what they’re trying to do statistically,” he said."Admittedly this is a historian rather than a linguist talking, but the project should not be a surprise to anyone who has been involved in corpus linguistics for the last 20 years. The ever-increasing size of corpora, and access to the internet as a corpus in itself, have inspired projects such as webcorp (and see this special edition of Computational Linguistics from 2003). Also, the article reporting the project (available from Science with a free subscription) describes what n-grams are and what they do. The statistical background to n-grams is not hard to find. Also surprising is the inclusion of Steven Pinker, not just in the NYTimes story but also on the list of authors. He claims an interest in language change, but neglects to point out that access to vast amounts of language data is dramatically eating away at his many claims for an innate language. Real evidence of language use and acquisition suggest ever more strongly that language emerges as a result of constant meaningful interaction with the environment (more of that on another post).

The other surprising response is that the principle researchers make fairly exaggerated claims about culture based on the change in patterns of frequency of use that the data reveals. Clearly patterns of use will change over time (and 200 years is still only a snapshot for many words), but claiming to have invented a new subject - "culturomics" - is probably taking things too far.