Google’s Book Search
I am one of the people who tend to assume that Google knows everything there is to know about information storage and retrieval; it is, after all, what they do.
But this article written last year by Geoffrey Nunberg proves otherwise.
Start with publication dates. To take Google’s word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler’s Killer in the Rain, The Portable Dorothy Parker, André Malraux’s La Condition Humaine, Stephen King’s Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams’s Culture and Society 1780-1950, and Robert Shelton’s biography of Bob Dylan, to name just a few. And while there may be particular reasons why 1899 comes up so often, such misdatings are spread out across the centuries. A book on Peter F. Drucker is dated 1905, four years before the management consultant was even born; a book of Virginia Woolf’s letters is dated 1900, when she would have been 8 years old. Tom Wolfe’s Bonfire of the Vanities is dated 1888, and an edition of Henry James’s What Maisie Knew is dated 1848.
Of course, there are bound to be occasional howlers in a corpus as extensive as Google’s book search, but these errors are endemic. A search on “Internet” in books published before 1950 produces 527 results; “Medicare” for the same period gets almost 1,600. Or you can simply enter the names of famous writers or public figures and restrict your search to works published before the year of their birth. “Charles Dickens” turns up 182 results for publications before 1812, the vast majority of them referring to the writer. The same type of search turns up 81 hits for Rudyard Kipling, 115 for Greta Garbo, 325 for Woody Allen, and 29 for Barack Obama. (Or maybe that was another Barack Obama.)
It seems to me that there is only one explanation for such egregious errors: carelessness. Based on the information in the article, Google tries to place the blame on the folks they got the information from.
But here’s the thing: don’t put information in your database if you don’t know what information is there.
It’s pretty simple.
Not sure if this huge file is going to be a mess? Take a small sample and see what your system does with it. If it spits out gibberish (and folks, 19k+ returns on “Internet” found in books published before 1957 is gibberish) then you don’t put it in.
The information you pull out of any system is only as good as the information you put into it. I live by this rule on a daily basis.
When I saw that this article had been published a year ago, I very nearly decided not to post about it. Surely they’ve come a long way in correcting the mess in a year! But… no. I ran some searches myself and found more problems than were reported a year ago. It seems the errors are getting worse with time, not better.
I understand that it’s a huge undertaking to manually go in and fix these errors, but it must be done. Otherwise, Google Book Search becomes virtually useless as anything other than another e-book reader.