Some of the work I have been doing lately at school has had an aspect of analyzing "reading level" -- or how difficult some text is to read and understand. One research project I am involved with is creating some reading comprehension tests, and we wanted to find good copyright-free text at our student's reading level that we could use on the test. And one of my class assignments is also about analysing text to determine how complex it is.
So the other day I was just using Google as usual to search for something, and I saw an interesting feature that sorts results based on the reading levels of beginner, intermediate, and advanced. I can't find much information on how this filter works, but it seems to deliver good results. I did a search on the word "quokka" and this is what I found:
The Quokka is a marsupial from brushy areas in southwestern Australia. It is a small wallaby, a type of kangaroo. Quokkas can hop with their powerful legs and walk on all four limbs. Their life span is about 5 years in captivity.(1)
The quokka (Setonix brachyurus), the only member of the genus Setonix, is a small macropod about the size of a domestic cat. Like other marsupials in the macropod family (such as the kangaroos and wallabies), the quokka is herbivorous and mainly nocturnal. It can be found on some smaller islands off the coast of Western Australia, in particular on Rottnest Island just off Perth and Bald Island near Albany.(2)
I'd have to agre that's a pretty big difference in reading levels!
By the way, that last quote came from Wikipedia, so I was wondering what the reading level for Wikipedia is in general. Doing a search for "site:en.wikipedia.org/wiki" shows these results:
The supposedly "Simple" Wikipedia website doesn't show much difference, according to Google:
So then I wondered how this website stacks up:
So then I wondered when those few "Intermediate" level posts were, so I checked by month and this is what I found.
This seems to show that I was writing at a higher level in 2006-2008 than I was at the years before and afterwards.
What does this mean? Since I can't figure out what Google is using to rank these pages, I can't tell. The only thing I can find from a Google employee is that
The feature is based primarily on statistical models we built with the help of teachers. We paid teachers to classify pages for different reading levels, and then took their classifications to build a statistical model. With this model, we can compare the words on any webpage with the words in the model to classify reading levels. We also use data from Google Scholar, since most of the articles in Scholar are advanced.(3)
But perhaps a more detailed analysis is needed. Which is exactly what I am doing for the class assignment I mentioned at the start of this post. Stay tuned!
(1) Quokka article at Enchanted Learning
(2) Quokka article on Wikipedia
(3) Google Product Forums