Monday, April 18, 2011

reCAPTCHA-ing the web


Where It Started


reCAPTCHA is a new program stemmed off of the original CAPTCHA program, which stands forCompletely Automated Public Turing Test To Tell Computers and Humans Apart. Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford of Carnegie Mellon University invented this program in 2000. Essentially what the CAPTCHA program achieved was to protect websites from bots and worms through the application of challenge-response tests that a human can pass, but present day computer programs cannot. The CAPTCHA program generates an image containing a string of distorted letters or digits that are easy enough for humans to solve, but a computer program would not be able to use any automated software to make out the specific letters.This program is widely used to protect systems from e-mail spam such as webmail services like Gmail and Yahoo! Mail as well as to minimize automated posts to blogs, wikis, and forums. This form of human verification was successful in preventing unwanted attacks, and since the original implementation of this program, it has become so widely used that a calculated over 200 million CAPTCHAs are displayed every day.



Next level of CAPTCHA


This leads us to reCAPTCHA. Luis von Ahn realized that the average human spends about ten seconds for each CAPTCHA, which calculates to roughly 150,000 hours of human effort each week, essentially wasted. He wanted to create a positive use of all this human effort and time that people spent typing CAPTCHAs leading us to the reCAPTCHA. This newprogram is the same CAPTCHA service that helps in the digitization of books, newspapers, and old time radio shows. Books that were developed before the computer age are getting all their pages scanned and transformed into text using “Optical Character Recognition” (OCR).The major issue faced with this program is that it is not perfect, but the transformation to text is useful because scanned images are not searchable, expensive to download, and are too large to store on smaller devices.


This is where the reCAPTCHA program steps in. This program improves the process of digitizing books by taking the words that the OCR cannot read and sends them to the World Wide Web through CAPTCHAs for humans to decipher. The reCAPTCHA provides two different distorted images, one is the word that could not be read by the OCR and the other is a control word already known. The human must type both words, and the computer assumes that if the control word is typed correctly that the questionable word has also been correctly typed. Also, the human is not aware of which word is the control and which is the questionable. This questionable word is given to a number of other people to determine if the word spelling is correct with higher confidence. The spelling must receive at least 2.5 votes to before it can be considered correct in this digitizing process. It is possible for a word to be unreadable and reCAPTCHA accounts for this possibility. reCAPTCHA provides a button for users to request a new pair of words if the current pair is unreadable. After six users reject the word before a correct spelling is chosen, the word considered unreadable.



It may seem strange to imagine that the average individual using the World Wide Web could be part of such a large ongoing project, but studies on the program have shown that the reCAPTCHA has an accuracy of 99.1%. This is a really simple way to kill two birds with one stone. The reCAPTCHA program provides the protection that we want from web attacks as well as allows humans to inadvertently help digitize our world. The current projects that the reCAPTCHA program is helping to digitize are old editions of the New York Times and books from Google Books. In 2009, reCAPTCHA was purchased by Google to get a jump-start on digitizing their Google books collection. This program creates a full circle. When we want to find a book or article online, it was essentially made possible by our own fingers at work.


After the first year of running this system, there have been over 1.2 billion CAPTCHAs solved which is approximately 17,600 books transcribed. Now, there are over 4 million suspicious words that are being solved daily, which is approximately 160 books per day. The reCAPTCHA is currently being used by over 40,000 websites. With all the amazing work in digitizing our world being done by reCAPTCHA, it is also providing better security for your computer. Some algorithms have been created to read the distorted text in CAPTCHAs, but the reCAPTCHA provides triple the amount of distortions comparatively. First, it still provides all the manual distortion protection that you receive from the normal CAPTCHAs. Second, the scanning process contains noise in the image. And lastly, many words already suffer from natural distortions because the underlying texts have faded over time. These distortions are very difficult for computers to decipher, but are simple for humans to solve. Thus the reCAPTCHA is three times stronger in protecting users in the World Wide Web.


Where is it headed?


The future is unknown. People are having fun with these reCAPTCHAs with something called CAPTCHArt. This is when the two words that come up make a coincidental pair and a screen shot is taken and a corresponding image is drawn around it. (http://www.captchart.com/) When all of humanity’s books and newspapers have been digitized, this program might be useful in digitizing other documents. The possibilities are endless, but possible documents could range from historical documents to art pieces with letterforms in them. There is no telling where the reCAPTCHA program will be going next, but there is time to develop something. The World Wide Web has only been around for approximately 20 years, and there were a lot of books created from the beginning of time to 1989.


On another note, Luis von Ahn and his team are working on an exciting project to translate the entire World Wide Web. Seeing as how the majority of the web is in English, more people would be inclined to use the Web around the world if they could understand what is on it. So taking the same concept of using human processing power to make an impact on the Web, they have createdduolingo.com. This site is free and helps individuals learn a new language while at the same time translate the Web. Innovation is happening all around us, and it’s free.


References:

  1. http://en.wikipedia.org/wiki/ReCAPTCHA
  2. http://www.google.com/recaptcha

No comments:

Post a Comment