Recently AOL released the logs of all online searches done by nearly 658'000 of its users over a period of three months earlier this year. The intention was honorable (to benefit academic researchers), the way it was done absolutely inept (putting a file up on a public website), and what the data reveal is staggering.
Quick summary: At the beginning of August, AOL - the big American Internet player - published on its research site a big compressed file (2 GB, compressed at 439 MB) for anyone to download. The file contained the complete list of all searches done through the AOL search engine by 658'000 people, randomly chosen among the 40-plus million AOL users. Each record was stripped of the person's screen name, which was replaced by an ID number. Elements in each record include the ID, the term or terms used for the search, the time stamp, whether the user clicked on a result and the corresponding visited website. There are an estimated 20 million web query records in the file. It looks like this (I've picked 15 lines randomly, relating to two different users):
Where is the problem, you may ask. Stripping the names and replacing them with ID numbers has allowed AOL to claim that the information was "anonymized", that there was no personally-identifiable data linked to the records. However, even a small familiarity with data-mining techniques makes apparent that similar search data is enough to identify people. Combining information about a specific city (searches for local news, for real estate listings, for shops or doctors or services, for train schedules, for school plays) with more personal data (many people routinely search for their own names or those of friends and family, or their phone number or street address, to satisfy their ego but also to check for example if someone else is using their personal information - think identity theft) can bring up pretty detailed profiles of specific individuals. The New York Times' reporters for example tracked down AOL user number 4417749, Thelma Arnold, a 62-year-old widow who lives in Lilburn, Georgia:
In the privacy of her four-bedroom home, Ms. Arnold searched for the answers to scores of life’s questions, big and small. How could she buy “school supplies for Iraq children”? What is the “safest place to live”? What is “the best season to visit Italy”? Her searches are a catalog of intentions, curiosity, anxieties and quotidian questions. There was the day in May, for example, when she typed in “termites,” then “tea for good health” then “mature living,” all within a few hours. (...) While these searches can tell the casual observer - or the sociologist or the marketer - much about the person who typed them, they can also prove highly misleading. At first glace, it might appear that Ms. Arnold fears she is suffering from a wide range of ailments. Her search history includes “hand tremors,” “nicotine effects on the body,” “dry mouth” and “bipolar.” But in an interview, Ms. Arnold said she routinely researched medical conditions for her friends to assuage their anxieties.
The file remained available for about 10 days, then AOL took it down and fired a few people. But in the meantime it had been downloaded hundreds, maybe thousands of times, and mirrored on other sites around the world. If downloading is not for you, someone has already put a search interface on the data. Which means that all that personal information is no longer to be stopped from circulating forever online. Marketers are already imagining possibilities to exploit the data (basically, what AOL did is give legitimate marketeers as well as non-legit spammers a list of most-searched terms). The event triggered a huge privacy debate in the US and beyond. And groups such as the Electronic Frontier Foundation already took legal steps, since AOL, whatever its intention ("This collection is distributed for NON-COMMERCIAL RESEARCH USE ONLY. Any application of this collection for commercial purposes is STRICTLY PROHIBITED", reads the disclaimer) is obviously in total breach of consumer protection and confidentiality rules.
This story involves AOL, but it is more generally symptomatic of the quantity and the quality of the traces we leave behind when we do even "normal" things on the Internet such as searching for information (a topic that I've already blogged here and here and here, and fictionalized here). Thelma Arnold's reaction when the NYT reporters called her - "I had no idea somebody was looking over my shoulder" - is typical of most Internet users. AOL's customers are identified by a nickname when they log in, but Google, MSN or Yahoo - and their commpetitors, and their local counterparts such as Search.ch in Switzerland or Alice in Italy - also monitor users through various combinations of digital tracking methods (profiles, IP addresses, cookies, etc) and log these data and keep them for various periods of time, as do most ISP. They claim they need the data to fine-tune their search algorithms or develop new features - short: to better serve the users. But of course they also use the data for marketing purpose, and the less scrupulous among them sell parts of the information. The mere fact that the data are logged and kept is troublesome, and generally speaking their privacy policies are not the most transparent, as a recent article in the San José Mercury News showed. In the US this has led for example to a long struggle between Google and the Department of Justice a few months ago, when the DOJ used subpoenas to gain access to Google's search logs. A judge rejected those efforts - but AOL now gave away publicly those very data.
Which, by the way, include search strings that betray intimate dilemmas ("fear that spouse is contemplating cheating", user 7268042), sound like desperate calls ("how to kill oneself", 9486162), may configure a crime ("how to kill your wife", user 17556639; "underage lolitas", 4797906) and much more. (In these last three examples, one wonders, should social services or the police intervene? That's just one of the many questions this story raises).
Figuring out what to do is tricky. Indisputably, search history needs more stringent privacy protection. Search engine expert John Battelle - who coined the term "database of intentions" to describe the collection of world's search requests - wishes for broad access. Jason Calacanis, one of the AOL execs, has blogged that "I want us to not keep logs of our search data". That appears very unlikely, given the business value of those data. Maybe the better compromise is mandatory - and independently verified - regular data destruction. But most importantly, people must be aware that every step they take online leaves tracks in the digital sand.
Bruno Giussani is a writer, the European Director of the 









Comments