Monday, October 22, 2007


On Saturday I had lunch with a friend of mine who works in software R & D.

One of his current projects is a system that allows computers to analyse blogs. In breaking down the blog posts and their attendant comment threads, the software will eventually be able to accomplish some very impressive feats.

For example, by looking at how often a commenter comments, and on what topics, and how their comments relate to the posts and the other comments, the software will be able to guess whether or not the blogger and the commenter know each other in real life, as well as online. That may not sound particularly useful, but once it identifies the difference, the software will be able to map your physical and electronic relationships and their overlaps. Unlike MySpace, which can only tell that you have 600 friends, this software will be able to analyse how many of those 600 you actually have a real relationship with, and even on what basis (electronic, physical, professional) that relationship exists.

The software does this by building an extraordinarily detailed image of each individual, based on language use and topics of interest. In terms of language, do I use emoticons? Do I use words like “basically” or “literally” a lot? Do I rely a little too heavily on ellipses, or do I always misspell certain words? In terms of topics, do my posts tend to contain certain words, like “scooter”, and never others, like “highchair”. Does my username frequently appear in comment threads that contain multiple instances of words like “jihad” or “lolcats” or “MST3K”?

Once the software has built an image of my language style and my buzz topics, it can compare that image to any given piece of text and calculate the likelihood of me being the author. A post of short, badly punctuated sentences bemoaning the lack of affordable childcare will score low. A post of long sentences full of parentheses about 'Zontar: The Thing From Venus' will score high.

The practical upshots are dual.

Firstly the software will be able to identify clusters or communities of bloggers, even if those bloggers don’t realise that they’re part of a cluster or a community. I will be able to ask it to find bloggers in my area who share my interests and lexicon, and it will identify them for me. It will look at the things about which I’ve written, my geographical location, the people I link to and the people they link to, and generate a list of possible correlations… neatly sidestepping dead blogs, spam blogs, subliterate LiveJournal entries, or any blog that contains the phrase “the wisdom of Kerry Nettle”. It becomes a kind of social networking tool, except that instead of laboriously filling out pages upon page of my details, everything I’ve ever written becomes my details. And it tracks all bloggers, not just those who have signed up to it, so it’s working from the largest conceivable dataset.

Secondly, the software will be able to identify any individual via the quirks of their writing style and the subjects they write about, even if they use different pseudonyms and computers with different IP addresses. All of a sudden anonymous trolling and sock puppetry become a lot more difficult.

The more perceptive reader will already be able to tell that I’m not exactly thrilled at some of the implications of this analysis. Put crudely, while the final version of this software will be used primarily to map individuals within limited communities, its technological descendants could potentially fingerprint every single person on the internet, so intimately and thoroughly that tracking an individual’s movements through cyberspace becomes a piece of cake. Next to this rather insidious technology, the data-mining of Facebook looks positively one-dimensional. We will all be tagged and tracked like migratory birds.

As a result, swapping between identities, or changing your identity, will eventually be just as hard on the internet as it is in real life. You might think that this serves people right for being clandestine or sneaky or hypocritical, but imagine if there was software that could upon request tie together your formal politically-themed blog, that one drunken rant about fat chicks you wrote to a chat room in 1998, your online CV, the fan fiction you wrote as a teenager, the anonymous venting about your spouse you wrote to, your old Lavalife personal ad, and so on and so forth. At the touch of a button anyone in the world can know about every single ill-considered remark, angry flare-up, ignorant position and superseded opinion you’ve ever expressed… and then, of course, use it against you. And if you’re like me - a big-mouthed idiot who can’t go a day without saying something offensive - it’s a cause for concern.

To invert the famous New Yorker cartoon, on the internet soon everyone will know you’re a dog.


Blogger Nerd Goddess said...

Oh, that frightens me. That frightens me very, very much.

1:05 AM  
Anonymous Matthew Jarvis said...

Two thoughts.

Firstly. I'm jealous that I don't get to work on cool software like that.

Secondly. Before we panic and retire to a hermetic life in some primitive land beyond the reach of the internet, subsisting only on what we can conveniently gather, consider.

One plausible scenario is that the deployment of this technology (I guess it'll be bought up by Google any day now) results in an increase in private or restricted sites which don't make their data available to crawlers and bots. Or which *cough* anonymise... sorry anonymiZe *cough* their data so that all their undergraduate drunken rants about fat chicks appear to the crawler to be incisive and scholarly discussions about the social- and gender-political implications of the mass-mediated body image narrative of Ms B Spears....

An implausible, but infinitely preferred scenario, is that people are finally motivated to stop posting drunken rants about fat chicks, and the overall quality of the data on the internet goes up a notch.

Overall, more opportunities for the cautious posters and trolls among us to laugh at the discomfiture of the incautious. Schadenfreude!

10:07 AM  
Blogger Blandwagon said...

My friend did state that the time of everyone on the internet being "fingerprinted" is still a long way off. When you consider the millions of individuals who are all writing in English (of varying quality), definitively identifying them requires more computational grunt than is currently available. Even if they could be identified, the software could be thwarted by a few simple but deliberate changes of writing style.

My friend also stated that even as he creates this mapping software, other developers are creating obfuscation software, so internet anonymity might not be thwarted yet. However my friend opined that it was good for people to be aware that such software either exists or will soon exist, and consider their contributions to cyberspace accordingly.

12:22 PM  
Blogger emawkc said...

Well let's not be naive here. Do you really think the NSA doesn't already have this technology? Believe me, they're tracking everything we write and recording ev..... [TRANSMISSION INTERUPTED]

1:25 AM  
Blogger phaedrus said...

I have to admit a fair amount of skepticism after reading this. For one, statistical stylometric analysis requires a great deal of source text. Combine this with trying to find the real author in a medium that often incorporates inline quoted text. (Then trying to contextualize everything a la AdSense!) Anything approaching useful results would be a real accomplishment. I wish him the best!

11:40 AM  
Blogger an9ie said...

Dammit - the anonymous bitching is one of my few joys!

12:55 AM  

Post a Comment

<< Home