PLEASE NOTE: Nodalpoint recently moved servers (we got cracked), during the move I lost the figures, images and data from this article.
Abstract
Looking for any possible distraction from writing my PhD thesis I have tracked and analyzed the first artificial meme (a GoMeme) through the blogosphere. The GoMeme was observed to mutate, and those mutations propagated. Limited evidence suggests that the network is scale free.
1. Introduction
A meme can be defined as self-propagating unit of information similar to the biological concept of a gene, the term was first used by Richard Dawkins in his book The Selfish Gene). Familiar examples of Internet memes are well known: I kiss you, All your base are belong to us, The Nike sweatshop story and more recently p23s5 and the order of words meme. Memes are are often considered "idea viruses" that spread in communities, with many believing that various religions can be considered meme like. The study of memes, termed "memetics", has seen recent interest in the blogosphere, with many people keen to understand how ideas spread through weblogs. However the means of tracking memes, how network structure relates to the spread of memes and elucidation of the characteristics of a meme have not been well studied.
The Hewlet-Packard Information Dynamics Laboratory as investigated the Implicit Structure and the Dynamics of Blogspace [1]. The study used data aggregated from weblogs over the period of a month, and then attempted to track the path of memes initially using hints in the weblog posts such as the word "via". The provenance of a particular meme proved difficult to track and SVMs (machine classification) was ultimately used to determine if two entries were similar. It has also been shown that the dynamics of infectious agent spread is dependent on the structure of the network [2]. Clay Shirky has previously published an essay showing that the explicit structure of the blogosphere is "scale-free". Scale-free networks have the property that nodes in the network are unevenly connected. Research on viral epidemics in scale-free networks showed that these networks lack an "epidemic thresh-hold" property [2]. This suggests that any infectious agent will always infect a constant proportion of nodes a scale-free network. However it has been suggested that there is a tipping point for internet memes. It is unclear that social-networks, which we can consider the implicit linking in the blogosphere, are also scale-free. This may be the underlying cause of so-called tipping points.
On the first of August Nova Spivak initiated a meme tracking experiment. An artificial meme, called a GoMeme, was created to study its infection patern in the weblog community. The meme was a weblog post with instructions on how to "spread" the meme to your own weblog, thus the GoMeme may be considered a meme about spreading memes. The GoMeme experiment included explicit via links which will hopefully provide an opportunity to examine methods to track memes through the blogoshpere as well as, possibly confirming the implicit network topology of weblogs and the dynamics of the meme epidemic.
2. Methods
Details of how the meme was construced can be found on the original post that started experiment. The post contained instructions on how to copy the post to your own weblog, as well as 15 questions (six required and 8 optional). There was also a 72 character globally unique identifier (guid) incorporated into the post. This was subsequently changed to a shorter version (a substring of the longer guid) for formatting reasons. The first task was to track the meme via the various indexing services available such as Technorati, Feedster and of course Google. Technorati was unfortuanly broken at the time the experiment was being conducted and Feedster was returning very few hits for the GoMeme guid. Google gave approximately 1100 hits for the long guid on Sunday the 9th via the Google webpage. I first retrieved all URLs via the Google API that referenced the long guid, this only turned up 198 URLs even though Google was reporting over 1000 on the main web page. Paging through the results I discovered that only 30 hits were displayed with filtering on and approximately 800 with filtering off. Searching again via the API (with filtering off) returned 770 URLs, so unfortunately the data set was limited, mainly because the Google API only allows a maximum of 1000 results (yes, Google are watching memes and you too).
A brief inspection of the results showed there were many duplicate links, for example the same host with /archive/ and /index.rdf showing up as well as the front page. To simplify the analysis I filtered URLs on the basis of hostname, which resulted in 262 unique hosts. I then used the Google API to retrieve the cached HTML front page for the first URL from each host. I further filtered these pages by checking for the guid using a Python script. Initial attempts to scrape the resulting pages for data failed, in most cases the structure of the meme was not maintained. I resorted to employing my personal army to manually transcribe results. This process resulted in around 209 URLs, excluding multiple domains pointing to the same host. I checked the consistency of the resulting graph by looking for the existence of all provenance links in the pool of URLs returned by Google. Any missing (i.e. Google didn't pick them up) were manually added (about 20-30). Finally posts that pointed to sources that didn't exist were removed so the data was consistent, the only exceptions were aggregator sites (see results), which resulted in a total of 222 unique nodes (over 1-7th August).
3. Results
The GoMeme post was almost instantaneously mutated by different formatting, markup, different order of questions and different answers. The v1 GoMeme (four were tried) was also confused with v2 (both giuds appearing on the same page). The time and dates were very inconsistent between posts: some people started blogging in the future, 24hr time with pm, posting at 4am, date of post (via CMS) and date in post didn't match. There were clear instances of people copying other peoples answers and not adding their own. The other big issue was provenance links to aggregator sites like blogxdex.net, planetrdf.com and planet.gnome.org (these are the free floating nodes in the figures below). Due to the times being local and not UTC , only a limited time based analysis was done. If anyone feels like normalizing for time zone data is provided. Nonetheless the results provide an interesting glimpse of the blogosphere.
3.1 General statistics
| Browser | IE* | Mozilla | Firefox | Opera | Safari | Camino | Lynx | Galeon | Epiphany | Total |
| Number | 40 | 40 | 116 | 10 | 21 | 3 | 1 | 1 | 1 | 205 |
| Percent | 20 | 6 | 57% | 5 | 10 | 1.5 | 0.5 | 0.5 | 0.5 | |
| OS | Mac OS X | Windows | Linux | Total | ||||||
| Number | 28 | 136 | 18 | 187 | ||||||
| Percent | 15 | 73% | 10 | |||||||
| Gender | Male | Female | total | |||||||
| Number | 154 | 50 | 254 | |||||||
| Percent | 75% | 25 | ||||||||
| Mean age (M and F) | Median age (M and F) | Stddev | ||||||||
| 32 | 31 | 10 |
Table 1- General statistics. And the award for the most popular browser is: Firefox. Unsurprisingly (to me anyway) the majority of bloggers are 30ish males who use Windows XP.
I compiled some general statistics on the survey shown above in Table 1 (note the total numbers are small, so don't jump to conclusions, this is for fun). I was surprised to see that Firefox was the most popular browser given that Windows (I included all varieties, XP, ME, 2000 and 98 - oh, the humanity), was the most popular operating system. Most bloggers are about 30 years old, male and employed as web professionals of some kind or other (*sigh* where are all the supermodel bloggers). Location, blogging since, hosting, CMS etc. were not included in Table 1 however they are in the data set provided.
3.2 meme spread visualization
Now on to the real fun stuff (if graph visualization is your thing anyway). To visualize the meme structure (a directed tree) I used Graphviz as shown in Figure 1.
Figure 1 - Overview of the spread structure, each blue circle represents a unique node (free floating nodes are the aggregator sites). Click on the image for a larger version, PDFs are provided with and without hostname labels. (with labels, no labels)
Looks nice, but what does it mean ? I tried overlaying the spread structure with the census data as shown in Figure 2.
Figure 2 - Meme spread by gender: Blue for boys, pink for girls (grey= no data) (with labels, no labels)
Meme spread by gender, do girls read other girls blogs ? How old are are all those highly infectious bloggers ? Spread by age is shown in Figure 3.
Figure 3 - Meme spread by age: 10-19 (green), 20-29 (blue), 30-39 (red), 40-49 (yellow), 50-59 (grey) and no data (black). (with labels, no labels)
None of the bloggers were over 60 (do they have the Internet in nursing homes ?) and the youngest was 15. I discounted dvorak as being 100, although a lot of people claimed to be old. Nothing particularly surprising so far, so I flagged alterations to the original post and characterized those as mutations (in a limited sense), Figure 4 shows the spread with mutations.
Figure 4 - (with labels (pdf) no labels (pdf))
I considered any significant alteration or omission to the post to be a mutation. For example kalsey.com omitted the operating system question. This omission was propagated by a number of other bloggers who found the post on that site. Some corrected the post by copying the original (I assume). Finally, temporal data was not normalized for timezones so Figure 5 is most likely wrong.
Figure 5 - Date the meme was picked up: 01 (red), 02 (green), 03 (blue), 04 (pink), 05 (orange), 06 (yellow), 07 (grey). (with labels no labels)
3.3 Network analysis
Due to the small number of nodes in the data set analysis of the network will not be very accurate. Nonetheless, Figures 7A and 7B show the connectivity of the nodes (blogs) in the network (blogosphere).
Figure 7a - Number of links vs. Number of nodes (click for larger version)
Figure 7b - Number of links vs. Number of nodes plotted on a log scale (click for larger version)
Is the network scale-free ? There is not enough data to tell. Although though if you look hard you might see a straight line in Figure 7B. These figures don't really say anything about nodes with lower connectivity that are linked by nodes with higher connectivity that spread the meme further (see discussion).
4. Discussion and conclusions.
The two main problems with this experiment were tracking the GoMeme and the temporal data. I only had access to reliable data for 222 nodes, due mostly to the Google API limit of 1000 results. I didn't have enough time to normalize the temporal data inlcuded in the meme so the dynamics of the meme remain unexamined. It was interesting to see the mutations spread (Figure 4), which suggest something about the malleability of the GoMeme structure (a free text post). It is clear that for any meme to be successfully spread it must have novelty value. I didn't have enough time to do things like check nodes that had few links, but were linked by highly linked nodes (cf. The HP Labs paper and Technorati newcomers), also whether the connectivity correlates with Technorati results. The guid was probably redundant, in future permalinks will be sufficient (thats what URIs are for after all). Will more mutations arise in the post structure after a longer time period ? I guess well will see. It should be noted that the meme has reached livejournal.com. Figures 7A and 7B suggest that the network is scalefree (although there is not enough data to fully justify this assertion).
The GoMeme experiment did not help elucidate any useful properties of a meme or ways to classify those properties. For that experiment a means of successfully tracking native memes and their provenance needs to be developed (I'll leave that as an exercise for the reader). Comments and corrections welcome.
Data
The data is available as XML. As are the Dot files if you want to try Graphviz.
1. R Pastor-Satorras and A Vespignani 2001 Phys. Rev. Lett. 86 3200
2. http://www.hpl.hp.com/research/idl/papers/blogs/index.html









Comments
23
23? Thou hast become an in-iniate. 23!
GoMeme from google's perspective
Nice idea. I came accross it when I saw that someone was using g-metrics to keep track of the googlecount of a strange string.
You may be interested to see the results, although the watch was initiated on Aug 30, when the experiment was over. (GoMeme 1.0 googlecount)
It would have been much more interesting had it been tracked from the begining of the experiment...
Panayotis.
Re: Analysis
By limiting your analysis to one entry per host name you miss a large number of blogs on places like radio, msdn or dotnetjunkies (my home).
With way too much work on your part you might be able to double or triple the sample size.
In addition you missed dotnetjunkies, which appeared fifth in Google's ranking: http://www.google.com/search?q=GoMeme+4.0&num=100&hl=en&lr=&ie=UTF-8&c2c...
Otherwise an interesting analysis
Cheers
Mark Levison
http://dotnetjunkies.com/WebLog/mlevison/
GUID
The data set I analyzed was from a google search for the guid from GoMeme 1.0, which is why I missed your post on DotNetJunkies. You are correct about the one entry per host problem and the amount of work to include those nodes. Using google turned out to be problematic due to the index not being consistent, it changes as the google engine adds more pages to the index, returning what it thinks are the most relevent to your search. The nuber of hits is usually the google engine's best estimate and the absolute nubmer present in the index. These are just impressions based on my use of the google API, I have not had them confirmed.
I thought people would get-over explicity spreading GoMemes after the first one, however I just noticed that GoMeme 2.0 seems to have some traction in the blogosphere.
the remark about people over 60
I was reading, absorbing, trying to understand what you were doing, then I came to this:
That's out and out age bashing and it ruined everything, changed the tone from something serious to something sophomoric.
I have a friend who is over 60, no one would ever guess from the way she writes as she's hip to the times and expresses herself well enough to have a very young following. She's said there's no reason to reveal her age as it doesn't mean anything except how long she's had to experience everything, be witness to new technologies which she's embraced and continues to push to their limits and if people knew how old her birth certificate says she is, she wouldn't be taken seriously. How sad that society is so judgemental.
I think you owe everyone over 60 an apology and when you find yourself that age, you'll look back at this and cringe with embarrassment.
No offence intended
While I do not feel that sentence implies anything negative about people over 60 not blogging, it if has offended anyone it was not intended to. I was made aware of age related prejudice in the work place after dicussing this issue with my mother. She is reluctant to divulge her age given people's assumptions about what people of a certain age should be doing (i.e. not teaching computers in her case).
My comment about internet access in nursing homes was simply a note to myself that I let slip throught in this piece. I am simply unaware of the fact, maybe someone can enlighten me :)
I'm only 64
Dont be too hard on him (her?) - you were young once
Tracking native memes
Just to add a thought on tracking, I don't think that it will ever be an easy problem to solve. Universal ids on all posts along with the via link would require all blogging systems to inlcude that as (non)optional metadata. I don't spend my life follwoing blog technology so this may exist somewhere and I've just missed it.
Meta-memetics and tracking
Hmm. Interesting. I'll throw a wrinkle in the works - might give you a different scale at which to work on the tracking scheme.
Given the definition of a meme as a self-propagating unit of information, it's my theory that all human knowledge is memetic in nature. We read to acquire it; we teach to pass it on. Some memes are more successful than others, at spreading or at sustainability; that's the self-propagation component. Memes will replicate in hard copy and in teachings that which successfully propagates in human minds.
Educational systems are propagation plants; every book that goes into a school has a life cycle. Some books are kept longer than others, even after many editions.
What would tracking text books look like?
It's not the human nodes that would be tracked, but the content itself.
Much larger scale, much longer time frames, more physical data (versus virtual).
Food for thought.
Or just a new meta-meme begging for propagation?
Best,
Rayne Today (http://blogs.salon.com/0001549)
Interesting thoughts, tracking is the problem...
Weblogs seem to be one of the only possbilities as they proved a certain amount of metadata with each post (times/dates). Interestingly the notion that ideas are not replicators but minds are (along with educational instutions/systems) has been explored by Liane Gabora in a paper with the same title: Ideas are not replicators but minds are.
Thanks for your comment.