header image

Wikipedia and Search – Some Quick Numbers

Posted by: | March 4, 2009 | 10 Comments |

Over the next several days I’m going to be posting some of my findings about and analysis of the relationship between search engines (specifically google) and Wikipedia. On a post on the University of Amseterdam’s “Masters of Media Blog” (here) Arno de Natris suggests that Wikipedia writers use Google as a primary means for determining whether or not something exists. “If you are not on Google,” he writes, “you never existed.” He presents as evidence a hoax text he attempted to insert into Wikipedia, which was refuted via this Google test.

My results suggest there is something to Arno’s argument, but that things may be a bit more complex than that one case might make them appear. Just to get things rolling, I would like to post tonight some quick numbers extracted from my data blog, (here) which is a recording of all (or at least nearly all) of the articles that users attempted to create on Wikipedia in the 24 hour period between 8am October 1, 2008, and 8am October 2, 2008. I recorded these by setting an RSS reader to download Wikipedia’s new article RSS feed once every five minutes for that 24 hour period (it is possible this may have resulted in a few articles being lost). This data is useful because it allows us to track which of the articles created in this 24 hour period were later deleted, even if they were deleted extremely quickly.

There were 1043 attempts to create articles during this 24 hour period. When I rechecked the articles approximately 5 months later, on March 2 2009, I found that 410 articles had been deleted and 633 retained.

Of the 410 articles deleted, the vast majority, about 350, were subject to what is called a “speedy delete” process. A bit of explanation as to what that means. Deleting an article from wikipedia is a bit of a big deal. Wiki’s great strength is that it keeps records of everything, every edit to an article, so as to make the Wiki process transparent, and to allow for bad edits to be reversed. For this reason, deleting a well-established article requires a quite deliberative process, involving on site debate and consensus seeking.

However, as wikipedia became larger, and more attractive to vandalism, it quickly became apparent that this deliberative process could not keep up with the vast numbers of articles being created. So for certain categories of offense, it was agreed that any administrator could “Speedily Delete” a recent article, so long as they noted that the article fit into one of the agreed upon categories. Today these “Criteria For Speedy Deletion” (CSD) encompass dozens of possible reasons for nixing an article, organized into several broad groups.

Of the 350 articles created in my test period that were ultimately subject to Speedy Deletion, far and away the post common criterion cited was CSD – A7, which accounted for 134 articles Speedily Deleted. CSD-A7 is explained by the Policy page describing CSD that would have been in effect on October 1 & 2 2008 as pertaining to:

An article about a real person, organization (band, club, company, etc.), or web content that does not indicate why its subject is important or significant. This is distinct from questions of verifiability and reliability of sources, and is a lower standard than notability; to avoid speedy deletion an article does not have to prove that its subject is notable, just give a reasonable indication of why it might be notable. A7 applies only to articles about web content or articles on people and organizations themselves, not articles on their books, albums, software and so on. Other article types are not eligible for deletion by this criterion. If controversial, as with schools, list the article at Articles for deletion instead.

That is to say, CSD – A7 is intended to allow for the speedy deletion of articles about real people or organizations when the text of the article does not contain any meaningful assertion of why said person or group might be important enough for listing in an encyclopedia. Take, for example, this charming attempt to add an article on someone named “Sam Hoza:”

Sam Hoza iz a stud. yezzir

Clearly, this sort of petty vandalism, the equivalent of scratching one’s name onto the digital wall that is Wikipedia, can be speedily deleted by a Wikipedia admin without any real need to do much research work.

However, other articles that were Speedy Deleted under CSD – A7 during the test period are not so clearly lacking in claims of importance or significance. Take for example, this attempt, also deleted under CSD – A7, to write an article on a Kurdish Artist named “Hozan Kawa”

Hozan ‘Kawa’ is a very popular Kurdish artist. He has a great love for Kurdish songs and poems. Kawa was born in a small place around Kurdistan of Turkey, instead (Palas) related to the city (Múshe). The city is famous for artists as Zeynike, Shakiro, Resho, Hüseyine Múshe and many other people.

His mother (Guleya Elif) said to him one day. Son after my death, I want you to remember never forget: If you want me to be proud of you and you lift my head up, where I shall love you. You must introduce the beautiful voice for the Kurdish people. They will never forget you and there you will be a legend in Music.

He has 11 siblings in her family. His father did many services to help him. He was a strong piece of ‘Kawa’ who would succeed to become a great artist.

‘Kawa’ started his studies in Kurdish. Besides his studies he began to learn Turkish. In Kawa’s world, he further in music. He really invested everything to be a professional Artist.

1987 in the city (Múshe), did he started a music group in the process to a large and useful artist carrier.

In 1995 he traveled to Europe and the country France. After a week in France, he started his career artist. In Newroz (1996) he was known by the group (Berxwedan). Now among the Kurdish, he is a big familiar face.

Between group Berxwedan he began to make his first album by the name ‘Ava Evine’. His album camed out in the year 2001. After that he published his second album ‘Taya Dila’.

Kawa eventually also gaved the third album ‘Ez ú Tu’. His second and third albums did make him to a amazing great artist. His voice is gold worth listening to. It’s just really fantastic for our Kurdish opinion.

‘Kawa’ has released the fourth album also more information about it coming later.

This article clearly makes a claim as to its subject’s significance. It claims its subject is a “popular Kurdish artist.” Clearly, the admin who deleted this article had to make a judgement call as to whether or not this claim was to be taken seriously. That is to say, they had to make a judgement call as to the whether or not Hozan Kawa was, in fact, a notable Kurdish artist. Technically, such a decision would fall outside the letter of CSD-A7 and call for the more deliberate deletion process. However, given the perceived need of wikipedia editors for a rapid means for deleting what they see as”vanity pages” posted by musical groups, authors, artists, and others, it is perhaps not surprising to see CSD-A7 pressed into service as a mechanism for Speedily Deleting articles of these sorts deemed not notable.

It is here that the use of search engines enters the picture. Two of the suggested uses of “Search Engine Tests” for policing articles on Wikipedia, as given by Wikipedia’s “Search Engine Test How-To Guide” are:

3. Genuine or hoax – Identifying if something is genuine or a hoax (or spurious, unencyclopedic)

4. Notability – Confirm whether it is covered by independent sources or just within its own circles.

Thus, it seems reasonable that, given the large number of articles Speedily Deleted under CSD-A7 and given the evidence that CSD-A7 is being used by admins as a means for Speedily Deleting articles on subjects they believe are non-notable (perhaps even hoaxes), that testing using Google and other Search Engines might play an important role in informing their decision making process in some of these cases. This is especially so given the short period of time Speedy Delete decisions are often made in, making the rapid information retrieval made possible by a search engine even more attractive.

A further 17 articles were deleted under CSD-G12, which is to be used for cases of “blatant copyright infringement.” Detecting such infringement is another of the suggested uses of Search Engine Tests listed in the how to guide, and the use of search engines was suggested by Jimmy Wales as long ago as 2001 as a means of discovering copyrighted material being inserted into Wikipedia.

We can see that many Speedy Deletes may be informed by the use of Search Engines. While in some of these cases Speedy Deletes may occur because Google suggests the subject of an article “does not exist,” in other cases it may be that Google results simply imply the subject’s existence is not important enough to deserve listing in an encyclopdia.

This can also be seen in many cases of the more deliberative deletion process. (CONTINUED TOMORROW)

under: Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *