Friday, March 02, 2007

How TIOBE Says its Programming Community Index is calculated


Recently I mentioned and then featured the TIOBE Programming Community Index. This is one measure of the relative popularity of about 100 different programming languages.

As has been noted elsewhere, popularity is important to the user base of a particular language, because a popular language is an employable language. There are both considerations of vested interests ("I spent 2 years and $1400 in books, courses and conference fees learning C#") and personal preference ("I would love to write OCaml code all day long"). Unless you're a Paul Graham with your own business and the executive authority to decree All Will Be Done in LISP, you want to proselytize other developers to use your language, and monitoring popularity statistics is a way to track progress towards adoption of your pet language. (I should note that 'Siamese Fighting Fish' would make a great name for a programming language, especially if it was in a beta release.)

How TIOBE says the PCI is calculated


On the front page of the Index, TIOBE says
the ratings are based on the world-wide availability of skilled engineers, courses and third party vendors. The popular search engines Google, MSN, and Yahoo! are used to calculate the ratings.

However, there's a link to the "definition" of the PCI, and it says something a little different.

TIOBE rating comes from search engine result counts


If you follow the link in the sentence that says "the definition of the TIOBE index can be found here" (TIOBE loses points for web design as the definition page redirects itself if loaded outside a frame set), it explains how the various columns are calculated. The "rating" column is the one used to sort the table, and the one most people will quote when they say "Language X is number 12". It turns out this value is calculated from a cleaned-up page hit count for the search query
+"<language> programming"

Jobs, courses and vendors not mentioned in the definition


There doesn't seem to be a mention here in the definition page of those elements mentioned on the front page of "availability of skilled engineers, courses and third party vendors." The FoxPro community was kind of counting on that to be the case. I can't reconcile the discrepancy between the description on the index page and the definition given later. I've checked the TIOBE index via the WayBack Machine for 2004, the year of that FoxPro post, and the description and definition are unchanged in this respect.

"<language> programming" as a choice of search query


So now we all know what we need to do: stop writing about "programming in Ocaml", "OCaml hacking", and "writing a widget in OCaml", and starting writing about "making a widget through OCaml programming". I can see how this search query phrase might generate fewer false positives than using simply the language name (where "C", "D", "Natural", and "Logo" have an unfair edge), but is this really the best and only query to use? Fortunately TIOBE does a few more queries to fix up the results. They have a list of "groupings" and "exceptions", where searches for "J2EE programming" are counted towards "Java", and "3-D Programming" is excluded from results for "D programming". (There, I just artificially inflated some counts.) Still, I think there might be additional queries of different patterns that could be helpful.

Page hit count not well-understood


Technorati CEO David Sify wonders where the Google and Yahoo! page hit counts come from, if you can only view about 1000, and this thread is continued here in a discussion on blog.searchenginewatch.com. These numbers are also presented as an approximation, so I wonder how much these counts can be off and how that might affect the TIOBE rating.

Reproducing the PCI


Since TIOBE has given a recipe for the way the rating is calculated, it should be easy to reproduce it and compare the results to those published. The current recipe is a little unclear on the rating calculation, I think they're missing some parentheses, but the April 2006 description was a little more clear.

The algorithm seems to say the rating is an average of the ratings on each search engine, and the rating for a search engine is done by dividing the page hit count for the language in question by the sum of the page hit counts for the top 50 languages.

How do you know what the top 50 languages are until you calculate the ratings? Or do they start with the previous top 50? I also wonder how they handle languages not in the top 50.

So, if you have the inclination, you can try to recreate the TIOBE PCI and see what you get. I have great confidence I can do this quickly in my spare time with my favorite web scraping secret weapon, but for now, I leave you with this thought:
Logo (#27) beats out Haskell (#41), so all we have to do is add monads to Logo and we'll be cranking out MIT-bound schoolkids like there's no tomorrow.

4 comments:

David said...

You might be interested in checking out my own results, although they are a little bit out of date at this point (I promise I'll re run them before summer!):

Language Popularity

Anonymous said...

RE: David, your charts don't have nearly enough languages...what about CL, or Scheme, or Erlang, or OCaml?

JFKBits said...

See comments on reddit for a recent post about TIOBE that also discuss the ways the index is calculated.

JFKBits said...

At sucks-rocks.com (Slogan: "Does it suck? Or does it rock?") they have canned results comparing some popular languages based on search engine data.