Drupal site crawler project

For years now, people have been asking me how many Drupal sites there are. This is, of course, something of a moving target and I figured that the only way to answer that question was to count all the Drupal sites out there one by one. Three years ago, I was finally motivated enough to write a Drupal site crawler that looks over the millions of websites online to find those powered by Drupal. The crawler initially ran for about 3 months, and it returned a lot of Drupal sites. For each one it found, I then did some data-mining to extract the location of their hosting servers. I made a heatmap visualization of all the Drupal sites on a map of the world, and was able to start tracking Drupal's geographical growth patterns over time. As a bonus, the crawler also counted Joomla!, Mambo, and Wordpress sites, as well as a number of other open source content management systems - it is good data on Drupal and the competition.
Writing that crawler was a really fun project. It taught me a ton about how Drupal is used, where Drupal was growing, and how we compared to other content management systems. On the technical side, I learned a ton about scalability. Thanks to my engineering background and my work on Drupal, scalability issues weren't new to me, but writing a crawler that processes information from billions of pages on the web, is a whole different ball park. At various different stages of the project and index sizes, some of the crawler’s essential algorithms and data structures got seriously bogged down. I made a lot of trade-offs between scalability, performance and resource usage. In addition to the scalability issues, you also have to learn to deal with massive sites, dynamic pages, wild-card domain names, rate limiting and politeness, cycles in the page/site graph, discovery of new sites, etc.
Even though the crawler ran for many months, I never really launched it or talked about it publicly. I personally lacked the resources (both my time and the money to run the servers) to keep it running all the time. I ultimately stopped the crawler altogether and to put the project on hold. Since then, a couple of things have changed: I learned a lot more about scalability thanks to my ongoing work on Drupal and on Mollom, which now processes hundreds of thousands of spam messages a day. I also co-founded Acquia along the way. Acquia shares my personal interest in tracking Drupal's growth and has the resources to help me revive my crawler. Last but not least, there has been a lot of innovation and knowledge sharing around how to solve scalability problems. As such, I'd now like to pick up the project where I left off 3 years ago, and relaunch it under the Acquia umbrella.
One thing hasn’t changed: the day remains +/- 24 hours long. I still don't have enough time to work on this, and have many priorities I didn’t have 3 years ago. That is why I'm looking for a summer intern who wants to take the lead on this project for a few months. The crawler is written in Java, so I'm looking for a student proficient in Java, and who also understands fundamental internet protocols such as DNS, HTTP and HTML. Experience with building scalable multi-server systems is a plus but not strictly required. I don't care where you live, but I only want to work with people who work hard and who have strong programming skills. Are you interested in being my intern for the summer? Contact me at my contact form and send me a copy of your resume. Or if you know someone who would be a good candidate, please point them to this blog post.
Related Content
AcquiaBlog

2010 has been an inflection point for the Acquia partner program. We are doing more business than ever with partners, including case studies with Palantir.net, Blink Reaction, and IBM Global Services.
Bryan House
It is that phase of my life! I'm just turning 30 in a month, working with Drupal for 7 years and just had my third Acquia anniversary a week ago. Time to look back and evaluate how things went, all the good and bad things; even better if the wisdom can be shared with others. This was part of my thinking when I submitted the session titled "Come for the software, stay for the community" for Drupalcon Copenhagen.
Gábor Hojtsy
It sounded like a really simple request: "Is it easy to add a search filter for 'My posts'?". In other words, add a search result facet for posts by the current (logged in) user through the Apache Solr Search Integration module APIs?
But then the wheels start turning - we want not just one blind link, but a real facet link that tells us how many results we'll get. Also, if we are filtering by 'My posts' then we probably have an equal use case for the opposite filter 'Posts not by me'. So we really need a facet block with two links and facets counts.
Peter Wolanin






