Use Apache Solr to search in files

Drupal's file handling capabilities keep getting better. Beyond the core upload module, the filefield module for CCK has enabled us to build sites with all sorts of files; documents, images, music, videos, and so forth. Searching within these docuements, however, has never been a common feature on Drupal sites. Some solutions have existed, particularly for extracting texts from PDFs and common wordprocessing documents. With Apache Solr, the attachments module, and an extension library called Tika, things can be much better. With Tika you can extract texts not only from Microsoft Office, Open Office, and PDF documents, you can also get text and metadata from images, songs, Flash movies and zipped archives. Searching for these texts is done as part of the normal Apache Solr driven site search.
This article shows how I set up Tika and the Apache Solr Attachments module on my MacBook Pro runing Snowleopard (OSX 10.6). There are two ways to run Tika, either as a client-side component (where the client is Drupal), or as a server-side component (the server being Solr). The advantage of running Tika client-side is that the files don't need to travel over the wire to have their texts extracted. Especially in the case of rich media (movies, images, music) this is quite desirable. Why send a 20M video over the network just to get 15-20 lines of text from it? Another important advantage of running Tika client-side is that it works with Acquia Search.
The disadvantages of running Tika client-side are that you have to install it on every client (in a multi-webserver environment, for example), and the processing workload then falls onto your webserver instead of offloading it to the Solr server. Acquia Search also doesn't currently support the option of offloading extraction to the Solr server, though it is a feature we might add.
This article will show you how to install Tika on the client.
What you need
You need java 1.6 (1.5 should work, but not as many document types are supported). Test this by typing java -version at the command line. Here's what I see on my machine:
robert$ java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-10M3025)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)You also need the java build tool called Maven. If you're on OSX 10.6 like I am this should already be the case. Check by typing mvn -v at the command line. Here's what I see on my machine:
robert$ mvn -v
Apache Maven 2.2.0 (r788681; 2009-06-26 15:04:01+0200)
Java version: 1.6.0_17
Java home: /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
Default locale: en_US, platform encoding: MacRoman
OS name: "mac os x" version: "10.6.2" arch: "x86_64" Family: "mac"Building Tika
No matter how you decide to run Tika (client or server side), you'll need to get Tika first. A zip file of the source code is available from the download page page on lucene.apache.org. Alternately you can get the source directly from the subversion repository with this command:
svn export -r901084 http://svn.apache.org/repos/asf/lucene/tika/trunk tika-0.6
Either way you'll end up with a directory called tika-0.6. Next we're going to use a tool called Maven to build Tika from the source files. From the command line, change directories into the new tika-0.6 directory. Check to see that you're in the directory containing the pom.xml file. Then type the following two commands. The first gives Maven enough memory, and the second tells it to build Tika:
export MAVEN_OPTS="-Xmx1024m -Xms512m"
mvn installThe first time I tried this I was on the train between Cologne and Brussels using the spotty wireless connection on the Thalys. The build process broke three times and each time I just typed mvn install and it picked up where it had left off, eventually succeeding. Later I tried the build process again in normal conditions and it worked seamlessly.
When Maven is done building, your nugget of gold will be tika-app/target/tika-app-0.6.jar. Let's test it out! Still from the tika-0.6 directory, try extracting some text from a file using the new tika-app-0.6.jar file:
java -jar ./tika-app/target/tika-app-0.6.jar -t [path/to/a/file]
Replace [path/to/a/file] with the path to some interesting file you'd like to test. If everything goes right you'll get the text from that file dumped to stdout (which means you'll see it scrolling by on in your command terminal).
As a final step I moved the tika-app-0.6.jar file to ~/bin (the directory where I keep my custom scripts and libraries) and named it tika.jar. This is optional. You can keep the jar file wherever it makes sense to you. Just take note of its absolute path, as you'll need it when configuring the apachesolr_attachments module.
Now we're ready to use Tika within the context of Drupal and Apache Solr searching.
Drupal, Solr and the Apache Solr Attachments module
For instructions on installing Drupal, Solr, or the Apache Solr module, please refer to the linked resources. You can also get up and running very quickly using the Acquia Drupal Stack Installer and Acquia Search (try it for free).
Note that you may need to give Solr more memory when doing attachment searching. For this example I tested using the Jetty container that comes with the Solr download, but I started it using this command:
java -Xmx1024m -Xms512m -jar start.jar
Get and install the Apache Solr Attachments module in the normal fashion. There is one configuration screen, found at q=admin/settings/apachesolr/attachments. Most of the the options are self-explanatory. You may want to allow a wider set of file extensions. For exact information about what is available, see the Tika supported formats page.

Upload files, run cron, and search
The only thing left to do is to upload some files, run cron, and do a search. The search results that match text in files link to both the file, and to the node to which they belong. Here's and example of me searching for "merlinofchaos" and finding the views-6.x-3.0-alpha2.tar.gz file that I uploaded (yes, Tika can search in tar.gz files, and yes, that's the whole Views 3 module).
Here's an example of me searching for "Drupal" and finding both a Word document and an iWork Keynote file.
Related Content
AcquiaBlog

2010 has been an inflection point for the Acquia partner program. We are doing more business than ever with partners, including case studies with Palantir.net, Blink Reaction, and IBM Global Services.
Bryan House
It is that phase of my life! I'm just turning 30 in a month, working with Drupal for 7 years and just had my third Acquia anniversary a week ago. Time to look back and evaluate how things went, all the good and bad things; even better if the wisdom can be shared with others. This was part of my thinking when I submitted the session titled "Come for the software, stay for the community" for Drupalcon Copenhagen.
Gábor Hojtsy
It sounded like a really simple request: "Is it easy to add a search filter for 'My posts'?". In other words, add a search result facet for posts by the current (logged in) user through the Apache Solr Search Integration module APIs?
But then the wheels start turning - we want not just one blind link, but a real facet link that tells us how many results we'll get. Also, if we are filtering by 'My posts' then we probably have an equal use case for the opposite filter 'Posts not by me'. So we really need a facet block with two links and facets counts.
Peter Wolanin









Comments
Wim Mostrey
I'm curious how the module
I'm curious how the module responds to files that are attached to multiple nodes, or files that are in the files table but that are no longer attached to a node.
Peter Wolanin
As of now, the module only
As of now, the module only examines files that are attached to nodes.
I think if the same file is attached to multiple nodes it may get indexed once per node. Since this is not common in D6 I haven't really tried to optimize the behavior.
How wonderful! Thanks!
How wonderful! Thanks!