Understanding the ApacheSolr CCK API

UPDATE: If you're working with the DRUPAL-6--2 branch please see the updated code example below.
In this article I will show you how you can write a tiny bit of code that will reveal new fields and facets for searching with the ApacheSolr module and Acquia Search. Using Acquia Drupal we'll write an example module that takes the file type from CCK file and image fields and makes them into their own search fields. This results in us being able to filter our search results based on file type. This code fulfils the situation where you want, for example, to find a specific post that has a JPEG image, or all of the posts with PDFs that match a particular keyword.
To start you may want to download the PDF file of screenshots that trace all of the steps I took to set up Acquia Drupal, Acquia Search, and the custom module. The broad steps are to:
- Sign up for a free trial.
- Download and install Acquia Drupal.
- Follow along with the code examples below to create the example.module.
- Set up a content type to have image and file fields. Create some content and upload a variety of files.
- Run cron to make sure your content has been indexed.
- Enable the new filters and blocks that the example.module is responsible for having created.
- Search!
What are facets?
The ApacheSolr module is a revolution in Drupal search. It allows you to search for the most general keyword that applies to what you're looking for, and then use the provided facet links to drill down to exactly the right content. An example would be searching for "Drupal search" on Drupal.org, then filtering by project and robertDouglass to get just the modules that I have written that deal with search. Facets, in other words, are the better version of advanced search forms, which, to be honest, suck.
What facets are available?
By default, ApacheSolr makes facets available for content type, author, language, taxonomy terms, and all CCK fields that are text fields with option widgets (select, radio, checkbox). This article assumes that you want some different facets, and you're using CCK fields. The file field and image field both have some interesting information that would make a great facet - their file type. Every file you upload has a distinct type: pdf, png, gif, tiff, doc, and so forth. Wouldn't it be nice to have this available as a facet? Of course!
What needs to happen to add new facets?
The ApacheSolr module comes equipped with an API for extending what gets indexed and how searching works. One of the important hooks in this API is hook_apachesolr_cck_field_mappings(). We're going to write a module that implements this hook, and use it to tell ApacheSolr how to make facets out of the file type on file and image fields.
To do this the hook is only going to have to tell ApacheSolr three things:
- What data type should be used in the index.
- What CCK widget types to be looking for during indexing.
- A callback function to use for extracting the data from the CCK field. We write this function ourselves.
The callback function that we write will then receive each node and each field name as they are being indexed. From that it must extract or generate whatever information interests us. In this case we're just extracting the file type, which is already present in the field. We could, however, return any amount of data doing any arbitrary processing that we care to. See the code example below to understand the structure of the array that the callback has to return.
The example module implementing hook_apachesolr_cck_field_mappings()
The first step in writing any module is to creat an .info file. Here's ours:
; file example/example.info
name = example
description = Example module showing custom CCK facets.
core = 6.xThe next step is to have a module file. This is the example/example.module file:
<?php
/**
* Implementation of hook_apachesolr_cck_field_mappings
*/
function example_apachesolr_cck_field_mappings() {
$mappings = array();
// 'filefield' is the CCK field_type. Correlates to $field['field_type']
$mappings['filefield'] = array(
// The callback function gets called at indexing time to get the values.
'callback' => 'example_callback',
// Common types are 'text', 'string', 'integer',
// 'double', 'float', 'date', 'boolean'
'index_type' => 'string',
// These are the CCK formatting widgets for which this mapping applies.
// If we wanted to target images but not generic files, for example,
// we could say 'filefield_widget' => FALSE
'widget_types' => array(
'filefield_widget' => TRUE,
'imagefield_widget' => TRUE,
),
);
return $mappings;
}
/**
* A function that gets called during indexing.
* @node The current node being indexed
* @fieldname The current field being indexed
*
* @return an array of arrays. Each inner array is a value, and must be
* keyed 'safe' => $value
*/
function example_callback($node, $fieldname) {
$fields = array();
foreach ($node->$fieldname as $field) {
// In this case we are indexing the filemime type. While this technically
// makes it possible that we could search for nodes based on the mime type
// of their file fields, the real purpose is to have facet blocks during
// searching.
$fields[] = array('safe' => check_plain($field['filemime']));
}
return $fields;
}
?>The example_apachesolr_cck_field_mappings() function returns an array that says "for any filefield CCK fields (this includes imagefields), use the function example_callback() while indexing, store the data as strings, and apply these instructions to filefield_widgets and imagefield_widgets".
The example_callback(), a function we specified as a callback, will get called with the $node and $fieldname during indexing. We use that information to dig around and get the file type, which is is found in $field['filemime']. Important note: the return value of the callback is an array of arrays. The inner arrays have one key, 'safe', and that key's value is the actual value we want to be indexed and used for faceting. The "safe" name of the key is there to remind you, as the developer, not to allow any cross site scripting, and please sanitize the value.
Below are screenshots of the return values of each of these functions for those who learn visually.


Results
Now when we search, we have two new facet blocks available letting us drill down into the search results based on the type of files that are uploaded to each one. Not bad for 15 lines of code (excluding comments)!






Robert Douglass
Peter Wolanin points out
Peter Wolanin points out that, at the moment, if you try to use a variation of this code on textfields, they get clobbered by the existing optionwidgets definitions in the ApacheSolr module itself. Just a warning for those of you with talents for discovering edge cases =)
Robert Douglass
Senior Drupal Advisor, Acquia
Robert, it is amazing how
Robert, it is amazing how far solr has come with Drupal in the last 2 years. I started researching this back at the open source CMS conference at Yahoo! and you had also just begun to work on this. I gave up and you didn't so thank you. Wish I was more of a developer I guess.
Anyhow, your outsourced solr hosting is a fantastic idea. Eased the implementation process greatly for those who can't do Tomcat, etc.
Again, thank you.
1) Using horse height as a
1) Using horse height as a facet example (please these comments are from my first drupal-solr project - so please feel free to correct) - a simpler way to create a custom facet rather than use the above function is to
* define a new text field - say called field_horseheight_class (assuming your actual horse height is stored in field_horseheight)
* and in the drupal GUI define the 'possible values' eg: 'Upto 12 hands', '12-14 hands'
* At the point where your are creating / updating the node - you can
// ------
$horseheightClass = 'Upto 12 hands';
if ( $node->field_horseheight < 12 ){
$horseheightClass = 'Upto 12 hands';
}
else if ( $node->field_horseheight >= 12 && $node->field_horseheight <= 14 ){
$horseheightClass = '12-14 hands';
}
...
$node->field_horseheight_class = array( '0' => array( 'value' => $horseheightClass));
* set the widget type for the field_horseheight_class as "check box/radio buttons" the apachesolr config pannel will automatically display the option to enable this as a facet
2) You may want to look at apachesolr_og module - its a nice example on how to create your own facet (and not use the apachesolr_cck_field_mappings function above ) - (simply cut and paste the apachesolr_og class and replace with your values)
3) None of the documentation points this out - I think its important to explain that the way apachesolr understands/learns about new fields with out needing a change to schema.xml is by using the dynamic fields and a naming convention. You can see the dynamic fields defined in schema.xml - for example
<!-- Dynamic field definitions. If a field name is not found, dynamicFieldswill be used if the name matches any of the patterns.
RESTRICTION: the glob-like pattern in the name attribute must have
a "*" only at the start or the end.
EXAMPLE: name="*_i" will match any field ending in _i (like myid_i, z_i)
Longer patterns will be matched first. if equal size patterns
both match, the first appearing in the schema will be used. -->
<dynamicField name="is_*" type="integer" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="im_*" type="integer" indexed="true" stored="true" multiValued="true"/>
<dynamicField name="sis_*" type="sint" indexed="true" stored="true" multiValued="false"/>
<dynamicField name="sim_*" type="sint" indexed="true" stored="true" multiValued="true"/>
4) You can see the dynamic fields that have been created by going to your solr admin port - by default http://localhost:8983/solr/ and clicking on "schema browser" -> "dynamic fields" and then "ss_*" or one of the other dynamic field prefixes
5) By default - apachesolr integration will take all the fields associated with your node and dump them into the document 'body' that is sent for indexing to apachesolr. This means that if you have
A) A private field that you do not want indexed or
B) Some field that your are using for admin purposes (not displaying to the users)
it will match on searches
For example: lets say you have a private field containing the data "evil company" - and a user searches for "evil" the node will match.
One way to prevent certain fields from being indexed is to
function apachesolr_myproject_nodeapi(&$node, $op, $a3 = NULL, $a4 = NULL){// This function gets called a _lot_
if ( $op == 'view' ){
// we are in the process of running the cron.php to build the search index
if ( isset($node) && ($node->build_mode == NODE_BUILD_SEARCH_INDEX) ){
if ( isset( $node->content['body'] ) ){
// we dont want fields like field_video_rights to go into the search index -
// (because they match the search result and appear in the snipet that is in the search result
//
// so we remove all of them and only allow body to go though to solr
$content = (array) $node->content;
$body = $content['body'];
$node->content = array( 'body' => $body );
}
}
}
}
6) During development of your search functionality - it really pays to cut the data in your dev database down to 2/3 per content type - this saves a LOT of time in reindexing cycles. The simplest way to do this is identify the nodes you want to keep and then on your dev database run
delete from node where nid not in ( 1111, 1112, .... etc )and then dont forget to flush - the cache
7) Also just to point out - In general if you want a field to be listed in the apachesolr pannel as a possible facet - you should set widget type to "check box/radio buttons" and NOT "single on/off checkbox"
updated example code for
updated example code for apachesolr-6.1-rc2
hi robert,
thanks for the great apachesolr project and this article! I noticed though that the code in this article isn't compatible with the current rc2-api/code. i have adjusted the code snippets that should work w/ rc2 (see below & please review :) ). perhaps you could make a brief note in the article or update the examples.
thanks again for the amazing effort - best, fredrik
1. adjust hook name and structure of return array
/*** Implementation of hook_apachesolr_cck_fields_alter
*/
function example_apachesolr_cck_fields_alter(&$mappings) {
// either for all CCK of a given field_type and widget option
// 'filefield' is here the CCK field_type. Correlates to $field['field_type']
$mappings['filefield'] = array(
'filefield_widget' => array('callback' => 'example_callback', 'index_type' => 'string'),
'imagefield_widget' => array('callback' => 'example_callback', 'index_type' => 'string')
);
// or per-field indexing assuming field_example is a filefield cck field
$mappings['per-field']['field_example'] = array(
// The callback function gets called at indexing time to get the values.
'callback' => 'example_callback',
// Common types are 'text', 'string', 'integer',
// 'double', 'float', 'date', 'boolean'
'index_type' => 'string',
);
}
2. return 'value'=>'XXXX' instead of 'safe'=>check_plain('XXXX') as apachesolr_clean_text() is called on the value in apachesolr_node_to_document()?
/*** A function that gets called during indexing.
* @node The current node being indexed
* @fieldname The current field being indexed
*
* @return an array of arrays. Each inner array is a value, and must be
* keyed 'value' => $value
*/
function example_callback($node, $fieldname) {
$fields = array();
foreach ($node->$fieldname as $field) {
// In this case we are indexing the filemime type. While this technically
// makes it possible that we could search for nodes based on the mime type
// of their file fields, the real purpose is to have facet blocks during
// searching.
$fields[] = array('value' => $field['filemime']);
}
return $fields;
}
Robert Douglass
Thanks for the updated code,
Thanks for the updated code, fredrik. I've added a note to the article referencing it.
Robert Douglass
Senior Drupal Advisor, Acquia
robert i tried all of codes
robert i tried all of codes and they are working wonderfull. before i wasnt understand here but i asked some friend and he tell me this is very easy.. >> * keyed 'safe' => $value
*/
function example_callback($node, $fieldname) {
$fields = array();
foreach ($node->$fieldname as $field) {
// In this case we are indexing the filemime type. While this technically
// makes it possible that we could search for nodes based on the mime type
// of their file fields, the real purpose is to have facet blocks during
// searching.
$fields[] = array('safe' => check_plain($field['filemime']));
}
return $fields;
} but i understand now and i know very good thank you robert
I've created a cck filed of
I've created a cck filed of type textarea with name filed_desc, how do i get this field to index in solr.
i have tried this but it is not indexing the filed, can somebody help.
<?php
// $Id$
/**
* Implementation of hook_apachesolr_cck_fields_alter
*/
function example_apachesolr_cck_fields_alter(&$mappings) {
// either for all CCK of a given field_type and widget option
// 'filefield' is here the CCK field_type. Correlates to $field['field_type']
$mappings['text'] = array(
'text_textarea' => array('callback' => 'example_callback', 'index_type' => 'string'),
);
}
/*** A function that gets called during indexing.
* @node The current node being indexed
* @fieldname The current field being indexed
*
* @return an array of arrays. Each inner array is a value, and must be
* keyed 'value' => $value
*/
function example_callback($node, $fieldname) {
$fields = array();
foreach ($node->$fieldname as $field) {
// In this case we are indexing the filemime type. While this technically
// makes it possible that we could search for nodes based on the mime type
// of their file fields, the real purpose is to have facet blocks during
// searching.
$fields[] = array('value' => $field['field_desc']);
}
return $fields;
}
?>