Nov 14, 2014

How to write a PostFilter for Solr 4.9


The problem

When working with Solr, there might be the time where a custom filter functionality is needed. Solr provides different possibilities on how to filter results such as ValueSourceParsers or QParserPlugins

In our own case we had stored a boolean expression in each document, reaching from simple true to very complex expressions. This string was post evaluated to check whether a document should be in the results or not.

Doesn't sound very fancy right? Just calculate it in advance!
But the boolean expression had to be interpreted differently depending on our filter query parameters. So the boolean's results could massively vary for each query.

This post is about explaining how to implement a QParserPlugin with PostFilter capabilities.
For the purpose of conveying the concept, the filter itself will be very simplified. In fact it will only be a modulo operation. Since Solr 4.0 there is already a mod function implemented but for the matter of understanding the idea it will be appropriate. 


What is a PostFilter

A PostFilter in Solr is a mechanism which will be executed after the query q and the filter queries fq as it may involve expensive logic or depend on document information.


What we will be doing

  • Write a basic PostFilter function for Solr 
  • Extend it to be adjustable
  • Give a thought on performance (using DocValues)


Implementing a basic PostFilter

It is necessary to implement two java classes a QueryParserPlugin and a Query.


1. ModuloQueryParserPlugin.java

Let's create the ModuloQueryParserPlugin class which extends the QParserPlugin. This is the class we will be linking to our Solr later. The QParserPlugin is abstract which forces us to implement two methods. For our filter we can leave the init method empty and focus on the createParser method.
public class ModuloQueryParserPlugin extends QParserPlugin {

  @Override
  public QParser createParser(String qstr, SolrParams localParams,
    SolrParams params, SolrQueryRequest req) {
     
     return new QParser(qstr, localParams, params, req) {
       
       @Override
       public Query parse() throws SyntaxError {
         return new ModuloQuery();
       }
     };
  }

  @Override
  public void init(NamedList args) {
    // Emtpy on purpose
  }
}


2. ModuloQuery.java

In our ModuloQueryParserPlugin we return an ModuloQuery (where all the filtering will be done). Create a new class called ModuloQuery which extends ExtendedQueryBase and implements the PostFilter interface.

Now we have to bend to the PostFilter API documentation which states two things:

Firstly, we need to make sure that our filter will always be executed after the query and the filter-query, for this purpose let's override the getCost method.
@Override
public int getCost() {
  // We make sure that the cost is at least 100 to be a post filter
  return Math.max(super.getCost(), 100);
}
With this functionality we will still be able to change the cost for the function but making sure that it will always be a post filter.

Secondly, the getCache method should always return false.

@Override
public boolean getCache() {
  return false;
}
From the PostFilter interface we need to implement the getFilterCollector method which
"returns a DelegatingCollector to be run after the main query and all of it's filters, but before any sorting or grouping collectors" PostFilter API Doc
The DelegatingCollector has one method necessary for our filtering purpose.
@Override
public DelegatingCollector getFilterCollector(IndexSearcher idxS) {

  return new DelegatingCollector() {

    @Override
    public void collect(int docNumber) throws IOException {
      // Our filter magic -> call super.collect()
    }
  };
}
Before we implement our filter logic, let me explain how Solr handles the filtering. Solr chains all the filter functions sorted by their cost parameter (functions with higher cost will be run later).
For instance, if there are two post filters (cost >= 100) the one with the lower cost will delegate its result set to the one with the higher cost.

We tell Solr to keep a specific document in the result set by collecting it. This is where the collect method comes to play. For us to make a document stay in the result set we need to call super.collect().


Now lets write our awesome filter logic. Only documents with their id being a multiple of 42 will be in our result.
@Override
public void collect(int docNumber) throws IOException {
  // To be able to get documents, we need the reader
  AtomicReader reader = context.reader();

  // From the reader we get the current document by the docNumber
  Document currentDoc = reader.document(docNumber);
  
  // We get the id field from our document
  Number currentDocId = currentDoc.getField("id").numericValue();

  // Filter magic
  if (currentDocId .intValue() % 42 == 0) {
    super.collect(docNumber);
  }
}

3. Important information about hashCode and equals

This is a static filter as we can't change the filter method. So we do not need to override hashCode and equals. In some further steps we are going to make our post filter adjustable and therefore those two methods must be overridden.

Side Note:
If you expect different results but always get the result from the first query. Take a look at those two methods.

4. Hooking up with Solr (small notes)

  • Build a jar file from our code
  • Put this jar in the lib folder
    • "...\solr-4.9.0\example\solr\collection1\lib"
  • Edit solrconfig.xml:
    • If your running stock Solr, there might be a comment saying Query Parsers after which you can add this:
<queryParser
name="ModuloPostFilter"
class="com.yourpath.ModuloQueryParserPlugin"/>
    • ModuloPostFilter is the name under which Solr will execute our ModuloQueryParserPlugin. 
  • Restart Solr


5. Applying our PostFilter

For running the filter I used the Solr admin panel and a stock Solr installation with 2 million documents each only with an unique id and no further information.

No filter

To make Solr use our PostFilter just add {!ModuloPostFilter} to the fq-field.


Post filter used

As we hoped, only documents with their id being a multiple of 42 are shown.



Extending our PostFilter to be adjustable

Now that we know how its done, our ModuloPostFilter seems to be boring, since filtering is always done the same way (id modulo 42).


1. ModuloQueryParserPlugin.java

Therefore ModuloQueryParserPlugin needs some editing in order to have access to the parameters in our collect method.
Before diving into code, let me explain what the parameters params and localParams in the createParser method are about.

params:

Includes all the Solr request parameters (q, wt, fq, etc).

For instance params.getParams("fq") will return an array of strings with one string for every fq-field. So we could possibly bind our filter to any given parameter.


localParams:
Includes all the parameters for our function. With one speciality, there are two defaults which are always set (type: the functions name; v: the part behind the closing curly bracket).

For example fq={!ModuloPostFilter}... will result in type=ModuloPostFilter and v=....

Any key=value pair can be defined within the curly brackets {!ModuloPostFilter key1=value1 key2=value2}.

This is the option we are going to use for the sake of our extension {!ModuloPostFilter modulo=x}.


Edit the createParser method to look like:

@Override
public QParser createParser(String qstr, SolrParams localParams,
  SolrParams params, SolrQueryRequest req) {
     
   return new QParser(qstr, localParams, params, req) {
       
     @Override
     public Query parse() throws SyntaxError {
       // The ModuloQuery knows the function parameters
       return new ModuloQuery(localParams);
     }
   };
}


2. ModuloQuery.java

We must add a constructor which checks for the desired key-value pair (modulo=x) in the localParams.
public class ModuloQuery extends ExtendedQueryBase
 implements PostFilter {
  private final int moduloX;

  public ModuloQuery(SolrParams localParams) {
    // We try to get the modulo pair
    // if there is none we will still be using 42
    moduloX = localParams.getInt("modulo", 42);
  }
  
  // previously added methods 
  // ...
}
The collect method must be adapted to use our new moduloX field.
@Override
public void collect(int docNumber) throws IOException {
  AtomicReader reader = context.reader();
  Document currentDoc = reader.document(docNumber);
  Number currentDocId= currentDoc.getField("id").numericValue();

  // new Filter magic
  if (currentDocId.intValue() % moduloX == 0) {
    super.collect(docNumber);
  }
}
Let your IDE implement hashCode and equals for ModuloQuery making use of the moduloX field.

3. Applying the adjustable PostFilter

We can now modify our filter call in the admin panel.

Using a different number for filtering

As you can see, we changed our filter behavior by setting the modulo pair.

Improving the performance

Iterating over 2 million document seems to be a bit of a performance issue even though our documents are structured very simple (in fact only having an id field).
As accessing the current document via reader.document(docNumber) is a relatively expensive call, the result set should be limited before running our post filter.


This illustration shows, that getting the document from the reader can lead to slow queries. For the case of improving the query time we are going to take a look at DocValues.


2. What are DocValues and how can we use them?

"With a search engine you typically build an inverted index for a field: where values point to documents. DocValues is a way to build a forward index so that documents point to values." - Solr Wiki
Let see how forwarded and inverted index are distinguished.

Forward-Index:
{
  'doc1': {'field-A':3, 'field-B':2, 'field-C':3},
  'doc2': {'field-A':1, 'field-B':3, 'field-C':4},
  'doc3': {'field-A':2, 'field-B':3, 'field-C':2},
  'doc4': {'field-A':4, 'field-B':4, 'field-C':4}
}
Inverted-Index:
{
  'field-A': {'doc1':3, 'doc2':1, 'doc3':2,'doc4':4},
  'field-B': {'doc1':2, 'doc2':3, 'doc3':3,'doc4':4},
  'field-C': {'doc1':3, 'doc2':4, 'doc3':4,'doc4':4}
}

The iteration over the forwarded index is faster in this context. It is not an all in one solution (but definitely worth mentioning), check the link for appropriate use cases.

To make a Solr field use this capability just add docValues="true" to that fields definition (in the schema.xml).

This is how our id field now looks like:

<field name="id" type="long" indexed="true" stored="true"
required="true" multiValued="false" docValues="true"/>
Usually indexing, storing and using docValues on the same field should not be done as Solr can't profit from it's optimization algorithms. For this post let's ignore that advise.

Dont forget, before we can use DocValues in our code Solr must re-index its data (Solr Wiki - HowToReindex).



3. ModuloQuery.java

We must change the collect method to make use of the new DocValues capability for the id-field.

@Override
public void collect(int docNumber) throws IOException {
  // SLOW: Document currentDoc = reader.document(docNumber);
  // FAST: gets the id field from the document with the docNumber
  long currentDocId = DocValues.getNumeric(context.reader(), "id")
                               .get(docNumber);

  if (currentDocId % moduloX == 0) {
    super.collect(docNumber);
  }
}


4. Testing the difference

Now let's apply the filter again and see the impact.

Using DocValues for accessing the id

Referring to the previous version, we can observe a decrease from ~3500ms to ~120ms.
The performance boost can not be seen representative as we were using a simplified schema and it heavily depends on your data structure. For our production system the impact was comparatively low.
It's just a hint at whats possibly worth looking at for tweaking a post filters performance.


5. Final words on DocValues

This approach has some downsides and it's usage depends on your overall data structure and environment. It is also possible to change how the forwarded index is stored by Solr via the "docValuesFormat" attribute.

The example shown in this post is very artificial and can not be seen too serious but as a first impression on DocValues. Also consider using FieldCache instead for accessing the document fields if your unable to reindex.



Sources

4 comments:

  1. Excellent work. Simple language !! Simple words !! Deep concept explained.

    ReplyDelete
  2. Awesome work. It is really hard to find doc like this. Thanks a lot!

    ReplyDelete
  3. Hi, I have tried to write a custom postfilter class for solr 5.3 using this very helpful article. I have the following exception can you please point out where the problem might be.
    org.apache.solr.common.SolrException: Error Instantiating queryParser, Pathology.Parser.NLPFilterPlugin failed to instantiate org.apache.solr.search.QParserPlugin
    at org.apache.solr.core.SolrCore.(SolrCore.java:820)
    at org.apache.solr.core.SolrCore.(SolrCore.java:659)
    at org.apache.solr.core.CoreContainer.create(CoreContainer.java:727)
    at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:447)
    at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:438)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:210)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: org.apache.solr.common.SolrException: Error Instantiating queryParser, Pathology.Parser.NLPFilterPlugin failed to instantiate org.apache.solr.search.QParserPlugin
    at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:588)
    at org.apache.solr.core.PluginBag.createPlugin(PluginBag.java:122)
    at org.apache.solr.core.PluginBag.init(PluginBag.java:217)
    at org.apache.solr.core.PluginBag.init(PluginBag.java:206)
    at org.apache.solr.core.SolrCore.(SolrCore.java:764)
    ... 9 more
    Caused by: java.lang.ClassCastException: class Pathology.Parser.NLPFilterPlugin
    at java.lang.Class.asSubclass(Class.java:3208)
    at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:475)
    at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:422)
    at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:567)
    ... 13 more

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete