Lukemapper: A Super Fast ORM for Lucene.net

Lucene is a document storage search engine library that utilizes inverted indexes and has great capabilities. It has been very popular and has been ported to almost every language, including .Net.

We use Lucene.Net for search here at Tech.Pro. Lucene has incredibly useful features, but it is built to work solely with strings, which can be a strong disadvantage if one is trying to store data of other types in a document. In addition, translating documents to strongly typed objects can be an annoying and redundant task.

We also use Dapper.Net here at Tech.Pro, and we love it's simple API and extremely satisfactory performance.

##Purpose

The concept I am trying to achieve is something similar in spirit to Dapper, except is meant to deal with mapping Lucene Documents to generic Objects, rather than Rows from a database.

Although Lucene is schema-less, in practice there is often an implicit schema in a Document which corresponds to a class or object in your code-base. Although you can easily use ORMs like Dapper or EntityFramework to map data from an RDBMS to CLR objects, doing so in Lucene is cumbersome and error-prone. Enter LukeMapper:

The desired API is something like the following:

Given some generic class in .Net like as follows:

class PocoClass
{
    public int Id;
    public string Name;

    public int PropId { get; set; }
    public string PropName { get; set; }
}

##Read Operations

If I wanted to run a query against an IndexSearcher in Lucene, and return the corresponding documents mapped to a List, I could do the following:

IndexSearcher searcher;
Query qry;
int numberToReturn = 10;

List results = searcher.Query(qry, numberToReturn);

Thus, the .Query<T>(Query,int) method is implemented as an extension method to an IndexSearcher, similar to how Dapper's .Query<T> method is implemented as an extension method to an IDBConnection object.

##Write Operations

Similarly, for Write operations, I would do the following:

IndexWriter writer;
IEnumerable objects;

// insert objects into index
writer.Write(objects)

And similarly, an update operation:

IndexWriter writer;
IEnumerable objects;
//method to find the corresponding document to update
Func identifyingQuery = o => new TermQuery(new Term("Id",o.Id.ToString()));

// update objects in index
writer.Update(objects, identifyingQuery);

Similar to Dapper and other Micro-ORMs out there, the implementation of the mapping will be done by generating a Deserializer/Serializer method via IL-Generation and caching it.

For the .Query() operation, the desired IL method generated should be semantically similar to the IL generated from the following method:

public static PocoClass ExampleDeserializerMethod(Document document)
{
    var poco = new PocoClass();

    poco.Id = Convert.ToInt32(document.Get("Id"));
    poco.Name = document.Get("Name");

    poco.PropId = Convert.ToInt32(document.Get("PropId"));
    poco.PropName = document.Get("PropName");

    return poco;
}

Similarly, for the .Write() and Update() methods, the Serializer methods will be semantically similar to the IL generated from the following method:

public static Document ExampleSerializerMethod(PocoClass obj)
{
    var doc = new Document();

    doc.Add(new Field("Id", obj.Id.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
    doc.Add(new Field("Name", obj.Name, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
    doc.Add(new Field("PropId", obj.PropId.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));
    doc.Add(new Field("PropName", obj.PropName, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS));

    return doc;
}

Although, some error handling may need to be inserted, among other things to make the method a bit more robust.

##Enhancing / Customizing with Attributes

Although basic functionality works essentially out of the box, with no attributes needed, further flexibility is garnered by the use of various Attributes.

[LukeMapper(IgnoreByDefault = true)]
public class ExampleClass
{
    // doesn't get indexed/stored
    [Luke(Store = Store.YES)]
    public int Id { get; set; }

    // doesn't get stored, but is indexed in "searchtext" field
    [Luke(Store = Store.NO, Index = Index.ANALYZED, FieldName = "searchtext")]
    public string Title { get; set; }

    // doesn't get stored, but is indexed in "searchtext" field
    [Luke(Store = Store.NO, Index = Index.ANALYZED, FieldName = "searchtext")]
    public string Body { get; set; }

    // doesn't get indexed/stored
    public int IgnoredProperty { get; set; }
}

[LukeMapper(DefaultIndex = Index.ANALYZED)]
public class ExampleClass
{
    // doesn't get indexed/stored
    [Luke(Index = Index.NOT_ANALYZED_NO_NORMS)]
    public int Id { get; set; }

    // get's analyzed, AND stored
    public string Title { get; set; }

    // get's analyzed, AND stored
    public string Body { get; set; }
}

public class ExampleClass
{
    // everything get's indexed and stored by default
    public int Id { get; set; }
    public string Title { get; set; }
    public string Body { get; set; }

    //opt-in ignored per property/field
    [Luke(Ignore=true)]
    public int Ignored { get; set; }
}

public class ExampleClass
{
    // everything get's indexed and stored by default
    public int Id { get; set; }
    public string Title { get; set; }
    public string Body { get; set; }

    //opt-in ignored per property/field
    public int Ignored { get; set; }
}

##Custom Serialization/Deserialization

You can override the serialization of certain properties, even more complex ones which are not supported, if it is needed for your application.

For instance, a common example might be that I have a list or array of something that I would like to serialize/deserialize into the document.

In this case, you can simply specify a static method to use for the serialization (and deserialization) using the LukeSerializerAttribute and LukeDeserializerAttribute.

public class TestCustomSerializerClass
{
    public int Id { get; set; }

    //this list would typically be ignored
    public List CustomList { get; set; }

    // if you specify a serializer, it will get serialized
    [LukeSerializer("CustomList")]
    public static string CustomListToString(List list)
    {
        return string.Join(",", list);
    }

    // and similarly, deserialized
    [LukeDeserializer("CustomList")]
    public static List StringToCustomList(string serialized)
    {
        return serialized.Split(',').ToList();
    }
}


public class TestCustomSerializerClass
{
    public int Id { get; set; }

    // maybe you just want to index the list for search, but don't need it on .Query()
    [Luke(Store = Store.NO,Index = Index.ANALYZED)]
    public List CustomList { get; set; }

    // in this case, only a serializer is needed
    [LukeSerializer("CustomList")]
    public static string CustomListToString(List list)
    {
        return string.Join(" ", list);
    }
}

As of now, the cacheing is done via a hashcode which should be unique to the declared fields in the IndexSearcher's index, and the object type which it is being mapped to.

##Data Types supported:

Textual:
- string
- char
Numeric:
- int
- int?
- long
- long?
Other:
- bool
- bool?
- DateTime
- DateTime?
In Progress (Not Yet Supported):
- char?
- byte
- byte?

##Performance

With an example class:

public class TestClass
{
    public int Id;

    public string PropString { get; set; }
}

The test was to instantiate 500 instances of TestClass, and compared inserting the

Operation	LukeMapper	Lucene.Net (native)
Insert 500 Documents (With no Serializer Cached)	89ms	19ms
Insert 500 Documents (Subsequent Calls)	3.31ms	4.52ms
Query 500 Documents (With no Deserializer Cached)	49ms	1ms
Query 500 Documents (Subsequent Calls)	1.26ms	1.05ms

This is a simple class, with only a string and an int, so I ran a second test with 2 more properties:

public class TestClass1
{
    public int Id;
    public string PropString { get; set; }
    public DateTime DateTime { get; set; }
    public int? NullId { get; set; }
}

Which had the following similar performance:

Operation	LukeMapper	Lucene.Net (native)
Insert 500 Documents (With no Serializer Cached)	103ms	23ms
Insert 500 Documents (Subsequent Calls)	5.84ms	6.05ms
Query 500 Documents (With no Deserializer Cached)	42ms	2ms
Query 500 Documents (Subsequent Calls)	1.84ms	1.15ms

What these benchmarks show is essentially what is to be expected. The first time .Write() is called on a class, it takes O(10^2) ms to generate the deserializer/serializer method. Once it is cached, the write and read operations are of the same order as the native calls (which they should be, since we are generating essentially the same CIL as when we are hand-coding it). What is a bit mysterious to me is why LukeMapper seems to consistently be writing to the index faster. This may be an issue with the benchmark. You can find the actual code used to find these numbers here

If anyone would like me to compare it to anything else, let me know. As far as I know there aren't really any other ORMs for lucene out there to compare against.

##Pseudo-Code: Under The Hood

The code of LukeMapper is essentially a single file which exposes several extension methods to IndexWriter and IndexSearcher.

The .Query() method might look like this:

public static IEnumerable Query(
    this IndexSearcher searcher,
    Query query,
    int n /*, Sort sort*/)
{
    // run actual search
    TopDocs td = searcher.Search(query, n);

    // if no results, nothing to do
    if (td.TotalHits == 0)
    {
        yield break;
    }

    //check to see if we have a deserializer
    var deserializer = Cache.Get(typeof(T),searcher);

    if(deserializer = null){
        // need to generate deserializer
        deserializer = GenerateDeserializer(typeof(T),searcher);
    }

    //perform mapping
    foreach(var document in td.ScoreDocs.Select(sd=>searcher.Doc(sd.doc)))
    {
        object next;
        next = deserializer(document);
        yield return (T)next;
    }
}

All of the magic is essentially in the GenerateDeserializer method. This is where reflection is used to determine what IL to generate and cache as a method. In some seriously simplified pseudo-code:

private static Func GenerateDeserializer(Type type, IndexSearcher searcher)
{
    var dm = new DynamicMethod(...);
    var il = dm.GetILGenerator();

    var properties = GetSettableProps(type);
    var fields = GetSettableFields(type);
    var attributes = GetLukeAttributes(type);

    foreach(var prop in properties){
        // figure out prop type, attributes, etc. and emit proper IL.
        il.Emit(...);
        il.Emit(...);
        il.Emit(...);
    }
    foreach(var field in fields){
        // figure out prop type, attributes, etc. and emit proper IL.
        il.Emit(...);
        il.Emit(...);
        il.Emit(...);
    }

    return (Func)dm.CreateDelegate(typeof(Func));
}

The methods for generating serializer methods are very similar.

If you are interested in the code, you can see it all on github

###Notes

In many ways this is not as practical as Dapper and is more of a specific application; Lucene is only meant to handle textual data and is schema-less, so mapping to objects of non-textual type with a specific schema is more error prone. The reality, though, is that most Lucene indexes are implemented with a relatively uniform schema.

###Current Status

I have started working on this project more and think it has promise and will likely use it in some projects of my own. If anyone is interested in helping out, I would certainly love the help. On the other hand, if anyone has any suggestions or feature requests, bring them on. Although this is not currently used in the Tech.Pro codebase, it will be soon.

Right now, I am focusing on the following:

Improve the error handling / feedback currently
Build in some support for NumericFields
Attribute to specify the "Identifier" of an object, and auto-generate the "identifyingQuery" needed for the Update() method.
Attribute to utilize term vectors usefully
Build in some automatic support for handling lists in typical fashion (ie csv, json-encoding, etc)
get char's and byte's working (seriously, why are they so difficult?)
I would like to get the project hosted on NuGet. Need to look into this as I have never done it.

Link to LukeMapper GitHub Repo

Any and all comments/feedback appreciated!

Leland Richardson