Lucene.Net Multiline Regular Expression Search

Question

We use Lucene.Net 3.0.3 Whitespace Analyzer and we index the file with the same name separated two fields with Not_Analyzed and Analyzed options shown below

        public static void WriteIndexes()
    {
        string indexPathRegex = ConfigurationManager.TfSettings.Application.CustomSettings["dbScritpsAddressRegex"];

        var analyzerRegex = new WhitespaceAnalyzer();
        var indexWriterRegex = new IndexWriter(indexPathRegex, analyzerRegex, IndexWriter.MaxFieldLength.UNLIMITED);

       foreach (LuceneIndex l in Indexes)
        {
            var doc = new Document();
            doc.Add(new Field("ServerName", l.ServerName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));

            doc.Add(new Field("DatabaseName", l.DatabaseName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.NO));
            doc.Add(new Field("SchemaName", l.SchemaName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("ObjectType", l.ObjectType.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("ObjectName", l.ObjectName.ToLowerInvariant(), Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("Script", l.Script, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.NO));
            doc.Add(new Field("Script", l.Script, Field.Store.YES, Field.Index.NOT_ANALYZED, Field.TermVector.NO));

            indexWriterRegex.AddDocument(doc);
        }
        indexWriterRegex.Optimize();
        analyzerRegex.Close();
        indexWriterRegex.Close();




    }

When we look for a single line regex expression it is ok.But when we look for multiline regular expression;if the size of the search file is smaller than 16 KB it is ok.But when it is larger than 16 KB , Lucene doesnt find the search keyword.Is this a bug? How can we fix this?

Sample keyword: .*taxId.*\n.*customerNo.*

       public  List<item> SearchAllScriptInIndex()
    {
        string indexPathRegex = ConfigurationManager.TfSettings.Application.CustomSettings["dbScritpsAddressRegex"];
        var searcher = new Lucene.Net.Search.IndexSearcher(indexPathRegex, false);

        const int hitsLimit = 1000000;
        var analyzer = new WhitespaceAnalyzer();

        var parser = new MultiFieldQueryParser(Lucene.Net.Util.Version.LUCENE_29, new[] { "Script", "DatabaseName", "ObjectType", "ServerName" }, analyzer);

        Term t = new Term("Script", Expression);
        RegexQuery scriptQuery = new RegexQuery(t);

        string s = string.Format("({0}) AND {1}", serverAndDatabasescript, objectTypeScript);
        var query = parser.Parse(s);

        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.Add(query, BooleanClause.Occur.MUST);
        booleanQuery.Add(scriptQuery, BooleanClause.Occur.MUST);

        var hits = searcher.Search(booleanQuery, null, hitsLimit, Sort.RELEVANCE).ScoreDocs;

        List<item> results = new List<item>();
        List<string> values = new List<string>();
        Dictionary<int, string> newLineIndices = new Dictionary<int, string>();
        foreach (var hit in hits)
        {
            var hitDocument = searcher.Doc(hit.Doc);
          string contentValue = hitDocument.Get("Script");
         LuceneIndex item = new LuceneIndex();
         item.ServerName = hitDocument.Get("ServerName");
          item.DatabaseName = hitDocument.Get("DatabaseName");
          item.ObjectName = hitDocument.Get("ObjectName");
          item.ObjectType = hitDocument.Get("ObjectType");
          item.SchemaName = hitDocument.Get("SchemaName");
          item.Script = hitDocument.Get("Script");
                    results.Add(item);

        }
        return results;

}

sisve sisve · Accepted Answer · 2012-11-16T08:47:53

The maximum supported term length is 16 383 characters according to the documentation for IndexWriter.AddDocument, and the field IndexWriter.MAX_TERM_LENGTH. It seems that terms longer than this are simply ignored, causing the problem you describe.

The documentation for AddDocument claims an exception is thrown, while the field just mentions that information is written to the infoStream [if one is set].

/// <p/>Note that each term in the document can be no longer
/// than 16383 characters, otherwise an
/// IllegalArgumentException will be thrown.<p/>

// [...]

/// <summary> Absolute hard maximum length for a term.  If a term
/// arrives from the analyzer longer than this length, it
/// is skipped and a message is printed to infoStream, if
/// set (see <see cref="SetInfoStream" />).
/// </summary>
public static readonly int MAX_TERM_LENGTH;

Source: IndexWriter.cs

Lucene.Net Multiline Regular Expression Search

1 Answers