We are Indexing a resume document by using elastic search java API. It works fine. When we are searching a keyword it's return the accurate response(Document) which has that keyword.
But we want to index document in deep. For example A resume has 'Skills' and their 'Skills Month'. Skills month may be 13 months in document. SO i search for that skill and set skill months between 10 to 15 months in elastic search query, then we want that record(Document).
How can we do this?
Here is the code for Indexing:-
IndexResponse response = client
.prepareIndex(userName, document.getType(),
document.getId())
.setSource(extractDocument(document)).execute()
.actionGet();
public XContentBuilder extractDocument(Document document) throws IOException, NoSuchAlgorithmException {
// Extracting content with Tika
int indexedChars = 100000;
Metadata metadata = new Metadata();
String parsedContent;
try {
// Set the maximum length of strings returned by the parseToString method, -1 sets no limit
parsedContent = tika().parseToString(new BytesStreamInput(
Base64.decode(document.getContent().getBytes()), false), metadata, indexedChars);
} catch (Throwable e) {
logger.debug("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]", e);
System.out.println("Failed to extract [" + indexedChars + "] characters of text for [" + document.getName() + "]" +e);
parsedContent = "";
}
XContentBuilder source = jsonBuilder().startObject();
if (logger.isTraceEnabled()) {
source.prettyPrint();
}
// File
source
.startObject(FsRiverUtil.Doc.FILE)
.field(FsRiverUtil.Doc.File.FILENAME, document.getName())
.field(FsRiverUtil.Doc.File.LAST_MODIFIED, new Date())
.field(FsRiverUtil.Doc.File.INDEXING_DATE, new Date())
.field(FsRiverUtil.Doc.File.CONTENT_TYPE, document.getContentType() != null ? document.getContentType() : metadata.get(Metadata.CONTENT_TYPE))
.field(FsRiverUtil.Doc.File.URL, "file://" + (new File(".", document.getName())).toString());
if (metadata.get(Metadata.CONTENT_LENGTH) != null) {
// We try to get CONTENT_LENGTH from Tika first
source.field(FsRiverUtil.Doc.File.FILESIZE, metadata.get(Metadata.CONTENT_LENGTH));
} else {
// Otherwise, we use our byte[] length
source.field(FsRiverUtil.Doc.File.FILESIZE, Base64.decode(document.getContent().getBytes()).length);
}
source.endObject(); // File
// Path
source
.startObject(FsRiverUtil.Doc.PATH)
.field(FsRiverUtil.Doc.Path.ENCODED, SignTool.sign("."))
.field(FsRiverUtil.Doc.Path.ROOT, ".")
.field(FsRiverUtil.Doc.Path.VIRTUAL, ".")
.field(FsRiverUtil.Doc.Path.REAL, (new File(".", document.getName())).toString())
.endObject(); // Path
// Meta
source
.startObject(FsRiverUtil.Doc.META)
.field(FsRiverUtil.Doc.Meta.AUTHOR, metadata.get(Metadata.AUTHOR))
.field(FsRiverUtil.Doc.Meta.TITLE, metadata.get(Metadata.TITLE) != null ? metadata.get(Metadata.TITLE) : document.getName())
.field(FsRiverUtil.Doc.Meta.DATE, metadata.get(Metadata.DATE))
.array(FsRiverUtil.Doc.Meta.KEYWORDS, Strings.commaDelimitedListToStringArray(metadata.get(Metadata.KEYWORDS)))
.endObject(); // Meta
// Doc content
source.field(FsRiverUtil.Doc.CONTENT, parsedContent);
// Doc as binary attachment
source.field(FsRiverUtil.Doc.ATTACHMENT, document.getContent());
// End of our document
source.endObject();
return source;
}
Below code is used for getting response:
QueryBuilder qb;
if (query == null || query.trim().length() <= 0) {
qb = QueryBuilders.matchAllQuery();
} else {
qb = QueryBuilders.queryString(query);//query is a name or string
}
org.elasticsearch.action.search.SearchResponse searchHits = node.client()
.prepareSearch()
.setIndices("ankur")
.setQuery(qb)
.setFrom(0).setSize(1000)
.addHighlightedField("file.filename")
.addHighlightedField("content")
.addHighlightedField("meta.title")
.setHighlighterPreTags("<span class='badge badge-info'>")
.setHighlighterPostTags("</span>")
.addFields("*", "_source")
.execute().actionGet();