How to retain HTML coding while indexing HTML documents to Apache Solr?

Question

I am indexing HTML documents into Solr via the SimplePostTool on the command line,

post -c core0 /mnt/Vancouver/programming/datasci/solr/test/d*.html

Despite various edits to solrconfig.xml and schema.xml (solr.HTMLStripCharFilterFactory etc.), Solr will not retain HTML content (URLs) present in the HTML source documents.

A new <a href="https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes">UC Riverside study</a> shows flame retardants ...

appears in Solr as

"p":[" A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study shows ...

It appears that Apache Tika is stripping the HTML coding from the content within HTML

elements, before it is passed to Solr.

https://lucene.apache.org/solr/guide/8_7/uploading-data-with-solr-cell-using-apache-tika.html#key-solr-cell-concepts

Rendered web page (note, e.g., A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study ... in first document)

Victoria Stuart Victoria Stuart · Accepted Answer · 2020-12-21T23:27:34

Update: here is a workaround.

url_process.sh

#!/bin/bash

cd /mnt/Vancouver/programming/datasci/solr/test/url_test/

for FILE in *.html
do
  cat $FILE | sed 's/<a href/LEFTANGLEBRACKETa href/g ; s%</a>%LEFTANGLEBRACKET/a>%g' > tmp
  post -c core0 tmp
done

solrconfig.xml

  <updateRequestProcessorChain
    processor="uuid,remove-blank,field-name-mutating,
    parse-boolean,parse-long,parse-double,parse-date">

    <processor class="solr.LogUpdateProcessorFactory"/>
    <processor class="solr.DistributedUpdateProcessorFactory"/>

    <processor class="solr.RegexReplaceProcessorFactory">
      <str name="fieldName">p</str>
      <str name="pattern">LEFTANGLEBRACKET</str>
      <str name="replacement">&lt;</str>
      <bool name="literalReplacement">true</bool>
    </processor>

    <processor class="solr.RunUpdateProcessorFactory"/>
  </updateRequestProcessorChain>

Explanation

I subvert Apache Tika preprocessing by:

preprocessing the HTML source documents with a BASH script, swapping all < in <a href="...">...</a> with an alphabetic string. This obfuscates those links from Tika.
Upon indexing, a RegexReplaceProcessorFactory processor in solrconfig.xml swaps back those < brackets, regenerating the URLs.

Result

Solr:

"p":[" A new <a href=\"https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes\">UC Riverside study</a> shows flame retardants ...],"

A working hyperlink now appears in the Ajax-rendered web page.

How to retain HTML coding while indexing HTML documents to Apache Solr?

1 Answers