I am indexing HTML documents into Solr via the SimplePostTool on the command line,
post -c core0 /mnt/Vancouver/programming/datasci/solr/test/d*.html
Despite various edits to solrconfig.xml and schema.xml (solr.HTMLStripCharFilterFactory etc.), Solr will not retain HTML content (URLs) present in the HTML source documents.
A new <a href="https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes">UC Riverside study</a> shows flame retardants ...
appears in Solr as
"p":[" A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study shows ...
It appears that Apache Tika is stripping the HTML coding from the content within HTML
elements, before it is passed to Solr.
Rendered web page (note, e.g., A new https://news.ucr.edu/articles/2020/11/06/chemicals-your-living-room-cause-diabetes UC Riverside study ... in first document)