My web app is an e-mail service. It stores email messages in MySQL database and email attachments are on a disk.
The database is similar to:
----------------------------------------------------------------------
| id | sender | receiver | subject | body | attach_dir | attachments |
----------------------------------------------------------------------
| 2 | 444 | 555 | Apples | Hey! | /mnt/emails| att1.doc\r\n|
| | | | | | | att2.doc\r\n|
----------------------------------------------------------------------
| 3 | 77 | 22 | Pears | Hola!| /mnt/emails| att1.zip\r\n|
----------------------------------------------------------------------
I index it with the following data-config.xml:
<dataConfig>
<dataSource name="mysql"
type="JdbcDataSource"
driver="com.mysql.jdbc.Driver"
url="jdbc:mysql://localhost:3306/email?
useUnicode=true&
characterEncoding=UTF-8&
useTimezone=true&
serverTimezone=UTC"
user="user"
password="pass"/>
<dataSource name="files"
type="BinFileDataSource" />
<document>
<entity name="email" dataSource="mysql"
query="SELECT id, subject, body, date, attach, attach_dir FROM email"
transformer="RegexTransformer"
>
<field column="id" name="id"/>
<field column="subject" name="subject"/>
<field column="body" name="content"/>
<field column="date" name="last_modified"/>
<field column="attach" name="attach" splitBy="\r\n" />
<field column="attach_dir" name="attach_dir"/>
<entity name="attach_glob" dataSource="null"
processor="FileListEntityProcessor"
baseDir="/mnt/attach/${email.attach_dir}" fileName=".*"
recursive="false" onError="skip">
<entity name="email_attachment" dataSource="files"
processor="TikaEntityProcessor"
url="${attach_glob.fileAbsolutePath}">
<field column="text" name="attach_content"/>
</entity>
</entity>
</entity>
</document>
</dataConfig>
This is working good with all the files except compressed files such as .zip. For .zip files the attach_content field gets filled only with the file names from the zip archive instead of content of the extracted files from the zip archives.
However if I use SimplePostTool like this:
/opt/solr/bin/post -c mycollection /mnt/attach/message3/att1.zip
then I get all content extracted from all the files inside of the zip archive and this is what I need. But I would need this content to be part of the documents added by Data Import Handler with the data-config.xml above.
Is this possible?