I'm using Tika-server to parse bunch of eml files. Extracting both content and metadata of emls and attachments works fine while using /rmeta endpoint.
Problem occurs with proper attachment file name. When attachment part in raw eml file has got a following structure:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="filename_a.pdf"
everything works fine: extracted filename path in metadata object (in api response) is:
"X-TIKA:embedded_resource_path": "/filename_a.pdf"
However some of my emails have got malformed header structure (missing filename in Content-Disposition) i.e.:
Content-Type: application/pdf; name="filename_a.pdf"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
Then after parsing the whole eml I obtain:
"X-TIKA:embedded_resource_path": "/embedded-1"
I checked in Tika's source code that filename meta is defined in \org\apache\tika\parser\RecursiveParserWrapper.class here:
private String getResourceName(Metadata metadata, RecursiveParserWrapper.ParserState state) {
String objectName = "";
if (metadata.get("resourceName") != null) {
objectName = metadata.get("resourceName");
} else if (metadata.get("embeddedRelationshipId") != null) {
objectName = metadata.get("embeddedRelationshipId");
} else {
objectName = "embedded-" + ++state.unknownCount;
}
objectName = FilenameUtils.getName(objectName);
return objectName;
}
I was trying to access somehow mentioned filename attribute by inspecting Content-Type key in metadata object but it's not there. (I assume that Tika assess Content-type key not just by looking into proper header hence needed filename is absent)
Therefore my question (since I'm not able to figure it out) is there a way to modify Tika source code to force filename extraction from Content-Type header when proper filename attribute in Content-Disposition header is missing?