0
votes

I have a task: "to get a duplicates (documents with same property value) from Alfresco database with count duplicates amounts". In MySql there will be something like that:

mysql> SELECT COUNT(*) AS repetitions, last_name, first_name
-> FROM cat_mailing
-> GROUP BY last_name, first_name
-> HAVING repetitions > 1

But I have read that "The CMIS query language in Alfresco does not support GROUP BY or HAVING." . Is there any query (in any supported language) to perform described task? Thank you!

UPD: for now I am counting in JVM this way (redefining hashCode/equals for Form20Row)

Map<Form20Row, Form20Row> rowsMap =  results.stream().parallel().map(doc -> {
            Form20Row row = new Form20Row();
            String propMark = propertyHelper.getNodeProp(doc, NDBaseDocumentModel.PROP_MARK);
            row.setGroupName(systemMessages.getString("form20.nss.name"));
            row.setDocMark(propMark);
            row.setDupesNumber(1);
            return row;
        }).collect(Collectors.toConcurrentMap(form20Row -> form20Row, form20Row -> form20Row,
                (existing, replacement) ->  {
                    int count = existing.getDupesNumber();
                    existing.setDupesNumber(++count);
                    return existing;
                }));
2
The question is unclear, duplicates of what? - Lista
of some property - Catherine Ivanova

2 Answers

2
votes

Alfresco uses SOLR for search on nodes but SOLR is very limited on joins, aggregate functions, counting ... What you may do is querying the SOLR index using facets like facet.field=field1&facet.mincount=1.

Personally I would prefer to query the alfresco db directly to find nodes having the same property values for specific properties. This will not depend on the solr index and gives you the full flexibility of SQL.

1
votes

I agree with Heiko.

This won't be trivial either, since Alfresco keeps "everything" in "alf_node_properties" table, and in case of strings in "string_value" column. So you can't know "which" property this is without joining with "alf_qname" table, and for larger databases this can take a longish time.

You probably want to filter out deleted and node versions, and not compare them at all.