I am trying to harvest some datasets into 2.8 ckan and the harvester is giving an email validation error. These datasets come from different sources and might not have proper email format / might contain multiple emails / url instead of a valid email. The earlier ckan 2.6 I used was able to harvest these datasets. Here is the error message I get:
ERROR [ckanext.harvest.harvesters.base] {'maintainer_email': ['Email [email protected]; [email protected] is not a valid format']} Traceback (most recent call last): File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/harvesters/base.py", line 369, in _create_or_update_package else 'package_create_rest')(context, package_dict) File "/usr/lib/ckan/default/src/ckan/ckan/logic/init.py", line 464, in wrapped result = _action(context, data_dict, **kw) File "/usr/lib/ckan/default/src/ckan/ckan/logic/action/create.py", line 177, in package_create raise ValidationError(errors) ValidationError: {'maintainer_email': ['Email [email protected]; [email protected] is not a valid format']}
When I dig a little into the ckan source, the ckan harvester is using the default schema from schema.py
schema = default_create_package_schema()
and this gives the schema with the email_validation:
'maintainer_email': [ignore_missing, unicode_safe, email_validator]
but for 2.6 there was no email_validator:
'maintainer_email': [ignore_missing, unicode]
My initial thought to skip this validation is to remove the email_validator from default_create_package_schema() of schema.py.
so, while it makes sense to validate the email, I thought it would be better if the validation was configurable, since for some cases(eg. multiple maintainers in the above error) we might need to skip the strict email validations.
Has anyone run into this issue and/or found the way to harvest these datasets despite invalid emails?