How do we get the document file url using the Watson Discovery Service?

1

votes

I don't see a solution to this using the available api documentation.

It is also not available on the web console.

Is it possible to get the file url using the Watson Discovery Service?

Can you clarify precisely what you mean by the "file" url? The query API has a GET method that probably gets what you want: watson-api-explorer.mybluemix.net/apis/discovery-v1#!/Queries/… - Colin Dean

get a document from the collection, the "text" field returned in the response along with the url of the document containing the text - johnrao07

The query GET response has a "results" array containing objects that have a "text" attribute containing the original text or "html" attribute containing the converter output. Do you mean the original URL of the document? - Colin Dean

yeah the original url of the document uploaded, I learned which is not possible from Anish, so got another way around - johnrao07

3

votes

If you need to store the original source/file URL, you can include it as a field within your documents in the Discovery service, then you will be able to query that field back out when needed.

1

votes

I also struggled with this request but ultimately got it working using Python bindings into Watson Discovery. The online documentation and API reference is very poor; here's what I used to get it working:

(Assume you have a Watson Discovery service and have a created collection):

# Programmatic upload and retrieval of documents and metadata with Watson Discovery

from watson_developer_cloud import DiscoveryV1
import os
import json

discovery = DiscoveryV1(
    version='2017-11-07',
    iam_apikey='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    url='https://gateway-syd.watsonplatform.net/discovery/api'
)

environments = discovery.list_environments().get_result()
print(json.dumps(environments, indent=2))

This gives you your environment ID. Now append to your code:

collections = discovery.list_collections('{environment-id}').get_result()
print(json.dumps(collections, indent=2))

This will show you the collection ID for uploading documents into programmatically. You should have a document to upload (in my case, an MS Word document), and its accompanying URL from your own source document system. I'll use a trivial fictitious example.

NOTE: the documentation DOES NOT tell you to append , 'rb' to the end of the open statement, but it is required when uploading a Word document, as in my example below. Raw text / HTML documents can be uploaded without the 'rb' parameter.

url = {"source_url":"http://mysite/dis030.docx"}
with open(os.path.join(os.getcwd(), '{path to your document folder with trailing / }', 'dis030.docx'), 'rb') as fileinfo:
    add_doc = discovery.add_document('{environment-id}', '{collections-id}', metadata=json.dumps(url), file=fileinfo).get_result()
    print(json.dumps(add_doc, indent=2))
    print(add_doc["document_id"])

Note the setting up of the metadata as a JSON dictionary, and then encoding it using json.dumps within the parameters. So far I've only wanted to store the original source URL but you could extend this with other parameters as your own use case requires.

This call to Discovery gives you the document ID.

You can now query the collection and extract the metadata using something like a Discovery query:

my_query = discovery.query('{environment-id}', '{collection-id}', natural_language_query="chlorine safety")
print(json.dumps(my_query.result["results"][0]["metadata"], indent=2))

Note - I'm extracting just the stored metadata here from within the overall returned results - if you instead just had: print(my_query) you'll get the full response from Discovery ... but ... there's a lot to go through to identify just your own custom metadata.

How do we get the document file url using the Watson Discovery Service?

2 Answers