0
votes

I am using HDInsight on Azure to research the scalability of ranking machine learning methods (learning to rank, for the insiders) on Hadoop. I managed to test run my implementation of a learning to rank algorithm on a HDInsight cluster and clocked its time to complete the operation.

Now I want to run the same code over and over again with different numbers of cores to see how the running time scales as a function of the number of cores. From other questions on this forum I understood that HDInsight does not allow changing the number of cores of a cluster. Would it instead be possible in some way to delete the current cluster, and then create a new cluster that makes use of the exact same container on my Azure Storage? I tried to do this by simply giving the new cluster the same name as the previous one (as the container that is created for a new cluster is automatically named after the cluster at creation time), but that doesn't work as the new container created for this new cluster will have "-1" appended to the cluster name. The datafile that I am trying to process is around 15GB in size, so it would be a real pain in the ass if I would need to upload this file to the cluster container for each cluster that I create.

Any help on how I can run my algorithms on HDInsight with varying numbers of cores without having to re-upload my input data for each point of measurement would be very much appreciated!

Kind Regards,

Niek Tax

1

1 Answers

1
votes

You should be able to link your existing storage container to an HDInsight cluster According to http://azure.microsoft.com/en-us/documentation/articles/hdinsight-use-blob-storage/#benefits

Using the custom create, you have one of the following options for the default storage account:

  • Use existing storage
  • Create new storage
  • Use storage from another subscription.

You also have the option to create your own Blob container or use an existing one.

The link shows how you can do that through the Windows Azure Portal.