0
votes

I have a bunch of machines with GPUs (me and my friends own them) mostly used for gaming and machine learning. Instead of using each machine independently, I thought it would be better to use them as a distributed system (e.g. doing distributed training of machine learning models) to reduce training times. I have no experience in developing applications (left alone cloud apps), but I thought it would be fun to create a client-server application where:

  • on front-end side, clients (e.g. me and my friends) can access the system and see which machines are available for work. If a machine has an idle GPU, then it should be available for work, otherwise (e.g. someone is working or playing) the machine should not be available. The client can select a bunch of available machines and then launch a virtual machine (with code to be run and all the necessary data) on them.

  • on the back-end side, the selected servers receive the virtual machine and execute the code inside it in a distributed way (e.g. Tensorflow allows distributed training).

In my opinion, the use of virtual machines is quite necessary, for privacy/safety reasons. I am basically letting my friends into my system, and so are they. I want to avoid clients messing up with a server. All of the machines have Ubuntu, beside one that uses Windows. Therefore I’d have to implement this in Ubuntu first.

Having said that, I have no idea where to start implementing all this. Beside choosing a language (I am more biased towards Java or Python, but I would consider any other option), what are the main steps that I should undertake? I know it's probably a common client-server application, but as I said I have no experience in app development. Thanks

1

1 Answers

1
votes

Based on my understanding. You want to set up a distributed computing system for machine learning and access from BS model right? If I am right. Maybe you can check out the tensorflow guide which designed to solve this problem. Below is the link.

https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md