1
votes

Server Scenario:

Ubuntu 12.04 LTS
Torque w/ Maui Scheduler
Hadoop

I am building a small cluster (10 nodes). The users will have the ability to ssh into any child node(LDAP Auth) but this is really unnecessary since all computation jobs they want to run can be submitted on the head node using torque, hadoop, or other resource managers tied with a scheduler to insure priority and proper resource allocation throughout the nodes. Some users will have priority over others.

Problem:

You can't force a user to use a batch system like torque. If they want to hog all the resources on one node or the head node they can just run their script / code directly from their terminal / ssh session.

Solution:

My main users or "superusers" want me to set up a remote login timeout which is what their current cluster uses to eliminate this problem. (I do not have access to this cluster so I can not grab the configuration). I want to setup a 30 minute timeout on all remote sessions that are inactive(keystrokes), if they are running processes I also want the session to be killed along with all job processes. This will eliminate people from NOT using an available batch system / scheduler.

Question:

How can I implement something like this? Thanks for all the help!

1
Did you try torque pam module? It would prevent ssh access to nodes from users unless they have resources allocated on the node. Also this serverfault question might be of use. - Dmitri Chubarov
Thanks for the help. Although I may consult with my users and end up implementing this, I'm more interested in setting up a remote login / job process timeout because the login node (head node), is also a compute node itself and the most powerful one. - Joe deNecola
@JoedeNecola I think any solution is going to be very difficult and imprecise if you continue with the model of having the login node be a compute node. - dbeer

1 Answers

0
votes

I've mostly seen sys admins solve this by not allowing ssh access to the nodes (often done using the pam module in TORQUE), but there are other techniques. One is to use pbstools. The reaver script can be setup to kill user processes that aren't part of jobs (or shouldn't be on those nodes). I believe it can also be configured to simply notify you. Some admins forcibly kill things, others educate users, that part is up to you.

Once you get people using jobs instead of ssh'ing directly, you may want to look into the cpuset feature in TORQUE as well. It can help you as you try to get users to use the amount of resources they request. Best of luck.

EDIT: noted that the pam module is one of the most common ways to restrict ssh access to the compute nodes.