Engineering Cluster Template
From Engineering Grid Wiki
Engineering Cluster Template
In general, our clusters have 4 main components:
In many cases, the Head Node, Cluster Control, and Storage are combined.
The "Head Node" refers to the main public access point of the cluster - the machine users log into to submit jobs, work with data, etc. Many times it is the same hardware as the cluster nodes, in some cases (when it acts as a storage server) it is a unique machine.
Often this machine will run any public-facing webservers that serve out pages referencing data on the cluster storage.
As it is usually the only public-facing portion of the cluster, it often provides access to the Internet and other networking functions.
Cluster Control and Applications
All of the cluster management and research applications on EIT-managed clusters are installed on an NFS share (either from the storage server or head node) mounted as /cluster/clustername, where clustername is the generally accepted name for that cluster. This is a common mountpoint (usually managed by automount) on every cluster member.
All of the EIT-managed computing clusters use Grid Engine as the cluster management mechanism. This is installed under /cluster/clustername/sge, using the default SGE installation method of having all local spool and log directories on the NFS mount. The SGE Cell name is usually the clustername.
Parallel operations can take place over a variety of mediums - Ethernet, Myrinet, Infiniband, and others. Most Engineering clusters communicate over Gigabit Ethernet.
Applications vary by cluster. Specific details on any special application will be found on each cluster's page.
Most parallel operations on EIT-managed clusters are enabled through MPICH2. We are moving to a modification of the "daemon-based smpd" MPICH2 integration method listed here. Using at least MPICH2 1.0.8, an additional environment flag will cause smpd to forgo using the ".smpd" file in the user's home directory, which will often cause job failures.
A base parallel job submission script:
#!/bin/sh # #$ -q all.q #$ -N testpsmp #$ -pe mpich2_psmp 8 #$ -cwd #$ -v SMPD_OPTION_NO_DYNAMIC_HOSTS=1 # This line defines the MPI communications port, do not change. @ smpdport = $JOB_ID % 5000 + 20000 /cluster/clustername/mpi/bin/mpiexec -n $NSLOTS -machinefile $TMPDIR/machines -port $smpdport -phrase $smpdport /cluster/clustername/mpi/share/examples_logging/cpilog exit 0
The SMPD_OPTION_NO_DYNAMIC_HOSTS flag passed through qsub instructs the smpd to not use the dynamic hosts file, and instead specifies the SMPD passphrase and port on the command line.
Cluster storage on Engineering School clusters are in general provided by direct-attached storage of varying types attached to the head node.
This storage is shared out to the nodes via NFS, usually managed by an automount pointing to /research-projects/clustername (often linked to for historical reasons to /project/clustername).
Depending on needs, the head node shares this out via NFS and CIFS.
Cluster nodes are set up to be identical, usually with a scripted Kickstart installation, or another method of cloning (Advanced Clustering's Cloner utility, Partimage, or any number of applications).
Node addresses are managed with static DHCP reservations, and storage access via NFS+automount. Only default distribution software is installed locally on each node; unless there's a very special need, all cluster applications are installed under the /cluster/clustername mount.
With hard drive sizes being what they are, there is usually a large partition set aside for temporary data (it's quicker to do temp storage locally than over NFS). This is usually accessed via a /cluster-tmp link to the actual storage partition. It's important that this is a seperate partition from the other system partitions. In general, it's the user's responsibility to move any data that should be kept to main cluster storage at the end of the job.
IP address space in the private cluster network is up to the administrators; any of the private address spaces (aside from the 172.16, which we use here internally at Engineering) will do. Most of the Engineering clusters use networks in the 10.0.0.0, which may or may not be proper practice.
It's important to note that internal services such as DNS, DHCP, or TFTP should only listen on the internal cluster interfaces so as not to interfere with the rest of the Engineering network.
Usually done on the head node, network address translation allows cluster nodes to access the internet for updates/installation, and users to download external data as part of a job. This is done through IPTables - the cluster nodes will have the head node as their default gateway.
DNS is important for all facets of cluster communication, be it cluster applications or storage. Using DNS eliminates the requirement to manage /etc/hosts or the like locally on the nodes.
The default 'caching-nameserver' RPM provided with RHEL/CentOS is a good place to start, as it provides a caching nameserver for the cluster (reducing external traffic). Add into that definitions for internal cluster nodes/resources.
The head node also can run DHCP, providing the cluster nodes with their network settings. This aids in installation and management of the nodes - each node wil gain an IP address based on it's MAC when configured correctly, which allows you to push out a default image to all nodes without worrying about setting the hostname or IP address.
TFTP is used to PXE boot nodes, most often for installation, but some clusters (not here at Engineering) will boot their nodes off of NFS as well.
- ↑ A discussion of how SGE can reduce NFS usage can be found at http://gridengine.sunsource.net/howto/nfsreduce.html
- ↑ Using Samba (www.samba.org), authenticating to our AD domains