Differences

This shows you the differences between two versions of the page.

--- cluster-jobs [2024/05/07 16:37] – gabriele
+++ cluster-jobs [2024/07/16 10:47] (current) – gabriele
@@ Line 33: / Line 33: @@
-Differences between CPUs and tasks
+==== Differences between CPUs and tasks ====
+As a new user writing your first SLURM job script the difference between ''--ntasks'' and ''--cpus-per-task'' is typically quite confusing. Assuming you want to run your program on a single node with 16 cores which SLURM parameters should you specify?\\
+The answer is it depends on whether your application supports MPI. MPI (message passing protocol) is a communication interface used for developing parallel computing programs on distributed memory systems. This is necessary for applications running on multiple computers (nodes) to be able to share (intermediate) results.\\
+To decide which set of parameters you should use, check if your application utilizes MPI and therefore would benefit from running on multiple nodes simultaneously. On the other hand, if you have an non-MPI enables application or made a mistake in your setup, it doesn’t make sense to request more than one node.\\
+__IMPORTANT__: Currently, psychp01 has only a single node, hence when running your analyses, you would need to carefully choose only how many CPUs your analysis (task) requires by using ''--cpus-per-task''.
-As a new user writing your first SLURM job script the difference between --ntasks and --cpus-per-task is typically quite confusing. Assuming you want to run your program on a single node with 16 cores which SLURM parameters should you specify?
-The answer is it depends whether your application supports MPI. MPI (message passing protocol) is a communication interface used for developing parallel computing programs on distributed memory systems. This is necessary for applications running on multiple computers (nodes) to be able to share (intermediate) results.
-To decide which set of parameters you should use, check if your application utilizes MPI and therefore would benefit from running on multiple nodes simultaneously. On the other hand, if you have an non-MPI enables application or made a mistake in your setup, it doesn’t make sense to request more than one node.
+===== 2. Job Blueprint =====
-Currently, psychp01 has only a single node, hence when running your analyses, you would need to carefully choose only how many CPUs your analysis (task) requires by using --cpus-per-task.
+You can save the following example to a file (e.g., ''run.sh'') on psychp01. Comment the two cp commands that are just for illustration purpose (lines 46 and 55) and change the ''SBATCH'' directives where applicable. You can then run the script by typing:
+  sbatch run.sh
+Please note that all values that you define with ''sbatch'' directives are hard values. When you, for example, ask for 6000 MB of memory (''--mem=6000MB'') and your job uses more than that, the job will be automatically killed by the manager.
-===== 2. Job Blueprint =====
+  #!/bin/bash -l
+  ##############################
+  #       Job blueprint        #
+  ##############################
+  # Give your job a name, so you can recognize it in the queue overview
+  #SBATCH --job-name=example
+  # Define, how many nodes you need. For now, psychp01 has only 1 node with a 100 CPUs.
+  #SBATCH --nodes=1
+  # By default, SLURM will allocate 1 CPU to your tasks. You can define the number of CPUs you would need for your job as follows:
+  #SBATCH --cpus-per-task=20
+  # You can further define the number of tasks with --ntasks-per-*
+  # See "man sbatch" for details. e.g., --ntasks=4 will ask for 4 cpus.
+  # Define, how long the job will run in real time. This is a hard cap meaning
+  # that if the job runs longer than what is written here, it will be
+  # force-stopped by the server. If you make the expected time too long, it will
+  # take longer for the job to start. Here, we say the job will take 5 minutes.
+  #              d-hh:mm:ss
+  #SBATCH --time=0-00:05:00
+  # Define the partition on which the job shall run. May be omitted.
+  #SBATCH --partition=test
+  # How much memory you need.
+  # --mem will define memory per node and
+  # --mem-per-cpu will define memory per CPU/core. Choose one of those.
+  #SBATCH --mem-per-cpu=1500MB
+  ##SBATCH --mem=5GB    # this one is not in effect, due to the double hash
+  # Turn on mail notification. There are many possible self-explaining values:
+  # NONE, BEGIN, END, FAIL, ALL (including all aforementioned)
+  # For more values, check "man sbatch"
+  #SBATCH --mail-type=END,FAIL
+  # You may not place any commands before the last SBATCH directive
+  # Define and create a unique scratch directory for this job
+  SCRATCH_DIRECTORY=/home/${USER}/${SLURM_JOBID}
+  mkdir -p ${SCRATCH_DIRECTORY}
+  cd ${SCRATCH_DIRECTORY}
+  # You can copy everything you need to the scratch directory
+  # ${SLURM_SUBMIT_DIR} points to the path where this script was submitted from
+  cp ${SLURM_SUBMIT_DIR}/myfiles*.txt ${SCRATCH_DIRECTORY}
+  # This is where the actual work is done. In this case, the script only waits.
+  # The time command is optional, but it may give you a hint on how long the
+  # command worked
+  time sleep 10
+  #sleep 10
+  # After the job is done we copy our output back to $SLURM_SUBMIT_DIR
+  cp ${SCRATCH_DIRECTORY}/my_output ${SLURM_SUBMIT_DIR}
+  # In addition to the copied files, you will also find a file called
+  # slurm-1234.out in the submit directory. This file will contain all output that
+  # was produced during runtime, i.e. stdout and stderr.
+  # After everything is saved to the home directory, delete the work directory to
+  # save space on /home
+  cd ${SLURM_SUBMIT_DIR}
+  rm -rf ${SCRATCH_DIRECTORY}
+  # Finish the script
+  exit 0
+==== Running many sequential jobs in parallel using job arrays ====
+In this example we wish to run many similar sequential jobs in parallel using job arrays. We take Python as an example, but this does not matter for the job arrays:
+  #!/usr/bin/env python
+  import time
+  print('start at ' + time.strftime('%H:%M:%S'))
+  print('sleep for 10 seconds ...')
+  time.sleep(10)
+  print('stop at ' + time.strftime('%H:%M:%S'))
+Save this to a file called “test.py” and try it out:
+  $ python test.py
+  start at 15:23:48
+  sleep for 10 seconds ...
+  stop at 15:23:58
+Good. Now we would like to run this script 16 times at the same time. For this we use the following script:
+  #!/bin/bash -l
+  #####################
+  # job-array example #
+  #####################
+  #SBATCH --job-name=example
+  # 16 jobs will run in this array at the same time
+  #SBATCH --array=1-16
+  # run for five minutes
+  #              d-hh:mm:ss
+  #SBATCH --time=0-00:05:00
+  # 500MB memory per core
+  # this is a hard limit
+  #SBATCH --mem-per-cpu=500MB
+  # you may not place bash commands before the last SBATCH directive
+  # define and create a unique scratch directory
+  SCRATCH_DIRECTORY=/ptmp/${USER}/job-array-example/${SLURM_JOBID}
+  mkdir -p ${SCRATCH_DIRECTORY}
+  cd ${SCRATCH_DIRECTORY}
+  cp ${SLURM_SUBMIT_DIR}/test.py ${SCRATCH_DIRECTORY}
+  # each job will see a different ${SLURM_ARRAY_TASK_ID}
+  echo "now processing task id:: " ${SLURM_ARRAY_TASK_ID}
+  python test.py > output_${SLURM_ARRAY_TASK_ID}.txt
+  # after the job is done we copy our output back to $SLURM_SUBMIT_DIR
+  cp output_${SLURM_ARRAY_TASK_ID}.txt ${SLURM_SUBMIT_DIR}
+  # we step out of the scratch directory and remove it
+  cd ${SLURM_SUBMIT_DIR}
+  rm -rf ${SCRATCH_DIRECTORY}
+  # happy end
+  exit 0
+Submit the script and after a short while you should see 16 output files in your submit directory.
+===== 3. R example =====
+Running R scripts on psychp01 is very easy. Save the following R script in your home directory (e.g., ''/home/gbellucci'') as ''r_script.R'':
+  # Create a 2000x2000 matrix with random values
+  A1 <- matrix(runif(2000*2000), nrow = 2000, ncol = 2000)
+  # Perform 1000 fft operations on it
+  start_time <- as.numeric(Sys.time())
+  for (i in 1:1000) {
+      A1 <- fft(A1)
+  }
+  time1 <- as.numeric(Sys.time()) - start_time
+  # write result time to file:
+  cat(paste('CPU time:', round(time1*1000,2),'ms'),file='results.txt')
+Save the following job script in your home directory as ''run_r_script.sh''.
+  #!/bin/bash -l
+  # My job script to run R code
+  #SBATCH -o ./job_output.%A_%a
+  #SBATCH -e ./job_errors.%A_%a
+  #SBATCH -D ./
+  #SBATCH -J run_r_test
+  #SBATCH --ntasks=1
+  #SBATCH --cpus-per-task=10
+  #SBATCH --mem=6000
+  #SBATCH --time=00:05:00
+  # Use the command Rscript and set the path to the R script to run it. Change /home/gbellucci with the location of your script on psychp01.
+  Rscript /home/gbellucci/r_script.R
+Now on the command line, change your current directory to the home directory where you saved your ''run_r_script.sh'' and use the ''sbatch'' command to run the R script:
+  sbatch run_r_script.sh
+When the job is finished you can display the output:
+  vim results.txt
+===== 4. Python example =====
+Below is an example of a Python script to be used on the cluster. Save the following Python script in your home directory (e.g., ''/home/gbellucci'') as ''python_script.py'':
+  import numpy as np
+  from time import time
+  # Create a 2000x2000 matrix with random values
+  A1 = np.random.rand(2000, 2000).astype('float32')
+  # Perform 1000 fft operations on it
+  start_time = time()
+  for i in range(1000):
+      A1 = np.fft.fft(A1)
+  end_time = time()
+  time1 = end_time - start_time
+  # write result time to file:
+  with open('results.txt', 'w') as fh:
+      fh.write(f'CPU time: {time1*1000:.2f} ms\n')
+The job script below will run your Python script. Save it in your home directory as ''run_python_script.sh''.
+  #!/bin/bash -l
+  # My job script to run Python code
+  #SBATCH -o ./job_output.%A_%a
+  #SBATCH -e ./job_errors.%A_%a
+  #SBATCH -D ./
+  #SBATCH -J run_test
+  #SBATCH --ntasks=1
+  #SBATCH --cpus-per-task=10
+  #SBATCH --mem=6000
+  #SBATCH --time=00:05:00
+  # import below here any python packages that are required for your script
+  # use the python command and set the path to your script to run it. Change /home/gbellucci with the location of your script on psychp01.
+  python /home/gbellucci/python_script.py
+Now on the command line, change your current directory to the home directory where you saved your ''run_python_script.sh'' and use the ''sbatch'' command to run the Python script:
+  sbatch run_python_script.sh
+When the job is finished you can display the output:
+  vim results.txt
+On page 37 of the pdf version of the manual in the [[cluster-guide|main page]], there is a short video that demonstrates what just described.
+===== 5. MATLAB example =====
+==== Simple example ====
+Running MATLAB scripts on psychp01 is pretty straightforward. Save the following MATLAB script in your home directory (e.g., ''/home/gbellucci'') as ''matlab_script.m'':
+  % Create a 2000x2000 matrix with random values
+  A1 = rand(2000,2000,'single');
+  % Perform 1000 fft operations on it
+  tic;
+  for i=1:1000
+      A1 = fft(A1);
+  end
+  time1 = toc;
+  % write result time to file:
+  fh = fopen('results.txt','w+');
+  fprintf(fh,'CPU time: %.2f ms\n',time1*1000);
+  fclose(fh);
+Save the following job script in your home directory as ''run_matlab_script.sh'':
+  #!/bin/bash -l
+  # This is a comment for this job script to run the above matlab script
+  #SBATCH -o ./job_output.%A_%a # this is just a label to name the output filename of your job
+  #SBATCH -e ./job_errors.%A_%a # this is just a label to name the error filename of your job
+  #SBATCH -D ./
+  #SBATCH -J run_test_for_matlab_script # this is just a label to name your job. When you run "squeue" in the command line, that's the name you'll see
+  # --- resource specification (which resources for how long) ---
+  #SBATCH --partition=test
+  #SBATCH --ntasks=1
+  #SBATCH --cpus-per-task=10
+  #SBATCH --mem=6000        # memory in MB required by the job
+  #SBATCH --time=00:05:00   # run time in h:m:s, up to 24h possible
+  # Use the srun command to run the MATLAB script. Change /home/gbellucci with the location of your script on psychp01.
+  srun matlab -nodisplay -batch /home/gbellucci/matlab_script.m
+Now on the command line, change your current directory to the home directory where you saved your ''run_matlab_script.sh'' and use the ''sbatch'' command to run the MATLAB script:
+  sbatch run_matlab_script.sh
+When the job is finished you can display the output:
+  vim results.txt
+==== Example with Parallel Computing ====
+For **parallel computing**, we would need to modify our MATLAB script ''matlab_script.m'' to run processes in parallel as follows:
+  % Create a 2000x2000 matrix with random values
+  parpool(10);
+  A1 = rand(2000,2000,'single');
+  % Perform 1000 fft operations on it
+  tic;
+  parfor i=1:1000
+      A1 = fft(A1);
+  end
+  time1 = toc;
+  % write result time to file:
+  fh = fopen('results.txt','w+');
+  fprintf(fh,'CPU time: %.2f ms\n',time1*1000);
+  fclose(fh);
+No further changes are required to your job script ''run_matlab_script.sh'' in your home directory. You should only be sure that you are not requesting more CPUs in your MATLAB script than the ones you are allowing in the job script. That is, in MATLAB, command parpool should request either less than or the same number of CPUs as the ones you set in ''#SBATCH --cpus-per-task=10''. Otherwise, your job will crash, as MATLAB will not find enough available CPUs when running.\\
+Now run your job script as before using the ''sbatch'' command:
+  sbatch run_matlab_script.sh
+==== Example with GPU Computing ====
+For now, **GPU computing** is not supported on psychp01. However, in general, to run MATLAB processes on a GPU partition, we would need to add another block to our MATLAB script. Save it as ''matlab_gpu.m'':
+  % Create a 2000x2000 matrix with random values
+  A1 = rand(2000,2000,'single');
+  % Perform 1000 fft operations on it
+  tic;
+  for i=1:1000
+      A1 = fft(A1);
+  end
+  time1 = toc;
+  % copy the matrix onto the GPU
+  A2 = gpuArray(A1);
+  % perform the same 1000 operations
+  tic;
+  for i=1:1000
+      A2 = fft(A2);
+  end
+  time2 = toc;
+  % write result time to file:
+  fh = fopen('results.txt','w+');
+  fprintf(fh,'CPU time: %.2f ms GPU time: %.2f ms Speedup factor: %.2f\n',time1*1000,time2*1000,time1/time2);
+  fclose(fh);
+We also have to modify the job script a bit. We will select the GPU partition. Save the new job script as ''run_matlab_gpu_script.sh'':
+  #!/bin/bash -l
+  # This is a comment for this job script to run the above matlab script on the GPU partition
+  #SBATCH -o ./job_output.%A_%a
+  #SBATCH -e ./job_errors.%A_%a
+  #SBATCH -D ./
+  #SBATCH -J mat_gpu
+  # --- resource specification (which resources for how long) ---
+  #SBATCH --partition=gpu
+  #SBATCH --ntasks=1
+  #SBATCH --cpus-per-task=10
+  #SBATCH --mem=6000        # memory in MB required by the job
+  #SBATCH --time=00:05:00   # run time in h:m:s, up to 24h possible
+  # --- start from a clean state and load necessary environment modules ---
+  # run MATLAB
+  srun matlab -nodisplay -batch /home/gbellucci/matlab_gpu.m
+And now run it like before:
+  sbatch run_matlab_gpu_script.sh
+[[{:backward_arrow.png?40|width: 12em}cluster-advanced_slurm|Advanced SLURM]][[{:forward_arrow.png?40|width: 12em}cluster-ijobs|Interactive jobs]]\\
+[[{:toc.png?40|width: 12em}cluster-toc|Return to Table of Contents]][[{:main_page.png?40|width: 12em}cluster-guide|Return to main page]]