bioinformatics banner
slider

Pysano reference manual

Contents
  1. Job directories
    1. Location
    2. Permission
    3. Contents
  2. Cmd.txt files
    1. Directives
      1. #e
      2. #c
      3. #a
      4. #t
    2. Macros
      1. @align
      2. @alignmetrics
      3. @collectrnaseqmetrics
      4. @sortsam
      5. @SamTranscriptomeParser
    3. Java commands
    4. Shell commands
    5. Path substitution
  3. Starting and stopping jobs
    1. pstart
    2. pstop
  4. Dry runs
  5. Limits and quotas
    1. Concurrent jobs
    2. Disk quotas on home directories
  6. Monitoring jobs on the web page

Job directories

Each pysano job is contained within a directory on the Linux file system. When the job is run by pysano, the directory and its contents (including subdirectories) will be copied to a computer cluster prior to being submitted to the cluster for execution.

Location

Job directories may be located within the /home, /scratch, or /tomato file systems. Directories that are local to a particular server, for example /tmp or /usr/tmp, are not visible to the pysano service. If a job directory contains symbolic or soft links to other files or directories, those files must also be within /home, /scratch, or /tomato.

Permission

Job directories must be writable by the pysano service. The best way to guarantee this is to make the directories group-writable.

Contents

Each job's directory (excluding any nested subdirectories) must contain a file named cmd.txt. This file contains the statements and directives that will be executed by pysano. This file must be readable by the pysano service, i.e. it must be group-readable.

Cmd.txt files

The file "cmd.txt" must be found in the job directory for each job. This script may contain pysano directives, shell statements, pysano macros, calls to java programs.

Directives

Directives are commands to pysano in your cmd.txt file that instruct pysano how to run your job.

#e contact_email_address

Pysano will email notifications about this job to the given email address. This is required, and without this your job will not run. You can optionally suppress email messages from Pysano with the options -a, -b, or -e: -a will suppress the terminate message, -b suppresses the job begin message, and -e suppresses the job end message. These can be combined, e.g. "-abe" will supppress all the email messages. These options must follow the email address, for example "#e my_address@example.edu -b". You cannot suppress warnings and error messages, however.

#c prefered_cluster_name

Pysano will direct the job to the named cluster, assuming the cluster is accepting jobs. If that cluster is not accepting jobs, or if you don't specify which cluster you want, your job will be sent to the least busy cluster that is available, which allows pysano to balance the load across the clusters. The available clusters are listed below.

#a analysis_number

Pysano will store your results in the given GNomEx analysis. Analysis numbers begin with the letter A, for example: "#a A2178". The analysis must already exist in GNomEx before a pysano job can store results there.

#t maximum_runtime_hours

This sets the maximum "wall time" (i.e. time on the clock on the wall, not time on the CPU) allowed for your job, in hours. Pysano will use this value or the default for the assigned cluster, whichever is less.

Macros

Macros are shorthand notation for performing a complex command or a series of commands in pysano. These are expanded by pysano before your job is executed. A good practice when using a macro for the first time is to execute your job as a "dry run" (see below) so you can see exactly what commands will be run.

@align -i fastq_files -g genome_index [-p bisulphite] [-novoalign [novoalign_options]]

Run an alignment using novoalign on the specified genome index file with the given Fastq files. The genome_index argument is expanded to the full path name of the matching genome index file plus the extension ".nov.illumina.nix". For example, "@align -g hg19" will align to the hg19.nov.illumina.nix index in the Human/Hg19 directory in pysano. The "-p bisulphite" option turns on bisulphite-mode alignment, which requires a special bisulphite genome index file. In this case, the suffix used for the genome index file will be ".nov.bisulphite.nix". Special options for novoalign can be given with the "-novoalign [more options]" notation. Common novoalign options are "-o SAM" (produce SAM output), "-r Random" (select one repetitive alignment at random). The "-gzip" option will gzip the alignment output.

@alignmetrics

@collectrnaseqmetrics

@sortsam

Calls the Picard SortSam program to sort a SAM file by genome coordinate and produce an indexed BAM file.

@SamTranscriptomeParser

Calls the USeq SamTranscriptomeParser program.

@SamTranscriptomeParser -f input_files [ -options [ samtranscriptomeparser_options ]]

The "-f" option is required, and must be followed by one or more input files or directories (e.g ".").

The "-options [ extra_options]" argument is optional.

A better way to call this program is like this:

SamTranscriptomeParser.jar -f input_files ...

This gets expanded by Pysano to the command:

java -Xmx22G -jar SamTranscriptomeParser.jar -f input_files ...

The benefit of calling SamTranscriptomeParser.jar this way is that the -Xmx option will be customized for the cluster where your job is run, and the version of SamTranscriptomeParser.jar used will be the most up-to-date one available.

Java commands

You can use the name of a java .jar file as a command in your cmd.txt file, and pysano will locate the jar file and expand the command to include the full path of the jar file, plus any extra command-line arguments you provide. For example, if your cmd.txt file contains the statement:

SamTranscriptomeParser.jar -f *.sam.gz


pysano will expand that statement to:

java -Xmx24G -jar /tomato/dev/app/useq/8.8.0/Apps/SamTranscriptomeParser.jar -f *.sam.gz

The -Xmx option to java indicates the amount of RAM to be allocated to the java process, and is set by pysano to an amount appropriate to the cluster on which your job is run.

Shell commands

You can enter any shell command in your cmd.txt file. Pysano will execute these commands on the compute cluster, after editing any path names.

Path substitution

Starting and stopping jobs

Pysano jobs are started and stopped using the "pstart" and "pstop" commands. Each of these commands requires the name or names of directories where your pysano jobs are located.

pstart

The pstart command takes one or more job directory names as its arguments, for example "pstart job_directory_1 job_directory2". Each job directory must exist, must be writable by pysano, and must contain a cmd.txt file with a valid #e email_address directive. If all of the given job directories can be validate then pstart will start all of the jobs, otherwise none of the jobs will be started. If there are any problems starting the jobs, pstart will print an error message and exit with a non-zero exit status.

pstop

Use the pstop command to stop running pysano jobs, e.g. "pstop job_directory_1". pstop can only be used on running jobs that are owned by you. Otherwise, pstop will produce an error message.

Dry runs

Sometimes it is useful to submit a job to pysano without actually running the job. This is called a dry run. This is useful when creating a new script, or when using an unfamiliar macro. To start a dry run use the pdryrun command, which takes the same arguments as the pstart command.

pdryrun will first validate each job directory, and pysano will process the jobs without actually submitting them to a cluster for execution. A pbs.sh script will be created in the job directory, which contains all the commands to be executed on the cluster to which your job is assigned, and you will receive the appropriate email messages about the status of your job. Once the dry run is complete you will be able to see the pbs.sh script in your job directory.

Limits and quotas

Pysano jobs are limited by the CPU, memory and disk space resources available on each node within the computer clusters. Jobs are run within the local disk space on each node, in the node's /scratch/local file system. These file systems vary in size from 250 Gb to 400 Gb depending on the cluster.

Cluster Location # Nodes #CPUs/node RAM Local Disk Space

ember

CHPC 14 12 hyperthreaded 24 Gb 400 Gb
kingspeak CHPC 4 16 hyperthreaded 64 Gb 400 Gb
kingspeak_20 CHPC 4 20 hyperthreaded 64 Gb 400 Gb
timbuktu HCI 5 24 132 Gb 250 Gb

Concurrent jobs

Users are limited to 5 concurrent running jobs in pysano. You may start as many jobs as you like, but only 5 jobs will actively run at a time. Your other jobs will remain in the queue until some of your active jobs finish.

Disk quotas

Your pysano jobs are subject to disk quotas on the Linux file systems, like any other Linux activity. Home directory quotas are typically set to 100 Gb. If your pysano jobs fail while transfering results back to your home directory this may be the problem. There are no user quotas on the /scratch file system, so this is a good place to perform pysano jobs. Other ways to optimize your disk space use are to compress your files with gzip, remove unnecessary temporary files, and direct your results into a GNomEx analysis using the #a directive.

Monitoring jobs on the web page

When your job is accepted for execution by pysano, you will receive an email message containing a link to a web page where you can monitor your jobs (assuming you supplied pysano with the correct email address, and did not suppress that message with the -b option to the #e directive). That web page will update every 10 seconds to show your job's progress through the pysano system. The elapsed time reported by the clusters is typically updated every 60 seconds. When your jobs reach the "running" state in pysano, they have been submitted for execution to the queue of a particular cluster. Once execution of the job begins, the elapsed time of the job will show its progress.