From 5507ba36584693c7153547ed1457b7ed22fc8b7a Mon Sep 17 00:00:00 2001 From: Kimmo Mattila Date: Fri, 10 Jan 2025 15:44:37 +0200 Subject: [PATCH 01/15] Using SD Connect with a-commands and allas-dir-to-bucket --- .../sd-connect-and-a-commands.md | 109 +++++++++++++++++ .../sd-connect-sharing-for-import.md | 111 ++++++++++++++++++ .../sequencing_center_tutorial.md | 18 +-- 3 files changed, 230 insertions(+), 8 deletions(-) create mode 100644 docs/data/sensitive-data/sd-connect-and-a-commands.md create mode 100644 docs/data/sensitive-data/sd-connect-sharing-for-import.md diff --git a/docs/data/sensitive-data/sd-connect-and-a-commands.md b/docs/data/sensitive-data/sd-connect-and-a-commands.md new file mode 100644 index 0000000000..ec28aeabc1 --- /dev/null +++ b/docs/data/sensitive-data/sd-connect-and-a-commands.md @@ -0,0 +1,109 @@ +# Using SD Connect service with a-commands + +SD Connect is part of the CSC sensitive data services that provide free-of-charge sensitive data processing environment for +academic research projects at Finnish universities and research institutes. SD Connect adds an automatic encryption layer to the Allas object storage system of CSC, so that it can be used for securely storing sensitive data. Data stored to SD Connect can also be accessed for SD Desktop secure virtual desktops. + +In most cases SD Connect is used through the [SD Connect Web interface](https://sd-connect.csc.fi), but in some cases command line tools +provide more efficient way to manage data in SD Connect. + +In this document we describe how you can use use the a-commands provided by [allas-cli-utils](https://github.com/CSCfi/allas-cli-utils) to upload and download data from SD Connect. These tools are available in CSC supercomputers (Puhti, Mahti and Lumi) and they can be installed in local Linux and Mac machines too. + +Note, that Allas itself does not separate data stored with SD connect from other data stored in +Allas. Data buckets can contain a mixture of SD Connect data, other encrypted data and normal data +and it is up to the user to know the type of the data. However, it is probably a good idea to keep SD Connect data +in buckets and folders that don't contain other types of data. + + +## Opening connection to SD Connect + +To open SD Connect compatible Allas connection you must add option *--sdc* the configurtion command. In CSC supercomputers the connecton is opened with commands: + +```test +module load allas +allas-conf --sdc +``` +In local installations the connection is typically opened with commands like + +``` +export PATH=/some-local-path/allas-cli-utils:$PATH +source /some-local-path/allas-cli-utils/allas_conf -u your-csc-account --sdc +``` + +The set up process asks first your CSC passwords (Haka or Virtu passwords can't be used here). +After that you will select the CSC project to be used. This is the normal login process for Allas. +However, when SD Connect is enabled, the process asks you to give the *SD Connect API token*. This +token must be retrieved from the [SD Connect web interface](https://sd-connect.csc.fi). Note that the tokens +are project specific. Make sure you have selected the same SD Connect project in both command line and in web +interface. + +In the web interface the token can be created using dialog that opens by selecting *Create API tokens* from the *Support* menu. + +Copy the token. paste it to command line and press enter. + +The SD Connect compatible Allas connection is now valid for next eight hours. And you can use commands like +*a-list* and *a-delete* to manage both normal Allas objects and SD Connect objects. + + +## Data upload + +Data can be uploaded to SD Connect by using command *a-put* with option *--sdc*. +For example to upload file *my-secret-table.csv" to location *2000123-sens/dataset2* in Allas use command: + +```text +a-put --sdc my-secret-table.csv -b 2000123-sens/dataset2 +``` + +This will produce SD Connect object: 2000123-sens/dataset2/my-secret-table.csv.c4gh + +All other a-put options and features can be used too. For example directories are +stored as tar files, if --asis option is not used. + +Command: + +```text +a-put --sdc my-secret-directory -b 2000123-sens/dataset2 +``` + +Will produce SD connect object: 2000123-sens/dataset2/my-secret-directory.tar.c4gh + +For massive data uploads, you can use *allas-dir-to-bucket* in combination with option *--sdc*. + +```text +allas-dir-to-bucket --sdc my-secret-directory 2000123-new-sens +``` + +The command above will copy all the files from directory my-secret-directory to bucket 2000123-new-sens in SD Connect compatible format. + + +## Data download + +Data can be downloaded form Allas with command a-get. If SD Connect connection is enabled, a-get will automatically try to decrypt objects with suffix *.c4gh*. + +So for example command: + +```text +a-get 2000123-sens/dataset2/my-secret-table.csv.c4gh +``` + +Will produce local file: my-secret-table.csv + +And similarly command: + +```text +a-get 2000123-sens/dataset2/my-secret-directory.tar.c4gh +``` + +Will produce local directory: my-secret-directory + +Note that this automatic decryptions works only for the files that have +been stored using the new SD Connect that was taken in use in October 2024. + +For the older SD Connect files and other Crypt4gh encrypted files you still must +provide the matching secret key with option *--sk* + +``` +a-get --sk my-key.sec 2000123-sens/old-date/sample1.txt.c4gh +``` + +Unfortunately there is no easy way to know, which encryption method has been used in +a .c4gh file stored in Allas. \ No newline at end of file diff --git a/docs/data/sensitive-data/sd-connect-sharing-for-import.md b/docs/data/sensitive-data/sd-connect-sharing-for-import.md new file mode 100644 index 0000000000..8154f8c162 --- /dev/null +++ b/docs/data/sensitive-data/sd-connect-sharing-for-import.md @@ -0,0 +1,111 @@ +# Using SD Connect to receive sensitive research data + +This document provides instructions of how a research group can use SD Connect to receive **sensitive data** from external +data provider like a sequencing center. The procedure presented here is applicable in cases where the data will analyzed in +SD Desktop or in a computer that has internet connection. + +In some sensitive data environments internet connection is not available. In those cases, please check the alternative +approach, defined in: + + * [Using Allas to receive sensitive research data](./sequencing_center_tutorial.md) + + +## SD Connect + +SD Connect is part of the CSC sensitive data services that provide free-of-charge sensitive data processing environment for +academic research projects at Finnish universities and research institutes. SD Connect adds an automatic encryption layer to the Allas object storage system of CSC, so that it can be used for securely storing sensitive data. SD Connect can be used for storing any kind of sensitive research data during the active working phase of a research project. +SD Connect is however not intended for data archiving. You must remove your data from SD Connect when the research project ends. + +There is no automatic backup processes in SD Connect. In technical level SD Connect is very reliable and fault-tolerant, +but if you, or some of your project members, remove or overwrite some data in SD Connect, +it is permanently lost. Thus, you might consider making a backup copy of your data to some other location. + +Please check the [SD Connect documentation](./sd_connect.md) for more details about SD Connect. + + +## 1. Obtaining a storage space in SD Connect + +If you are already using SD Connect service, you can skip this chapter and start from chapter 2. +Otherwise, do following steps to get access to SD Connect. + + +### 1.1. Create a user account + +If you are not yet CSC customer, register yourself to CSC. You can do these steps in the +CSC’s customer portal [MyCSC](https://my.csc.fi). + +Create a CSC account by logging in to MyCSC with Haka or Virtu. Remember to activate multi factor +authentication for your CSC account in order to be able to use SD Connect- + + +### 1.2. Create or join a project + +In addition to CSC user account, users must either join an existing CSC computing project +or set up a new computing project. You can use the same project to access other +CSC services too like SD Desktop, Puhti, or Allas. + +If you are eligible to act as a [project manager](https://research.csc.fi/prerequisites-for-a-project-manager), you can create a new CSC project in MyCSC and apply access to SD Connect. +Select 'Academic' as the project type. As a project manager, you can invite other users as members to your project. + +If you wish to be joined to an existing project, please ask the project manager to add your CSC user account to the +project member list. + +### 1.3. Add SD Connect access for your project + +Add _SD Connect_ service to your project in MyCSC. Only the project manager can add services. +After you have added SD Connect, to the project, the other project members need to login to +MyCSC and approve the terms of use for the service before getting access to SD Connect. + +After these steps, your project has 10 TB storage space available in SD Connect. +Please [contact CSC Service Desk](../../support/contact.md) if you need more storage space. + + +## 2. Creating a shared folder + +### 2.1. Creating a new root folder in SD Connect + +Once the service is enabled, you can login to [SD Connect interface](https://sd-connect.csc.fi). +After connecting, check that **Current project** setting refers to the CSC project +that you want to use. After that you can click the **Create folder** button to +create a new folder to be shared with the data provider. + +Avoid using spaces (use _ instead) and special characters in the folder names as they may cause problems in some cases. +Further, add some project specific feature, like project acronym, to the name, as the root folder needs to have an unique name +among all root folders of all SD Connect and Allas projects. + +### 2.2 Sharing the folder + +For sharing you need to know the _Sharing ID_ string of the data producer. You should request this 32 characters long +random string form the data producer by email. + +Do to the sharing, go to the folder list in SD Connect and press the share icon of the folder you wish to share. +Then copy the project ID to the first field of the sharing tool and select **Collaborate** as the sharing permission type. + +Now sharing is done and you can send the name of the shared bucket to the data producer by email. + + +### 2.3 Revoke bucket sharing after data transport + +Moving large datasets (several terabytes) of data to SD Connect can take a long time. +Once the producer tells that all data has been imported to the shared folder in Allas, you remove the external +access rights in SD Connect interface. Click the _share_ icon of the shared +folder and press **Delete** next to the project ID of the data producer. + + +## 3. Using encrypted data + +By default data stored to SD Connect is accessible only to the members of the CSC project. However project members can +share the folder to other CSC projects. + +The project members can download the data to their own computers using the SD Connect WWW interface +that automatically decrypts the data after downloading. + +The data can be accessed in [SD Desktop](https://sd-desktop.csc.fi) too, using the _Data Gateway_ +tool. + +In Linux and Mac computers, you can install a local copy of _allas-cli-utils_ tools that provides command line +tools to download (_a-get_) and upload ( a-put --sdc ) data from and to SD Connect. + +* [Using SD Connect data with a-commands](sd-connect-and-a-commands.md) + + diff --git a/docs/data/sensitive-data/sequencing_center_tutorial.md b/docs/data/sensitive-data/sequencing_center_tutorial.md index 1e957a67d0..24fbd9dbab 100644 --- a/docs/data/sensitive-data/sequencing_center_tutorial.md +++ b/docs/data/sensitive-data/sequencing_center_tutorial.md @@ -1,5 +1,9 @@ # Using Allas storage service to receive sensitive research data +This document provides an example of how a research group can use Allas service to receive **sensitive data** from external +data provider like a sequencing center. In many cases [SD Connect](sd-connect-sharing-for-import.md), provides you a more easy way to receive sensitive data but in some cases, SD Connect can't be used. For example, SD Connect is not able to provide you an encrypted file that you could later on decrypt in an environment that does not have internet connection. + +## Allas Allas storage service is a general purpose data storage service maintained by CSC. It provides free-of-charge storage space for academic research projects at Finnish universities and research institutes. @@ -10,9 +14,6 @@ There is no automatic backup processes in Allas. In technical level Allas is ver but if you, or some of your project members, remove or overwrite some data in Allas, it is permanently lost. Thus, you might consider making a backup copy of your data to some other location. -This document provides an example of how a research group can use Allas service to receive **sensitive data** from external -data provider like a sequencing center. - The steps 1 (Obtaining storage space in Allas), and 2 (Generating encryption keys) require some work, but they need to be done only once. Once you have the keys in place you can move directly to step 3 when you need to prepare a new shared bucket. @@ -34,17 +35,18 @@ Create a CSC account by logging in to MyCSC with Haka or Virtu. ### Step 1.2. Create or join a project -In addition to CSC user account, new users must either join a CSC computing project + +In addition to CSC user account, users must either join an existing CSC computing project or set up a new computing project. You can use the same project to access other -CSC services too like Puhti, cPouta, or SD desktop. +CSC services too like SD Desktop, SD Connect pt Puhti. -Create a CSC project in MyCSC and apply access to Allas. See if you are eligible to act as a project manager. -If your work belongs to any of the free-of-charge use cases, select 'Academic' as the project type. -As a project manager, you can invite other users as members to your project. +If you are eligible to act as a [project manager](https://research.csc.fi/prerequisites-for-a-project-manager), you can create a new CSC project in MyCSC and apply access to Allas. +Select 'Academic' as the project type. As a project manager, you can invite other users as members to your project. If you wish to be joined to an existing project, please ask the project manager to add your CSC user account to the project member list. + ### Step 1.3. Add Allas access for your project Add _Allas_ service to your project in MyCSC. Only the project manager can add services. From 9de47016ebababe992ca1d821f538e16cd87e1c7 Mon Sep 17 00:00:00 2001 From: Kimmo Mattila Date: Mon, 3 Feb 2025 08:39:59 +0200 Subject: [PATCH 02/15] sdsi page --- docs/data/sensitive-data/tutorials/sdsi.md | 279 +++++++++++++++++++++ 1 file changed, 279 insertions(+) create mode 100644 docs/data/sensitive-data/tutorials/sdsi.md diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md new file mode 100644 index 0000000000..c9574cae97 --- /dev/null +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -0,0 +1,279 @@ +# Submitting jobs from SD Desktop to the HPC environment of CSC + +The limited computing capacity of a SD Desktop virtual machine can prevent running heavy analysis tasks +for sensitive data. This document describes, how heavy compting tasks can be submitted form SD Desktop +to the Puhti HPC cluster. + +Please note following details that limit the usage of this procedure: + * You have to contact servicedesk@csc.fi the enable the job submission tools for your project. By default the job submission tools don't work. + * Each jobs reserves always one, and only one, full Puhti node for your task. Try to construct your batch job so that it uses effectively all the 40 computing cores of one Puhti node. + * The input files that the job uses must be uploaded to SD Connect before the job submission. Even though the job is submitted from SD Desktop, you can't utilize any files from the SD Desktop VM in the batch job. + * The jobs submitted from SD Desktop to Puhti have higher security level that normal Puhti jobs but lower than that of SD Desktop. + + +# Getting stared + +Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be create for your project and a project specific server process is launched for you project by CSC Puhti. + +The job submission is done with command `sdsi-client`. This tool can be added to your SD desktop machine by installing `CSC Tools` with SD tool installer to your SD Desktop machine. + +# Submitting jobs + +## Data Upload + +The batch josb submitted by sdsi-client read the input data from SD Connect service. Thus all the input data must be uploaded to SD Connect before the job is submitted. Note that you can't use data in the local disks of your SD Desktop virtual machine or unencrypted files as input files for your batch job. However, local files in Puhti can be used, if the access permissions allow all group members to use the data. + +Thus the first step in constructing a sensitive data batch job is to upload the input data to SD Coonnect. + +## Constructing a batch job file + +When you submit a batch job from SD Desktop, you must define following information: + +1. What files need be downloaded from SD Connect to Puhti to be used as input files +2. What commands will be executed +3. What data will be exported from Puhti to SD Connect when the job ends +4. How much resources (time, memory, temporary dick space ) the job needs. + +You can define this thins in command line as _sdsi-client_ command options, but normally +it is more convenient to give this information a as batch job definition file. +Below is a sample of a simple sdsi job definition file, named as job1.sdsi + +```text +data: + recv: + - 2008749-sdsi-input/data1.txt.c4gh + - 2008749-sdsi-input/data2.txt.c4gh +run: | + md5sum 2008749-sdsi-input/data1.txt + md5sum 2008749-sdsi-input/data2.txt +sbatch: +- --time=00:15:00 +- --partition=test +``` + +More sdsi batch job examples can be found below + +## Submitting the job + +The batch job defined in the file can be submitted with command + +```text +sdsi-client new -input job1.sdsi +``` +The submission command will ask for your CSC password, after which it prints you the ID number of the job. +You can use this ID number to check the status of your job. For example for job 123456 you can check the status +in *SD Desk desktop* with command: + +```text +sdsi-client status 1234 +``` + +Aternatively, you can use this ID in *Puhti* with `sacct` command: + +```text +sacct -j 123456 +``` + +## Steps of processing + +The task submitted with sdsi-client is transported to the batch job system of Puhti +where it is processed among other batch jobs. The resource requirements for the batch job: computing time, memory, local disk size, GPUs, are +set according to the values defined in the _sbatch:_ section in the job description file. + +The actual computing and starts only when a suitable Puhti node is available. Queueing times may be long as +the jobs always reserves one full node with sufficient local disk and memory. + +The execution of the actual computing includes following steps: + + 1. The input files, defined in the job description file, are downloaded and decrypted to the + local temporary disk space of the computing node. + + 2. Commands defined is the _run:_ section are executed. + + 3. Output files are encrypted and uploaded to SD Connect. + + 4. Local temporary disk space is cleaned. + + +## Output + +By default the exported files include standard output and standard error of the batch job (meaning the information +that in interactive working is written to to the terminal screen ) and files that moved in directory _$RESULTS_. + + +In SD Connect the results are uploaded to a bucket named as: sdhpc-results-project_number, in a subfolder named after the +batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two +new files in SD Connect: + +```txt + sdhpc-results-2008749/123456/slurm.err.tar.c4gh + sdhpc-results-2008749/123456/slurm.out.tar.c4gh +``` + You change the output bucket with sdsi-client option `-bucket bucket_name`. Note that the bucket + name must be uniq in this case too. + + +## Practicalities + +The jobs that sdsi submits reserve one full Puhti node. These nodes have 40 computing cores +so you should use these batch jobs only for tasks can utilize multiple computing cores. +Preferably all 40. + +In the previous example, the batch job the actual computing task consisted of calculating md5 +checksums for two files. The command used, `md5sum`, is able to use just one computing core so +the job waisted resources as 40 cores were reserved for the job but only one was used. + +However if you need to calculate a large amount of unrelated tasks that are able to use only one +or few computing cores, you can use tools like gnuparallel, nextfllow or snakemake to submit several +computing tasks to be executed in the same time. + +In the example below we have a tar file that has been stored to SD Connect: 2008749-sdsi-input/data_1000.tar.c4gh. +The tar file contains 1000 files for which we want to compute md5sum. Now the batch job could look like +following: + + +```text +data: + recv: + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-rnv.sh + module load parallel + tar xf 2008749-sdsi-input/data_1000.tar + ls 2008749-sdsi-input/data_1000 | parallel -j 40 md5sun +sbatch: +- --time=04:00:00 +- --partition=small +``` + + + + + + + +``` +data: + recv: + - sdsi-poc/rand1.c4gh + - sdsi-poc/rand2.c4gh + send: + - from: /dev/shm/slurm.err + to: subfolder + - from: /dev/shm/slurm.out + to: another_folder +bucket: results_bucket +run: cat sdsi-poc/rand1 sdsi-poc/rand2 +time-limit: 00:15:00 +queue: test +``` + + + + + + +txt +data: + recv: + - 2008749-data/data1.txt.c4gh +run: | + md5sum 2008749-data/data1.txt + + + +``` + + +```txt +data: + recv: + - 2008749-data/genotype_1.fam.c4gh + - 2008749-data/genotype_1.bim.c4gh + - 2008749-data/genotype_1.bed.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load plink/1.90b7.2 + pli +``` + +The tools for running backup process are not by default installed in SD Desktop Virtual Machines. Thus, the first step is that the +manager installs the **SD Backup tool** package using the [SD Software installer](../../sensitive-data/sd-desktop-software.md#customisation-via-sd-software-installer) + +Log in to your SD Desktop and open **Data Gateway**. If the software installation help tools are enabled for your project, then you should have folder: +`tools-for-sd-desktop` included in the directory that Data Gateway created (in `Projects/SD-Connect/your-project-name`). If you don't find `tools-for-sd-desktop` +directory through Data Gateway **send a request to [CSC Service Desk](../../../support/contact.md)**. In the request, indicate that you wish that the SD Desktop software installation help tools would +be made available for your project. You must also include to the message the **Project identifier string** of your project. +You can check this random string for example in the [SD Connect service](https://sd-connect.csc.fi). There you find the +Project Identifier in the **User information** view. + +Open `tools-for-sd-desktop` folder and from there, drag/copy file `sd-installer.desktop` to your desktop. + +[![Installing-sd-installer](../images/desktop/sd-installer1.png)](../images/desktop/sd-installer1.png) + +**Figure 1.** Copying `sd-installer.desktop` file to SD desktop. + +Double-click the copy of `sd-installer.desktop` to start the software installation tool. Use this tool to install _SD Backup_ tool +to your SD Desktop virtual machine if you have not yet done so. + +## Project Mangers Starts backup server + +When the SD Backup tool is installed, the Project Manager should start a new terminal session and there start a virtual terminal session with command: + +```text +screen +``` + +and then launch the backup process with command: + +```text +sd-backup-server.sh +``` + +When launched, `sd-backup-server.sh` asks for the CSC password of the Project Manager. + +After that the project manager can leave the virtual session running in background by pressing: +`Ctrl+a+d`. + +This way the `sd-backup-server.sh` command remains active in the virtual terminal session even when the connection to SD Desktop is closed. + +The actual server process is very simple. It checks the content of the backup directory once in a minute and moves the contents of this directory +to a bucket in SD Connect. The data is encrypted with CSC public key so that the backups can be used only in SD Desktop environment. +The default backup directory is `/shared-data/auto-backup` and target bucket in SD Connect is `sdd-backup-`. + +Note that the server is not able to check if the given password was correct. If a wrong password was given then backup requests will fail. +Thus, it may be good to test the backup process once the server is running. + +## Doing backups + +When the backup server is running, all users of the VM can use command `sd-backup` to make a backup copy of the dataset to SD Connect. +The syntax of the `sd-backup` command is: + +```text +sd-backup file.csv +``` + +or + +```text +sd-backup directory +``` + +The command copies the given file or directory to the backup directory from where the server process is able to move it to SD Connect. +In SD Connect a timestamp is added to the file name in order to make the file name unique. In addition, a metadata file is +created. This file contains information about the user who requested the backup, original host and location of the file. If backup is done for +a directory, then the content of the directory is stored as one tar-archive file and the metadata file will contain list of the backed-up files. + +For example, for a file called `my_data.csv` that locates in SD Desktop virtual machine called `secserver-1683868755`, a backup command: + +```text +sd-backup my_data.csv +``` + +Will create a backup file that will be available through Data Gateway in path: + +```text +Projects/SD-Connect/project_number/sdd-backup-secserver-1683868755/my_data.csv-2023-05-15-07:41 +``` + +Note that you have to refresh the Data Gateway connection in order to see the changes in SD Connect. From 6b9f9d0311fa5475f49f4318ae2387aff63fe2a9 Mon Sep 17 00:00:00 2001 From: Kimmo Mattila Date: Mon, 3 Feb 2025 08:50:26 +0200 Subject: [PATCH 03/15] sdsi page --- docs/data/sensitive-data/tutorials/sdsi.md | 122 ++++++--------------- 1 file changed, 31 insertions(+), 91 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index c9574cae97..1898ef3403 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -65,10 +65,10 @@ You can use this ID number to check the status of your job. For example for job in *SD Desk desktop* with command: ```text -sdsi-client status 1234 +sdsi-client status 123456 ``` -Aternatively, you can use this ID in *Puhti* with `sacct` command: +Alternatively, you can use this ID in *Puhti* with `sacct` command: ```text sacct -j 123456 @@ -80,7 +80,7 @@ The task submitted with sdsi-client is transported to the batch job system of Pu where it is processed among other batch jobs. The resource requirements for the batch job: computing time, memory, local disk size, GPUs, are set according to the values defined in the _sbatch:_ section in the job description file. -The actual computing and starts only when a suitable Puhti node is available. Queueing times may be long as +The actual computing starts only when a suitable Puhti node is available. Queueing times may be long as the jobs always reserves one full node with sufficient local disk and memory. The execution of the actual computing includes following steps: @@ -98,10 +98,10 @@ The execution of the actual computing includes following steps: ## Output By default the exported files include standard output and standard error of the batch job (meaning the information -that in interactive working is written to to the terminal screen ) and files that moved in directory _$RESULTS_. +that in interactive working is written to the terminal screen ) and files that moved in directory _$RESULTS_. -In SD Connect the results are uploaded to a bucket named as: sdhpc-results-project_number, in a subfolder named after the +In SD Connect the results are uploaded to a bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two new files in SD Connect: @@ -109,7 +109,7 @@ new files in SD Connect: sdhpc-results-2008749/123456/slurm.err.tar.c4gh sdhpc-results-2008749/123456/slurm.out.tar.c4gh ``` - You change the output bucket with sdsi-client option `-bucket bucket_name`. Note that the bucket + You can change the output bucket with sdsi-client option `-bucket bucket_name`. Note that the bucket name must be uniq in this case too. @@ -119,12 +119,12 @@ The jobs that sdsi submits reserve one full Puhti node. These nodes have 40 comp so you should use these batch jobs only for tasks can utilize multiple computing cores. Preferably all 40. -In the previous example, the batch job the actual computing task consisted of calculating md5 +In the previous example, the actual computing task consisted of calculating md5 checksums for two files. The command used, `md5sum`, is able to use just one computing core so -the job waisted resources as 40 cores were reserved for the job but only one was used. +the job waisted resources as 40 cores were reserved but only one was used. However if you need to calculate a large amount of unrelated tasks that are able to use only one -or few computing cores, you can use tools like gnuparallel, nextfllow or snakemake to submit several +or few computing cores, you can use tools like _gnuparallel_, _nextfllow_ or _snakemake_ to submit several computing tasks to be executed in the same time. In the example below we have a tar file that has been stored to SD Connect: 2008749-sdsi-input/data_1000.tar.c4gh. @@ -140,15 +140,36 @@ run: | source /appl/profile/zz-csc-rnv.sh module load parallel tar xf 2008749-sdsi-input/data_1000.tar - ls 2008749-sdsi-input/data_1000 | parallel -j 40 md5sun + ls data_1000 | parallel -j 40 md5sun sbatch: - --time=04:00:00 - --partition=small ``` +In the sample job above, the first source command is used to add +module command and other Puhti settings to the execution environment. +The GNUparallel is enabled command `module load parallel`. +Next the tar file containing 1000 files is extracted to the temporary local disk area. +Finally, the file listing of the extracted directory is guided to `parallel` command that runs +the given command, `md5sum`, for each file using 40 parallel processes (`-j 40`). +In the next example, GPU computing are used to speed up whisper speech recognition tool that +the user has installed to her own python virtual environment in Puhti +```text +data: + recv: + - 2008749-sdsi-input/interview-52.mp4.c4gh +run: | + source /appl/profile/zz-csc-rnv.sh + module load pytorch + source /projappl/project_2008749/whisper-python/bin/activate + whisper --model medium 2008749-sdsi-input/interview-52.mp4 --threads 40 +sbatch: +- --time=01:00:00 +- --gres=gpu:v100:1 +``` @@ -196,84 +217,3 @@ run: | module load plink/1.90b7.2 pli ``` - -The tools for running backup process are not by default installed in SD Desktop Virtual Machines. Thus, the first step is that the -manager installs the **SD Backup tool** package using the [SD Software installer](../../sensitive-data/sd-desktop-software.md#customisation-via-sd-software-installer) - -Log in to your SD Desktop and open **Data Gateway**. If the software installation help tools are enabled for your project, then you should have folder: -`tools-for-sd-desktop` included in the directory that Data Gateway created (in `Projects/SD-Connect/your-project-name`). If you don't find `tools-for-sd-desktop` -directory through Data Gateway **send a request to [CSC Service Desk](../../../support/contact.md)**. In the request, indicate that you wish that the SD Desktop software installation help tools would -be made available for your project. You must also include to the message the **Project identifier string** of your project. -You can check this random string for example in the [SD Connect service](https://sd-connect.csc.fi). There you find the -Project Identifier in the **User information** view. - -Open `tools-for-sd-desktop` folder and from there, drag/copy file `sd-installer.desktop` to your desktop. - -[![Installing-sd-installer](../images/desktop/sd-installer1.png)](../images/desktop/sd-installer1.png) - -**Figure 1.** Copying `sd-installer.desktop` file to SD desktop. - -Double-click the copy of `sd-installer.desktop` to start the software installation tool. Use this tool to install _SD Backup_ tool -to your SD Desktop virtual machine if you have not yet done so. - -## Project Mangers Starts backup server - -When the SD Backup tool is installed, the Project Manager should start a new terminal session and there start a virtual terminal session with command: - -```text -screen -``` - -and then launch the backup process with command: - -```text -sd-backup-server.sh -``` - -When launched, `sd-backup-server.sh` asks for the CSC password of the Project Manager. - -After that the project manager can leave the virtual session running in background by pressing: -`Ctrl+a+d`. - -This way the `sd-backup-server.sh` command remains active in the virtual terminal session even when the connection to SD Desktop is closed. - -The actual server process is very simple. It checks the content of the backup directory once in a minute and moves the contents of this directory -to a bucket in SD Connect. The data is encrypted with CSC public key so that the backups can be used only in SD Desktop environment. -The default backup directory is `/shared-data/auto-backup` and target bucket in SD Connect is `sdd-backup-`. - -Note that the server is not able to check if the given password was correct. If a wrong password was given then backup requests will fail. -Thus, it may be good to test the backup process once the server is running. - -## Doing backups - -When the backup server is running, all users of the VM can use command `sd-backup` to make a backup copy of the dataset to SD Connect. -The syntax of the `sd-backup` command is: - -```text -sd-backup file.csv -``` - -or - -```text -sd-backup directory -``` - -The command copies the given file or directory to the backup directory from where the server process is able to move it to SD Connect. -In SD Connect a timestamp is added to the file name in order to make the file name unique. In addition, a metadata file is -created. This file contains information about the user who requested the backup, original host and location of the file. If backup is done for -a directory, then the content of the directory is stored as one tar-archive file and the metadata file will contain list of the backed-up files. - -For example, for a file called `my_data.csv` that locates in SD Desktop virtual machine called `secserver-1683868755`, a backup command: - -```text -sd-backup my_data.csv -``` - -Will create a backup file that will be available through Data Gateway in path: - -```text -Projects/SD-Connect/project_number/sdd-backup-secserver-1683868755/my_data.csv-2023-05-15-07:41 -``` - -Note that you have to refresh the Data Gateway connection in order to see the changes in SD Connect. From c2a04737dba9e466af9e144ed845207d89ff6531 Mon Sep 17 00:00:00 2001 From: Kimmo Mattila Date: Mon, 3 Feb 2025 16:03:20 +0200 Subject: [PATCH 04/15] sdsi examples fixed --- docs/data/sensitive-data/tutorials/sdsi.md | 89 ++++++++++++++++++---- 1 file changed, 75 insertions(+), 14 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 1898ef3403..d65fc42994 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -113,11 +113,11 @@ new files in SD Connect: name must be uniq in this case too. -## Practicalities +## Running serial jobs effectively -The jobs that sdsi submits reserve one full Puhti node. These nodes have 40 computing cores -so you should use these batch jobs only for tasks can utilize multiple computing cores. -Preferably all 40. +The jobs that sdsi submits reserve always one full Puhti node. These nodes have 40 computing cores +so you should use these batch jobs only for tasks that can utilize multiple computing cores. +Preferably all 40. In the previous example, the actual computing task consisted of calculating md5 checksums for two files. The command used, `md5sum`, is able to use just one computing core so @@ -127,35 +127,96 @@ However if you need to calculate a large amount of unrelated tasks that are able or few computing cores, you can use tools like _gnuparallel_, _nextfllow_ or _snakemake_ to submit several computing tasks to be executed in the same time. -In the example below we have a tar file that has been stored to SD Connect: 2008749-sdsi-input/data_1000.tar.c4gh. -The tar file contains 1000 files for which we want to compute md5sum. Now the batch job could look like -following: +In the examples below we have a tar-arcvive file that has been stored to SD Connect: `2008749-sdsi-input/data_1000.tar.c4gh`. +The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks +so that all 40 cores are effectively used. + +### GNUparallel +In the case of GNUparallel based parallelization the workflow could look like +following: ```text data: recv: - 2008749-sdsi-input/data_1000.tar.c4gh run: | - source /appl/profile/zz-csc-rnv.sh + source /appl/profile/zz-csc-env.sh module load parallel tar xf 2008749-sdsi-input/data_1000.tar - ls data_1000 | parallel -j 40 md5sun + cd data_1000 + ls *.txt | parallel -j 40 md5sum {} ">" {.}.md5 + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ sbatch: - --time=04:00:00 - --partition=small ``` -In the sample job above, the first source command is used to add -module command and other Puhti settings to the execution environment. +In the sample job above, the first command, `source /appl/profile/zz-csc-env.sh` is used to add +_module_ command and other Puhti settings to the execution environment. The GNUparallel is enabled command `module load parallel`. Next the tar file containing 1000 files is extracted to the temporary local disk area. -Finally, the file listing of the extracted directory is guided to `parallel` command that runs -the given command, `md5sum`, for each file using 40 parallel processes (`-j 40`). +Finally, the file listing of the .txt filesmin the extracted directory is guided to `parallel` command that runs +the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). + + +### snakemake + +If we want to use SnakeMake we must first upload a SnakeMake job file (_md5sums.snakefile_ in this case) to SD Connect. +This file defines the input files to be processed, commands to be executed and outputs to be create. +Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for +example from your own computer or from Puhti. + +Content of SnakeMake file _md5sums.snakefile_ + +```text +txt_files = [f for f in os.listdir(".") if f.endswith(".txt")] + +rule all: + input: + expand("{file}.md5", file=txt_files) + +rule md5sum: + input: + "{file}" + output: + "{file}.md5" + shell: + "md5sum {input} > {output}" +``` + +The actual sdsi job file could look like this: + +```text +data: + recv: + - 2008749-sdsi-input/md5sums.snakefile.c4gh + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load snakemake + mkdir snakemake_cache + export SNAKEMAKE_OUTPUT_CACHE=$(pwd)"/snakemake_cache" + tar xf 2008749-sdsi-input/data_1000.tar + cp 2008749-sdsi-input/md5sums.snakefile data_1000 + cd data_1000 + snakemake --cores 40 --snakefile md5sums.snakefile + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ + +sbatch: +- --time=04:00:00 +- --partition=small +``` + In the next example, GPU computing are used to speed up whisper speech recognition tool that -the user has installed to her own python virtual environment in Puhti +the user has installed to her own python virtual environment in Puhti. + + + ```text data: From 5bd0f4df397e19e576d5abdedc47951e34840d0a Mon Sep 17 00:00:00 2001 From: Kimmo Mattila Date: Mon, 3 Feb 2025 19:14:23 +0200 Subject: [PATCH 05/15] sdsi examples fixed --- docs/data/sensitive-data/tutorials/sdsi.md | 61 ++++++++++++++++++++-- 1 file changed, 56 insertions(+), 5 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index d65fc42994..42f843eb45 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -160,11 +160,64 @@ Next the tar file containing 1000 files is extracted to the temporary local disk Finally, the file listing of the .txt filesmin the extracted directory is guided to `parallel` command that runs the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). +### nextfllow -### snakemake +If we want to use NextFlow we must first upload a NextFlow task file (_md5sums.nf_ in this case) to SD Connect. +This file defines the input files to be processed, commands to be executed and outputs to be created. +Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for +example from your own computer or from Puhti. + +Content of NextFlow file _md5sums.nf_ + +```text +nextflow.enable.dsl=2 + +process md5sum { + tag "$filename" + + input: + path txt_file from files("*.txt") + + output: + path "${txt_file}.md5" + + script: + """ + md5sum $txt_file > ${txt_file}.md5 + """ +} + +workflow { + md5sum() +} +``` +The actual sdsi job file could look like this: + +```text +data: + recv: + - 2008749-sdsi-input/md5sums.nf.c4gh + - 2008749-sdsi-input/data_1000.tar.c4gh +run: | + source /appl/profile/zz-csc-env.sh + module load nextflow + tar xf 2008749-sdsi-input/data_1000.tar + cp 2008749-sdsi-input/md5sums.nf data_1000 + cd data_1000 + nextflow run md5sums.nf -process.executor local -process.maxForks 40 + tar -cvf md5sums.tar *.md5 + mv md5sums.tar $RESULTS/ + +sbatch: +- --time=04:00:00 +- --partition=small +``` + + +### SnakeMake If we want to use SnakeMake we must first upload a SnakeMake job file (_md5sums.snakefile_ in this case) to SD Connect. -This file defines the input files to be processed, commands to be executed and outputs to be create. +This file defines the input files to be processed, commands to be executed and outputs to be created. Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for example from your own computer or from Puhti. @@ -210,14 +263,12 @@ sbatch: - --partition=small ``` - +### GPU computing In the next example, GPU computing are used to speed up whisper speech recognition tool that the user has installed to her own python virtual environment in Puhti. - - ```text data: recv: From 314470ba7dd8998de79cb1109fb7fe6a2d21aa07 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Mon, 23 Jun 2025 16:26:51 +0300 Subject: [PATCH 06/15] Update sdsi.md --- docs/data/sensitive-data/tutorials/sdsi.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 42f843eb45..51b17d3d47 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -13,9 +13,9 @@ Please note following details that limit the usage of this procedure: # Getting stared -Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be create for your project and a project specific server process is launched for you project by CSC Puhti. +Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be created for your project and a project specific server process is launched for you project by CSC Puhti. -The job submission is done with command `sdsi-client`. This tool can be added to your SD desktop machine by installing `CSC Tools` with SD tool installer to your SD Desktop machine. +The job submission is done with command `sdsi-client`. This tool can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md/#customisation-via-sd-software-installer) to your SD Desktop machine. # Submitting jobs @@ -23,7 +23,7 @@ The job submission is done with command `sdsi-client`. This tool can be added to The batch josb submitted by sdsi-client read the input data from SD Connect service. Thus all the input data must be uploaded to SD Connect before the job is submitted. Note that you can't use data in the local disks of your SD Desktop virtual machine or unencrypted files as input files for your batch job. However, local files in Puhti can be used, if the access permissions allow all group members to use the data. -Thus the first step in constructing a sensitive data batch job is to upload the input data to SD Coonnect. +Thus the first step in constructing a sensitive data batch job is to upload the input data to SD Connect. ## Constructing a batch job file @@ -35,8 +35,8 @@ When you submit a batch job from SD Desktop, you must define following informati 4. How much resources (time, memory, temporary dick space ) the job needs. You can define this thins in command line as _sdsi-client_ command options, but normally -it is more convenient to give this information a as batch job definition file. -Below is a sample of a simple sdsi job definition file, named as job1.sdsi +it is more convenient to give this information as a batch job definition file. +Below is a sample of a simple sdsi job definition file, named as _job1.sdsi_ ```text data: @@ -110,12 +110,12 @@ new files in SD Connect: sdhpc-results-2008749/123456/slurm.out.tar.c4gh ``` You can change the output bucket with sdsi-client option `-bucket bucket_name`. Note that the bucket - name must be uniq in this case too. + name must be unique in this case too. ## Running serial jobs effectively -The jobs that sdsi submits reserve always one full Puhti node. These nodes have 40 computing cores +The jobs that sdsi-client submits reserve always one full Puhti node. These nodes have 40 computing cores so you should use these batch jobs only for tasks that can utilize multiple computing cores. Preferably all 40. @@ -128,8 +128,7 @@ or few computing cores, you can use tools like _gnuparallel_, _nextfllow_ or _sn computing tasks to be executed in the same time. In the examples below we have a tar-arcvive file that has been stored to SD Connect: `2008749-sdsi-input/data_1000.tar.c4gh`. -The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks -so that all 40 cores are effectively used. +The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks so that all 40 cores are effectively used. ### GNUparallel @@ -155,7 +154,7 @@ sbatch: In the sample job above, the first command, `source /appl/profile/zz-csc-env.sh` is used to add _module_ command and other Puhti settings to the execution environment. -The GNUparallel is enabled command `module load parallel`. +GNUparallel is enabled with command `module load parallel`. Next the tar file containing 1000 files is extracted to the temporary local disk area. Finally, the file listing of the .txt filesmin the extracted directory is guided to `parallel` command that runs the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). From e93b69c828b5ecb11bd677d3807e3707ee86c918 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 12:46:37 +0300 Subject: [PATCH 07/15] Update sdsi.md --- docs/data/sensitive-data/tutorials/sdsi.md | 103 +++++---------------- 1 file changed, 24 insertions(+), 79 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 51b17d3d47..5f7488058b 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -1,6 +1,6 @@ # Submitting jobs from SD Desktop to the HPC environment of CSC -The limited computing capacity of a SD Desktop virtual machine can prevent running heavy analysis tasks +The limited computing capacity of a SD Desktop virtual machines can prevent running heavy analysis tasks for sensitive data. This document describes, how heavy compting tasks can be submitted form SD Desktop to the Puhti HPC cluster. @@ -13,15 +13,15 @@ Please note following details that limit the usage of this procedure: # Getting stared -Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be created for your project and a project specific server process is launched for you project by CSC Puhti. +Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be created for your project and a project specific server process is launched for you project by CSC. -The job submission is done with command `sdsi-client`. This tool can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md/#customisation-via-sd-software-installer) to your SD Desktop machine. +The job submission is done with command `sdsi-client`. This command can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md/#customisation-via-sd-software-installer) to your SD Desktop machine. # Submitting jobs ## Data Upload -The batch josb submitted by sdsi-client read the input data from SD Connect service. Thus all the input data must be uploaded to SD Connect before the job is submitted. Note that you can't use data in the local disks of your SD Desktop virtual machine or unencrypted files as input files for your batch job. However, local files in Puhti can be used, if the access permissions allow all group members to use the data. +The batch jobs submitted by sdsi-client read the input data from SD Connect service. Thus all the input data must be uploaded to SD Connect before the job is submitted. Note that you can't use data in the local disks of your SD Desktop virtual machine or unencrypted files as input files for your batch job. However, local files in Puhti can be used, if the access permissions allow all group members to use the data. Thus the first step in constructing a sensitive data batch job is to upload the input data to SD Connect. @@ -29,12 +29,12 @@ Thus the first step in constructing a sensitive data batch job is to upload the When you submit a batch job from SD Desktop, you must define following information: -1. What files need be downloaded from SD Connect to Puhti to be used as input files -2. What commands will be executed +1. What files need be downloaded from SD Connect to Puhti to be used as input files (`data:`) +2. What commands will be executed (`run : `) 3. What data will be exported from Puhti to SD Connect when the job ends -4. How much resources (time, memory, temporary dick space ) the job needs. +4. How much resources (time, memory, temporary dick space ) the job needs. (`sbatch:`) -You can define this thins in command line as _sdsi-client_ command options, but normally +You can define this this in command line as _sdsi-client_ command options, but normally it is more convenient to give this information as a batch job definition file. Below is a sample of a simple sdsi job definition file, named as _job1.sdsi_ @@ -60,14 +60,12 @@ The batch job defined in the file can be submitted with command ```text sdsi-client new -input job1.sdsi ``` -The submission command will ask for your CSC password, after which it prints you the ID number of the job. -You can use this ID number to check the status of your job. For example for job 123456 you can check the status -in *SD Desk desktop* with command: +The submission command will ask for your CSC password, after which it submits the task and it prints the ID number of the job. +You can use this ID number to check the status of your job. For example for job 123456 you can check the status in *SD Desk desktop* with command: ```text sdsi-client status 123456 ``` - Alternatively, you can use this ID in *Puhti* with `sacct` command: ```text @@ -97,26 +95,23 @@ The execution of the actual computing includes following steps: ## Output -By default the exported files include standard output and standard error of the batch job (meaning the information -that in interactive working is written to the terminal screen ) and files that moved in directory _$RESULTS_. +By default the exported files include standard output and standard error of the batch job (this is the text that in interactive working is written to the terminal screen ) and files that are in directory _$RESULTS_. -In SD Connect the results are uploaded to a bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the -batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two -new files in SD Connect: +The results are uploaded from Puhti to SD Connecti into bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two new files in SD Connect: ```txt sdhpc-results-2008749/123456/slurm.err.tar.c4gh sdhpc-results-2008749/123456/slurm.out.tar.c4gh ``` - You can change the output bucket with sdsi-client option `-bucket bucket_name`. Note that the bucket + You can change the output bucket with sdsi-client option `-bucket bucket-name`. Note that the bucket name must be unique in this case too. ## Running serial jobs effectively The jobs that sdsi-client submits reserve always one full Puhti node. These nodes have 40 computing cores -so you should use these batch jobs only for tasks that can utilize multiple computing cores. +so you should use these batch jobs for tasks that can utilize multiple computing cores. Preferably all 40. In the previous example, the actual computing task consisted of calculating md5 @@ -127,8 +122,7 @@ However if you need to calculate a large amount of unrelated tasks that are able or few computing cores, you can use tools like _gnuparallel_, _nextfllow_ or _snakemake_ to submit several computing tasks to be executed in the same time. -In the examples below we have a tar-arcvive file that has been stored to SD Connect: `2008749-sdsi-input/data_1000.tar.c4gh`. -The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks so that all 40 cores are effectively used. +In the examples below we have a tar-arcvive file that has been stored to SD Connect: `2008749-sdsi-input/data_1000.tar.c4gh`. The tar file contains 1000 text files (_.txt_) for which we want to compute md5sum. Bellow we have three alternative ways to run the tasks so that all 40 cores are effectively used. ### GNUparallel @@ -156,15 +150,11 @@ In the sample job above, the first command, `source /appl/profile/zz-csc-env.sh` _module_ command and other Puhti settings to the execution environment. GNUparallel is enabled with command `module load parallel`. Next the tar file containing 1000 files is extracted to the temporary local disk area. -Finally, the file listing of the .txt filesmin the extracted directory is guided to `parallel` command that runs -the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). +Finally, the file listing of the .txt files in the extracted directory is guided to `parallel` command that runs the given command, `md5sum`, for each file (_{}_) using 40 parallel processes (`-j 40`). ### nextfllow -If we want to use NextFlow we must first upload a NextFlow task file (_md5sums.nf_ in this case) to SD Connect. -This file defines the input files to be processed, commands to be executed and outputs to be created. -Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for -example from your own computer or from Puhti. +If you want to use NextFlow you must first upload a NextFlow task file (_md5sums.nf_ in this case) to SD Connect. This file defines the input files to be processed, commands to be executed and outputs to be created. Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for example from your own computer or from Puhti. Content of NextFlow file _md5sums.nf_ @@ -215,10 +205,7 @@ sbatch: ### SnakeMake -If we want to use SnakeMake we must first upload a SnakeMake job file (_md5sums.snakefile_ in this case) to SD Connect. -This file defines the input files to be processed, commands to be executed and outputs to be created. -Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for -example from your own computer or from Puhti. +If you want to use SnakeMake you must first upload a SnakeMake job file (_md5sums.snakefile_ in this case) to SD Connect. This file defines the input files to be processed, commands to be executed and outputs to be created. Note that you can't upload this file to the SD Connect form SD Desktop, but you must upload it for example from your own computer or from Puhti. Content of SnakeMake file _md5sums.snakefile_ @@ -264,8 +251,9 @@ sbatch: ### GPU computing -In the next example, GPU computing are used to speed up whisper speech recognition tool that -the user has installed to her own python virtual environment in Puhti. +sdsi-client can also be used to submit jobs that utilize the GPU capacity of Puhtu. +In example blelow example, GPU computing are used to speed up whisper speech recognition tool. +Whisper is installed in Puhti and activated there with command `module load whisper`. ```text @@ -273,10 +261,9 @@ data: recv: - 2008749-sdsi-input/interview-52.mp4.c4gh run: | - source /appl/profile/zz-csc-rnv.sh - module load pytorch - source /projappl/project_2008749/whisper-python/bin/activate - whisper --model medium 2008749-sdsi-input/interview-52.mp4 --threads 40 + source /appl/profile/zz-csc-env.sh + module load whisper + whisper --model large -f all -o $RESULTS --language Italian 2008749-sdsi-input/interview-52.mp4 sbatch: - --time=01:00:00 - --gres=gpu:v100:1 @@ -284,47 +271,5 @@ sbatch: -``` -data: - recv: - - sdsi-poc/rand1.c4gh - - sdsi-poc/rand2.c4gh - send: - - from: /dev/shm/slurm.err - to: subfolder - - from: /dev/shm/slurm.out - to: another_folder -bucket: results_bucket -run: cat sdsi-poc/rand1 sdsi-poc/rand2 -time-limit: 00:15:00 -queue: test -``` - - - - - - -txt -data: - recv: - - 2008749-data/data1.txt.c4gh -run: | - md5sum 2008749-data/data1.txt - - - -``` -```txt -data: - recv: - - 2008749-data/genotype_1.fam.c4gh - - 2008749-data/genotype_1.bim.c4gh - - 2008749-data/genotype_1.bed.c4gh -run: | - source /appl/profile/zz-csc-env.sh - module load plink/1.90b7.2 - pli -``` From f8640c2d7dcea2f29dfea5e167c0c440105c1dee Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 12:48:02 +0300 Subject: [PATCH 08/15] Update sdsi.md --- docs/data/sensitive-data/tutorials/sdsi.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 5f7488058b..962f3c106b 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -251,8 +251,8 @@ sbatch: ### GPU computing -sdsi-client can also be used to submit jobs that utilize the GPU capacity of Puhtu. -In example blelow example, GPU computing are used to speed up whisper speech recognition tool. +sdsi-client can also be used to submit jobs that utilize the GPU capacity of Puhti. +In the below example, GPU computing is used to speed up whisper speech recognition tool. Whisper is installed in Puhti and activated there with command `module load whisper`. From c3848569e84fc8769a8a843dd545fda821534030 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:11:47 +0300 Subject: [PATCH 09/15] Update sd-desktop-working.md --- docs/data/sensitive-data/sd-desktop-working.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 53d79bca22..06e11680c2 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -133,3 +133,7 @@ Read next: - [How to import data for analysis in your desktop](./sd-desktop-access.md) - [Customisation: adding software](./sd-desktop-software.md) - [How to manage your virtual desktop (delete, pause, detach volume etc.)](./sd-desktop-manage.md) + +## Submitting jobs from SD Desktop to HPC environments + +- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./sdsi.md) From 6fb922602fdad9a0f5e4775aa9e81bab4b60b5a9 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:16:56 +0300 Subject: [PATCH 10/15] Update sd-desktop-working.md --- docs/data/sensitive-data/sd-desktop-working.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 06e11680c2..4fab332add 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -131,7 +131,7 @@ The computing environment i.e. virtual desktop (visible from your browser) is is Read next: - [How to import data for analysis in your desktop](./sd-desktop-access.md) -- [Customisation: adding software](./sd-desktop-software.md) +- [Customisation: adding software](./tutorials/sd-desktop-software.md) - [How to manage your virtual desktop (delete, pause, detach volume etc.)](./sd-desktop-manage.md) ## Submitting jobs from SD Desktop to HPC environments From c62ed534542573b87f367d904c98b7b67f989636 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:22:40 +0300 Subject: [PATCH 11/15] Update sd-desktop-working.md --- docs/data/sensitive-data/sd-desktop-working.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 4fab332add..7214477916 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -136,4 +136,4 @@ Read next: ## Submitting jobs from SD Desktop to HPC environments -- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./sdsi.md) +- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./turorials/sdsi.md) From fbd023c27cc0eac1134d35b6a324bc24f64d4e7e Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:27:11 +0300 Subject: [PATCH 12/15] Update sd-desktop-working.md --- docs/data/sensitive-data/sd-desktop-working.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 7214477916..65a6fa276e 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -136,4 +136,4 @@ Read next: ## Submitting jobs from SD Desktop to HPC environments -- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./turorials/sdsi.md) +- [How to use sdsi-client to submit batch jobs from SD Desktop to Puhti](./tutorials/sdsi.md) From 3d393a705e5a61c520802bfd0c43cd62064e9a83 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:37:29 +0300 Subject: [PATCH 13/15] Update sd-desktop-working.md --- docs/data/sensitive-data/sd-desktop-working.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/sensitive-data/sd-desktop-working.md b/docs/data/sensitive-data/sd-desktop-working.md index 65a6fa276e..45e5a93c21 100644 --- a/docs/data/sensitive-data/sd-desktop-working.md +++ b/docs/data/sensitive-data/sd-desktop-working.md @@ -131,7 +131,7 @@ The computing environment i.e. virtual desktop (visible from your browser) is is Read next: - [How to import data for analysis in your desktop](./sd-desktop-access.md) -- [Customisation: adding software](./tutorials/sd-desktop-software.md) +- [Customisation: adding software](./sd-desktop-software.md) - [How to manage your virtual desktop (delete, pause, detach volume etc.)](./sd-desktop-manage.md) ## Submitting jobs from SD Desktop to HPC environments From c6ca8a86c8763609e7bfc9f4e1f7ea67083ea078 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Thu, 3 Jul 2025 13:43:53 +0300 Subject: [PATCH 14/15] Update sdsi.md --- docs/data/sensitive-data/tutorials/sdsi.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 962f3c106b..0f4cc33db3 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -15,7 +15,7 @@ Please note following details that limit the usage of this procedure: Add Puhti service to your project and contact CSC (sevicedesk@csc.fi) and request that Puhti access will be created for your SD Desktop environment. In this process a robot account will be created for your project and a project specific server process is launched for you project by CSC. -The job submission is done with command `sdsi-client`. This command can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md/#customisation-via-sd-software-installer) to your SD Desktop machine. +The job submission is done with command `sdsi-client`. This command can be added to your SD desktop machine by installing `CSC Tools` with [SD tool installer](../sd-desktop-software.md) to your SD Desktop machine. # Submitting jobs From ca0e32b94c5a2634a0020d02b47699a5e683d329 Mon Sep 17 00:00:00 2001 From: kkmattil Date: Fri, 15 Aug 2025 09:47:39 +0300 Subject: [PATCH 15/15] Update sdsi.md --- docs/data/sensitive-data/tutorials/sdsi.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/data/sensitive-data/tutorials/sdsi.md b/docs/data/sensitive-data/tutorials/sdsi.md index 0f4cc33db3..8b9d696aad 100644 --- a/docs/data/sensitive-data/tutorials/sdsi.md +++ b/docs/data/sensitive-data/tutorials/sdsi.md @@ -30,7 +30,7 @@ Thus the first step in constructing a sensitive data batch job is to upload the When you submit a batch job from SD Desktop, you must define following information: 1. What files need be downloaded from SD Connect to Puhti to be used as input files (`data:`) -2. What commands will be executed (`run : `) +2. What commands will be executed (`run: `) 3. What data will be exported from Puhti to SD Connect when the job ends 4. How much resources (time, memory, temporary dick space ) the job needs. (`sbatch:`) @@ -51,11 +51,11 @@ sbatch: - --partition=test ``` -More sdsi batch job examples can be found below +More sdsi batch job examples can be found below. ## Submitting the job -The batch job defined in the file can be submitted with command +The batch job defined in the file can be submitted with command: ```text sdsi-client new -input job1.sdsi @@ -98,7 +98,7 @@ The execution of the actual computing includes following steps: By default the exported files include standard output and standard error of the batch job (this is the text that in interactive working is written to the terminal screen ) and files that are in directory _$RESULTS_. -The results are uploaded from Puhti to SD Connecti into bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two new files in SD Connect: +The results are uploaded from Puhti to SD Connect into bucket named as: *sdhpc-results-*_project_number_, in a subfolder named after the batch job ID. In the example above the project used was 2008749 and the job id was 123456. Thus the job would produce two new files in SD Connect: ```txt sdhpc-results-2008749/123456/slurm.err.tar.c4gh