diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..e43b0f9 --- /dev/null +++ b/.gitignore @@ -0,0 +1 @@ +.DS_Store diff --git a/_chapters/single-cell-analysis/01-introduction/01-sc-workflow.jpeg b/_chapters/single-cell-analysis/01-introduction/01-sc-workflow.jpeg new file mode 100644 index 0000000..783b3e0 Binary files /dev/null and b/_chapters/single-cell-analysis/01-introduction/01-sc-workflow.jpeg differ diff --git a/_chapters/single-cell-analysis/01-introduction/index.md b/_chapters/single-cell-analysis/01-introduction/index.md new file mode 100644 index 0000000..9eaa5f1 --- /dev/null +++ b/_chapters/single-cell-analysis/01-introduction/index.md @@ -0,0 +1,24 @@ +--- +title: 'Introduction to Single-Cell Expression' +--- + + An expression profile is a representation of the activity (the expression) of thousands of genes for a single biological sample. + +In traditional, bulk gene expression studies we usually compare two or more types of tissue samples, for instance a healthy and pathological one. More specifically, we compare their **expression profiles**: the set of genes expressed in one sample is contrasted with the corresponding set expressed in the other in order to identify systematic differences in gene activity. Because a single sample is made up of hundreds to millions of cells, the measured expression level of any given gene effectively reflects something like **the average expression** across all the cells present in that sample. + +However, the cells that make up a sample can differ widely in their characteristics: they may have different functions and morphologies, or be in different developmental or cell-cycle states. For example, in the human retina we can find as many as 5 different types of neurons, each specialized to perform a certain function! Averaging gene expression across all cells therefore masks cell-to-cell variation, making it impossible to determine whether observed expression patterns arise uniformly across cells or from specific cell subpopulations. If we want to study the differences in the gene expression between different cells within a sample, we need techniques with a finer-grained resolution than what bulk gene expression sequencing techniques allow. + + Single-cell sequencing examines the sequence information (e.g.DNA or RNA) from individual cells + +Here’s where single-cell sequencing comes in. Using optimized next-generation sequencing it allows us to measure **sequence information at the level of a single cell**. We can sequence both the _genome_ or the _transcriptome_ of a single cell, but in this tutorial we'll be focusing on gene expression or _transcriptomic_ studies. These simultaneously **measure the RNA concentration** (conventionally only messenger RNA (mRNA)) of hundreds to thousands of genes in a single cell. + + +![](01-sc-workflow.jpeg) + + A gene expression profile of a single cell tells us something about it's function and state + +The genes that are expressed in a certain cell are characteristic of its **function and of its state**. For instance, we expect similar gene expression profiles from two healthy liver cells and different expression profiles between a healthy liver cell and a cancerous one. This fine-grained resolution opens up new avenues for understanding complex biological processes, such as development, disease progression, and cellular responses to stimuli. + +The technology behind sequencing at the level of a single cell, however interesting, is not the topic of this tutorial. Rather, what we want to cover over the next few chapters is how to approach single-cell data once it has been obtained. In order to make sense of such data we need to analyze it. We will use **Orange** to perform just that in an easy and intuitive manner. + + diff --git a/_chapters/single-cell-analysis/02-sc-data/index.md b/_chapters/single-cell-analysis/02-sc-data/index.md new file mode 100644 index 0000000..369cb87 --- /dev/null +++ b/_chapters/single-cell-analysis/02-sc-data/index.md @@ -0,0 +1,80 @@ +--- +title: 'Structure of Single Cell Data' +--- + +## Single Cell Datasets in Orange + +In single-cell expression studies, the data are typically first represented as a count matrix. Each row usually corresponds to an individual cell, and each column corresponds to a gene, with the entries recording how many RNA molecules from a given gene were detected (counted) in a given cell. + +Let's look at an example. You can find a number of preloaded, publicly available single cell datasets which can be accessed through the [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget. We will explore some of them in the following chapters. + + +![](sc-notes-01-00_80.jpg) + +Let us start by constructing a workflow that consists of a [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget and a [Data Table](https://orangedatamining.com/widget-catalog/data/datatable/) widget. The [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget reads the data from the server. Open the widget by double-clicking its icon. The window shows a list of available datasets. Let's start with a smaller dataset, a sample from the study conducted by [Baron et al. (2017)](https://pubmed.ncbi.nlm.nih.gov/27667365/) composed of pancreatic cells from a single human donor. Double click on the line with this data set to instruct the widget to send the data to its output. After loading the data, open the [Data Table](https://orangedatamining.com/widget-catalog/) to see the data we have just loaded in the spreadsheet. + +![](sc-notes-01-01_80.jpg)  + + +Counts signify how many copies of the expressed gene were detected in the cell + +There are 1631 cells and 5010 genes in this dataset sample. Orange data items are stored in rows - in single cell transcriptomics, our data items are cells. The cell **expression profiles** are therefore stored in rows. Columns refer to meta-features and genes. Our example data includes _cell class_, _barcode_, _cell id_ and some other meta information. When the gene expression values are represented with **whole numbers**, this usually indicates that we are dealing with **counts**, which **signify how many copies of the expressed gene were detected in the cell**. In other words, the numbers in the matrix tell us how many times we en**count**ered a RNA molecule of a gene in a particular cell. So, for instance, in the third row, we find a cell in which 0 RNA molecules of the genes AAAS and AACS were detected, but we encountered 12 transcripts of the gene AADAC. We call this kind of matrix a **count matrix**. + + +![](sc-notes-01-02_80.jpg)  + + Dropout refers to the phenomenon where a gene is expressed in a cell but not detected due to technical limitations, leading to false zero values + +By scrolling through the data, you will notice that there are many zero values in our count matrix - single cell data is **sparse**. This is completely normal. Since scRNA-seq captures RNA from individual cells, lowly expressed genes may have only a few RNA molecules present in a given cell, making them easy to miss during sequencing. This phenomenon is called **dropout**. A zero value in the count matrix can therefore signify either that the gene was truly not expressed in a given cell or, more likely, that its expression was not detected. + +Let's look at another sample dataset. Open the [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget again and select the dataset composed of bone marrow mononuclear cells, a sample of the data from [Zhang et al. (2017)](https://www.nature.com/articles/ncomms14049). Double click to load the data. + +![](sc-notes-01-021_80.jpg)  + +Single-cell datasets can be quite large, so it often makes sense to begin analysis on a subset of the data. Although this dataset is already relatively small (as it is itself a sample of a larger dataset), we will use it to demonstrate how to create an even smaller sample. Let's forward the data to a [Data Sampler](https://orangedatamining.com/widget-catalog/transform/datasampler/) widget. Open the widget and sample 100 cells from the data. There are several sampling types to choose from: we select the Fixed sample size option, set the number of instances to 100 and press Sample Data. Forward the data to a new [Data Table](https://orangedatamining.com/widget-catalog/data/datatable/) and open it. + + +![](sc-notes-01-04_80.jpg) + +![](sc-notes-01-03_80.jpg)  + + +You can find the number of input and output instances displayed at the bottom of an Orange widget. By clicking on them, you can take a quick glimpse at the data in a pop-up data table. + +In the columns, we can again identify meta-features such as _cell type_, _replicate_, _ID_, and _barcode_, along with genes. The rows correspond to individual cells. However, this time, expression values are represented as decimals rather than whole numbers. This indicates that the counts have most likely already been normalized. + +Now, let's augment our workflow to visualize the data. Because single-cell gene expression data are high-dimensional - each cell is described by the expression levels of thousands of genes - they cannot be visualized directly. To make visualization possible, we first need to reduce the dimensionality of the data while preserving as much of its underlying structure as possible. + +For now let's take a quick glance at our data using a popular dimensionality reduction technique called t-SNE. Draw a line from the Data Sampler and search for the t-SNE widget. Click and wait for the widget to process the data. We will also add another Data Table at the output of t-SNE. + +Open the t-SNE widget and select a few data points by drawing the rectangle around them. Now open the [Data Table(2)](https://orangedatamining.com/widget-catalog/) to observe how the data on selected cells are passed to the output of t-SNE. In Orange, most of the widgets are interactive, and send out the data upon any change in selection or any change of parameters of the widget. + + +![](sc-notes-01-05_75.jpg) + + +## Loading your own dataset + +The datasets we have worked with in the previous chapter come from the server. Orange can also read the data from spreadsheet file formats which include tab, comma separated and Excel files. Let us prepare a toy dataset in Excel and save it on a local disk. + +![](sc-notes-01-06_75.jpg)  + +We can use the [File](https://orangedatamining.com/widget-catalog/data/file/) widget to load this dataset. + + + +Instead of using Excel, we could also use Google Sheets, a free online spreadsheet alternative. Then, instead of finding the file on the local disk, we would enter its URL address to the [File](https://orangedatamining.com/widget-catalog/data/file/) widget ’s URL entry box. + +![](sc-notes-01-07_75.jpg)  + +Orange has correctly guessed that cell IDs are character strings and that this column in the dataset is special, meant to provide additional information and not to be used for any kind of modeling. All other columns are numeric features except for the type, which is a categorical feature. This is also the feature we wouldn't want to include in the profile of the cell and should rather consider it as a cell’s class. Double-click on the “feature” in the Role column and change the role of the feature type to “target”. Then click the Apply button. + +![](sc-notes-01-08_75.jpg)  + + +It is always good to check if all the data was read correctly. We can connect our [File](https://orangedatamining.com/widget-catalog/data/file/) widget with the [Data Table](https://orangedatamining.com/widget-catalog/) widget, and double-click on the [Data Table](https://orangedatamining.com/widget-catalog/) to see the data in the spreadsheet format. + +![](sc-notes-01-09_75.jpg)  + + +There is more to input data formatting and loading. We can define the type and kind of the data column, specify that the column is actually a web address of an image, and more. But enough for now. If you would really like to dive in for more, check out the documentation page on [Loading your Data](https://orange3.readthedocs.io/projects/orange-visual-programming/en/latest/loading-your-data/index.html), or one of our [videos](https://www.youtube.com/watch?v=MHcGdQeYCMg&list=PLmNPvQr9Tf-ZSDLwOzxpvY-HrE0yv-8Fy&index=4&ab_channel=OrangeDataMining) on this subject. \ No newline at end of file diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-00_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-00_80.jpg new file mode 100644 index 0000000..f4356c7 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-00_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-01_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-01_80.jpg new file mode 100644 index 0000000..8406e08 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-01_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-021_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-021_80.jpg new file mode 100644 index 0000000..431f1c0 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-021_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-02_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-02_80.jpg new file mode 100644 index 0000000..caf543b Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-02_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-03_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-03_80.jpg new file mode 100644 index 0000000..d051048 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-03_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-04_80.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-04_80.jpg new file mode 100644 index 0000000..d7472e9 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-04_80.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-05_75.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-05_75.jpg new file mode 100644 index 0000000..c0b8290 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-06_75.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-06_75.jpg new file mode 100644 index 0000000..7a84fa7 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-06_75.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-07_75.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-07_75.jpg new file mode 100644 index 0000000..fe1b016 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-07_75.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-08_75.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-08_75.jpg new file mode 100644 index 0000000..d8c12e2 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-08_75.jpg differ diff --git a/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-09_75.jpg b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-09_75.jpg new file mode 100644 index 0000000..745eba0 Binary files /dev/null and b/_chapters/single-cell-analysis/02-sc-data/sc-notes-01-09_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/index.md b/_chapters/single-cell-analysis/03-visualisation/index.md new file mode 100644 index 0000000..50e604d --- /dev/null +++ b/_chapters/single-cell-analysis/03-visualisation/index.md @@ -0,0 +1,39 @@ +--- +title: 'Visualizing Single Cell Landscapes' +--- + +Let us load some single-cell gene expression data and organize the cells in two-dimensional visualizations. We will use the following workflow, and within it, compare two popular data visualization approaches, principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). + + +  +[Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) connects to Orange’s data server that contains examples of datasets. You have to be connected to a network for this widget to work correctly. + +![](sc-notes-02-01_75.jpg) + +From a list of examples in [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/), let us again choose the data on mononuclear cells from bone marrow (Zheng et al., Nat Comm 2017). This data sets has already been preprocessed (to some degree) and comes with a selection of 1,000 genes. + + +![](sc-notes-02-02_75.jpg) + + +To pass only the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) components to [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) try rewiring the connection between the two widgets. + +We pass the data to [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) with the scree diagram, a chart that shows how much of the variance is explained with a first few components. [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) transforms our data to a new coordinate system defined by principal components, where the components are orthogonal to each other and where the transformation is constructed so that the first component explains most of the variance, then second-most of the remaining variance, and so on. + +A conceptually very different technique to PCA is [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/), which embeds the data into two dimensions so that cells with similar expression stay together. + + +  +[t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) widget does not include axis. In fact, axis in t-SNE make no sense. Why? Because the coordinates of the points are not any two features of the original dataset, but a complex non-linear mapping of the original multidimensional data into only two-dimensions. + +![](sc-notes-02-03_75.jpg) + + +  +To explore the differences between t-SNE and PCA, have both windows open, select the data in [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/), and observe the changes in [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) showing [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) projection. If Orange canvas window is getting in your way, use "Bring Widgets to the Front" command from the View menu. + +PCA and t-SNE are two popular visualizations of single-cell gene expression data. Their visual depictions are often very different. PCA is a linear transformation that aims to be “more faithful” to the original data, while t-SNE aims to expose the clustering structure and focuses on preserving local similarities. We can compare the layout of the two visualizations by adding a connection from [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) widget to the [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) showing the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) projection. With it, a subset of cells selected in the [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) will be exposed in the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) plot. + + +![](sc-notes-02-06_75.jpg) + diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-01_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-01_75.jpg new file mode 100644 index 0000000..d7f4824 Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-02_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-02_75.jpg new file mode 100644 index 0000000..f22ee21 Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-03_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-03_75.jpg new file mode 100644 index 0000000..b79864f Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-04_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-04_75.jpg new file mode 100644 index 0000000..dda41b5 Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-05_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-05_75.jpg new file mode 100644 index 0000000..a6defba Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-06_75.jpg b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-06_75.jpg new file mode 100644 index 0000000..ce9c615 Binary files /dev/null and b/_chapters/single-cell-analysis/03-visualisation/sc-notes-02-06_75.jpg differ diff --git a/_chapters/single-cell-analysis/04-preprocessing/index.md b/_chapters/single-cell-analysis/04-preprocessing/index.md new file mode 100644 index 0000000..42a6b90 --- /dev/null +++ b/_chapters/single-cell-analysis/04-preprocessing/index.md @@ -0,0 +1,49 @@ +--- +title: ' Data Filtering and Preprocessing' + +--- +Single-cell datasets can have a lot of technical variability issues. Each cell will generally capture a varying number of reads. This will cause some cells to have too low of a signal to be useful. Additionally, genes range from ever-active housekeeping genes to specialized genes that are only expressed in particular cell types or under certain conditions. Employing filtering techniques and preprocessing steps becomes crucial to prepare the data for subsequent analyses. + +## Filtering or quality control + +To illustrate the process of filtering out low-quality data points and features, also called quality control, we'll use a dataset of pancreas cells from a human donor. We begin by loading the data using the [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget. Once loaded, we examine its structure and individual cell features via the Data Table widget. + + +![](sc-notes-03-01_75.jpg) + +This dataset comprises over 8000 cells and an extensive gene count exceeding 20,000! + +Genes range from ever-active housekeeping genes to specialized genes only expressed in particular cell types or under certain conditions. Usually, we want to filter out both of these extremes - we can do so using the [Filter](https://orangedatamining.com/widget-catalog/single-cell/filter/) widget. + +You can filter out cells by gene counts or genes by cell counts. Additionally, you can choose whether to filter by detection counts or total counts. For instance, filtering cells by gene detection count will use only the number of expressed genes in a cell as the filtering criteria, whereas filtering them by total count will use all the transcripts (the sum of the expression values) in a cell. + +Since we want to first filter out genes, we select Genes as the filter type and further select to filter by the detection count of each gene. + + +![](sc-notes-03-02_75.jpg) + +Each dot on the plot on the right now represents one gene. The y-axis marks the number of cells the gene has been detected in. We can choose to log scale the data for a better visualization. There are quite a lot of genes expressed in less than ten cells and quite a lot of housekeeping genes that are expressed in a vast number of cells. We can select which genes to keep by dragging the upper and lower thresholds on the plot. Alternatively, we can simply write the minimal and maximal number of genes we want to keep. Let's retain genes that have been detected in at least 20 and at most 3000 cells. This has reduced the number of genes by more than a quarter. + +Alternatively, we could use the Dropout Gene Selection widget to filter out uninformative genes. This widget implements a method proposed by a paper from 2018 that selects genes based on the interplay of mean expression across the cells and the frequency of dropouts, that is, the proportion of cells where the gene was not expressed. Any gene that has high dropout rate and high mean expression could potentially be a marker of some particular subpopulation of cells. Dragging the threshold changes how many genes are filtered out. + + +![](sc-notes-03-03_75.jpg) + +Apart from filtering out non-informative genes, we might want to filter out whole cells. Each cell will generally capture a varying number of reads. This will cause some cells to have too low of a signal to be useful. Specifically, cells with fewer genes may suffer from damage or poor technical processing and thus provide less valuable information. But cells that express a very large number of genes or just contain a very high amount of expressed material are usually also not very informative. + +We can stack the [Filter](https://orangedatamining.com/widget-catalog/single-cell/filter/) widget one after the other. Since we want to filter out cells, we choose Cells as the filter type. There are again two options. We can either filter cells by the number of detected genes or by the total count of all transcripts. Let's select Detection count. Each point on the plot on the right now represents a cell, and the y-axis marks the number of expressed genes in those cells. Again, we can drag the threshold on the plot or simply type the desired minimal and maximum threshold on the left side of the widget. Let's filter cells that have less than 400 and more than 2400 expressed genes. + + +![](sc-notes-03-04_75.jpg) + +## Preprocessing + +After filtering out non-informative features (genes) and data samples (cells), we can proceed to preprocessing the expression values themselves. In Orange, this can be done through the [Single Cell Preprocess](https://orangedatamining.com/widget-catalog/single-cell/single_cell_preprocess/) widget by specifying an ordered list of preprocessing and data transformation steps. By default, the widget shows some standard preprocessing steps. Let's first remove these default steps and start from a blank pane. + +One of the most common preprocessing steps is to transform the expression values so that they are comparable across cells. This process is called normalization. + +Let us first normalize the gene expression of each cell so that the gene expressions for each cell sum to the same number. Choose the "Count per Million" normalization. This normalization method scales the expression values based on the total number of reads in each cell and converts them into counts per million. Additionally, it's common to apply a logarithmic transformation after normalization to achieve a more symmetric distribution and to better handle extreme values. We simply drag and drop the Logarithmic Scale preprocessor from the list on the left to the right, just after the normalization step. We select to scale the data with the natural logarithm. You can choose from additional preprocessing steps from the list on the left. + +![](sc-notes-03-05_75.jpg) + +Our data is now ready for further analysis! diff --git a/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-01_75.jpg b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-01_75.jpg new file mode 100644 index 0000000..69b81ee Binary files /dev/null and b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-02_75.jpg b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-02_75.jpg new file mode 100644 index 0000000..f3f6844 Binary files /dev/null and b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-03_75.jpg b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-03_75.jpg new file mode 100644 index 0000000..7f6d9b8 Binary files /dev/null and b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-04_75.jpg b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-04_75.jpg new file mode 100644 index 0000000..49dfdbd Binary files /dev/null and b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-05_75.jpg b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-05_75.jpg new file mode 100644 index 0000000..82d9005 Binary files /dev/null and b/_chapters/single-cell-analysis/04-preprocessing/sc-notes-03-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/index.md b/_chapters/single-cell-analysis/05-batch-effects/index.md new file mode 100644 index 0000000..21eaff9 --- /dev/null +++ b/_chapters/single-cell-analysis/05-batch-effects/index.md @@ -0,0 +1,53 @@ +--- +title: 'Batch Effects Correction' +--- + +## Batch Effects Correction + + + + +In single cell analysis, we often deal with data from several different sources, be it from a different provider, different experiments or simply different batches. A “batch” refers to an individual group of samples that are processed differently relative to other samples. This different processing when gathering a batch of data can affect variation in the obtained data. The technical, non-biological factors that affect variation in batches are reffered to as batch effects. + +Batch effects are problematic because they hinder our ability to measure true biological variation between samples, which is what we are interested in. Luckily, they can be dealt with computationally by aligning data from different batches. + +There are different approaches to align data sets and remove batch effects. Orange currently implements three: one through the [Batch Effect Removal](https://orangedatamining.com/widget-catalog/single-cell/batch_effect_removal/) widget, the second, more standard, using canonical correlation implemented in the widget [Align Datasets](https://orangedatamining.com/widget-catalog/single-cell/align_datasets/) and the third, most recently added, in the Harmony widget. + +We will consider human pancreas cell data from two separate research studies: a sample from a study by [Baron et al.](https://pubmed.ncbi.nlm.nih.gov/27667365/) and the other a sample from a study by [Xin et al.](https://pubmed.ncbi.nlm.nih.gov/27667665/). We want to merge our two sample datasets, coming from distinct batches, and plot them together. But before we do that, let's apply some preprocessing steps. + +For the dataset from [Baron et al.](https://pubmed.ncbi.nlm.nih.gov/27667365/) we have used the same preprocessing workflow as introduced in the previous chapter. To quickly recap, we start by dropping out some non-informative genes, then proceed to filter the cells by detection count, and then normalize the samples so that the gene expressions represent counts per million. We also apply a logarithmic transformation after normalization to achieve a more symmetric distribution. + +Let's apply some basic preprocessing steps to the second dataset as well. We can add another Single Cell Preprocessing widget onto the canvas and pass it the loaded data. We normalize using "Counts per Million" and apply a logarithmic transformation (natural logarithm). + + +![](sc-notes-04-01_75.jpg) + +Now we can concatenate the two data sets by sending both to the [Concatenate](https://orangedatamining.com/widget-catalog/transform/concatenate/) widget. Here, I select the option to keep only the variables that appear in both tables and append the data source ID as a new class attribute. When we open the concatenated dataset in a table, we now see a new column called Source ID. + + +![](sc-notes-04-02_75.jpg) + +We now plot the concatenated data with t-SNE. Set the color to represent the cell class. Clearly, we got some nice clusters, but some of the same cell types were split into separate clusters. We would expect cells of a specific type to have similar expression profiles and therefore cluster together. If we set the shape of the plotted cells to represent the Source ID, we can see that two datasets are clearly separated: the data from Baron et al., marked by circles, appears on the left, and the data from Xin et al., marked with crosses, to the right. This clearly indicates the presence of batch effects that we want to remove. + + +![](sc-notes-04-03_75.jpg) + +In our exploration of data integration methods, we'll first look at a technique that uses canonical correlation, implemented by the Align Datasets widget. This approach aims to optimize the transformation of data to improve correlations between expression vectors from two datasets. Let's navigate to the Align Datasets widget and examine its visualization of the correlation. First, we'll leave the parameters at their default settings and revisit the t-SNE plot. At first glance, it appears that the correction has yielded promising results, with similar cell types clustering together as expected. Furthermore, when we use shapes to represent the source of the data, we notice a closer proximity between cells from different sources. + + +![](sc-notes-04-04_75.jpg) + +Next, let's revisit the Align Datasets widget to fine-tune the parameters for potentially improved clustering. A common strategy is to start with a reduced number of components. Orange seamlessly propagates the transformed data to t-SNE, where we see an updated plot. The alignment between the two datasets appears significantly improved. + +Let's explore the second data alignment method implemented by the Batch Effect Removal widget. We add this widget to our canvas and feed it the combined data. When opening the widget we have to set the distinguishing feature for different batches; in our case, this is the Source ID. We'll also leave the "Skip zero expressions" option unchecked. After applying this correction and visualizing the results in t-SNE, with colors representing cell classes and shapes representing data sources, we observe an interesting result. While some clusters of identical cell types from different sources merge into cohesive units, others remain distinct. It appears that the Align Datasets widget outperformed the Batch Effect Removal method in this case. + + +![](sc-notes-04-05_75.jpg) + + +Let us now try Harmony, the third method available in Orange. Harmony has achieved good results in the [Batch integration benchmark study](https://openproblems.bio/benchmarks/batch_integration?version=v2.0.0), making it a robust general-purpose method for integrating single-cell datasets across batches. As before, we pass the concatenated data to the Harmony widget and specify Source ID as the batch-defining variable, leaving the remaining parameters at their default values for now. We then visualize the transformed data using t-SNE. The resulting plot shows that cells of the same type remain clustered together. What about the batch effect correction? To assess batch mixing, we set the Shape parameter to indicate the data source; this reveals that cells from different batches are now well mixed rather than forming separate clusters. Thus by using Harmony we have effectively reduced batch effects while preserving biologically meaningful structure for downstream analysis. + + +Among the three main parameters in Harmony (sigma, theta, and lambda), it is often useful to adjust theta, which controls the strength of batch correction: lower values result in weaker batch correction, whereas higher values enforce stronger mixing between batches. + +![](sc-notes-04-06_75.jpg) \ No newline at end of file diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-01_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-01_75.jpg new file mode 100644 index 0000000..23e348f Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-02_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-02_75.jpg new file mode 100644 index 0000000..1d91af6 Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-03_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-03_75.jpg new file mode 100644 index 0000000..a1cfba9 Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-04_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-04_75.jpg new file mode 100644 index 0000000..183983d Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-05_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-05_75.jpg new file mode 100644 index 0000000..9131a05 Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-06_75.jpg b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-06_75.jpg new file mode 100644 index 0000000..b70c16a Binary files /dev/null and b/_chapters/single-cell-analysis/05-batch-effects/sc-notes-04-06_75.jpg differ diff --git a/_chapters/single-cell-analysis/06-marker-genes/index.md b/_chapters/single-cell-analysis/06-marker-genes/index.md new file mode 100644 index 0000000..08f5e9c --- /dev/null +++ b/_chapters/single-cell-analysis/06-marker-genes/index.md @@ -0,0 +1,31 @@ +--- +title: 'Cell Types and Marker Genes' +--- + +## Identifying cell types + +Gene markers are one of the standard methods for discovering cell populations, determining cell states (e.g. cell cycle) and more. Here, we will continue to use a sample of the data on bone marrow mononuclear cells from previous lessons. We will score the cells according to gene markers, and observe the scored cells in the t- SNE visualization. We will assume, as it is often the case, that marker genes are keyed-in from some publication. We will only later resort to a list of marker genes from a public database. + + +![](sc-notes-06-01_75.jpg) + +So, we start with the marker genes. We picked a few that are contained in our data. Here they are, keyed-in in Excel. + +For Excel, we can save this list in, say, marker-genes.xlsx, and load it with the File widget. Notice that we have used only the names of the genes, and not any official codes. Orange deals with genes through NCBI’s IDs, and we add them using Genes widget. This widget will also tell us if the names of the genes have been resolved correctly. + + +![](sc-notes-06-02_75.jpg) + + + +![](sc-notes-06-03_75.jpg) Score Cells adds a column Score to the input data table. + +The output of the widget is a table that includes a gene name and cell type, both as specified in the input file, and Entrez ID. + +The idea is now that we would select the gene(s) from the data table, and then score the cells according to the mean expression of selected genes. Widget Score Cells assigns a numerical score to each cell that is proportional to an average expression of the marker genes at the input of the widget. The score is added as a meta attribute to the cell data on the output of Score Cells. Check this using the Data Table! We can now feed this data into t-SNE and set the color and size of the points to the cell score. + +Notice that with any change in the selection of marker genes, we find a group of cells in t-SNE plot where these genes are expressed. Looks like T cells are in the bottom right cluster, B cells somewhere in the middle, and erythrocytes in the left cluster. Did we say cluster? Oh, we are not there yet… + + +![](sc-notes-06-04_75.jpg) + diff --git a/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-01_75.jpg b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-01_75.jpg new file mode 100644 index 0000000..22867b5 Binary files /dev/null and b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-02_75.jpg b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-02_75.jpg new file mode 100644 index 0000000..97bf20d Binary files /dev/null and b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-03_75.jpg b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-03_75.jpg new file mode 100644 index 0000000..7c14c9b Binary files /dev/null and b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-04_75.jpg b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-04_75.jpg new file mode 100644 index 0000000..de38235 Binary files /dev/null and b/_chapters/single-cell-analysis/06-marker-genes/sc-notes-06-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/07-clusters/index.md b/_chapters/single-cell-analysis/07-clusters/index.md new file mode 100644 index 0000000..4b2ace1 --- /dev/null +++ b/_chapters/single-cell-analysis/07-clusters/index.md @@ -0,0 +1,32 @@ +--- +title: 'Cell Clustering' +--- + + +We will again use a sample from Bone marrow mononuclear cells with AML with 1000 cells that contain 1000 most variable genes. + +A typical task with a single cell data is to find clusters of cells. An advanced method that does this is Louvain clustering. Given the expression matrix, Louvain clustering creates a network of cells based on pairwise distances. Then, it searches for local communities — the parts of the network that are more strongly interconnected than expected by chance (think of friendships on social networks). Each local community is a cluster, and genes are assigned cluster labels accordingly. + + +![](sc-notes-07-01_75.jpg) +Louvain Clustering has a number of parameters. Here, we will stay with defaults, but you can experiment, change them and see the effect in t-SNE visualization. + +In Orange, Louvain clustering is in its own widget that appends a column of cluster labels to the cell data. Quite neatly, the number of clusters is determined automatically. Let us construct a workflow that displays the results of the clustering in the t-SNE plot and that examines the frequency of the cells in each of the clusters. Let us observe the t-SNE plot first. + +![](sc-notes-07-02_75.jpg) + +We have colored the points (cells) in t-SNE according to the cluster membership. Notice a nice separation of the clusters in the t-SNE plot. It looks like the cells are also well-separated in the original space of features (genes). A common mistake would be to compute the data projection first and then cluster the projected points. Obviously, then, the clusters would be separated perfectly and there would be no overlap. + +![](sc-notes-07-03_75.jpg) + +We can now use Distributions widget to observe the frequency of the cells within each cluster. + +![](sc-notes-07-04_75.jpg) + +Nice, the clusters are well represented and there is no need for any filtering at this stage. Notice also that some of the clusters mix healthy and diseased cells, and while this could be interesting, we will refrain to explore this aspect in this workshop. + + +Most visualizations in Orange are interactive. In Distributions, you can click on the on the bar to select the associated data. Try connecting Distribution to t-SNE widget to explore where are the regions in the embedding space of each cluster. + +![](sc-notes-07-05_75.jpg) + diff --git a/_chapters/single-cell-analysis/07-clusters/sc-notes-07-01_75.jpg b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-01_75.jpg new file mode 100644 index 0000000..4e1af9d Binary files /dev/null and b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/07-clusters/sc-notes-07-02_75.jpg b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-02_75.jpg new file mode 100644 index 0000000..dbb9236 Binary files /dev/null and b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/07-clusters/sc-notes-07-03_75.jpg b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-03_75.jpg new file mode 100644 index 0000000..942e305 Binary files /dev/null and b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/07-clusters/sc-notes-07-04_75.jpg b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-04_75.jpg new file mode 100644 index 0000000..a5de99b Binary files /dev/null and b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/07-clusters/sc-notes-07-05_75.jpg b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-05_75.jpg new file mode 100644 index 0000000..36d2890 Binary files /dev/null and b/_chapters/single-cell-analysis/07-clusters/sc-notes-07-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/index.md b/_chapters/single-cell-analysis/08-cluster-exploration/index.md new file mode 100644 index 0000000..61eb23c --- /dev/null +++ b/_chapters/single-cell-analysis/08-cluster-exploration/index.md @@ -0,0 +1,45 @@ +--- +title: 'Cluster Exploration and Discovery of Markers' +--- + +Here, we aim to discover “new” marker genes for B cells. We use quotes, of course, because it is likely that all markers for these cell types are already known. All we can do here is to rediscover some of them. The workflow we will use is our most complex one so far. + + +![](sc-notes-08-01_75.jpg) + + +![](sc-notes-08-02_75.jpg) + +We are already familiar with its part until t-SNE. In Data Table, we select one marker we have for B cells, CD19. Then, in t-SNE, we select a subset of cells of the red cluster in the center of the graph. + +![](sc-notes-08-03_75.jpg) + +We want to find genes that are expressed in the selected cells, but not expressed in all the other cells. We thus need to get all the data out from t-SNE, not just the selection and have a column that tells us if the cells were included in the selection. Whereas the default output of t-SNE is “Selected Data,” the output called Data of the t-SNE widget has all the data required, and we should rewire the connection. + + +Double click on the connection between t-SNE and Differential Expression widget and instead of Selected Data, connect the Data channel of the t-SNE. + +![](sc-notes-08-04_75.jpg) + +Differential expression shows the distribution of the differences of gene expression in selected and all other cells. We have to set this widget properly. Set the scoring method to T-test,the Label to Selected, and marked that Yes is our target value. Differential Expression can also compare the observed distribution of changes to the null-distribution — the thin grey line — where data cells in each row are randomly permuted. Click on Compute null distribution to switch on the visualization of null-distribution. + + +The distribution marked with grey line shows the null distribution, the distribution of genes scores under the arbitrary selection of target genes. + +![](sc-notes-08-05_75.jpg) + +Differential Expression widget outputs the data with genes that are in extremes of the distribution. That is those, for which the difference in selected and non-selected cells is the largest. Genes that are most differentially expressed, lie on the left and on the right side of the two vertical splitters and their score value belongs to the shaded part of the distribution. Move the two vertical splitters such that there are only about 60 selected genes which are highly expressed in the selected group. + +So, where are the genes that are selected in the Differential Expression widget? In the output of the widget. We can observe the output dataset and analyze the set of selected genes with widgets that we connected to Differential Expression. Observe the data in the Data Table, a list of selected genes in Gene Info and the results of analysis of Gene Ontology term enrichment in GO Browser. From all these choices, our workflow shows only the GO Browser, but you are welcome to explore other widgets as well. + + +![](sc-notes-08-06_75.jpg) + +In GO Browser, we find that our differentially expressed genes are characteristically present in several Gene Ontology terms. That is, several GO terms are enriched with our selection of genes. + +Interestingly, one of the most enriched terms is immune system process, with 34 genes from our differential expression set. Selecting this term, we get the data on these genes on the output of GO Browser, and we can check them out in the Genes widget. + + +![](sc-notes-08-07_75.jpg) + +Among the list of genes, there are also CD22 and CD38. “CD” stands for cluster of differentiation. Googling it, we find that two CD22 and CD38 are markers for B-cells. Oh, what a rediscovery! \ No newline at end of file diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-01_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-01_75.jpg new file mode 100644 index 0000000..91ffbb1 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-02_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-02_75.jpg new file mode 100644 index 0000000..7409368 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-03_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-03_75.jpg new file mode 100644 index 0000000..bbd9373 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-03_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-04_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-04_75.jpg new file mode 100644 index 0000000..8c3a906 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-04_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-05_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-05_75.jpg new file mode 100644 index 0000000..7325f03 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-05_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-06_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-06_75.jpg new file mode 100644 index 0000000..75df2f8 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-06_75.jpg differ diff --git a/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-07_75.jpg b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-07_75.jpg new file mode 100644 index 0000000..96b7548 Binary files /dev/null and b/_chapters/single-cell-analysis/08-cluster-exploration/sc-notes-08-07_75.jpg differ diff --git a/_chapters/single-cell-analysis/09-visual-annotation/index.md b/_chapters/single-cell-analysis/09-visual-annotation/index.md new file mode 100644 index 0000000..c2fdcc4 --- /dev/null +++ b/_chapters/single-cell-analysis/09-visual-annotation/index.md @@ -0,0 +1,25 @@ +--- +title: 'Visual Annotation of Cell Types' +--- + +There is a neat widget in Orange that can analyze two-dimensional cell maps and can annotate the clusters in such visualizations. The widget is called Annotator, and it requires on the input the data set with two features that define the positions of the cells in twodimensional embedding, and a set of marker genes with associated cell type. We use the following workflow to get all these. We will again use the Bone marrow mononuclear cells with AML (sample dataset). + + +Some rewiring is required to get this workflow work right. Notice that t-SNE’s output channel is Data, and that Marker Genes output goes to Genes channel of the Annotator. + +![](sc-notes-09-01_75.jpg) + +Marker Genes uses a large collection of markers from one of two open marker gene databases, Panglao or CellMarker. We want to select all the markers from the CellMarker database. The annotator will select the most expressed genes for each cell with the Mann-Whitney U test and compute the p-value of each cell types for a cell based on the selected statistical test. + + +Click on a marker and use the keyboard shortcut Command+A on Mac (or Ctrl+A on Windows) to select all. Move them to the right window. + + + + +The Annotator has to be set to use t-SNE-x and t-SNE-y to position each cell, but once this is set, the display is cute and perhaps quite relevant. + + +![](sc-notes-09-02_75.jpg) + +It visualizes groups of cells and for each group, it shows few most present cell types. \ No newline at end of file diff --git a/_chapters/single-cell-analysis/09-visual-annotation/orange-marker-genes1.gif b/_chapters/single-cell-analysis/09-visual-annotation/orange-marker-genes1.gif new file mode 100644 index 0000000..4320444 Binary files /dev/null and b/_chapters/single-cell-analysis/09-visual-annotation/orange-marker-genes1.gif differ diff --git a/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-01_75.jpg b/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-01_75.jpg new file mode 100644 index 0000000..cfb9217 Binary files /dev/null and b/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-01_75.jpg differ diff --git a/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-02_75.jpg b/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-02_75.jpg new file mode 100644 index 0000000..aae9fc5 Binary files /dev/null and b/_chapters/single-cell-analysis/09-visual-annotation/sc-notes-09-02_75.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-01/file-workflow.png b/_chapters/single-cell-analysis/quiz-01/file-workflow.png new file mode 100644 index 0000000..3f08ec9 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-01/file-workflow.png differ diff --git a/_chapters/single-cell-analysis/quiz-01/index.md b/_chapters/single-cell-analysis/quiz-01/index.md new file mode 100644 index 0000000..e667e77 --- /dev/null +++ b/_chapters/single-cell-analysis/quiz-01/index.md @@ -0,0 +1,184 @@ +--- +title: 'Quiz' +--- + +### Warmup Questions + + answer.toLowerCase() === "expression profile"} + options={["Expression Stamp", "Expression Profile"]} + trials={2} + timeout={10} +/> + + + answer.toLowerCase() === "false"} + options={["True", "False"]} + trials={2} + timeout={10} +/> + + answer.toLowerCase() === "true"} + options={["True", "False"]} + trials={2} + timeout={10} +/> + + + +### Download the quiz data + + Throughout the main part of this quiz you will be using a fraction of the data from [Liang et al.'s](https://pubmed.ncbi.nlm.nih.gov/31848347/) single-nuclei transcriptomic study (snRNA-seq) of the human retina. Using your newely acquired knowledge of single-cell analtyics you will try to replicate some of their analytical insights. + +Retinal tissue is composed of multiple cell types with distinct functions. The data contains samples from the **macular and peripheral region** of the retina from **a single healthy donor**. Sequence reads have already been aligned to the human genome, and the aligned reads were counted within exons. + +**Download the following data:** +1. The gene expression matrix [sc-quiz-sample1500.tab.gz](http://file.biolab.si/datasets/sc-quiz-sample1500.tab.gz) +2. Marker genes for cell types [sc-quiz-marker-genes.xlsx](http://file.biolab.si/datasets/sc-quiz-marker-genes.xlsx) + + +### Task 1 - Inspecting the expression matrix + +First, use Orange's File widget to load the retinal single cell gene expression data ([sc-quiz-sample1500.tab.gz](http://file.biolab.si/datasets/sc-quiz-sample1500.tab.gz)) into Orange and view it in the Data Table widget. Answer the following questions. + + +![](file-workflow.png) + + answer === "cells"} + options={["Cells", "Genes", "Cell source"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + We can load the dataset using the File widget. If we then open our dataset in the Data Table widget, we can see a metafeature for the Cell ID, indicating that each row represents one cell. Additionally, we can observe that the gene names are used as the column headers. + + + ![](sc-ex1-q1-exp.png) + + + + + + answer === "more than 14000"} + options={["1500", "14000", "More than 14000"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + In the top left corner of the Data Table widget you can find information on the number of instances (1500), in this case cells, as well as the number of features (14191), in this case genes. + + + ![](sc-ex1-q2-exp.png) + + + + + answer === "likely no"} + options={["Likely no", "Likely yes", "There is no way for me to know"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + In raw (unprocessed) single-cell RNA sequencing data, we typically obtain whole numbers representing count data — specifically, the number of times a transcript was detected for a given gene in a given cell. In the case above, the data has most likely not been normalized, since the data values are comprised of either zeros or whole numbers that represent counts. + + + +### Task 2 - Dimensionality reduction and visualisation + + + answer === "to simplify data while keeping important patterns and relationships."} + options={[ + "To remove genes that are not expressed in all cells.", + "To simplify data while keeping important patterns and relationships.", + "To randomly reduce the number of genes in the dataset.", + "To increase the number of dimensions in the dataset for better visualization." + ]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + + + + answer === "t-sne is a nonlinear method that highlights local similarities between cells, whereas pca is a linear transformation that captures major sources of variance in gene expression."} + options={["PCA is a nonlinear technique that emphasizes local clustering, while t-SNE preserves global variance in gene expression data.","t-SNE is a nonlinear method that highlights local similarities between cells, whereas PCA is a linear transformation that captures major sources of variance in gene expression.", "Both PCA and t-SNE generate identical visualizations of single-cell gene expression data.", "In single-cell analysis, PCA is preferred over t-SNE when identifying distinct cell clusters because it preserves the local structure of the data."]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + + + +Plot the data in a tSNE plot. + + answer === "the tissue from which a cell originates"} + options={["The tissue from which a cell originates", "The donor from which a cell originates", "The sequencing technique used to obtain the data"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + The two colors of the data points in the t-SNE plot represent the source of the cells, which in this case corresponds to their tissue origin — either from the macular or peripheral region of the retina. + + + ![](sc-ex1-q4-exp.jpg) + + + + + + + + + + + + diff --git a/_chapters/single-cell-analysis/quiz-01/sc-ex1-q1-exp.png b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q1-exp.png new file mode 100644 index 0000000..6fe65cd Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q1-exp.png differ diff --git a/_chapters/single-cell-analysis/quiz-01/sc-ex1-q2-exp.png b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q2-exp.png new file mode 100644 index 0000000..e24ed27 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q2-exp.png differ diff --git a/_chapters/single-cell-analysis/quiz-01/sc-ex1-q4-exp.jpg b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q4-exp.jpg new file mode 100644 index 0000000..4c05041 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-01/sc-ex1-q4-exp.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/1abc_tile_annot.jpg b/_chapters/single-cell-analysis/quiz-02/1abc_tile_annot.jpg new file mode 100644 index 0000000..21ed045 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/1abc_tile_annot.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/2abc_tile_annot.jpg b/_chapters/single-cell-analysis/quiz-02/2abc_tile_annot.jpg new file mode 100644 index 0000000..9607fa4 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/2abc_tile_annot.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/3abc_tile_annot.jpg b/_chapters/single-cell-analysis/quiz-02/3abc_tile_annot.jpg new file mode 100644 index 0000000..716beb3 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/3abc_tile_annot.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/index.md b/_chapters/single-cell-analysis/quiz-02/index.md new file mode 100644 index 0000000..e8703ce --- /dev/null +++ b/_chapters/single-cell-analysis/quiz-02/index.md @@ -0,0 +1,154 @@ +--- +title: 'Quiz' +--- + +### Task 1 - Quality control + +Perform quality control on the [expression matrix](http://file.biolab.si/datasets/sc-quiz-sample1500.tab.gz) of the retinal dataset from the study by Liang et al. (2019). + +a) Discard genes that were not detected in at least 1% of all cells. + + answer === "b"} + options={["A", "B", "C"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + Since we want to filter out genes we have to set the Filter Type to Genes. Furthermore, we are interested only whether or not the gene was detected and not the quantity of the expressed gene, we choose to filter by Detection count. + + + + +![](1abc_tile_annot.jpg) + + + answer === "to remove uninformative genes that are never expressed or expressed in too few cells."} + options={[ + "To remove uninformative genes that are never expressed or expressed in too few cells.", + "To ensure that every gene is expressed in at least half of the cells.", + "To reduce the number of samples in the dataset.", + "To improve batch effect correction." + ]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + +b) Filter cells based on a minimum number of 500 and a maximum number of 3000 expressed genes per cell. + + answer === "c"} + options={["A", "B", "C"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + Since we want to filter out cells we have to set the Filter Type to Cells. Furthermore, we are interested in the **number** of expressed genes per cell not in the total amount of transcripts per cell - so we again choose to filter by Detection count. + + + + +![](2abc_tile_annot.jpg) + + + answer === "All of the above."} + options={[ + "Cells with very few detected genes may be damaged or of low quality.", + "Cells with an unusually high number of expressed genes may be multiplets or artifacts.", + "Removing extreme cells improves the accuracy of downstream clustering and visualization.", + "All of the above." + ]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + +c) Filter cells based on a minimum number of 6000 and a maximum number of 80000 transcripts per cell. + + answer === "cells"} + options={["Genes", "Cells", "Transcripts"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + Since we are again filtering out cells we have to set the Filter Type to Cells. This means that the points on the graph represent cells (the threshold marks which cells are going to be excluded from (or included in) further analysis). + + + + + +### Task 2 - Normalization and scaling + +Normalize expression values for each gene in each cell to counts per 10000, logarithmize the values with natural logarithm and perform standardization with the Single Cell Preprocess widget. + + answer === "to account for sequencing depth and make gene expression values comparable across cells"} + options={["To eliminate biological variation between different cell types.", "To account for sequencing depth and make gene expression values comparable across cells.", "To change gene expression values so that all genes have the same expression level.", "Normalization is only needed for datasets with a small number of cells."]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + Raw gene expression values are influenced by technical factors such as sequencing depth. Normalization adjusts for these differences, allowing for meaningful comparisons of gene expression across cells. + + + + +### Task 3 - Gene annotation + +Map the genes in the dataset to the Entrez database. + + answer === "approximately 11000"} + options={["Approximately 300", "Approximately 14000", "Approximately 11000"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + ![](sc-ex2-q8-exp.jpg) + + + + +Plot the preprocessed and annotated data in a new t-SNE plot and compare it to the previous one. Quite the difference! + + + +### Task 4 - Batch Effect Correction + + + diff --git a/_chapters/single-cell-analysis/quiz-02/sc-ex2-q7-exp.jpg b/_chapters/single-cell-analysis/quiz-02/sc-ex2-q7-exp.jpg new file mode 100644 index 0000000..e13c990 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/sc-ex2-q7-exp.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/sc-ex2-q8-exp.jpg b/_chapters/single-cell-analysis/quiz-02/sc-ex2-q8-exp.jpg new file mode 100644 index 0000000..3cdeba7 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/sc-ex2-q8-exp.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-02/sc_filtering_workflow.jpg b/_chapters/single-cell-analysis/quiz-02/sc_filtering_workflow.jpg new file mode 100644 index 0000000..75e0c6b Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-02/sc_filtering_workflow.jpg differ diff --git a/_chapters/single-cell-analysis/quiz-03/index.md b/_chapters/single-cell-analysis/quiz-03/index.md new file mode 100644 index 0000000..3498fe3 --- /dev/null +++ b/_chapters/single-cell-analysis/quiz-03/index.md @@ -0,0 +1,85 @@ +--- +title: 'Quiz' +--- + + +### Task 1 - Identifying clusters + + +Above you can see a t-SNE plot of the retinal dataset showing expected clusters (the number of PCA components in the t-SNE widget set to 10). Identify the most likely cell type corresponding to each cluster. Use the data table of known marker genes for each cell type and set the aggregation parameter in the Score Cells widget to **Fraction of expressed markers**. + +![](tsne-clusters.png) + + + answer === "pink"} + options={["Orange", "Pink", "Yellow", "Red"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + + answer === "red"} + options={["Red", "Light Blue", "Green", "Pink"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + +Liang et al. report that in the peripheral tissue the proportion of rods is higher than the proportion of rods in the macular tissue. Does this hold for our dataset sample? Try using the Distributions widget to figure this out. + + answer === "true"} + options={["True", "False"]} + neutralOptions={["I don't understand the question."]} + trials={1} + timeout={10}> + + + +Select the top 100 genes that are differentially expressed in cones in comparison to non-cones (T-test). Forward them to the GO widget. Sort the lower list by increasing p-value. + + answer === "visual perception, sensory perception of light stimulus, detection of light stimulus"} + options={["Visual perception, Sensory perception of light stimulus, Detection of light stimulus", "Signal transduction, Nervous system proces, Sensory perception", "Detection of abiotic stimulus, Detection of external stimulus, Sensory perception"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + +Try to determine [the tissue source of the single-cell dataset from a human ](https://file.biolab.si/tmp/sc-quiz-anonymous-sample.tab). (Try using Marker Genes widget, Annotator and, if need be, a quick web search) + + answer === "pancreas"} + options={["Eye", "Kidney", "Pancreas", "Heart"]} + neutralOptions={["I don't understand the question."]} + trials={2} + timeout={10}> + + + + diff --git a/_chapters/single-cell-analysis/quiz-03/tsne-clusters.png b/_chapters/single-cell-analysis/quiz-03/tsne-clusters.png new file mode 100644 index 0000000..a6f6ce4 Binary files /dev/null and b/_chapters/single-cell-analysis/quiz-03/tsne-clusters.png differ diff --git a/single-cell-analysis/01-sc-data/index.md b/single-cell-analysis/01-sc-data/index.md new file mode 100644 index 0000000..4151335 --- /dev/null +++ b/single-cell-analysis/01-sc-data/index.md @@ -0,0 +1,37 @@ +--- +title: 'Single-Cell Expression Data' +subTitle: '' +requireLogin: false +quizThreshold: 0.6 +chapters: + - single-cell-analysis/01-introduction + - single-cell-analysis/02-sc-data + - single-cell-analysis/03-visualisation + - single-cell-analysis/quiz-01 +--- + +**What does it mean to study gene expression at the level of individual cells? How is single-cell data structured, and how can we make sense of such high-dimensional information? In these chapters, we introduce the basic concepts of single-cell transcriptomics and guide you through the process of exploring and visualizing single-cell data.** + +We begin by contrasting single-cell expression analysis with traditional bulk approaches, highlighting why cell-level resolution is crucial for understanding biological variability. We then examine how single-cell datasets are organized, focusing on count matrices, metadata, and common features such as sparsity and dropout. Finally, we explore methods for visualizing high-dimensional data, using techniques such as PCA and t-SNE to uncover patterns and structure in cellular populations. Throughout the tutorial, we will work with example datasets and use the Orange data mining platform to perform the analysis in an intuitive, visual workflow. + +Following are the main concepts we will cover: + +- **Expression Profile:** A representation of gene activity in a single biological sample, such as a tissue or an individual cell. + +- **Count Matrix:** A table of gene expression values across samples. + +- **Count:** The number of times RNA from a specific gene is detected in a single cell. + +- **Dropout:** The phenomenon where a gene is actually expressed in a cell but is not detected in the data, resulting in a recorded value of zero. + +- **Sparse Data:** Data in which most values are zero (or missing), and only a small fraction of entries contain non-zero values. + +- **Dimensionality Reduction:** Techniques used to visualize complex data in fewer dimensions. + +- **PCA (Principal Component Analysis):** A linear method that reduces dimensionality by capturing the main sources of variation in the data. + +- **t-SNE (t-distributed Stochastic Neighbor Embedding):** A non-linear method that groups similar cells together in a low-dimensional space, emphasizing local structure. + + +**Start by reading the lecture notes below and then answer the quiz questions.** + diff --git a/single-cell-analysis/02-qc-preprocessing-batch/index.md b/single-cell-analysis/02-qc-preprocessing-batch/index.md new file mode 100644 index 0000000..59d7259 --- /dev/null +++ b/single-cell-analysis/02-qc-preprocessing-batch/index.md @@ -0,0 +1,36 @@ +--- +title: 'Quality Control, Preprocessing and Batch Effect Correction' +subTitle: '' +requireLogin: false +quizThreshold: 0.6 +chapters: + - single-cell-analysis/04-preprocessing + - single-cell-analysis/05-batch-effects + - single-cell-analysis/quiz-02 +--- + + +**Why do we need to clean and adjust single-cell data before analysis? How can technical differences between experiments affect our results? In these chapters, we introduce basic steps for preparing single-cell datasets for analysis.** + +We begin with filtering, where we remove low-quality cells and genes that do not provide useful information. This helps reduce noise in the data. We then apply preprocessing steps such as normalization and log transformation to make gene expression values comparable across cells. Finally, we look at batch effects—differences between datasets caused by technical factors—and show how to correct them so that data from different sources can be analyzed together. + +Following are the main concepts we will cover: + +- **Filtering (Quality Control):** Removing low-quality cells and uninformative genes. + +- **Detection Count:** Measures how often something is observed, regardless of how strongly it is expressed. + - For a **gene**: in how many cells it appears (has a non-zero value). + - For a **cell**: how many genes are detected in that cell. + +- **Total Count:** Measures how much expression is present overall. + - For a **gene**: the sum of its expression across all cells. + - For a **cell**: the total number of transcripts detected in that cell. + +- **Normalization:** A preprocessing step that adjusts gene expression values to make cells comparable, usually by correcting for differences in the total number of detected transcripts in each cell. + +- **Log Transformation:** Scaling values to reduce large differences. + +- **Batch Effects:** Non-biological differences between datasets caused by technical variation during data collection or processing. + +- **Batch Correction:** Methods that adjust data to remove batch effects, allowing datasets from different sources to be compared and combined. + diff --git a/single-cell-analysis/03-cluster-exploration/index.md b/single-cell-analysis/03-cluster-exploration/index.md new file mode 100644 index 0000000..8da8559 --- /dev/null +++ b/single-cell-analysis/03-cluster-exploration/index.md @@ -0,0 +1,34 @@ +--- +title: 'Clusters' +subTitle: '' +requireLogin: false +quizThreshold: 0.6 +chapters: + - single-cell-analysis/06-marker-genes + - single-cell-analysis/07-clusters + - single-cell-analysis/08-cluster-exploration + - single-cell-analysis/09-visual-annotation + - single-cell-analysis/quiz-03 +--- + + +**How do we identify different cell types in single-cell data? How can we group similar cells together and discover genes that define them? In these chapters, we introduce basic methods for cell type identification, clustering, and marker gene discovery.** + +We begin by using known marker genes to score cells and visualize where different cell types appear in the data. We then apply clustering methods to group cells based on their gene expression profiles. Finally, we explore how to find new marker genes by comparing selected groups of cells to the rest of the dataset, and how to annotate cell types using existing marker databases. + +Following are the main concepts we will cover: + +- **Marker Genes:** Genes whose expression is characteristic of specific cell types. + +- **Cell Scoring:** Assigning scores to cells based on marker gene expression. + +- **Clustering:** Grouping similar cells based on gene expression patterns. + +- **Louvain Clustering:** A method for detecting groups (clusters) of similar cells. + +- **Differential Expression:** Identifying genes that differ between groups of cells. + +- **Cell Annotation:** Assigning biological meaning (cell types) to clusters using marker genes. + + + diff --git a/single-cell-analysis/cc-by-nc-nd.png b/single-cell-analysis/cc-by-nc-nd.png new file mode 100644 index 0000000..c7c5545 Binary files /dev/null and b/single-cell-analysis/cc-by-nc-nd.png differ diff --git a/single-cell-analysis/collection.md b/single-cell-analysis/collection.md new file mode 100644 index 0000000..586a69e --- /dev/null +++ b/single-cell-analysis/collection.md @@ -0,0 +1,13 @@ +--- +title: Single Cell Analysis +subTitle: A Self-Paced Tutorial on Visual Analytics for Single Cell Gene Expression Analysis +coverImg: '' +--- + +This material includes a set of assignments and links to tutorial material on single cell gene expression analysis. The tutorial aims to provide an accessible introduction to single cell analysis, covering topics such as preprocessing single cell data and conducting complex data analysis using visual programming. To facilitate the learning process, we will be using [Orange](http://orangedatamining.com), our data mining tool of choice for this course. If you have not already installed Orange, please visit its [download page](https://orangedatamining.com/download/#macos) now. Once Orange is installed, please install also its Bioinformatics and Single Cell add-ons. + +These course notes were prepared by Blaž Zupan and Ela Praznik with help from the members of the [Bioinformatics Lab](http://biolab.si) at [University of Ljubljana](http://www.uni-lj.si), Slovenia. + +The material is offered under Create Commons [CC BY-NC-ND licence](https://creativecommons.org/licenses/by-nc-nd/4.0/). + +![](cc-by-nc-nd.png) \ No newline at end of file