Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
.DS_Store
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions _chapters/single-cell-analysis/01-introduction/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
---
title: 'Introduction to Single-Cell Expression'
---

<!!! float-aside !!!> An expression profile is a representation of the activity (the expression) of thousands of genes for a single biological sample.

In traditional, bulk gene expression studies we usually compare two or more types of tissue samples, for instance a healthy and pathological one. More specifically, we compare their **expression profiles**: the set of genes expressed in one sample is contrasted with the corresponding set expressed in the other in order to identify systematic differences in gene activity. Because a single sample is made up of hundreds to millions of cells, the measured expression level of any given gene effectively reflects something like **the average expression** across all the cells present in that sample.

However, the cells that make up a sample can differ widely in their characteristics: they may have different functions and morphologies, or be in different developmental or cell-cycle states. For example, in the human retina we can find as many as 5 different types of neurons, each specialized to perform a certain function! Averaging gene expression across all cells therefore masks cell-to-cell variation, making it impossible to determine whether observed expression patterns arise uniformly across cells or from specific cell subpopulations. If we want to study the differences in the gene expression between different cells within a sample, we need techniques with a finer-grained resolution than what bulk gene expression sequencing techniques allow.

<!!! float-aside !!!> Single-cell sequencing examines the sequence information (e.g.DNA or RNA) from individual cells

Here’s where single-cell sequencing comes in. Using optimized next-generation sequencing it allows us to measure **sequence information at the level of a single cell**. We can sequence both the _genome_ or the _transcriptome_ of a single cell, but in this tutorial we'll be focusing on gene expression or _transcriptomic_ studies. These simultaneously **measure the RNA concentration** (conventionally only messenger RNA (mRNA)) of hundreds to thousands of genes in a single cell.


![](01-sc-workflow.jpeg)

<!!! float-aside !!!> A gene expression profile of a single cell tells us something about it's function and state

The genes that are expressed in a certain cell are characteristic of its **function and of its state**. For instance, we expect similar gene expression profiles from two healthy liver cells and different expression profiles between a healthy liver cell and a cancerous one. This fine-grained resolution opens up new avenues for understanding complex biological processes, such as development, disease progression, and cellular responses to stimuli.

The technology behind sequencing at the level of a single cell, however interesting, is not the topic of this tutorial. Rather, what we want to cover over the next few chapters is how to approach single-cell data once it has been obtained. In order to make sense of such data we need to analyze it. We will use **Orange** to perform just that in an easy and intuitive manner.


80 changes: 80 additions & 0 deletions _chapters/single-cell-analysis/02-sc-data/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
title: 'Structure of Single Cell Data'
---

## Single Cell Datasets in Orange

In single-cell expression studies, the data are typically first represented as a count matrix. Each row usually corresponds to an individual cell, and each column corresponds to a gene, with the entries recording how many RNA molecules from a given gene were detected (counted) in a given cell.

Let's look at an example. You can find a number of preloaded, publicly available single cell datasets which can be accessed through the [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget. We will explore some of them in the following chapters.

<!!! float-aside !!!>
![](sc-notes-01-00_80.jpg)

Let us start by constructing a workflow that consists of a [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget and a [Data Table](https://orangedatamining.com/widget-catalog/data/datatable/) widget. The [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget reads the data from the server. Open the widget by double-clicking its icon. The window shows a list of available datasets. Let's start with a smaller dataset, a sample from the study conducted by [Baron et al. (2017)](https://pubmed.ncbi.nlm.nih.gov/27667365/) composed of pancreatic cells from a single human donor. Double click on the line with this data set to instruct the widget to send the data to its output. After loading the data, open the [Data Table](https://orangedatamining.com/widget-catalog/) to see the data we have just loaded in the spreadsheet.

![](sc-notes-01-01_80.jpg)&nbsp;

<!!! float-aside !!!>
Counts signify how many copies of the expressed gene were detected in the cell

There are 1631 cells and 5010 genes in this dataset sample. Orange data items are stored in rows - in single cell transcriptomics, our data items are cells. The cell **expression profiles** are therefore stored in rows. Columns refer to meta-features and genes. Our example data includes _cell class_, _barcode_, _cell id_ and some other meta information. When the gene expression values are represented with **whole numbers**, this usually indicates that we are dealing with **counts**, which **signify how many copies of the expressed gene were detected in the cell**. In other words, the numbers in the matrix tell us how many times we en**count**ered a RNA molecule of a gene in a particular cell. So, for instance, in the third row, we find a cell in which 0 RNA molecules of the genes AAAS and AACS were detected, but we encountered 12 transcripts of the gene AADAC. We call this kind of matrix a **count matrix**.

<!!! width-max !!!>
![](sc-notes-01-02_80.jpg)&nbsp;

<!!! float-aside !!!> Dropout refers to the phenomenon where a gene is expressed in a cell but not detected due to technical limitations, leading to false zero values

By scrolling through the data, you will notice that there are many zero values in our count matrix - single cell data is **sparse**. This is completely normal. Since scRNA-seq captures RNA from individual cells, lowly expressed genes may have only a few RNA molecules present in a given cell, making them easy to miss during sequencing. This phenomenon is called **dropout**. A zero value in the count matrix can therefore signify either that the gene was truly not expressed in a given cell or, more likely, that its expression was not detected.

Let's look at another sample dataset. Open the [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) widget again and select the dataset composed of bone marrow mononuclear cells, a sample of the data from [Zhang et al. (2017)](https://www.nature.com/articles/ncomms14049). Double click to load the data.

![](sc-notes-01-021_80.jpg)&nbsp;

Single-cell datasets can be quite large, so it often makes sense to begin analysis on a subset of the data. Although this dataset is already relatively small (as it is itself a sample of a larger dataset), we will use it to demonstrate how to create an even smaller sample. Let's forward the data to a [Data Sampler](https://orangedatamining.com/widget-catalog/transform/datasampler/) widget. Open the widget and sample 100 cells from the data. There are several sampling types to choose from: we select the Fixed sample size option, set the number of instances to 100 and press Sample Data. Forward the data to a new [Data Table](https://orangedatamining.com/widget-catalog/data/datatable/) and open it.

<!!! float-aside !!!>
![](sc-notes-01-04_80.jpg)

![](sc-notes-01-03_80.jpg)&nbsp;

<!!! float-aside !!!>
You can find the number of input and output instances displayed at the bottom of an Orange widget. By clicking on them, you can take a quick glimpse at the data in a pop-up data table.

In the columns, we can again identify meta-features such as _cell type_, _replicate_, _ID_, and _barcode_, along with genes. The rows correspond to individual cells. However, this time, expression values are represented as decimals rather than whole numbers. This indicates that the counts have most likely already been normalized.

Now, let's augment our workflow to visualize the data. Because single-cell gene expression data are high-dimensional - each cell is described by the expression levels of thousands of genes - they cannot be visualized directly. To make visualization possible, we first need to reduce the dimensionality of the data while preserving as much of its underlying structure as possible.

For now let's take a quick glance at our data using a popular dimensionality reduction technique called t-SNE. Draw a line from the Data Sampler and search for the t-SNE widget. Click and wait for the widget to process the data. We will also add another Data Table at the output of t-SNE.

Open the t-SNE widget and select a few data points by drawing the rectangle around them. Now open the [Data Table(2)](https://orangedatamining.com/widget-catalog/) to observe how the data on selected cells are passed to the output of t-SNE. In Orange, most of the widgets are interactive, and send out the data upon any change in selection or any change of parameters of the widget.

<!!! width-max !!!>
![](sc-notes-01-05_75.jpg)


## Loading your own dataset

The datasets we have worked with in the previous chapter come from the server. Orange can also read the data from spreadsheet file formats which include tab, comma separated and Excel files. Let us prepare a toy dataset in Excel and save it on a local disk.

![](sc-notes-01-06_75.jpg)&nbsp;

We can use the [File](https://orangedatamining.com/widget-catalog/data/file/) widget to load this dataset.


<!!! float-aside !!!>
Instead of using Excel, we could also use Google Sheets, a free online spreadsheet alternative. Then, instead of finding the file on the local disk, we would enter its URL address to the [File](https://orangedatamining.com/widget-catalog/data/file/) widget ’s URL entry box.

![](sc-notes-01-07_75.jpg)&nbsp;

Orange has correctly guessed that cell IDs are character strings and that this column in the dataset is special, meant to provide additional information and not to be used for any kind of modeling. All other columns are numeric features except for the type, which is a categorical feature. This is also the feature we wouldn't want to include in the profile of the cell and should rather consider it as a cell’s class. Double-click on the “feature” in the Role column and change the role of the feature type to “target”. Then click the Apply button.

![](sc-notes-01-08_75.jpg)&nbsp;


It is always good to check if all the data was read correctly. We can connect our [File](https://orangedatamining.com/widget-catalog/data/file/) widget with the [Data Table](https://orangedatamining.com/widget-catalog/) widget, and double-click on the [Data Table](https://orangedatamining.com/widget-catalog/) to see the data in the spreadsheet format.

![](sc-notes-01-09_75.jpg)&nbsp;


There is more to input data formatting and loading. We can define the type and kind of the data column, specify that the column is actually a web address of an image, and more. But enough for now. If you would really like to dive in for more, check out the documentation page on [Loading your Data](https://orange3.readthedocs.io/projects/orange-visual-programming/en/latest/loading-your-data/index.html), or one of our [videos](https://www.youtube.com/watch?v=MHcGdQeYCMg&list=PLmNPvQr9Tf-ZSDLwOzxpvY-HrE0yv-8Fy&index=4&ab_channel=OrangeDataMining) on this subject.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 39 additions & 0 deletions _chapters/single-cell-analysis/03-visualisation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: 'Visualizing Single Cell Landscapes'
---

Let us load some single-cell gene expression data and organize the cells in two-dimensional visualizations. We will use the following workflow, and within it, compare two popular data visualization approaches, principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE).

<!!! float-aside !!!>
&nbsp;
[Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/) connects to Orange’s data server that contains examples of datasets. You have to be connected to a network for this widget to work correctly.

![](sc-notes-02-01_75.jpg)

From a list of examples in [Single Cell Datasets](https://orangedatamining.com/widget-catalog/single-cell/single_cell_datasets/), let us again choose the data on mononuclear cells from bone marrow (Zheng et al., Nat Comm 2017). This data sets has already been preprocessed (to some degree) and comes with a selection of 1,000 genes.

<!!! width-max !!!>
![](sc-notes-02-02_75.jpg)

<!!! float-aside !!!>
To pass only the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) components to [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) try rewiring the connection between the two widgets.

We pass the data to [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) with the scree diagram, a chart that shows how much of the variance is explained with a first few components. [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) transforms our data to a new coordinate system defined by principal components, where the components are orthogonal to each other and where the transformation is constructed so that the first component explains most of the variance, then second-most of the remaining variance, and so on.

A conceptually very different technique to PCA is [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/), which embeds the data into two dimensions so that cells with similar expression stay together.

<!!! float-aside !!!>
&nbsp;
[t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) widget does not include axis. In fact, axis in t-SNE make no sense. Why? Because the coordinates of the points are not any two features of the original dataset, but a complex non-linear mapping of the original multidimensional data into only two-dimensions.

![](sc-notes-02-03_75.jpg)

<!!! float-aside !!!>
&nbsp;
To explore the differences between t-SNE and PCA, have both windows open, select the data in [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/), and observe the changes in [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) showing [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) projection. If Orange canvas window is getting in your way, use "Bring Widgets to the Front" command from the View menu.

PCA and t-SNE are two popular visualizations of single-cell gene expression data. Their visual depictions are often very different. PCA is a linear transformation that aims to be “more faithful” to the original data, while t-SNE aims to expose the clustering structure and focuses on preserving local similarities. We can compare the layout of the two visualizations by adding a connection from [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) widget to the [Scatter Plot](https://orangedatamining.com/widget-catalog/visualize/scatterplot/) showing the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) projection. With it, a subset of cells selected in the [t-SNE](https://orangedatamining.com/widget-catalog/unsupervised/tsne/) will be exposed in the [PCA](https://orangedatamining.com/widget-catalog/unsupervised/PCA/) plot.

<!!! width-max !!!>
![](sc-notes-02-06_75.jpg)

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading