Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion _chapters/single-cell-analysis/04-preprocessing/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: ' Data Filtering and Preprocessing'

---
Single-cell datasets can have a lot of technical variability issues. Each cell will generally capture a varying number of reads. This will cause some cells to have too low of a signal to be useful. Additionally, genes range from ever-active housekeeping genes to specialized genes that are only expressed in particular cell types or under certain conditions. Employing filtering techniques and preprocessing steps becomes crucial to prepare the data for subsequent analyses.
Single-cell datasets can have a lot of **technical variability issues**. In contrast to bulk RNA sequencing, where measurements come from many cells at once and random noise is averaged out, in single-cell data each measurement comes from just one cell. Because the amount of RNA is very small, **small technical differences and random sampling effects have a much larger impact**, leading to high variability and many low or zero measurements (dropouts). As a result, some cells may have signal that is too low to be useful. In addition, **gene expression levels vary widely**, from consistently expressed housekeeping genes to genes that are only active in specific cell types or conditions. Therefore, filtering and preprocessing steps are essential to obtain reliable results for downstream analysis.

## Filtering or quality control

Expand Down
12 changes: 6 additions & 6 deletions _chapters/single-cell-analysis/05-batch-effects/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ title: 'Batch Effects Correction'

## Batch Effects Correction

<!!! float-aside !!!>
A batch is a group of samples (cells) that are processed together under the same experimental conditions

In single-cell analysis, we often work with data from multiple sources, such as different experiments, laboratories, or patient samples. A batch refers to a group of samples that were processed under the same technical conditions, for example, in the same lab, using the same protocol, reagents, or sequencing run. Differences between batches can introduce batch effects, which are sources of unwanted technical variation. These must be distinguished from biological variation which are typically the focus of analysis.


In single cell analysis, we often deal with data from several different sources, be it from a different provider, different experiments or simply different batches. A “batch” refers to an individual group of samples that are processed differently relative to other samples. This different processing when gathering a batch of data can affect variation in the obtained data. The technical, non-biological factors that affect variation in batches are reffered to as batch effects.

Batch effects are problematic because they hinder our ability to measure true biological variation between samples, which is what we are interested in. Luckily, they can be dealt with computationally by aligning data from different batches.
Batch effects are problematic because they obscure these biological differences and make comparisons between samples unreliable. Luckily, they can be dealt with computationally by aligning data from different batches.

There are different approaches to align data sets and remove batch effects. Orange currently implements three: one through the [Batch Effect Removal](https://orangedatamining.com/widget-catalog/single-cell/batch_effect_removal/) widget, the second, more standard, using canonical correlation implemented in the widget [Align Datasets](https://orangedatamining.com/widget-catalog/single-cell/align_datasets/) and the third, most recently added, in the Harmony widget.

Expand Down Expand Up @@ -39,13 +39,13 @@ In our exploration of data integration methods, we'll first look at a technique

Next, let's revisit the Align Datasets widget to fine-tune the parameters for potentially improved clustering. A common strategy is to start with a reduced number of components. Orange seamlessly propagates the transformed data to t-SNE, where we see an updated plot. The alignment between the two datasets appears significantly improved.

Let's explore the second data alignment method implemented by the Batch Effect Removal widget. We add this widget to our canvas and feed it the combined data. When opening the widget we have to set the distinguishing feature for different batches; in our case, this is the Source ID. We'll also leave the "Skip zero expressions" option unchecked. After applying this correction and visualizing the results in t-SNE, with colors representing cell classes and shapes representing data sources, we observe an interesting result. While some clusters of identical cell types from different sources merge into cohesive units, others remain distinct. It appears that the Align Datasets widget outperformed the Batch Effect Removal method in this case.
Let's explore the second data alignment method implemented by the Batch Effect Removal widget. We add this widget to our canvas and feed it the combined data. When opening the widget we have to set the distinguishing feature for different batches: in our case, this is the Source ID. We'll also leave the "Skip zero expressions" option unchecked. After applying this correction and visualizing the results in t-SNE, with colors representing cell classes and shapes representing data sources, we observe an interesting result. While some clusters of identical cell types from different sources merge into cohesive units, others remain distinct. It appears that the Align Datasets widget outperformed the Batch Effect Removal method in this case.

<!!! width-max !!!>
![](sc-notes-04-05_75.jpg)


Let us now try Harmony, the third method available in Orange. Harmony has achieved good results in the [Batch integration benchmark study](https://openproblems.bio/benchmarks/batch_integration?version=v2.0.0), making it a robust general-purpose method for integrating single-cell datasets across batches. As before, we pass the concatenated data to the Harmony widget and specify Source ID as the batch-defining variable, leaving the remaining parameters at their default values for now. We then visualize the transformed data using t-SNE. The resulting plot shows that cells of the same type remain clustered together. What about the batch effect correction? To assess batch mixing, we set the Shape parameter to indicate the data source; this reveals that cells from different batches are now well mixed rather than forming separate clusters. Thus by using Harmony we have effectively reduced batch effects while preserving biologically meaningful structure for downstream analysis.
Let us now try Harmony, the third method available in Orange. Harmony has achieved good results in the [Batch integration benchmark study](https://openproblems.bio/benchmarks/batch_integration?version=v2.0.0), making it a robust general-purpose method for integrating single-cell datasets across batches. As before, we pass the concatenated data to the Harmony widget and specify Source ID as the batch-defining variable, leaving the remaining parameters at their default values for now. We then visualize the transformed data using t-SNE. The resulting plot shows that cells of the same type remain clustered together. What about the batch effect correction? To assess batch mixing, we set the Shape parameter to indicate the data source: this reveals that cells from different batches are now well mixed rather than forming separate clusters. By using Harmony we have effectively reduced batch effects while preserving biologically meaningful structure for downstream analysis.

<!!! float-aside !!!>
Among the three main parameters in Harmony (sigma, theta, and lambda), it is often useful to adjust theta, which controls the strength of batch correction: lower values result in weaker batch correction, whereas higher values enforce stronger mixing between batches.
Expand Down
3 changes: 2 additions & 1 deletion _chapters/single-cell-analysis/06-marker-genes/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,8 @@ The output of the widget is a table that includes a gene name and cell type, bot

The idea is now that we would select the gene(s) from the data table, and then score the cells according to the mean expression of selected genes. Widget Score Cells assigns a numerical score to each cell that is proportional to an average expression of the marker genes at the input of the widget. The score is added as a meta attribute to the cell data on the output of Score Cells. Check this using the Data Table! We can now feed this data into t-SNE and set the color and size of the points to the cell score.

Notice that with any change in the selection of marker genes, we find a group of cells in t-SNE plot where these genes are expressed. Looks like T cells are in the bottom right cluster, B cells somewhere in the middle, and erythrocytes in the left cluster. Did we say cluster? Oh, we are not there yet…
Notice that with any change in the selection of marker genes, we find a group of cells in t-SNE plot where these genes are expressed. Looks like T cells are in the bottom right cluster, B cells somewhere in the middle, and erythrocytes in the left cluster. Did we say cluster?
<br />

<!!! width-max !!!>
![](sc-notes-06-04_75.jpg)
Expand Down
57 changes: 38 additions & 19 deletions _chapters/single-cell-analysis/quiz-02/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,51 +150,70 @@ Plot the preprocessed and annotated data in a new t-SNE plot and compare it to t

### Task 4 - Batch Effect Correction

Download the sample of a pancreas single cell gene expression dataset ([pancreas_sampled_1k5k.tab](http://file.biolab.si/datasets/pancreas_sampled_1k5k.tab)) and load it into Orange. Generate a t-SNE plot.


<Question
id="sc-ex2-q8"
points={1}
type="multi"
question="How many different batches are present in the dataset?"
scorer={(answer) => answer === "3"}
options={["11", "2", "3"]}
question="What is a batch in single cell analysis?"
scorer={(answer) => answer === "a group of cells processed under the same technical conditions"}
options={["A group of cells from the same tissue", "A group of cells from the same patient", "A group of cells processed under the same technical conditions", "A procedure that removes unwanted technical variability from the data"]}
neutralOptions={["I don't understand the question."]}
trials={2}
timeout={10}>
<Explanation after="correctOrMaxTrials">

<!!! retina !!!>
![](sc-ex2-q8-exp.jpg)
</Explanation>
</Question>


<Question
id="sc-ex2-q9"
points={1}
type="multi"
question="Why do we need to apply batch-correction?"
scorer={(answer) => answer === "to align datasets from different sources"}
options={["To normalize the data", "To align datasets from different sources", "To reduce the size of the dataset", "To separate datasets from different sources"]}
scorer={(answer) => answer === "to correct for technical differences so datasets can be compared"}
options={["To normalize the data", "To correct for technical differences so datasets can be compared", "To reduce the size of the dataset", "To separate datasets from different sources"]}
neutralOptions={["I don't understand the question."]}
trials={2}
timeout={10}>
</Question>



**Perform batch effect correction on the following data:**

a) Download a sample of a pancreas single-cell gene expression dataset ([pancreas_sampled_1k5k.tab](http://file.biolab.si/datasets/pancreas_sampled_1k5k.tab)) and load it into Orange. The dataset already includes a metafeature, _Batch_, which indicates the sequencing procedure used to obtain each measurement. Generate a t-SNE plot.


<Question
id="sc-ex2-q10"
points={1}
type="multi"
question="How many different batches are present in the dataset?"
scorer={(answer) => answer === "3"}
options={["11", "2", "3"]}
neutralOptions={["I don't understand the question."]}
trials={2}
timeout={10}>
<Explanation after="correctOrMaxTrials">

<!!! retina !!!>
![](sc-ex2-q8-exp.jpg)
</Explanation>
</Question>


**Apply two different batch-effect correction methods to the dataset:**

a) Using Align Datasets widget (set the Data source indicator to Batch and leave all other parameters at default values)
b) Apply two different batch-effect correction methods to the dataset:

i) Using Align Datasets widget (set the Data source indicator to Batch and leave all other parameters at default values)

b) Using Harmony widget (leave all parameters at their default values)

**For each method, generate a t-SNE embedding of the corrected data. Compare t-SNE plots (uncorrected, Align Datasets corrected, Harmony corrected) side by side.**
ii) Using Harmony widget (leave all parameters at their default values)

For each method, generate a t-SNE embedding of the corrected data. Compare t-SNE plots (uncorrected, Align Datasets corrected, Harmony corrected) side by side.


<Question
id="sc-ex2-q10"
id="sc-ex2-q11"
points={1}
type="multi"
question="Just by looking at the t-SNE plots, which method more effectively removes batch effects (i.e., shows better mixing of batches and separation of cell type clusters)?"
Expand All @@ -215,11 +234,11 @@ b) Using Harmony widget (leave all parameters at their default values)

Start from the uncorrected dataset and create a second Harmony workflow: add a new Harmony widget, set the parameter theta to 2.5, and leave all other parameters at their default values. Connect the output of this widget to a new t-SNE plot and set the number of PC components used to 30.

Compare this plot with the previous t-SNE plot obtained using Harmony with default parameters. Focus on how the change in theta affects the mixing of batches and the separation of clusters.
Compare this plot with the previous t-SNE plot obtained using Harmony with default parameters. Focus on how the change in theta and the number of PC components affects the mixing of batches and the separation of clusters.


<Question
id="sc-ex2-q11"
id="sc-ex2-q12"
points={1}
type="multi"
question="Compared to the default Harmony settings, does increasing theta to 2.5 and using 30 principal components improve batch mixing and cluster separation?"
Expand Down
67 changes: 52 additions & 15 deletions _chapters/single-cell-analysis/quiz-03/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,56 +6,93 @@ title: 'Quiz'
### Task 1 - Identifying clusters


Above you can see a t-SNE plot of the retinal dataset showing expected clusters (the number of PCA components in the t-SNE widget set to 10). Identify the most likely cell type corresponding to each cluster. Use the data table of known marker genes for each cell type and set the aggregation parameter in the Score Cells widget to **Fraction of expressed markers**.

![](tsne-clusters.png)


<Question
id="sc-ex3-q1"
points={1}
type="multi"
question="Which cluster on the plot above most likely corresponds to cone cells?"
scorer={(answer) => answer === "pink"}
options={["Orange", "Pink", "Yellow", "Red"]}
question="What are marker genes?"
scorer={(answer) => answer === "genes whose expression is characteristic of specific cell types or states"}
options={[
"Genes whose expression is characteristic of specific cell types or states",
"Genes that are expressed in all cells at the same level",
"Genes used to normalize gene expression data",
"Genes consistently expressed in most or all cells because they are required for basic cellular functions necessary for survival"
]}
neutralOptions={["I don't understand the question."]}
trials={2}
timeout={10}>
</Question>

Perform cluster exploration on the retinal dataset. Use the data table of known marker genes ([sc-quiz-marker-genes.xlsx](http://file.biolab.si/datasets/sc-quiz-marker-genes.xlsx)) for each cell type (don't forget to pass the marker genes data though the Genes widget to annotate!) and set the aggregation parameter in the Score Cells widget to **Fraction of expressed markers**.

![](sc-ex3-q2.jpg)


<Question
id="sc-ex3-q2"
points={1}
type="multi"
question="Which cluster on the plot above most likely corresponds to retinal ganglion cells?"
scorer={(answer) => answer === "red"}
options={["Red", "Light Blue", "Green", "Pink"]}
question="Which cell type is likely picked out in the t-SNE plot above?"
scorer={(answer) => answer === "rods"}
options={["Cones", "Horizontal cells", "Retinal ganglion cells", "Rods", "Amacrine cells"]}
neutralOptions={["I don't understand the question."]}
trials={2}
trials={4}
timeout={10}>
<Explanation after="correctOrMaxTrials">
<!!! retina !!!>
![](sc-ex3-q2-exp.jpg)
</Explanation>
</Question>


Liang et al. report that in the peripheral tissue the proportion of rods is higher than the proportion of rods in the macular tissue. Does this hold for our dataset sample? Try using the Distributions widget to figure this out.

![](sc-ex3-q3.jpg)


<Question
id="sc-ex3-q3"
points={1}
type="multi"
question="Which cell type is likely picked out in the t-SNE plot above?"
scorer={(answer) => answer === "horizontal cells"}
options={["Cones", "Horizontal cells", "Retinal ganglion cells", "Rods", "Amacrine cells"]}
neutralOptions={["I don't understand the question."]}
trials={4}
timeout={10}>
<Explanation after="correctOrMaxTrials">
<!!! retina !!!>
![](sc-ex3-q3-exp.jpg)
</Explanation>
</Question>


Liang et al. report that in the peripheral tissue the proportion of rods in comparison to other cell types is higher than the proportion of rods in comparison to other cells in the macular tissue. Does this hold for our dataset sample? Try using the Distributions widget to figure this out.

<!!! float-aside !!!>
(This is a hard question, so here is a hint: In the Distributions widget you need to differentiate between Rods (Selected) and non-Rods (Not Selected) (this means sending all data to Distributions, not just the selected data - rewire!) as well as between tissue Source (Macular or Peripheral). In addition, check _Stack columns_, _Show probabilities_ and _Show cummulative distribution_)

<Question
id="sc-ex3-q4"
points={1}
type="multi"
question="In our sample, the proportion of rods is higher in the peripheral tissue than the proportion of rods in the macular tissue:"
scorer={(answer) => answer === "true"}
options={["True", "False"]}
neutralOptions={["I don't understand the question."]}
trials={1}
timeout={10}>
<Explanation after="correctOrMaxTrials">
<!!! retina !!!>
![](sc-ex3-q41-exp.jpg)
![](sc-ex3-q42-exp.jpg)
</Explanation>
</Question>


Select the top 100 genes that are differentially expressed in cones in comparison to non-cones (T-test). Forward them to the GO widget. Sort the lower list by increasing p-value.

<Question
id="sc-ex3-q4"
id="sc-ex3-q5"
points={1}
type="multi"
question="Which among these GO terms have a high p-value and have an enrichment score above 40?"
Expand All @@ -70,7 +107,7 @@ Select the top 100 genes that are differentially expressed in cones in compariso
Try to determine [the tissue source of the single-cell dataset from a human ](https://file.biolab.si/tmp/sc-quiz-anonymous-sample.tab). (Try using Marker Genes widget, Annotator and, if need be, a quick web search)

<Question
id="sc-ex3-q5"
id="sc-ex3-q6"
points={1}
type="multi"
question="From which organ tissue do the cells from the dataset most likely come from?"
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.