-
Notifications
You must be signed in to change notification settings - Fork 1
Expand file tree
/
Copy pathFull_Tutorial.Rmd
More file actions
2192 lines (1775 loc) · 85.1 KB
/
Full_Tutorial.Rmd
File metadata and controls
2192 lines (1775 loc) · 85.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Full Tutorial"
date: '`r Sys.Date()`'
output:
html_document
knit: knitted
vignette: >
%\VignetteIndexEntry{Full Tutorial}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
# Full Tutorial for SigBridgeR
## 0. Preface
### 0.1 Contents
- [Full Tutorial for SigBridgeR](#full-tutorial-for-sigbridger)
- [0. Preface](##0-preface)
- [0.1 Contents](###01-contents)
- [0.2 Introduction to SigBridgeR](###02-introduction-to-sigbridger)
- [1. Load and Preprocess data](##1-load-and-preprocess-data)
- [1.1 Single-cell RNA-seq Data](###11-single-cell-rna-seq-data)
- [1.1.1 (Option A) Start from Raw Matrix](###111-option-a-start-from-raw-matrix)
- [1.1.2 (Option B) Start from AnnData Object](####112-option-b-start-from-anndata-object)
- [1.1.8 (Optional) Filter Out Tumor Cells](####118-optional-filter-out-tumor-cells)
- [1.2 Bulk expression data](###12-bulk-expression-data)
- [1.2.1 Evaluate the quality of your bulk RNA-seq data](####121-evaluate-the-quality-of-your-bulk-rna-seq-data)
- [Quality Control Metrics Reported](####quality-control-metrics-reported)
- [Recommended Parameter Adjustments](####recommended-parameter-adjustments)
- [1.2.2 Gene Symbol Conversion](###122-gene-symbol-conversion)
- [1.3 Phenotype Data](##13-phenotype-data)
- [2. Screen Cells Associated with Phenotype](##2-screen-cells-associated-with-phenotype)
- [2.1 (Option A) Scissor Screening](###21-option-a-scissor-screening)
- [2.2 (Option B) scPAS Screening](###22-option-b-scpas-screening)
- [2.3 (Option C) scAB Screening](###23-option-c-scab-screening)
- [2.4 (Option D) scPP Screening](###24-option-d-scpp-screening)
- [2.5 (Option E) DEGAS Screening](###25-option-e-degas-screening)
- [2.6 (Option F) LP_SGL Screening](###26-option-f-lp_sgl-screening)
- [2.7 (Option G) PIPET Screening](###27-option-g-pipet-screening)
- [2.8 (Option H) SIDISH Screening](###28-option-h-sidish-screening)
- [2.9 (Option I) SCIPAC Screening](###29-option-i-scipac_screening)
- [2.F Merge screening results](###2f-merge-screening-results)
- [3. Visualization](##3-visualization)
- [3.1 UMAP for screening results](###31-umap-for-screening-results)
- [3.2 Stack bar plot for screening results](###32-stack-bar-plot-for-screening-results)
- [3.3 Venn diagram for screening results](###33-venn-diagram-for-screening-results)
- [3.4 Upset plot for screening results](###34-upset-plot-for-screening-results)
- [4. Example](##4-example)
- [4.1 Survival-associated cell screening](###41-survival-associated-cell-screening)
- [4.2 Continuous Phenotype-associated cell screening](###42-continuous-phenotype-associated-cell-screening)
- [4.3 Binarized phenotype-associated cell screening](###43-binarized-phenotype-associated-cell-screening)
- [5. Troubleshooting](##5-troubleshooting)
- [6. References](##6-references)
### 0.2 Introduction to SigBridgeR
SigBridgeR (short for **Sig**nificant cell-to-phenotype **Bridge** in **R**) is an R package for screening cells highly associated with phenotype data using single-cell RNA-seq, bulk RNA expression and sample related phenotype data (e.g. patient survival, age, etc). It integrates many single cell phenotypic screening methods ([8. References](#8-references)) and provides unified preprocessing, parameter tuning, and visualization approaches,performing as a unified integration panel.
------------------------------------------------------------------------
## 1. Load and Preprocess data
First load the packages, you will see a version number message indicating successful loading:
```{r,load_package,eval=FALSE}
library(SigBridgeR)
# ✔ SigBridgeR v3.x.x loaded
```
### 1.1 Single-cell RNA-seq Data
You can use function `SCPreProcess` to preprocess your single-cell RNA-seq data. This function utilizes a flexible pipeline system, allowing you to customize the analysis workflow using character codes:
**Pipeline Code Table:**
| Code | Function | Description |
| --- | --- | --- |
| **o** | `CreateSeuratObject` | **Required.** Must be the first step. |
| **n** | `NormalizeData` | Standard normalization. |
| **s** | `ScaleData` | Scales data for PCA. |
| **v** | `FindVariableFeatures` | Selects highly variable genes. |
| **p** | `RunPCA` | Principal Component Analysis. |
| **e** | `FindNeighbors` | Computes SNN graph. |
| **c** | `FindClusters` | Louvain algorithm clustering. |
| **t** | `RunTSNE` | t-SNE reduction. |
| **u** | `RunUMAP` | UMAP reduction. |
| **r** | `SCTransform` | **SCT workflow.** Replaces n, s, v. |
#### 1.1.1 (Option A) Start from Raw Matrix
When starting from a raw count matrix (data.frame, matrix or dgCMatrix), `SCPreProcess` executes the steps defined in the pipeline argument (Default: `"onsvpcetu"`). You can customize parameters for each step via the `params` list.
```{r,scpreprocessing_raw_matrix,eval=FALSE}
your_seurat <- SCPreProcess(
sc = your_matrix,
...,
pipeline = "onsvpcetu",
params = list(
# * CreateSeuratObject
o = list(
project = "SC_Screen_Proj",
min.cells = 400L
),
# * NormalizeData
n = list(),
# * ScaleData
s = list(),
# * FindVariableFeatures
v = list(),
# * RunPCA
p = list(),
# * FindNeightbors
e = list(),
# * FindClusters
c = list(
resolution = 0.6
),
# * RunTSNE
t = list(),
# * RunUMAP
u = list()
# * SCTransform
# r = list()
),
quality_control = list(
pattern = c("^MT-")
),
data_filter = list(
nFeature_RNA_thresh = c(200L, 6000L),
nCount_RNA_thresh = c(500L, 50000L),
# * only used when specifed in `quality_control.pattern`
percent.mt = 20L, # mitochondrial genes
percent.rp = 60L # ribosomal protein genes
# ? When combined pattern is used, like `quality_control$pattern <- "^MT-|^RP[LS]"`
# ? Use `_` to separate different patterns like this:
# percent.mt_rp = 60L
# ? When filtering for non-mitochondrial genes and non-ribosomal proteins RNA genes,
# ? the column names are in lowercase letter form with regular expression symbols removed.
# `quality_control$pattern <- "^[rt]rna"`
# Correct threshhold setting is `percent.rt_rna = 60L`
# ? Use `SigBridgeR::Pattern2Colname()` to get the correct colname if still confused.
),
column2only_tumor = NULL
)
```
**Key Workflow Steps:**
1. Pipeline Execution: The function runs Seurat methods in the order specified by the `pipeline` string (e.g., 'o' then 'n').
2. QC & Filtering:
- Detection: Regex patterns in `quality_control` (e.g., `^MT-`) automatically generate metadata columns (e.g.,` percent.mt`).
- Filtering: Cells are filtered based on `data_filter` thresholds. Note: Custom patterns generate specific column names (e.g., `^[rt]rna` becomes `percent.rt_rna`).
3. Tumor Filtering: If `column2only_tumor` is provided, the function retains only cells with `"Tumor/Cancer/Malignant"`(case-insensitive) labels in that metadata column.
#### 1.1.2 (Option B) Start from AnnData Object
SCPreProcess natively supports AnnData object. You can use package `anndata` or `anndataR` to read in your AnnData object from `.h5ad` file.
First, import the data
```{r,scpreprocessing_anndata,eval=FALSE}
reticulate::use_pythonenv("The_path_to_your_python")
# reticulate::use_condaenv("conda_env_name")
anndata_obj <- anndata::read_h5ad("path_to_your_file.h5ad") # Or other file formats, make sure the matrix is in obj$X.
```
Or use `anndataR` to read in your file:
```{r,scpreprocessing_anndatar,eval=FALSE}
reticulate::use_pythonenv("The_path_to_your_python")
# reticulate::use_condaenv("conda_env_name")
anndata_obj <- anndataR::read_h5ad("path_to_your_file.h5ad") # basically no difference
```
Then just pass it to `SCPreProcess`:
```{r,scpreprocessing_in_anndata,eval=FALSE}
your_seurat <- SCPreProcess(
anndata_obj,
pipeline = "onsvpcetu", # Standard pipeline
column2only_tumor = "Tissue" # Optional: keep only tumor cells
)
```
The description of data (meta.data) in `anndata_obj$obs` will be add to `[email protected]`.
**helpful documentation:**
For the structure of `anndata`, you can refer to
- [AnnData for R](https://github.com/dynverse/anndata)
- [{anndataR}: An R package for working with AnnData objects](https://github.com/scverse/anndataR)
For more quality control, please use [scCustmoize](https://samuel-marsh.github.io/scCustomize/index.html), like `scCustomize::Add_Cell_QC_Metrics()`.
#### 1.1.8 (Optional) Filter Out Tumor Cells
If you already have a Seurat object and aim to filter out phenotype-associated cells in tumor cells. `SCPreProcess` will validate its structure and filter out tumor cells using the specified metadata column:
```{r,scpreprocessing_seurat,eval=FALSE}
your_seurat <- SCPreProcess(sc = your_seurat, column2only_tumor = "Tissue")
```
> Note: I don't recommend using columns like `column2only_tumor = "Celltype"` as tumor cell identities vary across tissues. instead:
>
> - Create a Dedicated Column: Add a new metadata column (e.g., is_tumor) to explicitly label cells:"Tumo(u)r"/"Normal"
>
> - Code Example:
>
> ```{r,choose_tumor_column,eval=FALSE}
> # For glioblastoma (GBM)
> seurat_obj$is_tumor <- ifelse(
> grepl("GBM|glioblastoma|astrocytoma_grade_IV", seurat_obj$Celltype, ignore.case = TRUE),
> "Tumor", # or "Tumour"
> "Normal" # or "Non-Tumor"
> )
> ```
### 1.2 Bulk expression data
Here are some methods for processing bulk RNA-seq gene expression data matrices.
#### 1.2.1 Evaluate the quality of your bulk RNA-seq data
`BulkPreProcess` performs comprehensive quality control on raw bulk RNA-seq count matrix data.
Key parameters for `BulkPreProcess`:
- `data`: Expression matrix with genes as rows and samples as columns, or a list containing count_matrix and sample_info.
- `sample_info`: Sample information data frame (optional), ignored if data is a list. A qualified `sample_info` should contain both `sample` and `condition` columns (case-sensitive), and there are no specific requirements for the data type stored in the `condition` column.
- `gene_symbol_conversion`: Whether to convert Ensembl version IDs and TCGA version IDs to genes with [SymbolConvert in Section 1.2.2](#122-gene-symbol-conversion), default TRUE.
- `check`: Whether to perform detailed quality checks, default TRUE.
- `min_count_threshold`: Minimum count threshold for gene filtering, default 10.
- `min_gene_expressed`: Minimum number of samples a gene must be expressed in, default 3.
- `min_total_reads`: Minimum total reads per sample, default 1e6.
- `min_genes_detected`: Minimum number of genes detected per sample, default 10000.
- `min_correlation`: Minimum correlation threshold between samples, default 0.8.
- `n_top_genes`: Number of top variable genes for PCA analysis, default 500.
- `show_plot_results`: Whether to generate PCA visualization plots, default TRUE.
- `verbose`: Whether to output detailed information, default TRUE.
Suppose you have bulk RNA-seq count data and sample information, and you want to perform comprehensive preprocessing and quality control. You can refer to and use the following code:
```{r,bulk_preprocess_example,eval=FALSE}
# Example usage of BulkPreProcess
filtered_counts <- BulkPreProcess(
data = your_count_matrix,
sample_info = your_sample_info,
gene_symbol_conversion = TRUE,
check = TRUE,
min_count_threshold = 10,
min_samples_expressed = 3,
min_total_reads = 1e6,
min_genes_detected = 10000,
min_correlation = 0.8,
n_top_genes = 500,
show_plot_results = TRUE,
verbose = TRUE
)
```
The function returns a filtered bulk count matrix after applying quality control steps and does not perform log2 transformation on the data.
##### Quality Control Metrics Reported
- Gene Filtering:
First, lowly expressed genes are removed, retaining genes that have an expression level reaching `min_count_threshold` in at least `min_gene_expressed` samples. This step is based on statistical power considerations, as lowly expressed genes are prone to false-positive results in differential analysis.
- Sample Filtering:
Subsequently, low-quality samples are filtered based on three core indicators:\
- Sequencing Depth: The total number of reads must reach `min_total_reads`.
- Library Complexity: The number of detected genes must exceed `min_genes_detected`.
- Sample Consistency: When quality checking is enabled, the average correlation between samples must meet the `min_correlation` threshold.
- Data Integrity Check: Identifies technical anomalies; too many missing values may indicate experimental issues.
- Feature Expression Filtering: Uses a count threshold rather than a proportion threshold to avoid over-penalizing small samples. Genes must be stably expressed in multiple samples to ensure the reliability of statistical tests.
- Sample-Level Filtering:
Comprehensively evaluates technical quality: \
Sequencing depth ensures detection sensitivity, the number of detected genes reflects library diversity, and sample correlation (Pearson) validates experimental reproducibility. PCA analysis further identifies batch effects and outlier samples.
- PCA Variance:
Variance explained by first two principal components. \
When `show_plot_results = TRUE`, the function generates PCA plot colored by experimental condition
- Batch Effects: Proportion of genes significantly affected by batch
##### Recommended Parameter Adjustments
Parameter settings need to balance data quality and information retention:
- For Low-Depth Sequencing Data:
Should relax `min_total_reads` and `min_genes_detected` due to technical limitations rather than quality issues causing low indicators.
- For Noisy Datasets:
Need to increase the `min_correlation` threshold to strengthen sample consistency requirements and exclude technical variation interference.
- For Large Datasets:
Consider setting `check = FALSE` to enable only partial filtering functions and speed up the process.
All parameters are set based on empirical values, and users should make appropriate adjustments according to the specific experimental design and sequencing platform characteristics.
#### 1.2.2 Gene Symbol Conversion
`SymbolConvert` performs a straightforward task: converting common gene identifiers (e.g., Ensemble IDs, Entrez) to standardized gene symbols by using the [IDConverter](https://github.com/ShixiangWang/IDConverter) package.
```{r,symbol_convert_example,eval=FALSE}
# genes * samples
your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)
your_bulk_data <- SymbolConvert(your_bulk_data)
```
You can also use other package like `org.Hs.eg.db` for gene symbol matching if you prefer not to use SymbolConvert's built-in `IDConverter`.
```{r,symbol_convert_example2,eval=FALSE}
library(org.Hs.eg.db)
your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)
your_bulk_data <- BulkPreProcess(your_bulk_data, gene_symbol_conversion = FALSE)
ensembl_ids <- sub("\\..*", "", rownames(your_bulk_data))
gene_symbols <- mapIds(org.Hs.eg.db,
keys = ensembl_ids,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
rownames(your_bulk_data) <- gene_symbols
```
### 1.3 Phenotype Data
Basically you can just use your phenotype data directly. If you are confused about the structure `Screen()` requires, please refer to [Section 4](#example).
We provide some functions helping formatting and checking your phenotype data.
**Checking NA and report the position**:
```{r,NA_check_function,eval=FALSE}
# `max_print`: output how many NA location messages at one time in console if NA exists
CheckNA(your_phenotype_data, max_print = 5L)
```
```{r, Check_NA_example, eval=FALSE}
mat <- matrix(c(NA,1,1,NA),2, dimnames = list(c("Gene1","Gene2"),c("Sample1","Sample2")))
na_position <- CheckNA(mat)
# ! Found 2 NA values in data
# First 2 positions:
# Row 1 ("Gene1"), col 1 ("Sample1")
# Row 2 ("Gene2"), col 2 ("Sample2")
```
**In-place data transformation**
Use it just in tidyverse-like style
```{r,col_convert,eval=FALSE}
# ? if a vector
v <- 1:2000
v2 <- PhenoMap(v, v < 1000 ~ "0" ,v > 1000 ~ "1",.default = "here_is_1k")
table(v2)
# 0 1 here_is_1k
# 999 1000 1
v3 <- PhenoMap(v, v < 100 ~ 0, v < 1000 ~ 1, v > 1000 ~ 2, .default = NA_real_)
table(v3)
# 0 1 2
# 99 900 1000
# ? if a data
d <- mtcars
d2 <- PhenoMap(d, mpg > 15 ~ 1, mpg <= 15 ~ 0)
table(d2$mpg)
# 0 1
# 6 26
head(d2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
# 1: 1 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2: 1 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3: 1 4 108 93 3.85 2.320 18.61 1 1 4 1
# 4: 1 6 258 110 3.08 3.215 19.44 1 0 3 1
# 5: 1 8 360 175 3.15 3.440 17.02 0 0 3 2
# 6: 1 6 225 105 2.76 3.460 20.22 1 0 3 1
```
**Direct Interface for Foolproof Data Processing**
```{r,pheno_pre_process,eval=FALSE}
your_phenotype_data <- PhenoPreProcess(
bulk = your_bulk_data,
phenotype = your_phenotype_data,
phenotype_class = "survival",
status == "death" ~ 1,
status == "alive" ~ 0,
selelct = c("time","status")
)
```
Key parameters for `PhenoPreProcess`:
- `bulk`: Bulk RNA-seq expression data with genes as rows and samples as columns
- `phenotype`: Phenotype data (Named numeric vector/ data.frame)
- `phenotype_class`: Type of phenotype data, e.g., "binary", "survival", "continuous"
- `...`: pass to `PhenoMap`
- `select`: Columns to select from the phenotype data when it is a data.frame
------------------------------------------------------------------------
## 2. Screen Cells Associated with Phenotype
The function **`Screen`** provide 8 different options for screening cells associated with phenotype, These 8 algorithms come from the repositories mentioned in [6. References](#6-references), and you can choose one of them to screen your cells.
Key parameters for `Screen`:
- `matched_bulk`: A data frame of bulk expression data after intersecting samples. Make sure the rownames of `matched_bulk` is identical to `phenotype`.
- `sc_data`: A Seurat object after preprocessing, you can use the output of `Preprocess` function or your own preprocessed Seurat object.
- `phenotype`: A data frame or named vacor of phenotype data after intersecting samples. See [5. Example](#5-example) for more details.
- `label_type`: A character value specifying the filtering labels are stored in the `Seurat_object@misc` . Default: `NULL`, meaning the name of metho will be used.
- `phenotype_class`: A character value specifying the phenotype data type, i.e. `"binary"`, `"survival"` or `"continuous"`.
- `screen_method`: A character value specifying the screening method, i.e. "Scissor", "scPAS", "scAB", "scPP", "DEGAS", "LP_SGL", or "PIPET"
- `...`: Other parameters for the screening methods, see below
### 2.1 (Option A) Scissor Screening
Parameters pass to `...` when using `Scissor` method:
- `path2save_scissor_inputs`: A character value specifying the path to save intermediate data, you can also set `path2load_scissor_cache = NULL` to suppress the saving of intermediate files. Default: `Scissor_inputs.RData`
- `path2load_scissor_cahce`: A character value specifying the path to load intermediate data
- `alpha`: Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in `[0,1]`. A larger alpha lays more emphasis on the l1 norm.
- `cutoff`: Cutoff for the percentage of the Scissor selected cells in total cells. This parameter is used to restrict the number of the Scissor selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data. Only used when `alpha = NULL`.
- `reliability_test`: A logical value specifying whether to perform reliability test. Default: `FALSE`
- `reliability_test.n`: Permutation times (default: 10)
- `reliability_test.nfold`: The fold number in cross-validation (default: 10)
- `cell_evaluation`: A logical value specifying whether to perform cell evaluation. Default: `FALSE`
- `cell_evaluation.benchmark_data`: Path to benchmark data (RData file).
- `cell_evaluation.FDR`: FDR threshold for cell evaluation (default: 0.05).
- `cell_evaluation.bootstrap_n`: Number of bootstrap iterations for cell evaluation (default: 100).
**Usage**:
```{r,scissor_screening,eval=FALSE}
scissor_result = Screen(
matched_bulk = matched_bulk,
sc_data = sc_dataset, # A Seurat object after preprocessing
phenotype = matched_phenotype_data,
label_type = "TP53", # The filtering labels are stored in the `@misc`, you can change it to your own label
phenotype_class = "binary",
screen_method = c("Scissor"),
path2save_scissor_inputs = "Tmp/Scissor_inputs.RData" # Intermediate data
)
```
You can use the intermediate data for repeated runs. This is an inherent feature of the `Scissor`.
```{r,scissor_screening_cache,eval=FALSE}
scissor_result = Screen(
sc_data = sc_dataset,
label_type = "TP53",
phenotype_class = "binary",
screen_method = c("Scissor"),
path2load_scissor_cahe = "Tmp/Scissor_inputs.RData" # Intermediate data
)
```
If only the parameters `alpha` and `cutoff` are adjusted, this method can also be applied.
```{r,scissor_screening_param_adjusted,eval=FALSE}
# When `alpha = NULL`, an alpha iteration will continue until phenotype-associated cells are screened out or no cells are screened out even after exceeding the `cutoff`.
scissor_result = Screen(
sc_data = sc_dataset,
label_type = "TP53",
phenotype_class = "binary",
screen_method = c("Scissor"),
path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data
alpha = NULL,
cutoff = 0.2
)
```
**Returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `scissor_result`: The result of Scissor screening, including the parameters used
- `reliability_test`: Reliability test results
- `cell_evaluation`: Cell evaluation results
**Cell level Evaluation & Reliability Test**:
You can use `cell_evalutaion = TRUE` and `reliability_test = TRUE` to obtain some supporting information for each Scissor selected cell. First, prepare a benchmark dataset yourself for cell evalutaion.
```{r,scissor_screening_cell_level_evaluation,eval=FALSE}
scissor_result = Screen(
sc_data = sc_dataset,
label_type = "TP53",
phenotype_class = "binary",
screen_method = c("Scissor"),
path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data
reliability_test = TRUE,
cell_evaluation = TRUE,
cell_evaluation.benchmark_data = "path_to_benchmark_data.RData",
alpha = NULL,
cutoff = 0.05
)
```
helpful documentation:
[Scissor-Cell Level Evaluations](https://sunduanchen.github.io/Scissor/vignettes/Scissor_Tutorial.html#cell-level-evaluations)
### 2.2 (Option B) scPAS Screening
Parameters pass to `...` when using `scPAS` method (basically adapted from the `scPAS`'s documentation):
- Parameters passed to `scPAS::scPAS()`
These parameters directly interface with the core `scPAS`() function from the original package:
- `assay`: Name of Assay to get.
- `imputation`: Logical. imputation or not.
- `imputation_method`: Character. Name of alternative method for imputation.
- `nfeature`: Numeric. The Number of features to select as top variable features in `sc_data`. Top variable features will be used to intersect with the features of `matched_bulk`. Default is NULL and all features will be used.
- `alpha`: Numeric. Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If `alpha = NULL`, a default searching vector is used. The range of alpha is in `[0,1]`. A larger alpha lays more emphasis on the l1 norm.
- `cutoff`: Numeric. Cutoff for the percentage of the scPAS selected cells in total cells when `alpha = NULL`. This parameter is used to restrict the number of the scPAS selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data.
- `network_class`: The source of feature-feature similarity network. By default this is set to sc and the other one is bulk.
- `FDR_threshold`: Numeric. FDR value threshold for identifying phenotype-associated cells (default: 0.05)
- `independent`: Logical. The background distribution of risk scores is constructed independently of each cell. (default: TRUE)
- `permutation_times`: Number of permutations to perform (default: 2000)
**usage**:
```{r,scPAS_screening,eval=FALSE}
scpas_result = Screen(
matched_bulk = matched_bulk,
sc_data = A_Seurat_object,
phenotype = phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scPAS",
phenotype_class = "binary",
assay = 'RNA',
imputation = FALSE,
imputation_method = c("KNN", "ALRA"),
nfeature = 3000L,
alpha = c(0.01, NULL),
cutoff = 0.2,
network_class = c("SC", "bulk"),
permutation_times = 2000L,
FDR_threshold = 0.05,
independent = TRUE
)
```
**returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `stats`: A data.frame the significance of scPAS screening results
- `para`: A list containing the parameters used in scPAS screening
### 2.3 (Option C) scAB Screening
Parameters pass to `...` when using `scAB` method (basically adapted from the `scAB`'s documentation):
- `alpha`: Coefficient of phenotype regularization, default is `0.005`. When specified `NULL`, a default searching vector is used. A custom numeric vector is also supported.
- `alpha_2`: Coefficient of cell-cell similarity regularization, default is `0.005`. When specified `NULL`, a default searching vector is used. A custom numeric vector is also supported
- `maxiter`: Maximum number of iterations, default is `2000`
- `tred`: Threshold for early stopping, default is `2`
**usage**:
```{r,scAB_screening,eval=FALSE}
scab_result = Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scAB",
phenotype_class = "binary",
alpha = c(0.005, NULL),
alpha_2 = c(0.005, NULL),
maxiter = 2000L,
tred = 2L
)
```
**note**:
When both `alpha` and `alpha_2` are specified as `NULL` or as search vectors, the total number of searches equals the product of their lengths, which may lead to long runtimes. In such cases, we recommend enabling parallel computation.
```{r,scAB_screening_parallel,eval=FALSE}
setFuncOption(parallel = TRUE, parallel.type = 'multisession', workers = 4L)
```
Then simply run the function directly.
**returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `scAB_result`: A list with the submatrix and loss value
### 2.4 (Option D) scPP Screening
Parameters pass to `...` when using `scPP` method :
- `ref_group`: The reference group for the binary analysis, default is `0`
- `Log2FC_cutoff`: The cutoff for the log2 fold change of the binary analysis, default is `0.585`
- `estimate_cutoff`: Effect size threshold for continuous traits, default is `0.2`
- `probs`: Quantile cutoff in (0, 0.5) for cell classification, default is `0.2`. When specified `NULL`, a default searching vector is used. A custom searching vector is also supported.
**usage**:
```{r,scPP_screening,eval=FALSE}
# This will take several hours
scpp_result = Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53", # The filtering labels are stored in the `@misc`
screen_method = "scpp",
phenotype_class = "binary",
ref_group = 0,
Log2FC_cutoff = 0.585,
estimate_cutoff = 0.2,
probs = c(0.2, NULL)
)
```
**returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `gene_list`: A list containing positive genes and negative genes
### 2.5 (Option E) DEGAS Screening
Parameters pass to `...` when using `DEGAS` method
- `sc_data.pheno_colname`: The column name of the phenotype in the `[email protected]` slot, used to specify the phenotype for more accurate screening. Default is `NULL`.
- `tmp_dir`: The directory for storing the intermediate files. Default is `NULL`, a directory named `tmp` will be created.
- `env_params` : A list of parameters for the environment, default is `list()`. Use `?DoDEGAS` to see the details.
- `degas_params`: A list of parameters for the DEGAS algorithm, default is `list()`. Use `?DoDEGAS` to see the details.
- `normality_test_method`: Method for normality testing: `"jarque-bera", "d'agostino", or "kolmogorov-smirnov"`, default is "jarque-bera".
```{r,DEGAS_screening,eval=FALSE}
degas_result <- Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53", # The labels are stored in the `@misc` and are used to identify the screening results.
screen_method = "DEGAS",
# The type of phenotype
phenotype_class = c("binary", "continuous", "survival"),
# Environment parameters
env_params = list(
env.name = "r-reticulate-degas",
env.type = "conda",
# Environment.yml file will be used to create the conda environment in default, so other parameters can be ignored
env.method = "environment",
# The path of the environment.yml file
env.file = system.file(
"conda/DEGAS_environment.yml",
package = "SigBridgeR"
),
env.python_verion = "3.9.15",
env.packages = c(
"tensorflow" = "2.4.1",
"protobuf" = "3.20.3"
),
# Force re-creating the env
env.recreate = FALSE,
# Use conda-forge channel for env creating
env.use_conda_forge = TRUE,
# Message output when creating the env
env.verbose = FALSE
),
# DEGAS parameters
degas_params = list(
# DEGAS.model_type will be automatically determined by the `phenotype_class`
DEGAS.model_type = c(
"BlankClass", # only bulk level phenotype specified
"ClassBlank", # only single cell level phenotype specified
"ClassClass", # when both single cell level phenotype and bulk level phenotype specified
"ClassCox", # when both single cell level phenotype and bulk level survival data specified
"BlankCox" # only bulk level survival data specified
),
# DEGAS.architecture will be `DenseNet` in default
DEGAS.architecture = c(
"DenseNet", # a dense net network
"Standard" # a feed forward network
),
# The path to save intermediate data
path.data = '',
path.result = '',
# The python executable path of the conda environment, auto detected in default
DEGAS.pyloc = NULL,
# Some functions will be called in the DEGAS algorithm
DEGAS.toolsPath = paste0(.libPaths()[1], "/DEGAS/tools/"), # or `file.path(.libaPaths()[1], "SigBridgeR/DEGAS_tools/")`
# Screening parameters
DEGAS.ff_depth = 3,
DEGAS.bag_depth = 5,
DEGAS.train_steps = 2000,
DEGAS.scbatch_sz = 200,
DEGAS.patbatch_sz = 50,
DEGAS.hidden_feats = 50,
DEGAS.do_prc = 0.5,
DEGAS.lambda1 = 3.0,
DEGAS.lambda2 = 3.0,
DEGAS.lambda3 = 3.0,
DEGAS.seed = 2
),
# default: "jarque-bera".
normality_test_method = c(
"jarque-bera",
"d'agostino",
"kolmogorov-smirnov"
)
)
```
This code chunk will CREATE a conda environment called "**r-reticulate-degas**" with python 3.9.15, and install the required packages (tensorflow, protobuf) for DEGAS. If an environment with the same name already exists, it will directly use this environment without creating a new one (unless specified `env_params = list(env.recreate=TRUE)`).
To obtain the default parameters for DEGAS, you can use `SigBridgeR:::DEGASParamSet()`
```{r,get_default_params1,eval=FALSE}
degas_param <- SigBridgeR:::DEGASParamSet(list())
env_param <- SigBridgeR:::DEGASEnvSet(list())
```
You can use `ListPyEnvs()` to list all the python environments in your system, including virtual environments. Both Windows and Unix-like systems are supported. More information can be found in [https://wanglabcsu.github.io/SigBridgeR/articles/Other_Function_Details.html](https://wanglabcsu.github.io/SigBridgeR/articles/Other_Function_Details.html)
```{r,show_python_envs2,eval=FALSE}
# * On Unix-like system it goes like this
ListPyEnv()
# name python type
# 1 base /home/user/miniconda3/bin/python conda
# 2 r-reticulate-degas /home/user/miniconda3/envs/r-reticulate-degas/bin/python conda
# 3 test /home/user/.virtualenvs/test/bin/python venv
```
Please note that the environmental dependencies required for DEGAS to run are quite stringent, and conflicts are highly likely to occur.
**returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `model`: A model trained using single-cell RNA expression matrix, tissue bulk RNA sequencing expression matrix, and phenotypic data.
- `DEGAS_prediction`: Using the model to conduct prediction for each cell, resulting in a data.frame where each phenotype has a predicted probability score.
### 2.6 (Option F) LP_SGL Screening
Parameters pass to `...` when using `LP_SGL` method
- `resolution`: Resolution parameter for Leiden clustering (default: `0.6`)
- `alpha`: Alpha parameter for SGL balancing L1 and L2 penalties (default: `0.5`)
- `nfold`: Number of folds for cross-validation (default: `5`)
- `dge_analysis`: List controlling differential expression analysis:
- `run`: Whether to run DEG analysis (default: `FALSE`)
- `logFC_threshold`: Log fold change threshold (default: `1`)
- `pval_threshold`: P-value threshold (default: `0.05`)
```{r,LP_SGL_screening,eval=FALSE}
lpsgl_result <- Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53",
resolution = 0.6,
alpha = 0.5,
nfold = 5,
dge_analysis = list(
run = FALSE, # whether to run DEG analysis
logFC_threshold = 1,
pval_threshold = 0.05
),
... # Additional parameters like verbose, seed
)
```
**returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening
- `sgl_fit` : Fitted SGL model object
- `cvfit` : Cross-validation results
- `dge_res` : Differential expression results if requested (NULL otherwise)
### 2.7 (Option G) PIPET Screening
Parameters pass to `...` when using `PIPET` method
- **Phenotype adaptation parameters**:
- `discretize_method`: Character: "kmeans"/"quantile"/"custom" (default: "kmeans")
- `cutoff`: Numeric vector for custom discretization when `discretize_method` is "custom" (default: NULL)
- **Marker generation parameters**:
- `log2FC`: Numeric: log2FC threshold (default: 1)
- `p.adjust`: Numeric: adjusted p-value threshold (default: 0.05)
- **Single-cell annotation parameters**:
- `distance`: Character: "cosine"/"pearson"/"spearman" (default: "cosine")
- `nPerm`: Integer: permutation times for statistical test (default: 1000L)
**Usage**:
```{R,PIPET_screening,eval=FALSE}
pipet_result = Screen(
matched_bulk = matched_bulk,
sc_data = sc_dataset, # A Seurat object after preprocessing
phenotype = matched_phenotype_data,
label_type = "TP53", # The filtering labels are stored in the `Seurat_object@misc`
phenotype_class = "binary", # `survival` is not supported
screen_method = "PIPET",
# PIPET specific parameters
lg2FC = 1,
p.adjust = 0.05,
distance = "cosine",
nPerm = 1000,
parallel = FALSE # Whether to use parallel computing, before using this parameter, please make sure that future::plan() has been called
)
```
**Returning structure**: A list containing:
- `scRNA_data`: A Seurat object after screening (with PIPET annotations in meta.data)
- `markers`: Phenotype-specific marker genes
### 2.8 (Option H) SIDISH Screening
Parameters passed to `...` when using `SIDISH` method:
- `verbose`: Logical. Whether to print verbose output during execution. Default is `TRUE`.
- `label_type`: Character specifying the phenotype label type stored in `sc_data@misc`. Default is `"SIDISH"`.
- `assay`: Seurat assay name to use for expression data. Default is `"RNA"`.
- `sidish_param`: List of parameters for SIDISH algorithm. See below for details.
- `env_params`: List of parameters for environment setup. See below for details.
```{r,SIDISH_screening,eval=FALSE}
sidish_result <- Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_survival_phenotype, # Must contain "time" and "status" columns
label_type = "SIDISH", # Labels stored in `@misc` slot to identify screening results
screen_method = "SIDISH",
phenotype_class = "survival", # Currently only survival phenotype is supported
# SIDISH algorithm parameters
sidish_param = list(
# Preprocessing parameters
patient_id = "Sample",
celltype_name = "celltype_major",
processed = TRUE,
n_genes_by_counts = 5000L,
pct_counts_mt = 10L,
batch_correction = FALSE,
survival_ = "time", # Column name for survival time
status = "status", # Column name for event status
# Execution environment
device = "cuda", # "cuda" for GPU acceleration, "cpu" for CPU-only
use_spatial_graph = FALSE,
k_neighbors = NULL,
# Phase 1: VAE training parameters
phase1_epochs = 225L,
phase1_i_epochs = 20L,
phase1_latent_size = 32L,
phase1_layer_dims = c(512L, 128L),
phase1_batch_size = 256L,
phase1_optimizer = "Adam",
phase1_lr = 1e-4,
phase1_lr_3 = 1e-4,
phase1_dropout = 0L,
phase1_type = "Dense", # "Dense" or "Normal"
# Phase 2: Deep Cox training parameters
phase2_epochs = 500L,
phase2_hidden = 128L,
phase2_lr = 1e-4,
phase2_dropout = 0L,
phase2_test_size = 0.2,
phase2_batch_size_bulk = 256L,
# Training & risk definition
train_iterations = 5L,
train_percentile = 0.95,
train_steepness = 30L,
train_path = "./SIDISH_res/", # Path to save intermediate data
train_num_workers = 0L,
train_distribution_fit = "fitted" # "fitted" or "default"
),
# Python environment parameters
env_params = list(
env.name = "r-reticulate-sidish-nvidia", # Auto-selected based on device parameter
env.type = "conda",
env.method = "environment",
env.file = system.file(
"conda/SIDISH_nvidia_environment.yml",
package = "SigBridgeR"
),
env.python_verion = "3.12.12",
env.packages = c(
"numpy" = "1.26.4"
# Additional packages defined in environment.yml
),
env.recreate = FALSE, # Force re-creating environment if TRUE
env.use_conda_forge = TRUE,
env.verbose = FALSE
),
verbose = TRUE
)
```
This code chunk will CREATE a conda environment called **`r-reticulate-sidish-nvidia`** (for GPU) or **`r-reticulate-sidish-cpu`** (for CPU) with Python 3.12.12 and install all required dependencies for SIDISH. If an environment with the same name already exists, it will be reused without recreation (unless `env_params = list(env.recreate = TRUE)` is specified).
**Note**: SIDISH currently **only supports survival phenotypes** (`phenotype_class = "survival"`). The phenotype data frame must contain columns named `"time"` (survival time) and `"status"` (event indicator).
To obtain the default parameters for SIDISH, use `SigBridgeR:::SIDISHParamSet()` and `SigBridgeR:::SIDISHEnvSet()`:
```{r,get_default_params,eval=FALSE}
# Get default SIDISH algorithm parameters
sidish_default_params <- SigBridgeR:::SIDISHParamSet(list())
# Get default environment parameters (CPU version)
env_default_cpu <- SigBridgeR:::SIDISHEnvSet(list(), device = "cpu")
# Get default environment parameters (GPU version)
env_default_gpu <- SigBridgeR:::SIDISHEnvSet(list(), device = "cuda")
```
You can use `ListPyEnvs()` to list all Python environments available on your system (both conda and virtualenv):
```{r,show_python_envs,eval=FALSE}
ListPyEnv()
# name python type
# 1 base /home/user/miniconda3/bin/python conda
# 2 r-reticulate-sidish-nvidia /home/user/miniconda3/envs/r-reticulate-sidish-nvidia/bin/python conda
# 3 r-reticulate-sidish-cpu /home/user/miniconda3/envs/r-reticulate-sidish-cpu/bin/python conda
```
> **Important considerations**:
> - GPU acceleration requires CUDA-compatible hardware. If GPU is not detected but `device = "cuda"` is specified, the function will abort with an error message. Set `sidish_param = list(device = "cpu")` to run on CPU. And of course the python environment must be recreated.
**Return structure**: A named list containing:
- `scRNA_data`: A Seurat object with SIDISH screening results integrated into the `@misc` and `@meta.data` slots under the specified `label_type`. The object contains cell-level risk scores and survival-related annotations generated by the SIDISH algorithm.
### 2.9 (Option I) SCIPAC Screening
Parameters pass to `...` when using `PIPET` method
- `hvg`: Integer. Number of highly variable genes to use for preprocessing. Default is `1000L`.
- `do_pca_sc`: Logical. Whether to perform PCA on single-cell data and apply the rotation matrix to bulk data; if FALSE, PCA is performed on bulk data and applied to single-cell data. Default is `FALSE`.
- `n_pc`: Integer. Number of principal components to use. Default is `60L`.
- `sc_batch_col`: Character or vector. Batch variable for single-cell data. Default is `NULL`.
- `resolution`: Integer. Clustering resolution for cell type identification. Default is `2L`.
- `ela_net_alpha`: Numeric. Elastic net mixing parameter (0 = ridge, 1 = lasso). Default is `0.4`.
- `bt_size`: Integer. Bootstrap sample size for stability assessment. Default is `50L`.
- `ncore`: Integer. Number of CPU cores for parallel computation. Default is `7L`.
- `ci_alpha`: Numeric. Significance level for confidence intervals. Default is `0.05`.
- `nfold`: Integer. Number of folds for cross-validation for regression models. Default is `10L`.
- `...`: Additional arguments. Supports `assay` (character), `verbose` (logical), and `seed` (integer).
```{r,SCIPAC_screening,eval=FALSE}
scipac_result <- Screen(
matched_bulk = your_matched_bulk,
sc_data = A_Seurat_object,
phenotype = your_matched_phenotype,
label_type = "TP53",
hvg = 1000L,
do_pca_sc = FALSE,
n_pc = 60L,
sc_batch_col = NULL,
resolution = 2L,
ela_net_alpha = 0.4,
bt_size = 50L,
ncore = 7L,
ci_alpha = 0.05,
nfold = 10L,