SigBridgeR/vignettes/Full_Tutorial.Rmd at 90d54ce29fe916eeabd61a22dcdd6d1b0377f67c · WangLabCSU/SigBridgeR

History

2192 lines (1775 loc) · 85.1 KB

Raw

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

780

781

782

783

784

785

786

787

788

789

790

791

792

793

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

---

title: "Full Tutorial"

date: '`r Sys.Date()`'

output:

html_document

knit: knitted

vignette: >

%\VignetteIndexEntry{Full Tutorial}

%\VignetteEngine{knitr::rmarkdown}

%\VignetteEncoding{UTF-8}

---

# Full Tutorial for SigBridgeR

## 0. Preface

### 0.1 Contents

- [Full Tutorial for SigBridgeR](#full-tutorial-for-sigbridger)

- [0. Preface](##0-preface)

- [0.1 Contents](###01-contents)

- [0.2 Introduction to SigBridgeR](###02-introduction-to-sigbridger)

- [1. Load and Preprocess data](##1-load-and-preprocess-data)

- [1.1 Single-cell RNA-seq Data](###11-single-cell-rna-seq-data)

- [1.1.1 (Option A) Start from Raw Matrix](###111-option-a-start-from-raw-matrix)

- [1.1.2 (Option B) Start from AnnData Object](####112-option-b-start-from-anndata-object)

- [1.1.8 (Optional) Filter Out Tumor Cells](####118-optional-filter-out-tumor-cells)

- [1.2 Bulk expression data](###12-bulk-expression-data)

- [1.2.1 Evaluate the quality of your bulk RNA-seq data](####121-evaluate-the-quality-of-your-bulk-rna-seq-data)

- [Quality Control Metrics Reported](####quality-control-metrics-reported)

- [Recommended Parameter Adjustments](####recommended-parameter-adjustments)

- [1.2.2 Gene Symbol Conversion](###122-gene-symbol-conversion)

- [1.3 Phenotype Data](##13-phenotype-data)

- [2. Screen Cells Associated with Phenotype](##2-screen-cells-associated-with-phenotype)

- [2.1 (Option A) Scissor Screening](###21-option-a-scissor-screening)

- [2.2 (Option B) scPAS Screening](###22-option-b-scpas-screening)

- [2.3 (Option C) scAB Screening](###23-option-c-scab-screening)

- [2.4 (Option D) scPP Screening](###24-option-d-scpp-screening)

- [2.5 (Option E) DEGAS Screening](###25-option-e-degas-screening)

- [2.6 (Option F) LP_SGL Screening](###26-option-f-lp_sgl-screening)

- [2.7 (Option G) PIPET Screening](###27-option-g-pipet-screening)

- [2.8 (Option H) SIDISH Screening](###28-option-h-sidish-screening)

- [2.9 (Option I) SCIPAC Screening](###29-option-i-scipac_screening)

- [2.F Merge screening results](###2f-merge-screening-results)

- [3. Visualization](##3-visualization)

- [3.1 UMAP for screening results](###31-umap-for-screening-results)

- [3.2 Stack bar plot for screening results](###32-stack-bar-plot-for-screening-results)

- [3.3 Venn diagram for screening results](###33-venn-diagram-for-screening-results)

- [3.4 Upset plot for screening results](###34-upset-plot-for-screening-results)

- [4. Example](##4-example)

- [4.1 Survival-associated cell screening](###41-survival-associated-cell-screening)

- [4.2 Continuous Phenotype-associated cell screening](###42-continuous-phenotype-associated-cell-screening)

- [4.3 Binarized phenotype-associated cell screening](###43-binarized-phenotype-associated-cell-screening)

- [5. Troubleshooting](##5-troubleshooting)

- [6. References](##6-references)

### 0.2 Introduction to SigBridgeR

SigBridgeR (short for **Sig**nificant cell-to-phenotype **Bridge** in **R**) is an R package for screening cells highly associated with phenotype data using single-cell RNA-seq, bulk RNA expression and sample related phenotype data (e.g. patient survival, age, etc). It integrates many single cell phenotypic screening methods ([8. References](#8-references)) and provides unified preprocessing, parameter tuning, and visualization approaches,performing as a unified integration panel.

------------------------------------------------------------------------

## 1. Load and Preprocess data

First load the packages, you will see a version number message indicating successful loading:

```{r,load_package,eval=FALSE}

library(SigBridgeR)

# ✔ SigBridgeR v3.x.x loaded

```

### 1.1 Single-cell RNA-seq Data

You can use function `SCPreProcess` to preprocess your single-cell RNA-seq data. This function utilizes a flexible pipeline system, allowing you to customize the analysis workflow using character codes:

**Pipeline Code Table:**

| Code | Function | Description |

| --- | --- | --- |

| **o** | `CreateSeuratObject` | **Required.** Must be the first step. |

| **n** | `NormalizeData` | Standard normalization. |

| **s** | `ScaleData` | Scales data for PCA. |

| **v** | `FindVariableFeatures` | Selects highly variable genes. |

| **p** | `RunPCA` | Principal Component Analysis. |

| **e** | `FindNeighbors` | Computes SNN graph. |

| **c** | `FindClusters` | Louvain algorithm clustering. |

| **t** | `RunTSNE` | t-SNE reduction. |

| **u** | `RunUMAP` | UMAP reduction. |

| **r** | `SCTransform` | **SCT workflow.** Replaces n, s, v. |

#### 1.1.1 (Option A) Start from Raw Matrix

When starting from a raw count matrix (data.frame, matrix or dgCMatrix), `SCPreProcess` executes the steps defined in the pipeline argument (Default: `"onsvpcetu"`). You can customize parameters for each step via the `params` list.

```{r,scpreprocessing_raw_matrix,eval=FALSE}

your_seurat <- SCPreProcess(

sc = your_matrix,

...,

pipeline = "onsvpcetu",

params = list(

# * CreateSeuratObject

o = list(

project = "SC_Screen_Proj",

min.cells = 400L

# * NormalizeData

n = list(),

# * ScaleData

s = list(),

# * FindVariableFeatures

v = list(),

# * RunPCA

p = list(),

# * FindNeightbors

e = list(),

# * FindClusters

c = list(

resolution = 0.6

# * RunTSNE

t = list(),

# * RunUMAP

u = list()

# * SCTransform

# r = list()

quality_control = list(

pattern = c("^MT-")

data_filter = list(

nFeature_RNA_thresh = c(200L, 6000L),

nCount_RNA_thresh = c(500L, 50000L),

# * only used when specifed in `quality_control.pattern`

percent.mt = 20L, # mitochondrial genes

percent.rp = 60L # ribosomal protein genes

# ? When combined pattern is used, like `quality_control$pattern <- "^MT-|^RP[LS]"`

# ? Use `_` to separate different patterns like this:

# percent.mt_rp = 60L

# ? When filtering for non-mitochondrial genes and non-ribosomal proteins RNA genes,

# ? the column names are in lowercase letter form with regular expression symbols removed.

# `quality_control$pattern <- "^[rt]rna"`

# Correct threshhold setting is `percent.rt_rna = 60L`

# ? Use `SigBridgeR::Pattern2Colname()` to get the correct colname if still confused.

column2only_tumor = NULL

)

```

**Key Workflow Steps:**

1. Pipeline Execution: The function runs Seurat methods in the order specified by the `pipeline` string (e.g., 'o' then 'n').

2. QC & Filtering:

- Detection: Regex patterns in `quality_control` (e.g., `^MT-`) automatically generate metadata columns (e.g.,` percent.mt`).

- Filtering: Cells are filtered based on `data_filter` thresholds. Note: Custom patterns generate specific column names (e.g., `^[rt]rna` becomes `percent.rt_rna`).

3. Tumor Filtering: If `column2only_tumor` is provided, the function retains only cells with `"Tumor/Cancer/Malignant"`(case-insensitive) labels in that metadata column.

#### 1.1.2 (Option B) Start from AnnData Object

SCPreProcess natively supports AnnData object. You can use package `anndata` or `anndataR` to read in your AnnData object from `.h5ad` file.

First, import the data

```{r,scpreprocessing_anndata,eval=FALSE}

reticulate::use_pythonenv("The_path_to_your_python")

# reticulate::use_condaenv("conda_env_name")

anndata_obj <- anndata::read_h5ad("path_to_your_file.h5ad") # Or other file formats, make sure the matrix is in obj$X.

```

Or use `anndataR` to read in your file:

```{r,scpreprocessing_anndatar,eval=FALSE}

reticulate::use_pythonenv("The_path_to_your_python")

# reticulate::use_condaenv("conda_env_name")

anndata_obj <- anndataR::read_h5ad("path_to_your_file.h5ad") # basically no difference

```

Then just pass it to `SCPreProcess`:

```{r,scpreprocessing_in_anndata,eval=FALSE}

your_seurat <- SCPreProcess(

anndata_obj,

pipeline = "onsvpcetu", # Standard pipeline

column2only_tumor = "Tissue" # Optional: keep only tumor cells

)

```

The description of data (meta.data) in `anndata_obj$obs` will be add to `[email protected]`.

**helpful documentation:**

For the structure of `anndata`, you can refer to

- [AnnData for R](https://github.com/dynverse/anndata)

- [{anndataR}: An R package for working with AnnData objects](https://github.com/scverse/anndataR)

For more quality control, please use [scCustmoize](https://samuel-marsh.github.io/scCustomize/index.html), like `scCustomize::Add_Cell_QC_Metrics()`.

#### 1.1.8 (Optional) Filter Out Tumor Cells

If you already have a Seurat object and aim to filter out phenotype-associated cells in tumor cells. `SCPreProcess` will validate its structure and filter out tumor cells using the specified metadata column:

```{r,scpreprocessing_seurat,eval=FALSE}

your_seurat <- SCPreProcess(sc = your_seurat, column2only_tumor = "Tissue")

```

> Note: I don't recommend using columns like `column2only_tumor = "Celltype"` as tumor cell identities vary across tissues. instead:

> - Create a Dedicated Column: Add a new metadata column (e.g., is_tumor) to explicitly label cells:"Tumo(u)r"/"Normal"

> - Code Example:

> ```{r,choose_tumor_column,eval=FALSE}

> # For glioblastoma (GBM)

> seurat_obj$is_tumor <- ifelse(

> grepl("GBM|glioblastoma|astrocytoma_grade_IV", seurat_obj$Celltype, ignore.case = TRUE),

> "Tumor", # or "Tumour"

> "Normal" # or "Non-Tumor"

> )

> ```

### 1.2 Bulk expression data

Here are some methods for processing bulk RNA-seq gene expression data matrices.

#### 1.2.1 Evaluate the quality of your bulk RNA-seq data

`BulkPreProcess` performs comprehensive quality control on raw bulk RNA-seq count matrix data.

Key parameters for `BulkPreProcess`:

- `data`: Expression matrix with genes as rows and samples as columns, or a list containing count_matrix and sample_info.

- `sample_info`: Sample information data frame (optional), ignored if data is a list. A qualified `sample_info` should contain both `sample` and `condition` columns (case-sensitive), and there are no specific requirements for the data type stored in the `condition` column.

- `gene_symbol_conversion`: Whether to convert Ensembl version IDs and TCGA version IDs to genes with [SymbolConvert in Section 1.2.2](#122-gene-symbol-conversion), default TRUE.

- `check`: Whether to perform detailed quality checks, default TRUE.

- `min_count_threshold`: Minimum count threshold for gene filtering, default 10.

- `min_gene_expressed`: Minimum number of samples a gene must be expressed in, default 3.

- `min_total_reads`: Minimum total reads per sample, default 1e6.

- `min_genes_detected`: Minimum number of genes detected per sample, default 10000.

- `min_correlation`: Minimum correlation threshold between samples, default 0.8.

- `n_top_genes`: Number of top variable genes for PCA analysis, default 500.

- `show_plot_results`: Whether to generate PCA visualization plots, default TRUE.

- `verbose`: Whether to output detailed information, default TRUE.

Suppose you have bulk RNA-seq count data and sample information, and you want to perform comprehensive preprocessing and quality control. You can refer to and use the following code:

```{r,bulk_preprocess_example,eval=FALSE}

# Example usage of BulkPreProcess

filtered_counts <- BulkPreProcess(

data = your_count_matrix,

sample_info = your_sample_info,

gene_symbol_conversion = TRUE,

check = TRUE,

min_count_threshold = 10,

min_samples_expressed = 3,

min_total_reads = 1e6,

min_genes_detected = 10000,

min_correlation = 0.8,

n_top_genes = 500,

show_plot_results = TRUE,

verbose = TRUE

)

```

The function returns a filtered bulk count matrix after applying quality control steps and does not perform log2 transformation on the data.

##### Quality Control Metrics Reported

- Gene Filtering:

First, lowly expressed genes are removed, retaining genes that have an expression level reaching `min_count_threshold` in at least `min_gene_expressed` samples. This step is based on statistical power considerations, as lowly expressed genes are prone to false-positive results in differential analysis.

- Sample Filtering:

Subsequently, low-quality samples are filtered based on three core indicators:\

- Sequencing Depth: The total number of reads must reach `min_total_reads`.

- Library Complexity: The number of detected genes must exceed `min_genes_detected`.

- Sample Consistency: When quality checking is enabled, the average correlation between samples must meet the `min_correlation` threshold.

- Data Integrity Check: Identifies technical anomalies; too many missing values may indicate experimental issues.

- Feature Expression Filtering: Uses a count threshold rather than a proportion threshold to avoid over-penalizing small samples. Genes must be stably expressed in multiple samples to ensure the reliability of statistical tests.

- Sample-Level Filtering:

Comprehensively evaluates technical quality: \

Sequencing depth ensures detection sensitivity, the number of detected genes reflects library diversity, and sample correlation (Pearson) validates experimental reproducibility. PCA analysis further identifies batch effects and outlier samples.

- PCA Variance:

Variance explained by first two principal components. \

When `show_plot_results = TRUE`, the function generates PCA plot colored by experimental condition

- Batch Effects: Proportion of genes significantly affected by batch

##### Recommended Parameter Adjustments

Parameter settings need to balance data quality and information retention:

- For Low-Depth Sequencing Data:

Should relax `min_total_reads` and `min_genes_detected` due to technical limitations rather than quality issues causing low indicators.

- For Noisy Datasets:

Need to increase the `min_correlation` threshold to strengthen sample consistency requirements and exclude technical variation interference.

- For Large Datasets:

Consider setting `check = FALSE` to enable only partial filtering functions and speed up the process.

All parameters are set based on empirical values, and users should make appropriate adjustments according to the specific experimental design and sequencing platform characteristics.

#### 1.2.2 Gene Symbol Conversion

`SymbolConvert` performs a straightforward task: converting common gene identifiers (e.g., Ensemble IDs, Entrez) to standardized gene symbols by using the [IDConverter](https://github.com/ShixiangWang/IDConverter) package.

```{r,symbol_convert_example,eval=FALSE}

# genes * samples

your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)

your_bulk_data <- SymbolConvert(your_bulk_data)

```

You can also use other package like `org.Hs.eg.db` for gene symbol matching if you prefer not to use SymbolConvert's built-in `IDConverter`.

```{r,symbol_convert_example2,eval=FALSE}

library(org.Hs.eg.db)

your_bulk_data <- read.csv("path_to_your_file.csv", header = TRUE, row.names = 1)

your_bulk_data <- BulkPreProcess(your_bulk_data, gene_symbol_conversion = FALSE)

ensembl_ids <- sub("\\..*", "", rownames(your_bulk_data))

gene_symbols <- mapIds(org.Hs.eg.db,

keys = ensembl_ids,

column = "SYMBOL",

keytype = "ENSEMBL",

multiVals = "first")

rownames(your_bulk_data) <- gene_symbols

```

### 1.3 Phenotype Data

Basically you can just use your phenotype data directly. If you are confused about the structure `Screen()` requires, please refer to [Section 4](#example).

We provide some functions helping formatting and checking your phenotype data.

**Checking NA and report the position**:

```{r,NA_check_function,eval=FALSE}

# `max_print`: output how many NA location messages at one time in console if NA exists

CheckNA(your_phenotype_data, max_print = 5L)

```

```{r, Check_NA_example, eval=FALSE}

mat <- matrix(c(NA,1,1,NA),2, dimnames = list(c("Gene1","Gene2"),c("Sample1","Sample2")))

na_position <- CheckNA(mat)

# ! Found 2 NA values in data

# First 2 positions:

# Row 1 ("Gene1"), col 1 ("Sample1")

# Row 2 ("Gene2"), col 2 ("Sample2")

```

**In-place data transformation**

Use it just in tidyverse-like style

```{r,col_convert,eval=FALSE}

# ? if a vector

v <- 1:2000

v2 <- PhenoMap(v, v < 1000 ~ "0" ,v > 1000 ~ "1",.default = "here_is_1k")

table(v2)

# 0 1 here_is_1k

# 999 1000 1

v3 <- PhenoMap(v, v < 100 ~ 0, v < 1000 ~ 1, v > 1000 ~ 2, .default = NA_real_)

table(v3)

# 0 1 2

# 99 900 1000

# ? if a data

d <- mtcars

d2 <- PhenoMap(d, mpg > 15 ~ 1, mpg <= 15 ~ 0)

table(d2$mpg)

# 0 1

# 6 26

head(d2)

# mpg cyl disp hp drat wt qsec vs am gear carb

# <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>

# 1: 1 6 160 110 3.90 2.620 16.46 0 1 4 4

# 2: 1 6 160 110 3.90 2.875 17.02 0 1 4 4

# 3: 1 4 108 93 3.85 2.320 18.61 1 1 4 1

# 4: 1 6 258 110 3.08 3.215 19.44 1 0 3 1

# 5: 1 8 360 175 3.15 3.440 17.02 0 0 3 2

# 6: 1 6 225 105 2.76 3.460 20.22 1 0 3 1

```

**Direct Interface for Foolproof Data Processing**

```{r,pheno_pre_process,eval=FALSE}

your_phenotype_data <- PhenoPreProcess(

bulk = your_bulk_data,

phenotype = your_phenotype_data,

phenotype_class = "survival",

status == "death" ~ 1,

status == "alive" ~ 0,

selelct = c("time","status")

)

```

Key parameters for `PhenoPreProcess`:

- `bulk`: Bulk RNA-seq expression data with genes as rows and samples as columns

- `phenotype`: Phenotype data (Named numeric vector/ data.frame)

- `phenotype_class`: Type of phenotype data, e.g., "binary", "survival", "continuous"

- `...`: pass to `PhenoMap`

- `select`: Columns to select from the phenotype data when it is a data.frame

------------------------------------------------------------------------

## 2. Screen Cells Associated with Phenotype

The function **`Screen`** provide 8 different options for screening cells associated with phenotype, These 8 algorithms come from the repositories mentioned in [6. References](#6-references), and you can choose one of them to screen your cells.

Key parameters for `Screen`:

- `matched_bulk`: A data frame of bulk expression data after intersecting samples. Make sure the rownames of `matched_bulk` is identical to `phenotype`.

- `sc_data`: A Seurat object after preprocessing, you can use the output of `Preprocess` function or your own preprocessed Seurat object.

- `phenotype`: A data frame or named vacor of phenotype data after intersecting samples. See [5. Example](#5-example) for more details.

- `label_type`: A character value specifying the filtering labels are stored in the `Seurat_object@misc` . Default: `NULL`, meaning the name of metho will be used.

- `phenotype_class`: A character value specifying the phenotype data type, i.e. `"binary"`, `"survival"` or `"continuous"`.

- `screen_method`: A character value specifying the screening method, i.e. "Scissor", "scPAS", "scAB", "scPP", "DEGAS", "LP_SGL", or "PIPET"

- `...`: Other parameters for the screening methods, see below

### 2.1 (Option A) Scissor Screening

Parameters pass to `...` when using `Scissor` method:

- `path2save_scissor_inputs`: A character value specifying the path to save intermediate data, you can also set `path2load_scissor_cache = NULL` to suppress the saving of intermediate files. Default: `Scissor_inputs.RData`

- `path2load_scissor_cahce`: A character value specifying the path to load intermediate data

- `alpha`: Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If alpha = NULL, a default searching vector is used. The range of alpha is in `[0,1]`. A larger alpha lays more emphasis on the l1 norm.

- `cutoff`: Cutoff for the percentage of the Scissor selected cells in total cells. This parameter is used to restrict the number of the Scissor selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data. Only used when `alpha = NULL`.

- `reliability_test`: A logical value specifying whether to perform reliability test. Default: `FALSE`

- `reliability_test.n`: Permutation times (default: 10)

- `reliability_test.nfold`: The fold number in cross-validation (default: 10)

- `cell_evaluation`: A logical value specifying whether to perform cell evaluation. Default: `FALSE`

- `cell_evaluation.benchmark_data`: Path to benchmark data (RData file).

- `cell_evaluation.FDR`: FDR threshold for cell evaluation (default: 0.05).

- `cell_evaluation.bootstrap_n`: Number of bootstrap iterations for cell evaluation (default: 100).

**Usage**:

```{r,scissor_screening,eval=FALSE}

scissor_result = Screen(

matched_bulk = matched_bulk,

sc_data = sc_dataset, # A Seurat object after preprocessing

phenotype = matched_phenotype_data,

label_type = "TP53", # The filtering labels are stored in the `@misc`, you can change it to your own label

phenotype_class = "binary",

screen_method = c("Scissor"),

path2save_scissor_inputs = "Tmp/Scissor_inputs.RData" # Intermediate data

)

```

You can use the intermediate data for repeated runs. This is an inherent feature of the `Scissor`.

```{r,scissor_screening_cache,eval=FALSE}

scissor_result = Screen(

sc_data = sc_dataset,

label_type = "TP53",

phenotype_class = "binary",

screen_method = c("Scissor"),

path2load_scissor_cahe = "Tmp/Scissor_inputs.RData" # Intermediate data

)

```

If only the parameters `alpha` and `cutoff` are adjusted, this method can also be applied.

```{r,scissor_screening_param_adjusted,eval=FALSE}

# When `alpha = NULL`, an alpha iteration will continue until phenotype-associated cells are screened out or no cells are screened out even after exceeding the `cutoff`.

scissor_result = Screen(

sc_data = sc_dataset,

label_type = "TP53",

phenotype_class = "binary",

screen_method = c("Scissor"),

path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data

alpha = NULL,

cutoff = 0.2

)

```

**Returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `scissor_result`: The result of Scissor screening, including the parameters used

- `reliability_test`: Reliability test results

- `cell_evaluation`: Cell evaluation results

**Cell level Evaluation & Reliability Test**:

You can use `cell_evalutaion = TRUE` and `reliability_test = TRUE` to obtain some supporting information for each Scissor selected cell. First, prepare a benchmark dataset yourself for cell evalutaion.

```{r,scissor_screening_cell_level_evaluation,eval=FALSE}

scissor_result = Screen(

sc_data = sc_dataset,

label_type = "TP53",

phenotype_class = "binary",

screen_method = c("Scissor"),

path2load_scissor_cahce = "Tmp/Scissor_inputs.RData", # Intermediate data

reliability_test = TRUE,

cell_evaluation = TRUE,

cell_evaluation.benchmark_data = "path_to_benchmark_data.RData",

alpha = NULL,

cutoff = 0.05

)

```

helpful documentation:

[Scissor-Cell Level Evaluations](https://sunduanchen.github.io/Scissor/vignettes/Scissor_Tutorial.html#cell-level-evaluations)

### 2.2 (Option B) scPAS Screening

Parameters pass to `...` when using `scPAS` method (basically adapted from the `scPAS`'s documentation):

- Parameters passed to `scPAS::scPAS()`

These parameters directly interface with the core `scPAS`() function from the original package:

- `assay`: Name of Assay to get.

- `imputation`: Logical. imputation or not.

- `imputation_method`: Character. Name of alternative method for imputation.

- `nfeature`: Numeric. The Number of features to select as top variable features in `sc_data`. Top variable features will be used to intersect with the features of `matched_bulk`. Default is NULL and all features will be used.

- `alpha`: Numeric. Parameter used to balance the effect of the l1 norm and the network-based penalties. It can be a number or a searching vector. If `alpha = NULL`, a default searching vector is used. The range of alpha is in `[0,1]`. A larger alpha lays more emphasis on the l1 norm.

- `cutoff`: Numeric. Cutoff for the percentage of the scPAS selected cells in total cells when `alpha = NULL`. This parameter is used to restrict the number of the scPAS selected cells. A cutoff less than 50% (default 20%) is recommended depending on the input data.

- `network_class`: The source of feature-feature similarity network. By default this is set to sc and the other one is bulk.

- `FDR_threshold`: Numeric. FDR value threshold for identifying phenotype-associated cells (default: 0.05)

- `independent`: Logical. The background distribution of risk scores is constructed independently of each cell. (default: TRUE)

- `permutation_times`: Number of permutations to perform (default: 2000)

**usage**:

```{r,scPAS_screening,eval=FALSE}

scpas_result = Screen(

matched_bulk = matched_bulk,

sc_data = A_Seurat_object,

phenotype = phenotype,

label_type = "TP53", # The filtering labels are stored in the `@misc`

screen_method = "scPAS",

phenotype_class = "binary",

assay = 'RNA',

imputation = FALSE,

imputation_method = c("KNN", "ALRA"),

nfeature = 3000L,

alpha = c(0.01, NULL),

cutoff = 0.2,

network_class = c("SC", "bulk"),

permutation_times = 2000L,

FDR_threshold = 0.05,

independent = TRUE

)

```

**returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `stats`: A data.frame the significance of scPAS screening results

- `para`: A list containing the parameters used in scPAS screening

### 2.3 (Option C) scAB Screening

Parameters pass to `...` when using `scAB` method (basically adapted from the `scAB`'s documentation):

- `alpha`: Coefficient of phenotype regularization, default is `0.005`. When specified `NULL`, a default searching vector is used. A custom numeric vector is also supported.

- `alpha_2`: Coefficient of cell-cell similarity regularization, default is `0.005`. When specified `NULL`, a default searching vector is used. A custom numeric vector is also supported

- `maxiter`: Maximum number of iterations, default is `2000`

- `tred`: Threshold for early stopping, default is `2`

**usage**:

```{r,scAB_screening,eval=FALSE}

scab_result = Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_matched_phenotype,

label_type = "TP53", # The filtering labels are stored in the `@misc`

screen_method = "scAB",

phenotype_class = "binary",

alpha = c(0.005, NULL),

alpha_2 = c(0.005, NULL),

maxiter = 2000L,

tred = 2L

)

```

**note**:

When both `alpha` and `alpha_2` are specified as `NULL` or as search vectors, the total number of searches equals the product of their lengths, which may lead to long runtimes. In such cases, we recommend enabling parallel computation.

```{r,scAB_screening_parallel,eval=FALSE}

setFuncOption(parallel = TRUE, parallel.type = 'multisession', workers = 4L)

```

Then simply run the function directly.

**returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `scAB_result`: A list with the submatrix and loss value

### 2.4 (Option D) scPP Screening

Parameters pass to `...` when using `scPP` method :

- `ref_group`: The reference group for the binary analysis, default is `0`

- `Log2FC_cutoff`: The cutoff for the log2 fold change of the binary analysis, default is `0.585`

- `estimate_cutoff`: Effect size threshold for continuous traits, default is `0.2`

- `probs`: Quantile cutoff in (0, 0.5) for cell classification, default is `0.2`. When specified `NULL`, a default searching vector is used. A custom searching vector is also supported.

**usage**:

```{r,scPP_screening,eval=FALSE}

# This will take several hours

scpp_result = Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_matched_phenotype,

label_type = "TP53", # The filtering labels are stored in the `@misc`

screen_method = "scpp",

phenotype_class = "binary",

ref_group = 0,

Log2FC_cutoff = 0.585,

estimate_cutoff = 0.2,

probs = c(0.2, NULL)

)

```

**returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `gene_list`: A list containing positive genes and negative genes

### 2.5 (Option E) DEGAS Screening

Parameters pass to `...` when using `DEGAS` method

- `sc_data.pheno_colname`: The column name of the phenotype in the `[email protected]` slot, used to specify the phenotype for more accurate screening. Default is `NULL`.

- `tmp_dir`: The directory for storing the intermediate files. Default is `NULL`, a directory named `tmp` will be created.

- `env_params` : A list of parameters for the environment, default is `list()`. Use `?DoDEGAS` to see the details.

- `degas_params`: A list of parameters for the DEGAS algorithm, default is `list()`. Use `?DoDEGAS` to see the details.

- `normality_test_method`: Method for normality testing: `"jarque-bera", "d'agostino", or "kolmogorov-smirnov"`, default is "jarque-bera".

```{r,DEGAS_screening,eval=FALSE}

degas_result <- Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_matched_phenotype,

label_type = "TP53", # The labels are stored in the `@misc` and are used to identify the screening results.

screen_method = "DEGAS",

# The type of phenotype

phenotype_class = c("binary", "continuous", "survival"),

# Environment parameters

env_params = list(

env.name = "r-reticulate-degas",

env.type = "conda",

# Environment.yml file will be used to create the conda environment in default, so other parameters can be ignored

env.method = "environment",

# The path of the environment.yml file

env.file = system.file(

"conda/DEGAS_environment.yml",

package = "SigBridgeR"

env.python_verion = "3.9.15",

env.packages = c(

"tensorflow" = "2.4.1",

"protobuf" = "3.20.3"

# Force re-creating the env

env.recreate = FALSE,

# Use conda-forge channel for env creating

env.use_conda_forge = TRUE,

# Message output when creating the env

env.verbose = FALSE

# DEGAS parameters

degas_params = list(

# DEGAS.model_type will be automatically determined by the `phenotype_class`

DEGAS.model_type = c(

"BlankClass", # only bulk level phenotype specified

"ClassBlank", # only single cell level phenotype specified

"ClassClass", # when both single cell level phenotype and bulk level phenotype specified

"ClassCox", # when both single cell level phenotype and bulk level survival data specified

"BlankCox" # only bulk level survival data specified

# DEGAS.architecture will be `DenseNet` in default

DEGAS.architecture = c(

"DenseNet", # a dense net network

"Standard" # a feed forward network

# The path to save intermediate data

path.data = '',

path.result = '',

# The python executable path of the conda environment, auto detected in default

DEGAS.pyloc = NULL,

# Some functions will be called in the DEGAS algorithm

DEGAS.toolsPath = paste0(.libPaths()[1], "/DEGAS/tools/"), # or `file.path(.libaPaths()[1], "SigBridgeR/DEGAS_tools/")`

# Screening parameters

DEGAS.ff_depth = 3,

DEGAS.bag_depth = 5,

DEGAS.train_steps = 2000,

DEGAS.scbatch_sz = 200,

DEGAS.patbatch_sz = 50,

DEGAS.hidden_feats = 50,

DEGAS.do_prc = 0.5,

DEGAS.lambda1 = 3.0,

DEGAS.lambda2 = 3.0,

DEGAS.lambda3 = 3.0,

DEGAS.seed = 2

# default: "jarque-bera".

normality_test_method = c(

"jarque-bera",

"d'agostino",

"kolmogorov-smirnov"

)

```

This code chunk will CREATE a conda environment called "**r-reticulate-degas**" with python 3.9.15, and install the required packages (tensorflow, protobuf) for DEGAS. If an environment with the same name already exists, it will directly use this environment without creating a new one (unless specified `env_params = list(env.recreate=TRUE)`).

To obtain the default parameters for DEGAS, you can use `SigBridgeR:::DEGASParamSet()`

```{r,get_default_params1,eval=FALSE}

degas_param <- SigBridgeR:::DEGASParamSet(list())

env_param <- SigBridgeR:::DEGASEnvSet(list())

```

You can use `ListPyEnvs()` to list all the python environments in your system, including virtual environments. Both Windows and Unix-like systems are supported. More information can be found in [https://wanglabcsu.github.io/SigBridgeR/articles/Other_Function_Details.html](https://wanglabcsu.github.io/SigBridgeR/articles/Other_Function_Details.html)

```{r,show_python_envs2,eval=FALSE}

# * On Unix-like system it goes like this

ListPyEnv()

# name python type

# 1 base /home/user/miniconda3/bin/python conda

# 2 r-reticulate-degas /home/user/miniconda3/envs/r-reticulate-degas/bin/python conda

# 3 test /home/user/.virtualenvs/test/bin/python venv

```

Please note that the environmental dependencies required for DEGAS to run are quite stringent, and conflicts are highly likely to occur.

**returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `model`: A model trained using single-cell RNA expression matrix, tissue bulk RNA sequencing expression matrix, and phenotypic data.

- `DEGAS_prediction`: Using the model to conduct prediction for each cell, resulting in a data.frame where each phenotype has a predicted probability score.

### 2.6 (Option F) LP_SGL Screening

Parameters pass to `...` when using `LP_SGL` method

- `resolution`: Resolution parameter for Leiden clustering (default: `0.6`)

- `alpha`: Alpha parameter for SGL balancing L1 and L2 penalties (default: `0.5`)

- `nfold`: Number of folds for cross-validation (default: `5`)

- `dge_analysis`: List controlling differential expression analysis:

- `run`: Whether to run DEG analysis (default: `FALSE`)

- `logFC_threshold`: Log fold change threshold (default: `1`)

- `pval_threshold`: P-value threshold (default: `0.05`)

```{r,LP_SGL_screening,eval=FALSE}

lpsgl_result <- Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_matched_phenotype,

label_type = "TP53",

resolution = 0.6,

alpha = 0.5,

nfold = 5,

dge_analysis = list(

run = FALSE, # whether to run DEG analysis

logFC_threshold = 1,

pval_threshold = 0.05

... # Additional parameters like verbose, seed

)

```

**returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening

- `sgl_fit` : Fitted SGL model object

- `cvfit` : Cross-validation results

- `dge_res` : Differential expression results if requested (NULL otherwise)

### 2.7 (Option G) PIPET Screening

Parameters pass to `...` when using `PIPET` method

- **Phenotype adaptation parameters**:

- `discretize_method`: Character: "kmeans"/"quantile"/"custom" (default: "kmeans")

- `cutoff`: Numeric vector for custom discretization when `discretize_method` is "custom" (default: NULL)

- **Marker generation parameters**:

- `log2FC`: Numeric: log2FC threshold (default: 1)

- `p.adjust`: Numeric: adjusted p-value threshold (default: 0.05)

- **Single-cell annotation parameters**:

- `distance`: Character: "cosine"/"pearson"/"spearman" (default: "cosine")

- `nPerm`: Integer: permutation times for statistical test (default: 1000L)

**Usage**:

```{R,PIPET_screening,eval=FALSE}

pipet_result = Screen(

matched_bulk = matched_bulk,

sc_data = sc_dataset, # A Seurat object after preprocessing

phenotype = matched_phenotype_data,

label_type = "TP53", # The filtering labels are stored in the `Seurat_object@misc`

phenotype_class = "binary", # `survival` is not supported

screen_method = "PIPET",

# PIPET specific parameters

lg2FC = 1,

p.adjust = 0.05,

distance = "cosine",

nPerm = 1000,

parallel = FALSE # Whether to use parallel computing, before using this parameter, please make sure that future::plan() has been called

)

```

**Returning structure**: A list containing:

- `scRNA_data`: A Seurat object after screening (with PIPET annotations in meta.data)

- `markers`: Phenotype-specific marker genes

### 2.8 (Option H) SIDISH Screening

Parameters passed to `...` when using `SIDISH` method:

- `verbose`: Logical. Whether to print verbose output during execution. Default is `TRUE`.

- `label_type`: Character specifying the phenotype label type stored in `sc_data@misc`. Default is `"SIDISH"`.

- `assay`: Seurat assay name to use for expression data. Default is `"RNA"`.

- `sidish_param`: List of parameters for SIDISH algorithm. See below for details.

- `env_params`: List of parameters for environment setup. See below for details.

```{r,SIDISH_screening,eval=FALSE}

sidish_result <- Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_survival_phenotype, # Must contain "time" and "status" columns

label_type = "SIDISH", # Labels stored in `@misc` slot to identify screening results

screen_method = "SIDISH",

phenotype_class = "survival", # Currently only survival phenotype is supported

# SIDISH algorithm parameters

sidish_param = list(

# Preprocessing parameters

patient_id = "Sample",

celltype_name = "celltype_major",

processed = TRUE,

n_genes_by_counts = 5000L,

pct_counts_mt = 10L,

batch_correction = FALSE,

survival_ = "time", # Column name for survival time

status = "status", # Column name for event status

# Execution environment

device = "cuda", # "cuda" for GPU acceleration, "cpu" for CPU-only

use_spatial_graph = FALSE,

k_neighbors = NULL,

# Phase 1: VAE training parameters

phase1_epochs = 225L,

phase1_i_epochs = 20L,

phase1_latent_size = 32L,

phase1_layer_dims = c(512L, 128L),

phase1_batch_size = 256L,

phase1_optimizer = "Adam",

phase1_lr = 1e-4,

phase1_lr_3 = 1e-4,

phase1_dropout = 0L,

phase1_type = "Dense", # "Dense" or "Normal"

# Phase 2: Deep Cox training parameters

phase2_epochs = 500L,

phase2_hidden = 128L,

phase2_lr = 1e-4,

phase2_dropout = 0L,

phase2_test_size = 0.2,

phase2_batch_size_bulk = 256L,

# Training & risk definition

train_iterations = 5L,

train_percentile = 0.95,

train_steepness = 30L,

train_path = "./SIDISH_res/", # Path to save intermediate data

train_num_workers = 0L,

train_distribution_fit = "fitted" # "fitted" or "default"

# Python environment parameters

env_params = list(

env.name = "r-reticulate-sidish-nvidia", # Auto-selected based on device parameter

env.type = "conda",

env.method = "environment",

env.file = system.file(

"conda/SIDISH_nvidia_environment.yml",

package = "SigBridgeR"

env.python_verion = "3.12.12",

env.packages = c(

"numpy" = "1.26.4"

# Additional packages defined in environment.yml

env.recreate = FALSE, # Force re-creating environment if TRUE

env.use_conda_forge = TRUE,

env.verbose = FALSE

verbose = TRUE

)

```

This code chunk will CREATE a conda environment called **`r-reticulate-sidish-nvidia`** (for GPU) or **`r-reticulate-sidish-cpu`** (for CPU) with Python 3.12.12 and install all required dependencies for SIDISH. If an environment with the same name already exists, it will be reused without recreation (unless `env_params = list(env.recreate = TRUE)` is specified).

**Note**: SIDISH currently **only supports survival phenotypes** (`phenotype_class = "survival"`). The phenotype data frame must contain columns named `"time"` (survival time) and `"status"` (event indicator).

To obtain the default parameters for SIDISH, use `SigBridgeR:::SIDISHParamSet()` and `SigBridgeR:::SIDISHEnvSet()`:

```{r,get_default_params,eval=FALSE}

# Get default SIDISH algorithm parameters

sidish_default_params <- SigBridgeR:::SIDISHParamSet(list())

# Get default environment parameters (CPU version)

env_default_cpu <- SigBridgeR:::SIDISHEnvSet(list(), device = "cpu")

# Get default environment parameters (GPU version)

env_default_gpu <- SigBridgeR:::SIDISHEnvSet(list(), device = "cuda")

```

You can use `ListPyEnvs()` to list all Python environments available on your system (both conda and virtualenv):

```{r,show_python_envs,eval=FALSE}

ListPyEnv()

# name python type

# 1 base /home/user/miniconda3/bin/python conda

# 2 r-reticulate-sidish-nvidia /home/user/miniconda3/envs/r-reticulate-sidish-nvidia/bin/python conda

# 3 r-reticulate-sidish-cpu /home/user/miniconda3/envs/r-reticulate-sidish-cpu/bin/python conda

```

> **Important considerations**:

> - GPU acceleration requires CUDA-compatible hardware. If GPU is not detected but `device = "cuda"` is specified, the function will abort with an error message. Set `sidish_param = list(device = "cpu")` to run on CPU. And of course the python environment must be recreated.

**Return structure**: A named list containing:

- `scRNA_data`: A Seurat object with SIDISH screening results integrated into the `@misc` and `@meta.data` slots under the specified `label_type`. The object contains cell-level risk scores and survival-related annotations generated by the SIDISH algorithm.

### 2.9 (Option I) SCIPAC Screening

Parameters pass to `...` when using `PIPET` method

- `hvg`: Integer. Number of highly variable genes to use for preprocessing. Default is `1000L`.

- `do_pca_sc`: Logical. Whether to perform PCA on single-cell data and apply the rotation matrix to bulk data; if FALSE, PCA is performed on bulk data and applied to single-cell data. Default is `FALSE`.

- `n_pc`: Integer. Number of principal components to use. Default is `60L`.

- `sc_batch_col`: Character or vector. Batch variable for single-cell data. Default is `NULL`.

- `resolution`: Integer. Clustering resolution for cell type identification. Default is `2L`.

- `ela_net_alpha`: Numeric. Elastic net mixing parameter (0 = ridge, 1 = lasso). Default is `0.4`.

- `bt_size`: Integer. Bootstrap sample size for stability assessment. Default is `50L`.

- `ncore`: Integer. Number of CPU cores for parallel computation. Default is `7L`.

- `ci_alpha`: Numeric. Significance level for confidence intervals. Default is `0.05`.

- `nfold`: Integer. Number of folds for cross-validation for regression models. Default is `10L`.

- `...`: Additional arguments. Supports `assay` (character), `verbose` (logical), and `seed` (integer).

```{r,SCIPAC_screening,eval=FALSE}

scipac_result <- Screen(

matched_bulk = your_matched_bulk,

sc_data = A_Seurat_object,

phenotype = your_matched_phenotype,

label_type = "TP53",

hvg = 1000L,

do_pca_sc = FALSE,

n_pc = 60L,

sc_batch_col = NULL,

resolution = 2L,

ela_net_alpha = 0.4,

bt_size = 50L,

ncore = 7L,

ci_alpha = 0.05,

nfold = 10L,

View remainder of file in raw view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

Full_Tutorial.Rmd

Latest commit

History

Full_Tutorial.Rmd

File metadata and controls