Tutorial1: Using scMagnifier on single-batch datasets

This tutorial demonstrates how to use scMagnifier on single-batch data. Our workflow includes data preprocessing, GRN construction, TF perturbation, consensus clustering and cluster merging.
Here we use LUNG_N30 dataset as our example.

Step1: Data preprocessing

In this step, we define preprocess() fuction, which will preprocess your dataset. You need to input the path to the h5ad file in the function (here, the gene expression matrix in the h5ad file is required to be the raw matrix). You can select the clustering method (Leiden is the default method, and you can also choose Louvain by setting cluster_method="Louvain"). You can input the clustering resolution you wish to use (the default value is 0.75, which can be adjusted via resolution=""). Additionally, if there are cell annotations in your h5ad file, you can input the name of the annotation column (set label_key=""), thereby obtaining the visualization results of cell annotations on UMAP.
You can refer to the preprocess_core.py file for the remaining detailed parameters.

from scMagnifier import preprocess
preprocess(input_path="/mnt/disk1/hzh/LUNG_N30.h5ad")

Loading outdir：/mnt/disk1/hzh/Tutorial/preprocessed_result
Loading data from: /mnt/disk1/hzh/LUNG_N30.h5ad
Original data shape: (2886, 29634)
Filtering genes with min_counts >= 1 ...
After gene filtering shape: (2886, 20658)
Normalizing per cell (total counts -> target_sum=1e4) ...
Selecting top 2000 highly variable genes (on non-log data) ...
Number of HVG selected: 2000
Subsetting AnnData to HVGs ...
After subsetting shape: (2886, 2000)
Re-normalizing per cell after gene filtering ...
Saving un-logged copy to adata.raw and adata.layers['raw_count'] ...
Applying log1p and scaling ...
Before log1p: min=0.0, max=7687.08642578125, mean=5.000003337860107
Running PCA (n_comps=20) ...
Computing neighbors (n_neighbors=10, n_pcs=20) ...
Computing UMAP ...
Running Leiden clustering (resolution=0.75) ...
Plotting UMAP colored by leiden clusters ...
UMAP (leiden) figure saved to: preprocessed_result/umap_leiden.png
Plotting UMAP colored by Cell_subtype labels  ...
UMAP (Cell_subtype) figure saved to: /mnt/disk1/hzh/Tutorial/preprocessed_result/umap_Cell_subtype.png
Preprocessed AnnData saved to: /mnt/disk1/hzh/Tutorial/preprocessed_result/preprocessed.h5ad
==================================================
Preprocessing finished! All outputs saved in: /mnt/disk1/hzh/Tutorial/preprocessed_result
Note: adata.raw and adata.layers['raw_count'] contain un-logged data for downstream use.

Then you can find the preprocessed h5ad file and generated images in the preprocessed_result directory.
You do not need to execute the code below; it is only intended to display the generated images.

# You do not need to execute this code
from IPython.display import HTML
import base64

def img_to_base64(img_path):
    with open(img_path, "rb") as f:
        return base64.b64encode(f.read()).decode()

img_path1 = "/mnt/disk1/hzh/Tutorial/preprocessed_result/umap_leiden.png"
img_path2 = "/mnt/disk1/hzh/Tutorial/preprocessed_result/umap_Cell_subtype.png"
img1_b64 = img_to_base64(img_path1)
img2_b64 = img_to_base64(img_path2)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="300">
    <img src="data:image/png;base64,{img2_b64}" width="400">
</div>
""")

Step2: GRN Construction

In this step, we define GRN() function to perform GRN (Gene Regulatory Network) construction, and the CellOracle method is used for this process. In the function, we use the human promoter GRN as the base GRN. If you are using data from other species, you can specify it via the base_grn="" parameter – for this, you may refer to the built-in GRNs of CellOracle. Additionally, if you used the Louvain algorithm during preprocessing, you need to set cluster_key="louvain".
You can refer to the grn_core.py file for the remaining detailed parameters.

from scMagnifier import GRN
GRN()

Then you will find in the GRN directory: TF_HVG_intersect_genes.txt, which stores candidate perturbed TFs (Transcription Factors), and preadata.final.celloracle.oracle, which stores the final GRN.

# You do not need to execute this code
import os
folder_path = "/mnt/disk1/hzh/Tutorial/GRN"
print(f"All files in folder {folder_path}:")
for file_name in os.listdir(folder_path):
    full_path = os.path.join(folder_path, file_name)
    if os.path.isfile(full_path): 
        print(f"  - {file_name}")

All files in folder /mnt/disk1/hzh/Tutorial/GRN:
  - preadata_links.celloracle.links
  - preadata.final.celloracle.oracle
  - preadata.initial.celloracle.oracle
  - TF_HVG_intersect_genes.txt
  - pca_variance_explained.png

Step3: TF perturbation

In this step, we define perturb() function to simulate the process of gene perturbation propagation. If you used the Louvain method during the preprocessing step, you need to set cluster_key="louvain" in the perturb() function to maintain consistency. Additionally, if you customized the resolution parameter during preprocessing, you should also set resolution="" here to keep it consistent. You can also adjust the perturbation multiplier by modifying the multiple="" parameter. If your dataset contains cell annotation information, you can specify the column name via label_key="".
You can refer to the perturb_core.py file for the remaining detailed parameters.

Next, we use two genes (ARID5B and ARNT2) as examples to demonstrate the perturbation process.You can perform perturbation for all genes by running the same code.

from scMagnifier import perturb
perturb()

[INFO] Running multipliers: ['0.1', '-0.1']

[INFO] === Running for multiplier=0.1 ===
[INFO] Results will be saved under: perturb_results/0p1
[INFO] Found coefficient matrix in oracle.coef_matrix_per_cluster
[INFO] Found 16 clusters in leiden: ['0', '1', '10', '11', '12', '13', '14', '15', '2', '3', '4', '5', '6', '7', '8', '9']
[SAVED] Original expression matrix: perturb_results/0p1/original_matrix.csv
[INFO] Loaded 2 genes from txt file.
================================================================================
[RUNNING] Perturbing gene: ARID5B (multiplier=0.1)
[SAVED] Perturbed matrix for ARID5B: perturb_results/0p1/perturbed_matrix_ARID5B_0.1.csv
[INFO] Running log1p + scale + PCA + neighbors + UMAP + clustering ...
WARNING: adata.X seems to be already log-transformed.
[SAVED] perturb_results/0p1/cluster_ARID5B_0.1.csv
[INFO] Skipped saving perturbed AnnData (save_h5ad=False)


... storing 'pert_cluster' as categorical


[DONE] Perturbation for ARID5B completed.
================================================================================
================================================================================
[RUNNING] Perturbing gene: ARNT2 (multiplier=0.1)
[SAVED] Perturbed matrix for ARNT2: perturb_results/0p1/perturbed_matrix_ARNT2_0.1.csv
[INFO] Running log1p + scale + PCA + neighbors + UMAP + clustering ...
WARNING: adata.X seems to be already log-transformed.
[SAVED] perturb_results/0p1/cluster_ARNT2_0.1.csv
[INFO] Skipped saving perturbed AnnData (save_h5ad=False)


... storing 'pert_cluster' as categorical


[DONE] Perturbation for ARNT2 completed.
================================================================================

[INFO] All genes done for multiplier=0.1.

[INFO] === Running for multiplier=-0.1 ===
[INFO] Results will be saved under: perturb_results/neg0p1
[INFO] Found coefficient matrix in oracle.coef_matrix_per_cluster
[INFO] Found 16 clusters in leiden: ['0', '1', '10', '11', '12', '13', '14', '15', '2', '3', '4', '5', '6', '7', '8', '9']
[SAVED] Original expression matrix: perturb_results/neg0p1/original_matrix.csv
[INFO] Loaded 2 genes from txt file.
================================================================================
[RUNNING] Perturbing gene: ARID5B (multiplier=-0.1)
[SAVED] Perturbed matrix for ARID5B: perturb_results/neg0p1/perturbed_matrix_ARID5B_-0.1.csv
[INFO] Running log1p + scale + PCA + neighbors + UMAP + clustering ...
WARNING: adata.X seems to be already log-transformed.
[SAVED] perturb_results/neg0p1/cluster_ARID5B_-0.1.csv
[INFO] Skipped saving perturbed AnnData (save_h5ad=False)


... storing 'pert_cluster' as categorical


[DONE] Perturbation for ARID5B completed.
================================================================================
================================================================================
[RUNNING] Perturbing gene: ARNT2 (multiplier=-0.1)
[SAVED] Perturbed matrix for ARNT2: perturb_results/neg0p1/perturbed_matrix_ARNT2_-0.1.csv
[INFO] Running log1p + scale + PCA + neighbors + UMAP + clustering ...
WARNING: adata.X seems to be already log-transformed.
[SAVED] perturb_results/neg0p1/cluster_ARNT2_-0.1.csv
[INFO] Skipped saving perturbed AnnData (save_h5ad=False)


... storing 'pert_cluster' as categorical


[DONE] Perturbation for ARNT2 completed.
================================================================================

[INFO] All genes done for multiplier=-0.1.

[INFO] All multipliers completed.

After perturbation, the perturb_results directory will store the clustering results following each gene perturbation (e.g., cluster_ARID5B_0.1.csv), the perturbed gene expression matrix (e.g., perturbed_matrix_ARID5B_0.1.csv), as well as the visualization results of the perturbed clustering results on the new UMAP, the visualization results of the perturbed clustering results on the original UMAP, and the visualization results of cell annotations on the new UMAP.

Next, we demonstrate the visualization plots of the new clustering results on the new UMAP, the original UMAP, and cell annotations on the new UMAP following perturbation of the ARID5B gene.

# You do not need to execute this code
img_path1 = "/mnt/disk1/hzh/Tutorial/perturb_results/0p1/umap_new_leiden_perturbed_ARID5B_0.1.png"
img_path2 = "/mnt/disk1/hzh/Tutorial/perturb_results/0p1/umap_original_on_new_leiden_perturbed_ARID5B_0.1.png"
img_path3 = "/mnt/disk1/hzh/Tutorial/perturb_results/0p1/umap_Cell_subtype_ARID5B_0.1.png"
img1_b64 = img_to_base64(img_path1)
img2_b64 = img_to_base64(img_path2)
img3_b64 = img_to_base64(img_path3)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="300">
    <img src="data:image/png;base64,{img2_b64}" width="300">
    <img src="data:image/png;base64,{img3_b64}" width="400">
</div>
""")

Step4: Consensus clustering

In this step, we define consensus() function to perform consensus clustering. If you have used the Louvain method for clustering in the preceding steps, you need to set cluster_method="louvain" in the function for consistency. In addition, if cell annotations are available, you can input the column name of the cell annotations for label_key="". We use a default resolution of 1.5 here; if you want to modify this value, you can set resolution="".
You can refer to the consensus_core.py file for the remaining detailed parameters.

from scMagnifier import consensus
consensus()

[INFO] Found 158 cluster CSV files.
[INFO] Loaded h5ad with 2886 cells.
[INFO] 2886 cells after alignment.
[INFO] One-hot matrix shape: (2886, 2509)
[INFO] Using standard PCA from X_pca
[INFO] PCA matrix shape: (2886, 20)
[INFO] Computing PCA (Euclidean) distance matrix ...
[INFO] Computing One-hot (cosine) distance matrix ...
[INFO] Running new UMAP (precomputed distances) ...
[INFO] New UMAP embedding shape: (2886, 2)
[SAVED] Custom UMAP + combined_distance saved to consensus_result/adata_with_rpcumap.h5ad

[INFO] Building kNN graph from the combined distance matrix (precomputed)...


... storing 'cluster' as categorical


[INFO] kNN graph placed into adata_cluster.obsp

[INFO] Running leiden clustering at resolution=1.5
[SAVED] consensus_result/leiden_res1.5.csv


... storing 'cluster' as categorical



[DONE] All resolutions processed.

After consensus clustering, you can obtain the h5ad file storing the rpcUMAP embeddings (the embedding information is saved in "X_umap_custom") in the consensus_result folder, along with the results of consensus clustering in leiden_res1.5.csv, the visualizations of consensus clustering results on the original UMAP and rpcUMAP, and the visualization of cell annotations on rpcUMAP.

Next, we present the visualizations of the consensus clustering results on the original UMAP and rpcUMAP, as well as the visualization of the cell annotation results on rpcUMAP.

# You do not need to execute this code
img_path1 = "/mnt/disk1/hzh/Tutorial/consensus_result/leiden_UMAP_res1.5.png"
img_path2 = "/mnt/disk1/hzh/Tutorial/consensus_result/leiden_rpcUMAP_res1.5.png"
img_path3 = "/mnt/disk1/hzh/Tutorial/consensus_result/rpcumap_realLabel_Cell_subtype.png"
img1_b64 = img_to_base64(img_path1)
img2_b64 = img_to_base64(img_path2)
img3_b64 = img_to_base64(img_path3)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="300">
    <img src="data:image/png;base64,{img2_b64}" width="300">
    <img src="data:image/png;base64,{img3_b64}" width="400">
</div>
""")

Step5: Cluster merging

This is the final step of scMagnifier. In this step, we define merge() function to perform cluster merging. In the function, we set the default value of min_size_fraction to 0.01. In practical use, you can lower this value—especially for tasks involving the identification of rare cells, where you can set it to 0.001.Additionally, if you need to control the number of clusters in the final output, you can also do so by adjusting the value of min_size_fraction.
You can refer to the merge_core.py file for the remaining detailed parameters.

from scMagnifier import merge
merge()

[INFO] Auto-selected cluster CSV: consensus_result/leiden_res1.5.csv
[INFO] Loading h5ad and cluster CSV ...
[INFO] Read cluster CSV with 2886 rows.


... storing 'merged_cluster' as categorical


[INFO] 2886 cells after alignment (intersection).
[INFO] Found 22 initial D-clusters.
[INFO] Centroid coords shape (cells x features): (2886, 2000)
[INFO] Computed 22 centroids based on HVG expression.
[INFO] THM raw = 8.41936, th_scaler = 0.75, THM_scaled = 6.31452
[INFO] After threshold merging, 21 merged clusters created.
[INFO] 21 clusters after the first merge step.
[INFO] Minimum cluster size threshold = 29 cells (1.00%).
[INFO] 20 clusters after merging small clusters.
[SAVED] Merged cluster CSV -> merged_result/merged_clusters.csv


... storing 'merged_cluster' as categorical


[SAVED] Old UMAP merged plot -> merged_result/umap_merged.png
[SAVED] New UMAP merged plot -> merged_result/rpcumap_merged.png
[DONE] Merge + visualization complete.

After cluster merging, we obtain the final results of scMagnifier, including the final clustering results saved in merged_clusters.csv, as well as the visualizations of the clustering results on the original UMAP and rpcUMAP.

Next, we demonstrate the visualization plots of the final clustering results on the original UMAP and the rpcUMAP.

# You do not need to execute this code
img_path1 = "/mnt/disk1/hzh/Tutorial/merged_result/umap_merged.png"
img_path2 = "/mnt/disk1/hzh/Tutorial/merged_result/rpcumap_merged.png"
img1_b64 = img_to_base64(img_path1)
img2_b64 = img_to_base64(img_path2)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="300">
    <img src="data:image/png;base64,{img2_b64}" width="300">
</div>
""")

Tool1: umap_compare() function

We develop the umap_compare() function to compare the positions of clusters in the final results on UMAP and rpcUMAP. You can input selection=[0,1] into the function to view the comparison of the positions of clusters 0 and 1 on UMAP and rpcUMAP.

from scMagnifier import umap_compare
umap_compare(selection=[0,1])

[INFO] Loading h5ad file: consensus_result/adata_with_rpcumap.h5ad
[INFO] Loading cluster definition CSV: merged_result/merged_clusters.csv
[SUCCESS] UMAP comparison plot saved to: umap_compare/umap_compare.png

# You do not need to execute this code
img_path1 = "/mnt/disk1/hzh/Tutorial/umap_compare/umap_compare.png"
img1_b64 = img_to_base64(img_path1)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="600">
</div>
""")

Tool2: TFscore() function

We defined the TFscore() function to calculate the importance scores of TFs in each cluster obtained through the perturbation process.

from scMagnifier import TFscore
TFscore()

[INFO] ===== Processing folder: 0p1 =====
[INFO] Loaded adata with 2886 cells
[INFO] Found 79 cluster_ files and 79 perturbed_matrix_ files
[DONE] Folder 0p1 processed successfully -> TFscore_output/analysis_0p1/analysis_result

[INFO] All TFscore analysis completed -> TFscore_output

By running the TFscore() function, you can obtain bar charts of the top 10 ranked TFs for each cluster. Here, we present the results for clusters 0, 1, and 2.

# You do not need to execute this code
img_path1 = "/mnt/disk1/hzh/Tutorial/TFscore_output/analysis_0p1/analysis_result/top10_genes_cluster_0.png"
img_path2 = "/mnt/disk1/hzh/Tutorial/TFscore_output/analysis_0p1/analysis_result/top10_genes_cluster_1.png"
img_path3 = "/mnt/disk1/hzh/Tutorial/TFscore_output/analysis_0p1/analysis_result/top10_genes_cluster_2.png"
img1_b64 = img_to_base64(img_path1)
img2_b64 = img_to_base64(img_path2)
img3_b64 = img_to_base64(img_path3)
HTML(f"""
<div style="display: flex; gap: 20px; scope: local;">
    <img src="data:image/png;base64,{img1_b64}" width="300">
    <img src="data:image/png;base64,{img2_b64}" width="300">
    <img src="data:image/png;base64,{img3_b64}" width="300">
</div>
""")