Visium HD使用Cellpose进行细胞分割获取单细胞精度空间数据

原创

生信大杂烩

发布于 2025-05-29 21:35:22

6552

Visium HD为了获取单细胞级数据，前面我们介绍了基于高清的H&E图像使用StarDist进行细胞核分割，然后将覆盖在识别出来的细胞核区域上的2um bin进行合并，将这些合并的bins当做一个细胞进行后续分析。

不管是10x官方提供的细胞核分割方法(https://www.10xgenomics.com/analysis-guides/segmentation-visium-hd), 或是bin2cell (https://github.com/Teichlab/bin2cell), 抑或是上节我们测试的ENACT (https://github.com/Sanofi-Public/enact-pipeline)其本质都是一样的，使用StarDist预训练的2D_versatile_he模型首先识别细胞核轮廓，然后将覆盖在细胞核轮廓上的bins进行合并，无非是bin-to-cell合并的方法不同罢了。

需要特别强调下，只要是使用StarDist预训练的2D_versatile_he模型进行分割的，那一定分割的是细胞核，尽管StarDist也是可以进行细胞轮廓识别，但是2D_versatile_he模型是使用DAPI染的细胞核图像训练的，而上面我们说的3种分割都是用的2D_versatile_he模型，所以记住它们都是细胞和分割，分割后bin-to-cell获取的是接近snRNA-seq结果会丢失细胞质的表达信息。

为了识别出细胞边界，我们可以使用Cellpose (https://www.cellpose.org/)进行单细胞分割，然后再bin-to-cell获得的就是单细胞精度的数据了。

Cellpose提供了多种预训练模型，主要包括2类：

1. 细胞质模型（cyto）：适用于细胞质分割，通常用于处理细胞形态较为复杂的图像。

2. 细胞核模型（nuclei）：专门用于细胞核的分割，适合于需要高精度核定位的应用。

Cellpose在Visium HD数据细胞分割中的优势：

高分辨率细胞分割：

Cellpose作为一种深度学习模型，能够有效地处理这些高分辨率图像，提供精确的细胞分割，从而确保后续分析的准确性；

自动化处理：

Cellpose能够自动识别和分割图像中的细胞，减少了人工标注的需求；

适应性强：

Cellpose经过多样化的数据集训练，具有较强的通用性，能够适应不同类型的细胞图像，而无需用户进行大量的重新训练或调整参数；

支持复杂组织结构：

Cellpose能够有效处理复杂且细胞密度较高且复杂的组织结构，能准确识别重叠细胞和多样化形态，提高分割精度。

以下是使用Cellpose进行细胞分割，然后将2um bin基因表达合并得到单细胞精度表达数据，代码我都进行了详细注释。

# 导入必要的库import numpy as npimport pandas as pdimport scanpy as scimport tifffileimport geopandas as gpdfrom shapely.geometry import Pointfrom cellpose import models, core
# 定义读取图像的函数，并可选择下采样因子def read_dapi_image(img_path: str, downscale_factor: int = 2) -> np.ndarray:    img = io.imread(img_path)  # 读取图像文件    print(img.shape)  # 打印图像的形状    return downscale_image(img, downscale_factor=downscale_factor)  # 返回下采样后的图像
# 定义下采样图像的函数def downscale_image(img: np.ndarray, downscale_factor: int = 2) -> np.ndarray:    # 计算每个轴上所需的填充量    pad_height = (downscale_factor - img.shape[0] % downscale_factor) % downscale_factor    pad_width = (downscale_factor - img.shape[1] % downscale_factor) % downscale_factor    new_shape = (img.shape[0] + pad_height, img.shape[1] + pad_width, img.shape[2])  # 新图像形状    new_image = np.zeros(new_shape, dtype=img.dtype)  # 创建新的零填充图像    for channel in range(img.shape[2]):        new_image[:, :, channel] = np.pad(            img[:, :, channel], ((0, pad_height), (0, pad_width)), mode="constant"        )  # 对每个通道进行零填充    return new_image  # 返回填充后的图像
# 定义运行Cellpose进行细胞分割的函数def run_cellpose(img: np.ndarray, model_path: str):    use_GPU = core.use_gpu()  # 检查是否使用GPU    model = models.CellposeModel(gpu=use_GPU, pretrained_model=model_path)  # 加载Cellpose模型    channels = [0, 0]  # 设置通道（这里假设为单通道）    masks, flows, styles = model.eval(        [img],        channels=channels,        diameter=model.diam_labels,        flow_threshold=1,        cellprob_threshold=0,        batch_size=4,    )  # 执行细胞分割并返回掩膜、流和样式信息    return (masks, flows, styles)  # 返回分割结果
# 设置Visium HD数据的文件路径并读取图像img_path = "./data/visiumHD_colorectal_cancer/P1_CRC/Visium_HD_Human_Colon_Cancer_P1_tissue_image.btf"maxed_visium = read_dapi_image(img_path, downscale_factor=1)  # 读取并下采样图像
# 设置分块参数和模型路径chunk_per_axis = 2  # 每个轴上分块数目mp = 'cyto'  # 使用细胞质模型进行分割
# 对图像进行分块并调用Cellpose进行细胞分割masks_top_left, flows, styles = run_cellpose(    maxed_visium[        : np.shape(maxed_visium)[0] // chunk_per_axis,        : np.shape(maxed_visium)[1] // chunk_per_axis,        :,    ],    model_path=mp,)masks_top_right, flows, styles = run_cellpose(    maxed_visium[        : np.shape(maxed_visium)[0] // chunk_per_axis,        np.shape(maxed_visium)[1] // chunk_per_axis :,        :,    ],    model_path=mp,)masks_bottom_left, flows, styles = run_cellpose(    maxed_visium[        np.shape(maxed_visium)[0] // chunk_per_axis :,        : np.shape(maxed_visium)[1] // chunk_per_axis,        :,    ],    model_path=mp,)masks_bottom_right, flows, styles = run_cellpose(    maxed_visium[        np.shape(maxed_visium)[0] // chunk_per_axis :,        np.shape(maxed_visium)[1] // chunk_per_axis :,        :,    ],    model_path=mp,)
# 初始化全掩膜数组，准备合并分块结果constant = 1000000  # 用于区分不同块的常量值full_mask = np.zeros_like(maxed_visium[:, :, 0], dtype=np.uint32)  # 创建与原图相同大小的掩膜数组
# 将各个块的掩膜合并到全掩膜中，并加上常量以区分不同区域full_mask[    : np.shape(maxed_visium)[0] // chunk_per_axis,    : np.shape(maxed_visium)[1] // chunk_per_axis,] = masks_top_left[0]full_mask[    : np.shape(maxed_visium)[0] // chunk_per_axis,    np.shape(maxed_visium)[1] // chunk_per_axis :,] = masks_top_right[0] + (constant)full_mask[    np.shape(maxed_visium)[0] // chunk_per_axis :,    : np.shape(maxed_visium)[1] // chunk_per_axis,] = masks_bottom_left[0] + (constant * 2)full_mask[    np.shape(maxed_visium)[0] // chunk_per_axis :,    np.shape(maxed_visium)[1] // chunk_per_axis :,] = masks_bottom_right[0] + (constant * 3)
# 将全掩膜中值为常量的部分设置为0，表示没有细胞full_mask = np.where(full_mask % constant == 0, 0, full_mask)
# 将全掩膜保存为PNG格式文件tifffile.imsave("./visium_hd/segmentation.png", full_mask)# 设置Visium HD数据目录路径以加载空间转录组数据dir_base = "./data/visiumHD_colorectal_cancer/P1_CRC/outs/binned_outputs/square_002um/"
# 加载Visium HD数据中的基因表达矩阵（HDF5格式）raw_h5_file = dir_base + "filtered_feature_bc_matrix.h5"adata = sc.read_10x_h5(raw_h5_file)  # 使用Scanpy读取数据
# 加载组织位置坐标文件（Parquet格式）tissue_position_file = dir_base + "spatial/tissue_positions.parquet"df_tissue_positions = pd.read_parquet(tissue_position_file)
# 将数据框的索引设置为条形码，以便后续合并操作使用df_tissue_positions = df_tissue_positions.set_index("barcode")
# 在数据框中创建索引以检查合并情况df_tissue_positions["index"] = df_tissue_positions.index
# 将组织位置添加到AnnData对象的元数据中（obs）adata.obs = pd.merge(adata.obs, df_tissue_positions, left_index=True, right_index=True)
# 从坐标数据框创建GeoDataFrame，以便进行空间分析geometry = [    Point(xy)    for xy in zip(        df_tissue_positions["pxl_col_in_fullres"],        df_tissue_positions["pxl_row_in_fullres"],    )]gdf_coordinates = gpd.GeoDataFrame(df_tissue_positions, geometry=geometry)
## 将捕获区域分配给单个细胞# 检查点掩膜是否在全掩膜范围内，以确保坐标有效性point_mask = (    gdf_coordinates["pxl_row_in_fullres"].values.astype(int) < np.shape(full_mask)[0]) & (gdf_coordinates["pxl_col_in_fullres"].values.astype(int) < np.shape(full_mask)[1])
# 根据有效坐标从全掩膜中提取细胞ID信息cells = full_mask[    gdf_coordinates["pxl_row_in_fullres"].values.astype(int)[point_mask],    gdf_coordinates["pxl_col_in_fullres"].values.astype(int)[point_mask],]gdf_coordinates = gdf_coordinates[point_mask]gdf_coordinates["cells"] = cells  # 将细胞ID添加到GeoDataFrame中assigned_regions = gdf_coordinates[gdf_coordinates["cells"] > 0]
# 将细胞ID信息合并到AnnData对象中，以便后续分析使用adata.obs = adata.obs.merge(    assigned_regions[["cells"]], left_index=True, right_index=True, how="left")adata = adata[~pd.isna(adata.obs["cells"])]  # 移除没有细胞ID的数据行adata.obs["cells"] = adata.obs["cells"].astype(int)  # 将细胞ID转换为整数类型
## 在捕获区域内汇总转录本计数from scipy import sparse
# 按唯一细胞ID对数据进行分组，以便汇总转录本计数groupby_object = adata.obs.groupby(["cells"], observed=True)counts = adata.X  # 提取基因表达计数矩阵spatial_coords = adata.obs[["pxl_row_in_fullres", "pxl_col_in_fullres"]].valuesN_groups = groupby_object.ngroups  # 获取唯一细胞ID数量（组数）N_genes = counts.shape[1]  # 获取基因数量
# 初始化稀疏矩阵以存储每个细胞的基因计数汇总结果summed_counts = sparse.lil_matrix((N_groups, N_genes))polygon_id = []  # 存储多边形ID列表cell_coords = []  # 存储细胞坐标列表row = 0
# 遍历每个唯一多边形，计算基因计数的总和。for polygons, idx_ in groupby_object.indices.items():    summed_counts[row] = counts[idx_].sum(0)  # 汇总该多边形内所有细胞的基因计数    cell_coords.append(np.mean(spatial_coords[idx_], axis=0))  # 获取该多边形内细胞坐标均值作为代表坐标    row += 1    polygon_id.append(polygons)  # 添加多边形ID到列表中cell_coords = np.array(cell_coords)  summed_counts = summed_counts.tocsr()  # 转换为压缩稀疏行矩阵格式
# 创建一个新的AnnData对象，用于存储汇总后的基因计数矩阵及其元数据。grouped_adata = anndata.AnnData(    X=summed_counts,    obs=pd.DataFrame(polygon_id, columns=["cells"], index=polygon_id),    var=adata.var,)grouped_adata.write('adata_cellpose.h5ad')

拿到细胞分割后的结果后，降维聚类看下结果，对比之前合并8um bin结果看起来，大群基本上分开了，效果挺好。

题外话

其实我们还可以使用Qupath导入H&E图像，使用Cellpose或者一些开源的机器学习算法可视化进行细胞分割，好处是可以自己训练模型，然后细胞分割后直观就能看到，如果不满意就可以微调参数进行分割，直到能将绝大多数细胞分割出来，最后可以导出这些细胞的坐标信息，然后就是使用上面bin-to-cell部分的代码，就能达到更好的结果。

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 [email protected] 删除。

图像处理