The present invention provides a computer-implemented method for classifying a test sample obtained from a tumour of a test subject, comprising: (a) providing a whole genome sequence of the tumour and of a non-tumour sample from the test subject; (b) analysing the genome sequence of the tumour to compute at least five general features and at least five recurrence features selected from the general and recurrence features shown in Table SF2-1;
(c) providing at least five reference clusters that have been obtained by: (i) analysing a training data set comprising tumour genomes of at least 1000 subjects to compute said at least five general features and at least five recurrence features defined in step (b) for each of said at least 1000 subjects; (ii) performing principal component analysis (PCA) using the features of step (i) as input, optionally after scaling and/or centring the features to account for their different scales, to obtain a plurality of principal components that together account for at least 50% of the variance of the training data set; and (iii) performing hierarchical clustering using said plurality of principal components to divide the samples of the training data set into at least five reference clusters, each reference cluster having a centre that is defined by computing the mean of the plurality of principal components of all samples in that cluster; (d) determining the Euclidean distance between said plurality of principal components of the test sample and the centre of each of said reference clusters; and (e) classifying the tumour sample obtained from the test subject as belonging to the reference cluster to which it is found to be nearest in step (d). Also provided a related methods and systems.
本发明提供了一种用于对从受试者肿瘤中获得的测试样本进行分类的计算机实施 方法,该方法包括:(a) 提供肿瘤的全
基因组序列和来自受试者的非肿瘤样本的全
基因组序列;(b) 分析肿瘤的
基因组序列以计算至少五个一般特征和至少五个复发特 征,这些特征选自表 SF2-1 所示的一般特征和复发特征;
(c) 提供至少五个参考簇,这些参考簇是通过以下方法获得的(i) 分析由至少 1000 个受试者的肿瘤
基因组组成的训练数据集,为所述至少 1000 个受试者中的每个受试者计算步骤(b)中定义的所述至少五个一般特征和至少五个复发特征;(ii) 使用步骤(i)的特征作为输入,执行主成分分析(PCA),可选择在对特征进行缩放和/或居中以考虑其不同尺度之后,以获得多个主成分,这些主成分共同占训练数据集方差的至少 50%;(iii) 使用所述多个主成分进行分层聚类,将训练数据集中的样本划分为至少五个参考簇,每个参考簇都有一个中心,该中心通过计算该簇中所有样本的多个主成分的平均值来定义;(d) 确定测试样本的多个主成分与每个参考簇的中心之间的欧氏距离;以及 (e) 将从测试对象处获得的肿瘤样本分类为属于在步骤(d)中发现与之最接近的参考簇。还提供了相关的方法和系统。