MACHINE-LEARNING BASED PREDICTION OF THE SURVIVAL POTENTIAL OF CELLS

(19)

(11)

EP 4 227 948 A1

(12)	EUROPEAN PATENT APPLICATION

(43)	Date of publication:
	16.08.2023 Bulletin 2023/33

(21)	Application number: 22155787.9

(22)	Date of filing: 09.02.2022

(51)

International Patent Classification (IPC):

G16B 25/10^(2019.01)

G16B 40/20^(2019.01)

(52)	Cooperative Patent Classification (CPC):
	G16B 25/10; G16B 40/20

(84)	Designated Contracting States:
	AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR
	Designated Extension States:
	BA ME
	Designated Validation States:
	KH MA MD TN

(71)	Applicant: Université de Genève
	1211 Genève 4 (CH)

(72)	Inventors:
	FABRE, Pierre Jean-Louis Raymond 1148 L'Isle (CH) PRADOS, Julien Bruno 74380 Lucinges (FR) LO GIUDICE, Quentin Wilhelm Raoul Ange Marie 74100 Ambilly (FR)

(74)	Representative: ETL IP Patent- und Rechtsanwaltsgesellschaft mbH
	Clayallee 343 14169 Berlin 14169 Berlin (DE)

(54)	MACHINE-LEARNING BASED PREDICTION OF THE SURVIVAL POTENTIAL OF CELLS

(57) Techniques for predicting the survival potential of a cell at single-cell resolution are disclosed. Embodiments include a computer-implemented method, a machine-learning model data structure, a computer-implemented method of training a machine-learning model, as well as corresponding computer programs and data processing systems. One embodiment concerns a computer-implemented method (110) of predicting the survival potential of a cell at single-cell resolution. The method comprises receiving an input dataset (112), wherein the input dataset (112) comprises omics data associated with at least one cell of a biological sample. The method further comprises inferring (116) a numerical score which represents a predicted survival potential of the at least one cell using a trained machine-learning model. The method further comprises generating an output dataset (122) comprising the numerical score.

Description

TECHNICAL FIELD

[0001] The subject-matter disclosed herein generally relates to the field of data-driven medicine, more specifically to the computer-aided analysis of the health level of single-cells at single cell level, and even more specifically to techniques for predicting a survival potential of a cell.

BACKGROUND

[0002] Neurogenerative diseases affect millions of people worldwide. A neurodegenerative disease is typically caused by the progressive loss of structure or function of neurons, in the process known as neurodegeneration, which may ultimately lead to cell death. Examples of neurogenerative diseases include Alzheimer's disease, Parkinson's disease, frontotemporal dementia (FTD) and amyotrophic lateral sclerosis (ALS). FTD and ALS are progressive neurological disorders that involve cognitive deficits, language abnormalities and muscle weakness. They mostly occur between the age of 40 to 70 and typically lead to death within three to eight years after the first symptom. The estimated number of 2020 ALS cases across 22 countries is 121,028 prevalent and 41,128 incident cases (Brown, Carolyn A., Cathy Lally, Varant Kupelian, and W. Dana Flanders. 2021. "Estimated Prevalence and Incidence of Amyotrophic Lateral Sclerosis and SOD1 and C9orf72 Genetic Variants." Neuroepidemiology 55 (5): 342-53. https://doi.org/10.1159/000516752.), and no treatments are available to stop the degenerative process.

[0003] Some studies even claim that as much as one third of the population in Europe, which amounts to roughly 180 million persons, was affected by brain disorders in 2010 ( Wittchen, H.U., F. Jacobi, J. Rehm, A. Gustavsson, M. Svensson, B. Jönsson, J. Olesen, et al. 2011. "The Size and Burden of Mental Disorders and Other Disorders of the Brain in Europe 2010." European Neuropsychopharmacology 21 (9): 655-79. https://doi.org/10.1016/j.euroneuro.2011.07.018.). In the report "2021 Alzheimer's Disease Facts and Figures" of the Alzheimer's Association (Alzheimers Dement 2021; 17(3)), the number of Americans with Alzheimer's disease is estimated to be as many as 6.2 million while the Parkinson's Foundation estimates that 1.2 million people in the United States could be living with Parkinson's disease by 2030 (see https://www.parkinson.org/Understanding-Parkinsons/Statistics).

[0004] The current solutions for diagnosing neurogenerative diseases are primarily based on the identification of cell death when the cells are already dying. Moreover, the known methods can typically classify the cells as only "live" or "dead". However, this approach is limited for determining whether a given treatment has an effect on cells, one reason being that it takes a long time with conventional methods to see how the cell reacts.

[0005] Unrelated to neurogenerative diseases, WO 2021/127610 A1 is concerned with techniques for prognosing and diagnosing cancer and proposes a method of determining a cancer progression risk score of a subject. The method may include detecting expression levels of genes of a progression gene signature in a sample obtained from a tumor, tissue, or body fluid, and calculating the cancer progression risk score based thereon.

[0006] WO 2021/138548 A1 discloses methods for longevity-related applications. It suggests microscopy imaging techniques for predicting a cell's state, function or age based on features such as morphology or expression of certain biomarkers as detected by specific binding reagents. Methods for identifying a drug capable of changing a cell's state, function or predicted age are also provided and considered useful for drug discovery.

[0007] US 2014/0057258 A1 discloses a method to generate classifiers to diagnose osteoarthritis in patient samples. Further technological background in the field of degenerative diseases is included in, for example, F. Arai et al.: "Machine Learning of Hematopoietic Stem Cell Divisions from Paired Daughter Cell Expression Profiles Reveals Effects of Aging on Self-Renewal" (Cell Syst. 2020 Dec 16;11(6):640-652.e5. doi: 10.1016/j.cels.2020.11.004. Epub 2020 Dec 8. PMID: 33296684). KR 102144719 B1 titled "Parkinson's disease diagnosis apparatus based on Al (artificial intelligence) using multiple prediction result by multiple learning models, and method", CN 110222745 B titled "Similarity learning based and enhanced cell type identification method", WO 2021/163706 A1 titled "Panomic genomic prevalence score" and Szalai, Bence, Vigneshwari Subramanian, Christian H Holland, Robert Alföldi, László G Puskas, and Julio Saez-Rodriguez. 2019. "Signatures of Cell Death and Proliferation in Perturbation Transcriptomics Data-from Confounding Factor to Effective Prediction." Nucleic Acids Research 47 (19): 10010-26. https://doi.org/10.1093/nar/gkz805.

[0008] In view of the above, it is a problem underlying the invention to provide techniques which allow a more high-resolution assessment of the health of a cell, in particular its survival potential.

SUMMARY

[0009] Techniques for predicting the survival potential of a cell at single-cell resolution are disclosed herein. The cell may be part of a biological sample, e.g., a cell sample comprising mammalian cells, in particular human cells which may have been obtained in a biopsy. Embodiments of the disclosed subject-matter include a computer-implemented method of predicting the survival potential of a cell, a computer-implemented method of training a machine-learning model, a corresponding machine-learning model data structure, as well as corresponding computer programs and data processing systems.

[0010] Certain embodiments are based on the observation that the survival potential of the cells composing a sample (e.g. a tissue) may not be uniform, but depend for example on the cell type. Certain embodiments can track the survival potential at single cell level and allow the identification of clusters of "unhealthy" cells, making it highly valuable to investigate the effects of drugs on cell subpopulations.

[0011] Although embodiments of the invention are disclosed herein in the context of predicting the survival potential, the concepts disclosed herein are more generally applicable to predicting a health level of a cell, preferably at single-cell resolution, a health level being any parameter associated with or relevant for the health of a cell. Examples of cell's health level-related information besides the survival potential include, without limitation, capacity of proliferation, differentiation, synaptic activity, ion transport activity, regeneration speed, age, and / or metabolism.

[0012] Certain embodiments of the invention may be based on the usage of machine-learning model and/or machine-learning algorithm. Machine learning may refer to algorithms and statistical models that computer systems may use to perform a specific task without using explicit instructions, instead relying on models and inference. For example, in machine-learning, instead of a rule-based transformation of data, a transformation of data may be used that is inferred from an analysis of historical and/or training data. For example, the content of data relating to cell samples or other biological samples may be analyzed using a machine-learning model or using a machine-learning algorithm. In order for the machine-learning model to analyze the content of the data, the machine-learning model may be trained using training data as input and training content information as output. By training the machine-learning model with a large number of training data and associated training content information (e.g. labels or annotations), the machine-learning model "learns" a transformation between the input data and the output, which can be used to provide an output based on non-training data provided to the machine-learning model. The provided data may be preprocessed to obtain a feature vector, which is used as input to the machine-learning model. An innovative aspect in applying machine learning techniques in embodiments of the invention involves framing the prediction problem appropriate to the given task and building the appropriate prediction model for the given task.

[0013] One embodiment of a computer-implemented method serves for predicting the survival potential of a cell at single-cell resolution. The method may comprise receiving an input dataset. The input dataset may comprise omics data associated with at least one cell, multiple cells, or even all cells of a biological sample. Omics data may comprise any information associated with a cell or any parameter of a cell which is usable for inferring or determining a survival potential of the respective cell. Generally, the branches of science known informally as -omics are various disciplines in biology whose names end in the suffix - omics, such as genomics, epigenomics, microbiomics, lipidomics, proteomics, glycomics, foodomics, metabolomics, metagenomics and transcriptomics. Accordingly, the -omics data may in one aspect comprise data associated with at least one of a transcriptomic profile, a proteomic profile and an epigenomic profile of the cell.

[0014] As one specific example, the input dataset may comprise a transcriptomic profile which comprises a gene expression profile at single-cell resolution. In addition, or alternatively, the transcriptomic profile may comprise single-cell mRNA sequencing data.

[0015] The input dataset profiles are preferably in h5 or FASTQ format, and may also comprise metadata (preferably text format). The text may be converted into a format suitable for further processing, such as into a comma-separated values (CSV) format.

[0016] The method may comprise inferring a numerical score, also referred to herein as "survival outcome score" or "SOS", which represents a predicted survival potential value of the at least one cell using a trained machine-learning model, in particular using any machine-learning model disclosed herein. A numerical score may be inferred for each one of several selected cells or one numerical score may be inferred per cell of the cell sample. It is also conceivable to infer a combined numerical score for a group of cells, or only for a selected subset of cells as representatives of the cell sample. The method may generate an output dataset comprising the numerical score or scores. The numerical score may be a number in the form of an integer or, more preferably, a floating-point number. The numerical score may correspond to or fall within one of several classes on a scale, non-limiting examples of which include "dead", "ongoing cell death", "weak", "robust" and/or "regenerative".

[0017] Accordingly, the method departs from the known approaches in which a cell can be classified as either dead or alive, and provides a more proactive, forward-looking approach where the survival potential is predicted and effectively forecasted.

[0018] The method may further comprise generating a unified representation of data of the input dataset, in particular a gene expression matrix based on at least part of the input dataset. The gene expression matrix may encode a normalized expression of a given gene in a given cell, preferably for all cells under consideration. In one aspect, if the input dataset comprises a transcriptomic profile, the method may convert the transcriptomic profile into a uniformized matrix representation which encodes a (normalized) expression of a given gene in a given cell. The following is an illustrative non-limiting example of a gene expression matrix:

	Cell 1	Cell 2	...	Cell n
Gene 1	3.7	2.0	...	13.8
Gene 2	2.2	3.2	...	1.1
Gene 3	1.6	14.4	...	18.6
...	...	...	...	...
Gene m	25.5	0.0	...	0.0

[0019] The gene expression matrix is preferably encoded in a matrix representation. The matrix may be provided in a machine-readable format such as the Hierarchical Data Format (HDF), e.g. HDF4 or HDF5, comma-separated values (CSV), tab-separated values (TSV), Matrix Market (MM), a 3-columns sparse matrix representation, or the like.

[0020] The generated gene expression matrix may serve as the input for inferring the numerical score (the "survival outcome score") described above. Using the gene expression matrix as input for the machine-learning model has the advantage that data can be structured into biologically relevant units of information, and such a matrix can be generated in many contexts which provides applicability of the model in a broad range of situations.

[0021] Generating the gene expression matrix may comprise at least one of:

removing a cell if an associated number of expressed genes of the cell is below a predefined threshold
removing a cell if a unique molecular identifier (UMI) count of the cell is below a predefined threshold
removing a cell if an associated number of expressed genes is outside a global normal distribution of the number of expressed genes in the input dataset
removing a cell if an associated magnitude of expressed genes is outside a global normal distribution of the magnitude of expressed genes in the input dataset
removing a gene if a number of cells in which the gene is expressed is below a predefined threshold
removing a gene if the expression magnitude of a gene is steady across a number of cells
homogenizing, in particular by down-sampling, an amount of UMIs per cell to be within a predefined range
reorganizing matrix rows and/or matrix columns according to gene names
identifying missing genes and, optionally, performing a remedying action (e.g. filling the missing entry with a specific value, such as an average value or a value computed by an imputation method)
adjusting expression values so they are expressed into an identical unit (e.g. log-transformation and/or scaling by a pre-defined (or computed) normalization factor and/or variable standardization, gene length normalization and/or expression centering).

[0022] The input dataset may be derived from different sources. One example includes a nucleic acid sequence count, preferably in FASTA/FASTQ format, generated by a sequencing instrument. Generating the gene expression matrix may comprise processing a cell barcode in the nucleic acid sequence to identify the cell of origin, and comparing the nucleic acid sequence to a reference genome to identify the gene or genes of origin.

[0023] Another example involves gene expression values which are already organized as a matrix, which may be obtained from multiple sources including: processed sequencing data, gene expression values.

[0024] The machine-learning model may comprise a deep-learning model and an ordinal regression model. Inferring the numerical score may comprise processing the input dataset by the deep-learning model and the ordinal regression model, preferably in parallel, and generating the numerical score based on a combination of outputs of the deep-learning model and the ordinal regression model. The outputs may not be directly combined, but the values may be first standardized independently on the datasets. In certain embodiments, two pieces of information are pertinent when combining: the average score and the difference between scores. The combination of these two types of models is particularly beneficial because an advantage of the ordinal regression model is that the model is easy to interpret. Compared to the deep-learning model, the ordinal-regression may provide better insight on the genes that are used to compute the SOS score. The ordinal regression model was in come embodiments used to identify a reduced list of genes used to compute the survival score, onto which biological hypothesis underlying embodiments of the invention are built. A drawback of the ordinal regression model may be that it is more sensitive to noise in the input data. In contrast, the deep-learning model can be designed to be more robust to noise in the input dataset because of a number of factors: (1) The model can use all input genes to compute the SOS score, and can then aggregate this information to correct a noisy input; (2) the deep-learning model can be trained with noisy input data to push the network to produce robust scores; (3) the deep-learning model may be designed to be robust to variations in the normalization of the input: by introducing a feed-back loop the network normalizes differently distinct views of the input data. A drawback of the deep-learning model may be its interpretability: as the expression of virtually all genes may be used by the model, it may be more difficult to extract a reduced list of genes to extrapolate a biological hypothesis like for the ordinal-regression model.

[0025] The method may further comprise generating a confidence index which indicates an expected variability between the outputs of the deep-learning model and the ordinal regression model. The output dataset may comprise the confidence index. This way, it is possible to evaluate the quality of the scores produced by the method.

[0026] The method may further comprise displaying the output dataset on an electronic display device. The displaying may comprise displaying the numerical values directly and/or displaying a graphical representation of the numerical score. The predicted survival potential may be indicated by a color associated with the numerical score and/or by its position on a scale. Accordingly, this allows for an easy to perceive and intuitively to understand use of the scores.

[0027] The method may further comprise repeating at least a subset of the steps disclosed above with at least one additional input dataset associated with the same subject to generate at least one additional output dataset. The output dataset and the at least one additional output dataset may be compared. This allows to derive meaningful insights into the development of the subject over time, e.g. for monitoring the course of a disease or the effectiveness of a drug.

[0028] In one embodiment, a machine-learning model data structure is provided. Machine-learning algorithms are usually based on a machine-learning model. In other words, the term "machine-learning algorithm" may denote a set of instructions that may be used to create, train or use a machine-learning model. The term "machine-learning model" may denote a data structure and/or set of rules that represents the learned knowledge (e.g., based on the training performed by the machine-learning algorithm). In embodiments, the usage of a machine-learning algorithm may imply the usage of an underlying machine-learning model (or of a plurality of underlying machine-learning models). The usage of a machine-learning model may imply that the machine-learning model and/or the data structure/set of rules that is the machine-learning model is trained by a machine-learning algorithm.

[0029] For example, the machine-learning model may be an artificial neural network (ANN). ANNs are systems that are inspired by biological neural networks, such as can be found in a retina or a brain. ANNs comprise a plurality of interconnected nodes and a plurality of connections, so-called edges, between the nodes. There are usually three types of nodes, input nodes that receiving input values, hidden nodes that are (only) connected to other nodes, and output nodes that provide output values. Each node may represent an artificial neuron. Each edge may transmit information, from one node to another. The output of a node may be defined as a (non-linear) function of its inputs (e.g., of the sum of its inputs). The inputs of a node may be used in the function based on a "weight" of the edge or of the node that provides the input. The weight of nodes and/or of edges may be adjusted in the learning process. In other words, the training of an artificial neural network may comprise adjusting the weights of the nodes and/or edges of the artificial neural network, i.e. to achieve a desired output for a given input.

[0030] Alternatively, the machine-learning model may be a support vector machine, a random forest model or a gradient boosting model. Support vector machines (i.e., support vector networks) are supervised learning models with associated learning algorithms that may be used to analyze data (e.g., in classification or regression analysis). Support vector machines may be trained by providing an input with a plurality of training input values that belong to one of two categories. The support vector machine may be trained to assign a new input value to one of the two categories. Alternatively, the machine-learning model may be a Bayesian network, which is a probabilistic directed acyclic graphical model. A Bayesian network may represent a set of random variables and their conditional dependencies using a directed acyclic graph.

[0031] Alternatively, the machine-learning model may be based on a genetic algorithm, which is a search algorithm and heuristic technique that mimics the process of natural selection.

[0032] The machine-learning model may be configured for predicting the survival potential of a cell at single-cell resolution, particular in any of the methods disclosed herein. The machine-learning model may comprise an input module configured for receiving an input dataset and an output module configured to produce an output dataset. The input dataset may comprise omics data associated with at least one cell of a cell sample. The output dataset may comprise a numerical score which represents a predicted survival potential of the at least one cell.

[0033] The machine-learning model may comprise a deep-learning model and an ordinal regression model. The machine-learning model may comprise a combiner module for combining outputs of the deep-learning model and the ordinal regression model to generate the numerical score or scores.

[0034] The deep-learning model may comprise an input layer comprising a plurality of nodes. Each node may be associated with a particular gene of a cell. The deep-learning model may comprise a first hidden layer comprising a plurality of nodes i, each node i being associated with a linear weight w_i with a normalization constraint of 1:∥w_i∥=1, and wherein x denotes the input of the first hidden layer (206) and the first hidden layer (206) is configured for dividing the outputs <w_i.x> by d=<|w_i|.|x|>. The deep-learning model may comprise an output layer comprising a node associated with a numerical score which represents the predicted survival potential of the cell. This allows the normalization to be more efficient. The normalization considerably improves the adaptation of the model to new datasets.

[0035] The ordinal regression model may comprise a loss function of the following form:

[0036] The output dataset produced by one of the machine-learning models disclosed herein may be used for a variety of use cases. Non-limiting examples include:

predicting an effectiveness of a candidate agent for a degenerative disease
assessing the level of cytotoxicity in a population of cells
indicating the severity of a degenerative disease to improve prognosis
identifying which particular cell types are responding to a given treatment
assessing the health status of a single cell

[0037] In one embodiment, a computer-implemented method of training a machine-learning model is provided. The machine-learning model may be configured for predicting the survival potential of a cell sample at single-cell resolution, in particular in accordance with any of the characteristics disclosed herein.

[0038] The method may comprise receiving a training dataset. The training dataset may comprise omics data associated with a plurality of cells and information on the health status of the plurality of cells. The method may comprise training the machine-learning model using the training dataset or data associated with the training dataset.

[0039] Machine-learning models may be trained using training input data. The above example uses a training method called "supervised learning". In supervised learning, the machine-learning model is trained using a plurality of training samples, wherein each sample may comprise a plurality of input data values, and a plurality of desired output values, i.e., each training sample is associated with a desired output value. By specifying both training samples and desired output values, the machine-learning model "learns" which output value to provide based on an input sample that is similar to the samples provided during the training. Apart from supervised learning, semi-supervised learning may be used. In semi-supervised learning, some of the training samples lack a corresponding desired output value. Supervised learning may be based on a supervised learning algorithm (e.g., a classification algorithm, a regression algorithm or a similarity learning algorithm). Classification algorithms may be used when the outputs are restricted to a limited set of values (categorical variables), i.e., the input is classified to one of the limited set of values. Regression algorithms may be used when the outputs may have any numerical value (within a range). Similarity learning algorithms may be similar to both classification and regression algorithms but are based on learning from examples using a similarity function that measures how similar or related two objects are.

[0040] Apart from supervised or semi-supervised learning, unsupervised learning may be used to train the machine-learning model. In unsupervised learning, (only) input data might be supplied and an unsupervised learning algorithm may be used to find structure in the input data (e.g. by grouping or clustering the input data, finding commonalities in the data). Clustering is the assignment of input data comprising a plurality of input values into subsets (clusters) so that input values within the same cluster are similar according to one or more (pre-defined) similarity criteria, while being dissimilar to input values that are included in other clusters.

[0041] Reinforcement learning is a third group of machine-learning algorithms that may be used to train the machine-learning model. In reinforcement learning, one or more software actors (called "software agents") are trained to take actions in an environment. Based on the taken actions, a reward is calculated. Reinforcement learning is based on training the one or more software agents to choose the actions such, that the cumulative reward is increased, leading to software agents that become better at the task they are given (as evidenced by increasing rewards).

[0042] Furthermore, some techniques may be applied to some of the machine-learning algorithms. For example, feature learning may be used. In other words, the machine-learning model may at least partially be trained using feature learning, and/or the machine-learning algorithm may comprise a feature learning component. Feature learning algorithms, which may be called representation learning algorithms, may preserve the information in their input but also transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions. Feature learning may be based on principal components analysis or cluster analysis, for example.

[0043] In some examples, anomaly detection (i.e., outlier detection) may be used, which is aimed at providing an identification of input values that raise suspicions by differing significantly from the majority of input or training data. In other words, the machine-learning model may at least partially be trained using anomaly detection, and/or the machine-learning algorithm may comprise an anomaly detection component.

[0044] In some examples, the machine-learning algorithm may use a decision tree as a predictive model. In other words, the machine-learning model may be based on a decision tree. In a decision tree, observations about an item (e.g., a set of input values) may be represented by the branches of the decision tree, and an output value corresponding to the item may be represented by the leaves of the decision tree. Decision trees may support both discrete values and continuous values as output values. If discrete values are used, the decision tree may be denoted a classification tree, if continuous values are used, the decision tree may be denoted a regression tree.

[0045] Association rules are a further technique that may be used in machine-learning algorithms. In other words, the machine-learning model may be based on one or more association rules. Association rules are created by identifying relationships between variables in large amounts of data. The machine-learning algorithm may identify and/or utilize one or more relational rules that represent the knowledge that is derived from the data. The rules may e.g. be used to store, manipulate or apply the knowledge.

[0046] The training dataset may be obtained in the context of a biological experiment of one of the following designs comprising:

a perturbation experiment where beneficial and/or aggressive treatments are applied to cells, such as mechanical aggression and their transcriptomic response is followed over time together with their health status, such as e.g. cell death and/or cell proliferation
a regeneration experiment where transcriptomic profiles of cells from regenerative tissues are followed over time so the regenerative ability of a cell can be estimated from its transcriptomic profile
a degeneration experiment where the transcriptomic profiles of cells into degenerative diseases is compared or followed over time
an aging experiment where transcriptomic profiles of cells into organisms of increasing age is followed or compared between specimens so as to link the impact of age on the cell transcriptome
a developmental experiment where transcriptomic profiles of cells are followed during development of an organism

[0047] The training method may further comprise generating a gene expression matrix based on at least part of the training dataset, wherein the gene expression matrix encodes a normalized expression of a given gene in a given cell. The gene expression matrix may be identical, or at least similar, to the gene expression matrix used during the inference method described further above. The generated gene expression matrix may serve as the input for the training method.

[0048] The training of the machine-learning model may comprise subjecting a deep-learning model of the machine-learning model to a first training phase. The first training phase may comprise a first plurality of epochs, preferably at least 30 epochs. The training may comprise determining a pruning mask for a first hidden layer of the deep-learning model. The pruning mask may be configured for selecting a first set of nodes with the highest weights and a second set of nodes with the lowest weights. The training may comprise subjecting the deep-learning model to a second training phase with the pruning mask applied. The second training phase may comprise a second plurality of epochs, preferably at least 30 epochs. The pruning serves to simplify the model against overfitting and to make the network more robust to global normalization changes.

[0049] Preferably, a dropout probability of the first hidden layer involved in the first and/or second training step may be selected from the range of 30% to 70%, more preferably from the range of 40% to 60%, and most probably is a dropout probability of essentially 50%. With a dropout probability of 50%, half of the values are hidden during the training process, which is in certain embodiments considered to be a good estimate of what one may expect in a new dataset. Setting the dropout probability lower may not make the output robust enough to noise, setting it higher may require lot of epochs in the training process.

[0050] Some embodiments of the invention provide a data carrier having electronically readable control signals, which are capable of cooperating with a programmable computer system, such that one of the methods described herein is performed.

[0051] Generally, embodiments of the invention can be implemented as a computer program (product) with a program code, the program code being operative for performing one of the methods when the computer program product runs on a computer. The program code may, for example, be stored on a machine-readable carrier. Other embodiments comprise the computer program for performing one of the methods described herein, stored on a machine-readable carrier. In other words, an embodiment of the present invention is, therefore, a computer program having a program code for performing one of the methods described herein, when the computer program runs on a computer.

[0052] A further embodiment of the invention provides a storage medium (or a data carrier, or a computer-readable medium) comprising, stored thereon, the computer program for performing one of the methods described herein when it is performed by a processor. The data carrier, the digital storage medium or the recorded medium are typically tangible and/or non-transitionary. A further embodiment of the present invention is an apparatus as described herein comprising a processor and the storage medium.

[0053] A further embodiment of the invention provides a data stream or a sequence of signals representing the computer program for performing one of the methods described herein. The data stream or the sequence of signals may, for example, be configured to be transferred via a data communication connection, for example, via the internet.

[0054] A further embodiment of the invention provides a processing means, for example, a computer or a programmable logic device, configured to, or adapted to, perform one of the methods described herein.

[0055] A further embodiment of the invention provides a computer having installed thereon the computer program for performing one of the methods described herein.

[0056] A further embodiment of the invention provides an apparatus or a system configured to transfer (e.g., electronically or optically) a computer program for performing one of the methods described herein to a receiver. The receiver may, for example, be a computer, a mobile device, a memory device or the like. The apparatus or system may, for example, comprise a file server for transferring the computer program to the receiver.

[0057] In some embodiments, a programmable logic device (for example, a field programmable gate array) may be used to perform some or all of the functionalities of the methods described herein. In some embodiments, a field programmable gate array may cooperate with a microprocessor in order to perform one of the methods described herein. Generally, the methods are preferably performed by any hardware apparatus.

BRIEF DESCRIPTION OF THE DRAWINGS

[0058] The disclosure may be better understood by reference to the following drawings:

Fig. 1:: A flowchart illustrating a machine-learning phase and an inference phase in accordance with embodiments of the invention
Fig. 2:: A conceptual representation of a deep learning model usable in embodiments of the invention
Fig. 3:: A graphical representation of survival outcome scores in accordance with embodiments of the invention as compared to the prior art
Fig. 4:: Validation results to exemplify the robustness of survival outcome scores calculated for a sample of cells in accordance with embodiments of the invention
Fig. 5:: Validation results to exemplify the robustness of survival outcome scores obtained by a 200-genes classifier in accordance with embodiments of the invention

DESCRIPTION OF PREFERRED EMBODIMENTS

[0059] In the following, various techniques for predicting the survival potential of a cell at single-cell resolution will be described. The disclosed techniques may be used for identifying a candidate agent that may be effective to treat or ameliorate an effect of a (neuro)degenerative disease in a subject, or for a specific treatment.

[0060] In one embodiment, the overall process may include a database creation phase, a model building phase and a health score inference phase. The database creation phase may include extracting a large dataset from single-cell RNA sequencing data. The database creation phase may also include filtrating and/or normalizing the data. The model building phase may include building a classifier based on a machine-learning strategy. The model building phase may also include validating the classifier with cells which respond positive or negative to the treatment under consideration. The health score inference phase may include algorithmically identifying cells which respond to the treatment to ultimately accelerate the discovery of a suitable drug.

[0061] Fig. 1 illustrates an overall workflow according to an embodiment including a machine-learning phase 102 and an inference phase 110. The overall purpose of the machine-learning phase 102 is to build a suitable model from training data, and the purpose of the inference phase 110 is to use the model on real input data to predict the survival potential. It shall be understood that the machine-learning phase 102 and the inference phase 110 may be performed by different entities at different points in time, independent of each other.

[0062] In the following, the embodiment of the machine-learning phase 102 shown in Fig. 1 will first be described. The machine-learning phase 102 operates on a training dataset 104 as input. In the illustrated example, the training dataset 104 comprises transcriptomic profiles and information about the cell health status. The data may comprise raw datasets as FASTQ, for example, which have been produced using single-cell genomics techniques such as Drop-seq, Nuc-seq or 10X genomics scRNA-seq. The data may be converted to a computer-readable format, such as comma-separated values (CSV).

[0063] The cell names may be barcoded. In single-cell transcriptomics, barcoding of the cell typically involves (1) isolating cells into droplets, wells of a plate, or using a microfluidic system, (2) lysing and extracting RNA from each cell, and (3) adding a unique nucleic-acid sequence (the barcode) into each isolate to the transcripts (usually by reverse-transcription). The barcode will typically be sequenced together with the transcripts during sequencing.

[0064] In the example, the data has been produced using different genomics methods, and thus a customization step is performed for every set of data to ensure proper normalization. To this end, as illustrated in Fig. 1, the machine-learning phase 102 includes a matrix building step 106. In the example, the matrix (also referred to herein as gene expression matrix or GEX matrix) is generated by applying defined filters. Cells that are expressing a too low number of genes and UMIs (<500 = cellular debris) are removed, and cells that are outside the global normal distribution of the number of expressed genes (cell aggregation or doublets) are removed. Furthermore, genes that are expressed in a low fraction of the cells are removed as they cannot represent any meaningful variations or biological features. The number of UMIs per cell within the acceptable thresholds is homogenized by down sampling to generate gene expression profiles that can be comparable. The resulting gene expression matrix is composed of the gene expression of each cell present in the dataset.

[0065] The generated gene expression matrix is then used in the model building step 108 illustrated in Fig. 1. In the described embodiment, the machine-learning model is a combination of two different machine-learning models, namely a deep-learning model 118 and an ordinal regression model 120. The ordinal regression model 120 of the illustrated embodiment serves for extracting a list of marker genes, and may be sensitive to input data changes. The deep-learning model 118 of the illustrated embodiment is designed to be highly robust to input changes and thus allows more flexibility to adapt for multi-modal datasets, such as single-cell RNA-seq that can be integrated in the analysis as an extra set of parameters.

[0066] Fig. 2 is a more detailed illustration of the structure of the deep-learning model 118 according to an embodiment. The model 118 comprises an input layer 202, a first custom linear layer 206, a second linear layer 210, a third linear layer 214, an output layer 218, and four dropout layers 204, 208, 212, 216.

[0067] The input layer 202 comprises one node per gene. The inputs are expected to be Reads Per Million Reads (RPM) normalized gene expression profiles on a logarithmic scale.

[0068] The output layer 218 comprises one node for the survival outcome score.

[0069] To make the model 118 robust to missing values, it is simulated with a high dropout rate all along the network and onto the input layer. Accordingly, the model 118 in Fig. 1 comprises a first dropout layer 204 after the input layer 202 with a dropout rate of 50%, and three more dropout layers 208, 212, 216, each with a dropout rate of 30%.

[0070] The first linear layer 206 of the model 118 is a custom linear layer which is designed to handle different input normalizations. Let x be the input of layer 206, and w_i the weights for the i^th node of layer 206. Layer 206 is designed such that:

The linear weights of all nodes have a normalization constraint of 1 : ∥w_i∥ = 1
The outputs <w_i.x> of the linear layer 206 are divided by d = <|w_i|.|x|> (i.e., the dot product between the absolute values of w_i and the absolute values of x).

[0071] The rationale behind the customization of layer 206 is that the linear weights w_i can be seen as a mask on the input values that select a view of the input data. Dividing the linear outputs by d is a way to dynamically normalize (as represented by the arrow 220 in Fig. 2) the data on a reduced fraction of the input that is considered by node i. This allows the normalization to be more efficient. The inventors have found and validated in experiments that this considerably improves the adaptation of the model to new datasets.

[0072] One embodiment of a training process for the model 118 comprises a first training phase in which there are no pruning constraints on layer 206, and the network 118 is trained on at least 30 epochs with an ADAM optimizer. After the first training phase, a pruning mask is computed on layer 206. The pruning serves to analyze the linear weights of layer 206 and to keep, for each node, 100 genes, namely the 50 genes with the highest weights and the 50 genes with the lowest weights. In a second training phase, the network weights are fitted again on at least 30 more epochs with the pruning constraints applied, so the unpruned network weights can adapt to the pruning. The pruning serves to simplify the model 118 against overfitting and to make the network more robust to global normalization changes.

[0073] Returning to the model building step 108 illustrated in Fig. 1, The ordinal regression model 120 is, in the illustrated example, based on the "bmrm" R package developed by Julien Prados, which is available at https://github.com/pradosj/bmrm. bmrm is an R package implementing a bundle method for minimization of convex and non-convex risk under L1 or L2 regularization. It implements the algorithm proposed by Teo et al.: "Bundle Methods for Regularized Risk Minimization" (JMLR 2010) as well as the extension proposed by Do and Artieres: "Regularized bundle methods for convex and non-convex risks" (JMLR 201). The package includes various loss functions for machine learning suitable for big data analysis. Applications include structured prediction, linear SVM, multiclass SVM, f-beta optimization, ROC optimization, ordinal regression, quantile regression, epsilon insensitive regression, least mean square, logistic regression, and least absolute deviation regression, all with L1 and L2 regularization. For use in the illustrated embodiment, the preferred loss function used to train the network 118 is the ordinal loss function

where f is the vector network predictions for a set of samples, and y is the expected network outcome. The arguments that can be used are as follows:

x: matrix of training instances (one instance by row)
y: integer vector of positive values (>=1) representing the training labels for each instance in x
C: the cost matrix to use, C[i,j] being the cost for predicting label i instead of label j.
Impl: either the string "loglin" or "quadratic", which defines the implementation to use for the computation of the loss. Value a function taking one argument w and computing the loss value and the gradient at point w

[0074] Thus, the modeling step comprises in the illustrated example a regression analysis which is used to predict an ordinal variable. In the present example, the variables are the survivability or the vulnerability of cells following toxic treatments that we obtained from longitudinal studies and from which a genetic signature composed of specific weighted interactions was extracted. The model is applicable to other normalized datasets to infer the survival scores of each cell.

[0075] The trained machine-learning model is then used in the inference phase 110. Returning to the example in Fig. 1, the inference takes as input an input dataset 112 with transcriptomic profiles. A gene expression matrix 114 is built in step 114, similar to the gene expression matrix building step 106 performed during the machine-learning phase 102. The generated gene expression matrix is then subjected to the model inference 116, which involves the combined machine-learning model including the deep-learning model 118 and the ordinal regression model 120 built during the machine-learning phase 102.

[0076] The output of the model inference step 116 is, in the illustrated example, post-processed as follows: The generated survival outcome scores (one per evaluated cell) are standardized across the cell sample. The respective scores produced by the deep-learning model 118 and the ordinal regression model 120 are merged (in the example by averaging), and the standard deviation is calculated. The variability (e.g., as quantified by the standard deviation) between the two scores is used to define a confidence index. The scores for each cell are associated with their associated cell type. An output dataset 122 (e.g., in the form of a matrix data structure) is generated comprising the above information. Certain embodiments of the invention may also generate an output dataset 122 with only a subset of the above information. In the output dataset 122, the cell names may be barcoded for confidentiality reasons.

[0077] Fig. 3 illustrates in the top section the approach taken predominantly in the prior art, namely to identify a cell as either alive or dead. The bottom section in Fig. 3 illustrates an exemplary visual representation of survival outcome scores according to an embodiment of the invention. As can be seen, the representation comprises a scale of several classes (in the example: "dead cells", "ongoing cell death", "week", "robust cells", "regenerative"). Each dot represents a given cell of the evaluated cell sample. The survival outcome score of a cell may be indicated by its position on the scale, as shown in Fig. 3. The survival score of a cell may also be indicated by the color of the corresponding dot, e.g., purple for "dead cells", orange for "ongoing cell death", brown for "week", dark green for "robust cells" and light green for "regenerative". Also a combination of both position and color is conceivable. While most methods in the prior art assess the viability of the cells by scoring their status as dead (or almost dead) or alive, the single-cell scores computed by embodiments of the invention may reveal a continuum of states that have a predictive value for the capacity of each single cell to survive. Thus, while some cells with poor values are likely to be engaged in a cell death process, other cells may be alive but stated as weak, and yet other cells may be alive and stated as robust, etc.

[0078] A proof of principle is provided in Fig. 4. Figs. 4A and 4B are cross-validations on neurons from the retina. Fig. 4A illustrates results of an analysis performed on over forty populations of cells with the scores of the cell at time 0 plotted on the x-axis and a value indicating the overall change of the fraction of cells remaining two weeks after an injury on the same cells plotted on the y-axis. Each line represents a group of hundred single-cells. The figure shows that cells with poor values (towards the lower part of the chart) were depleted over time, while cells with high scores remain and thus represent a larger fraction of the cell population.

[0079] Fig. 4B illustrates violin plots showing the progression of a cluster of about one hundred cells with an initial low survival score. Over time, as cells die, the number of cells in the group decreases, with the remining cells being the ones with the higher scores.

[0080] Fig. 4C is a validation on cortical neurons (upper motor neurons) from a different dataset. The pie charts represent the proportion of two upper motoneurons populations with high SOS (labelled "SOS-HI") and low SOS (labelled "SOS-LOW") among the general population. These motoneurons are shown three days after injury (upper pie chart) and 15 days after injury (lower pie chart). The remaining population of cells hare a higher degree of molecular identity with the population with higher SOS.

[0081] The foregoing is a disclosure of various techniques for determining and predicting the survival potential of cells, e.g., from mammalian cultures or human biopsies. Embodiments provide methods for identifying a candidate agent that may be effective to treat or ameliorate an effect of a degenerative disease in a subject or for a specific treatment. Certain embodiments of the disclosed methods may comprise or use i) a 200-genes classifier ranking and weighting levels of mRNA abundance with ii) a combinatorial series of a top genes signature for specific cell types, and which best explains the classifier performance to identify therapeutical agents.

[0082] In one embodiment, the 200-genes classifier / signature is produced by the ordinal regression model 120. The 200-genes classifier may be configured for ranking dynamics of mRNA abundance in cells with different potential of survival in vivo and in vitro. The strength was exemplified with a 32 genes signature (so-called "best-fit signature"; BFS) which best explains the classifier performance. The proof of concept (see Fig. 4) of its generalizability was provided by validating the classifier in three independent single-cell datasets, including public data (Tran et al., 2019) data from a previous publication of the inventors (Lo Giudice et al., 2019) and a large unpublished single-cell dataset (Leleu et al., in preparation).

[0083] Fig. 5 illustrates the robustness of the survival outcome scores obtained by using the above-mentioned 200-genes classifier on retinal cells subjected to severe injury.

[0084] Certain embodiments compute health-scores for cells of any types based on their transcriptomic profile. One application is to compare health scores under different treatment conditions. To compute the scores, one may rely on the outcome of biological experiments which must be carefully designed to capture the expected outcome (cell health). A typical focus is on time course experiments where cells have been treated or aggressed. The cells are followed over time and the survival outcome scores are observed at different points of the experiment. The prediction models disclosed herein are designed with the objective to be able to predict as early as possible the outcomes of cell survivability. The disclosed methods may also be applied to experimental designs related to aging, cell regeneration (e.g., for liver cells), cell duplication (e.g., for cancer-cells) and perturb-seq experiments.

[0085] Embodiments of the invention are susceptible to being used with human stem cells and cells derived from patients with cancer or other degenerative diseases. The robustness of embodiments of the method is useful to refine high throughput technologies, including image-based cytometry assay and genetic engineered organoid screening. Embodiments of the disclosed techniques will also be useful for clinical settings by way of its capacity to evaluate strong clinical prognosis of cells. Embodiments of the disclosed techniques benefit from a solid read-out based on its strong sensitivity and predictive value. This allows to rapidly test and identify new drugs and/or biophysical factors which are key to sustain viability in human cells.

[0086] Accordingly, the techniques disclosed herein may be used to identify agents that can increase survival ratio for cell-type specific population in animal model and human cultures. The sensitivity of refined parameters is one beneficial characteristic of the disclosed techniques, with each gene having a different impact on the assessment of the survival outcome score. The result of the assessment represents a genetic interrogation of survival and regenerative capacity. With the rapidly growing overall demand for data-driven medicine, the disclosed techniques provide an analytical package for drug and/or treatment screening that is usable by hospitals, pharmaceutical and/or medical companies, as well as academic institutions.

[0087] In the following, various non-limiting examples of using the output dataset 122 produced by one of the methods disclosed herein will be described:
Predicting an effectiveness of a candidate agent for a degenerative disease: For example, during therapeutics screening of numerous compounds, while developing a new drug, the compound may be assessed early on, before the classical readouts. While single-cells share many morphological features of viability, they may be at risk based on their vulnerable combinatorial transcriptomes. In a degenerative setting, it may take years for the weakest cells to succumb. A candidate agent that can restore their viability outcome before the cell death can thus be screened since it will increase the survival outcome score (SOS) of these cells. Furthermore, some sub-populations of cells are known to be at risk for particular disease (such as beta-cell failure in diabetes, lower motor neuron loss in spinal muscular atrophy, dopaminergic neurons from the substantia nigra pars compacta in Parkinson disease, retinal ganglion cells in Glaucoma etc). Another example is the rescuing of a particular metabolic pathway which will not change the test cells readouts on a classical compound assay, but that can be detected by embodiments of the present disclosure. Also conceivable is a rescue of a phenotype or readout by adding up the different drugs, "rebuilding" the correct molecular pathway, such as to create an atlas of how to rebuild a pathway.

[0088] Assessing the level of cytotoxicity in a population of cells: a given treatment may lower the SOS of some cells, and thus reveal a potential harmful effect of a given drug, thereby allowing the fast identification of dangerous compounds during therapeutics developments.

[0089] Indicating the severity of a degenerative disease to improve prognosis: most diseases are not progressing at the same pace, and lower SOS scores are associated with poorer survival rate. By the quantification of the neurodegeneration level, the level and composition of therapeutic drugs may be adapted.

[0090] Identifying which particular cell types are responding to a given treatment: Higher SOS is a strong indication that a given population of cells is very robust and thus will have higher survival potential. Cells sharing most of their molecular identity will actually have distinctive SOS and thus can be distinguished on their survival outcome. This may allow research to be done in a more specific way toward a disease or a compound. It may also enable finding cell types not yet known to be reactive for a specific compound, in the sense of a new way of drug screening.

[0091] Assessing the health status of a single cell: even in physiological settings one can assess the differential robustness of cells in a large population, which is particularly useful for rare and/or critical cell types in vivo.

[0092] Although some aspects have been described in the context of an apparatus, it is clear that these aspects also represent a description of the corresponding method, where a block or device corresponds to a method step or a feature of a method step. Analogously, aspects described in the context of a method step also represent a description of a corresponding block or item or feature of a corresponding apparatus.

[0093] Some or all of the method steps may be executed by (or using) a hardware apparatus, such as a processor, a microprocessor, a programmable computer or an electronic circuit. Depending on certain implementation requirements, embodiments of the invention can be implemented in hardware or in software. The implementation can be performed using a non-transitory computer-readable storage medium such as a digital storage medium, for example a floppy disc, a DVD, a Blu-Ray, a CD, a ROM, a PROM, and EPROM, an EEPROM or a FLASH memory, having electronically readable control signals stored thereon, which cooperate (or are capable of cooperating) with a programmable computer system such that the respective method is performed.

Claims

1. A computer-implemented method (110) of predicting the survival potential of a cell at single-cell resolution, the method comprising:

receiving an input dataset (112), wherein the input dataset (112) comprises omics data associated with at least one cell of a biological sample;

inferring (116) a numerical score which represents a predicted survival potential of the at least one cell using a trained machine-learning model, in particular the machine-learning model of any one of claims 6-9; and

generating an output dataset (122) comprising the numerical score.

2. The method of claim 1, further comprising:

generating (114) a unified representation of data of the input dataset (112), in particular a gene expression matrix based on at least part of the input dataset (112), wherein the gene expression matrix encodes a normalized expression of a given gene in a given cell;

wherein generating (114) the gene expression matrix comprises at least one of:

- removing a cell if an associated number of expressed genes of the cell is below a predefined threshold;

- removing a cell if a unique molecular identifier, UMI, count of the cell is below a predefined threshold;

- removing a cell if an associated number of expressed genes is outside a global normal distribution of the number of expressed genes in the input dataset (112);

- removing a cell if an associated magnitude of expressed genes is outside a global normal distribution of the magnitude of expressed genes in the input dataset (112);

- removing a gene if a number of cells in which the gene is expressed is below a predefined threshold;

- removing a gene if the expression magnitude of a gene is steady across a number of cells;

- homogenizing, in particular by down-sampling, an amount of UMIs per cell to be within a predefined range;

- reorganizing matrix rows and/or matrix columns according to gene names

- identifying missing genes and, optionally, performing a remedying action such as filling the missing entry with a specific value, such as an average value or a value computed by an imputation method;

- adjusting expression values so they are expressed into an identical unit, such as log-transformation and/or scaling by a pre-defined or computed normalization factor and/or variable standardization, gene length normalization and/or expression centering.

3. The method of claim 1 or 2, wherein the input dataset (112) is derived from at least one source comprising:

a nucleic acid sequence count, preferably in FASTA/FASTQ format, generated by a sequencing instrument, wherein generating (114) the gene expression matrix comprises processing a cell barcode in the nucleic acid sequence to identify the cell of origin, and comparing the nucleic acid sequence to a reference genome to identify the gene or genes of origin;

gene expression values already organized as a matrix, which may be obtained from multiple sources including: processed sequencing data, gene expression values.

4. The method of any one of the preceding claims, wherein the machine-learning model comprises a deep-learning model (118) and an ordinal regression model (120); and
wherein inferring (116) the numerical score comprises:

processing the input dataset (112) by the deep-learning model (118) and the ordinal regression model (120);

generating the numerical score based on a combination of outputs of the deep-learning model (118) and the ordinal regression model (120);

generating a confidence index which indicates an expected variability between the outputs of the deep-learning model (118) and the ordinal regression model (120); wherein the output dataset (122) comprises the confidence index.

5. The method of any one of the preceding claims, further comprising displaying the output dataset (122) on an electronic display device;

wherein the displaying comprises displaying a graphical representation of the numerical score;

wherein, preferably, the predicted survival potential is indicated by a color associated with the numerical score and/or by its position on a scale.

6. A machine-learning model data structure configured for predicting the survival potential of a cell at single-cell resolution, wherein the machine-learning model comprises:

an input module configured for receiving an input dataset (112), wherein the input dataset (112) comprises omics data associated with at least one cell of a biological sample; and

an output module configured to produce an output dataset (122), wherein the output dataset (104) comprises a numerical score which represents a predicted survival potential of the at least one cell.

7. The machine-learning model of claim 6, further comprising:

a deep-learning model (118);

an ordinal regression model (120); and

a combiner module for combining outputs of the deep-learning model (118) and the ordinal regression model (120) to generate the numerical score.

8. The machine-learning model of claim 7, wherein the deep-learning model (118) comprises:

an input layer (202) comprising a plurality of nodes, each node associated with a particular gene of a cell;

a first hidden layer (206) comprising a plurality of nodes i, each node i being associated with a linear weight w_i with a normalization constraint of 1:∥w_i∥=1, and wherein x denotes the input of the first hidden layer (206) and the first hidden layer (206) is configured for dividing the outputs <w_i.x> by d=<|W_i|.|x|>; and

an output layer (218) comprising a node associated with a numerical score which represents the predicted survival potential of the cell.

9. The machine-learning model of claim 7 or 8, wherein the ordinal regression model (120) comprises a loss function

10. The use of an output dataset (122) produced by a machine-learning model according to any one of claims 6-9 for at least one of:

- predicting an effectiveness of a candidate agent for a degenerative disease;

- assessing the level of cytotoxicity in a population of cells;

- indicating the severity of a degenerative disease to improve prognosis;

- identifying which particular cell types are responding to a given treatment;

- identifying which particular cell types are affected by a disease;

- assessing the health status of a single cell.

11. A computer-implemented method (102) of training a machine-learning model configured for predicting the survival potential of a cell at single-cell resolution, in particular the machine-learning model of any one of claims 6-9, comprising:

receiving a training dataset (104), wherein the training dataset (104) comprises omics data associated with a plurality of cells and information on the health status of the plurality of cells;

training (108) the machine-learning model using data associated with the training dataset (104);

wherein the training dataset (104) is obtained in the context of a biological experiment of one of the following designs comprising:

- a perturbation experiment where beneficial and/or aggressive treatments are applied to cells, such as mechanical aggression and their transcriptomic response is followed over time together with their health status, such as e.g. cell death and/or cell proliferation;

- a regeneration experiment where transcriptomic profiles of cells from regenerative tissues are followed over time so the regenerative ability of a cell can be estimated from its transcriptomic profile;

- a degeneration experiment where the transcriptomic profiles of cells into degenerative diseases is compared or followed over time;

- an aging experiment where transcriptomic profiles of cells into organisms of increasing age is followed or compared between specimens so as to link the impact of age on the cell transcriptome;

- a developmental experiment where transcriptomic profiles of cells are followed during development of an organism.

12. The method of claim 11, further comprising:
generating (106) a gene expression matrix based on at least part of the training dataset (102), wherein the gene expression matrix encodes a normalized expression of a given gene in a given cell.

13. The method of claim 11 or 12, wherein the training (108) of the machine-learning model comprises:

subjecting a deep-learning model (118) of the machine-learning model to a first training phase, wherein the first training phase comprises a first plurality of epochs, preferably at least 30 epochs;

determining a pruning mask for a first hidden layer (206) of the deep-learning model (118), wherein the pruning mask is configured for selecting a first set of nodes with the highest weights and a second set of nodes with the lowest weights;

subjecting the deep-learning model (118) to a second training phase with the pruning mask applied, wherein the second training phase comprises a second plurality of epochs, preferably at least 30 epochs;

wherein, preferably, a dropout probability of the first hidden layer (206) involved in the first and/or second training step is selected from the range of 30% to 70%, more preferably from the range of 40% to 60%, and most probably is a dropout probability of essentially 50%.

14. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any one of claims 1-5 and/or 11-13.

15. A data processing system comprising means for carrying out the method of any one of claims 1-5 and/or 11-13.

Drawing

Search report

Search report

Cited references

REFERENCES CITED IN THE DESCRIPTION

This list of references cited by the applicant is for the reader's convenience only. It does not form part of the European patent document. Even though great care has been taken in compiling the references, errors or omissions cannot be excluded and the EPO disclaims all liability in this regard.

Patent documents cited in the description

Non-patent literature cited in the description

BROWNCAROLYN A.CATHY LALLYVARANT KUPELIANW. DANA FLANDERSEstimated Prevalence and Incidence of Amyotrophic Lateral Sclerosis and SOD1 and C9orf72 Genetic VariantsNeuroepidemiology, 2021, vol. 55, 5342-53 [0002]
WITTCHENH.U.F. JACOBIJ. REHMA. GUSTAVSSONM. SVENSSONB. JÖNSSONJ. OLESEN et al.The Size and Burden of Mental Disorders and Other Disorders of the Brain in Europe 2010European Neuropsychopharmacology, 2011, vol. 21, 9655-79 [0003]
2021 Alzheimer's Disease Facts and FiguresAlzheimer's Association20210000vol. 17, [0003]
F. ARAI et al.Machine Learning of Hematopoietic Stem Cell Divisions from Paired Daughter Cell Expression Profiles Reveals Effects of Aging on Self-RenewalCell Syst., 2020, vol. 11, 6640-652 [0007]
SZALAIBENCEVIGNESHWARI SUBRAMANIANCHRISTIAN H HOLLANDROBERT ALFÖLDILÁSZLÓ G PUSKASJULIO SAEZ-RODRIGUEZSignatures of Cell Death and Proliferation in Perturbation Transcriptomics Data-from Confounding Factor to Effective PredictionNucleic Acids Research, 2019, vol. 47, 1910010-26 [0007]
TEO et al.Bundle Methods for Regularized Risk MinimizationJMLR20100000 [0073]
DOARTIERESRegularized bundle methods for convex and non-convex risksJMLR201- [0073]