## Abstract

In this review, the fundamental basis of machine learning (ML) and data mining (DM) are summarized together with the techniques for distilling knowledge from state-of-the-art omics experiments. This includes an introduction to the basic mathematical principles of unsupervised/supervised learning methods, dimensionality reduction techniques, deep neural networks architectures and the applications of these in bioinformatics. Several case studies under evaluation mainly involve next generation sequencing (NGS) experiments, like deciphering gene expression from total and single cell (scRNA-seq) analysis; for the latter, a description of all recent artificial intelligence (AI) methods for the investigation of cell sub-types, biomarkers and imputation techniques are described. Other areas of interest where various ML schemes have been investigated are for providing information regarding transcription factors (TF) binding sites, chromatin organization patterns and RNA binding proteins (RBPs), while analyses on RNA sequence and structure as well as 3D dimensional protein structure predictions with the use of ML are described. Furthermore, we summarize the recent methods of using ML in clinical oncology, when taking into consideration the current omics data with pharmacogenomics to determine personalized treatments. With this review we wish to provide the scientific community with a thorough investigation of main novel ML applications which take into consideration the latest achievements in genomics, thus, unraveling the fundamental mechanisms of biology towards the understanding and cure of diseases.

- Machine learning
- supervised-unsupervised learning
- NGS
- gene expression
- scRNA-seq
- TFs
- RBPs
- RNA structure
- sequence motifs
- review

The majority of the large-scale data in bioinformatics and systems biology include genome wide studies from next generation sequencing (NGS) experiments, such as, studies for deciphering gene expression from total and single cell (scRNA-seq), as well as data that provide information regarding the binding sites of transcription factors (TFs) and RNA binding proteins (RBPs) while incorporating information of the RNA substrate such as, sequence and RNA structure. NGS technology enables the decoding the genome of many organisms, learning the transcriptome and proteome per cell or deciphering differences from genome-wide association studies (GWAS) (1) between different organisms and clarifying the functions and properties of many biological systems. These topics of bioinformatics include a large repertoire of datasets. To analyze and interpret the big biomedical data, efficient algorithms are constantly being developed for processing, building, and matching the genomes (2) or determining the gene expression differences under normal or disease conditions (3). The constant evolution of ML will aid biologists to find patterns and associations in various studies while also enabling them to predict the outcome of biomodels under investigation, uncovering the fundamental mechanisms in biology.

## A General Overview of Machine Learning (the Basic Principles)

The general scope of machine learning ML is to devise algorithms that can run in an automated fashion to predict new behavior or classify patterns arising in complex data sets, based on sets of training data. ML consists of a large variety of methods designed to address a wide class of different problems, so in this section we refrain to a selection of methods and problems.

Classification is one of the main tasks in ML with many important applications in a variety of disciplines including medicine. The problem of classification can be abstracted as follows: Given points in a high dimensional space, corresponding to different entities (of different quality) can you separate these points into distinct groups each being homogeneous and comprising of points of the same quality? We will illustrate this analysis with a real example for defining and classifying the distribution of gene expression from RNA-seq experiments and their response upon treatment in different time points such as the distribution shown in Figure 1.

Each distribution can be modeled as an array of n real numbers, say *x=(x _{1},…,x_{n})*, each considered as a point in the Euclidean space

*R*of suitable dimension

^{n}*n*. Then, our data can be conceived as collections of points in

*R*. The rationale of such a “visualization” is to manage those points related to qualitatively different gene expressions located in distant parts of the underlying Euclidean space where the points are embedded, therefore obtaining different clusters.

^{n}One possible separation scheme between the qualitatively different clusters can be in terms of a separating hyperplane, *i.e*., a subset of *R ^{n}* defined as the set of points

*x=(x*such that H={

_{1},...,x_{n})*x*∈R

^{n}|w•x+

*w*

_{0}=0} for a suitable vector

*w=(w*and scalar w

_{1},...,w_{n})_{0}with

*w*•

*x*denoting the Euclidean inner product defined by . This set is called hyperplane since in 2- or 3- dimensions

*H*corresponds to a straight line or a two-dimensional (2D) plane (respectively) separating

*R*or

^{2}*R*in two distinct sub-spaces.

^{3}The same principle is applied in any dimension *n*. The subspace *H* corresponds to an *n-1* dimensional hyperplane which separates *R ^{n}* into two distinct parts H+={

*x*∈

*R*|

^{n}*w•x*+ w

_{0}>0} and H–={x∈R

^{n}|

*w•x*+ w

_{0}<0}. Our goal is to find if a separating hyperplane

*H*exists such that all points of class 1, satisfy the condition

*x*∈H

_{i}_{+}while all points of class 2,

*i.e*., points

*x*with

_{i}*i*∈N

_{2}, satisfy the condition

*x*∈H

_{i}_{–}. If we manage to identify such a hyperplane, then we have a separating rule, that is a function f:

*R*→

^{l}*R*, defined by

*f*(

*x*)=

*f(x*:=

_{1},...,x_{l})*w•x+w*such that for all points

_{o}*x*, with i∈

_{i}=(x_{i1},...,x_{il})*N*it holds that

_{1}*f(x*>0, whereas for all points

_{i})*x*, with

_{i}=(x_{i1},...,x_{il})*i*∈N

_{2}it holds that

*f(x*<0. The construction of such a function is in the very essence of ML where we will demonstrate the basic supervised and unsupervised learning techniques for solving such task.

_{i})## Supervised Machine Learning Principles

A supervised machine learning approach infers to a function derived from labeled training data or training set and a desired output value. The algorithm after learning will proceed in trying to correctly determine the class labels for unseen instances with the minimum error.

*Support vector machines (SVM)*. In our previous examples for classifying gene expression data (Figure 1), a possible criterion for assessing the “fitness” of a separating hyperplane in classifying the data (equiv. points) is the distance that such a criterion yields for the most difficult data to be separated, *i.e*., those data (equiv. points) on the boundaries of the geometric loci of the points in *x _{i}*∈N

_{1}and the points in

*x*∈N

_{i}_{2}. Clearly, under this interpretation the optimal hyperplane

*H*would be the one for which the distance between classified points on the boundaries is maximum. In such cases, an optimal hyperplane H, would be the one for which the misspecification error is minimum. The distance of a point

*x*from the hyperplane

_{i}*H*can be calculated in terms of the vector

**w**and the scalar

*w*as shown in Equation 1 (Eq. 1).

_{o}[1]

where ||w|| denotes the Euclidean norm of the vector **w**, defined by ||w||:=w*•*w. Let *δ=δ(H)=δ(w,w _{0})* be the minimum distance between the points in class 1 and 2 (Eq. 2).

[2]

Let us consider the problem of choosing the hyperplane *H* which maximizes the distance δ, maxδ _{{w,w0}} (w,w_{0}), subject to the separation conditions as determined *via* Eq. 2 and 3.

[3]

For a fixed ε>0, we can get the best separation result if we choose the separation hyperplane *H* such that: *w•x _{i}+w_{o}*>ε||w||, i∈N

_{1},

*w•x*+

_{j}*w*<-ε||w||, j∈

_{o}*N*

_{2}where the term on the left-hand sides correspond to the distance of the points from

*H*. Defining the new variables y

_{i}so that

*y*=1 if

_{i}*i*∈

*N*

_{1}and

*y*=-1 if i∈

_{i}*N*

_{2}, and considering the problem of maximizing the distance (or equivalently maximizing ε/||w|| and setting without loss of generality ε=1) corresponds to the more standard convex optimization problem (Eq. 4):

[4]

such that

[5]

The classification function may be readressed as *f(x)=sign(w•x+w _{o})*, with the classification assignment for any data point

*x*given in terms of the response

_{i}*y*. A geometric viewpoint of the above optimization problem is as trying to fit an empty slab of the maximum possible width between the two data classes. Convex optimization problems enjoy a long tradition and a powerful arsenal of analytic and numerical techniques that can be used for their treatment (4). One of these is their Lagrangian formulation, intimately related to the powerful and deep dual formulation of such problems. According to this viewpoint, a solution of the above problem can be understood as a saddle point (min-max) of the augmented Lagrangian function defined by Eq. (6):

_{i}=f(x_{i})[6]

where *λ _{i}*≥0, are the generalized Lagrange multipliers. The saddle point conditions for L are the celebrated KKT conditions (5) according to which

**w**and λ satisfy the conditions.

The first condition in Eq. (7) can help us identify a linear combination of the support vectors, depending on the Lagrange multipliers, hence the terminology SVM. An interesting alternative formulation of the problem, introduces the concept of the loss function which is fundamental in ML and is the following: Upon defining M as the set of misclassified data points, one possible indicator of misclassification error may be the loss function: *L(H)=L(w,w _{0})*=-Σ

*. The gradient of the loss function is defined as the direction of*

_{{i∈M}}y_{i}(w•x_{i}+w_{o})*n=(∂L/(∂w*, which can be normalized by dividing with ||n|| (if needed) and a typical gradient scheme would be to start at an initial point

_{1}),...,∂L/(∂w_{l}),∂L/(∂w_{o}))*w*to

_{o}*w*=(w

^{k}_{1}

^{k},...,w

_{l}

^{k},w

_{0}

^{k}) and then create a sequence w

^{(k+1)}=w

^{k}-ϱn, k=0,1..n for a scalar ϱ>0, often called the learning rate. If this scheme converges [which does under conditions, as described in (4)] then it converges to a point where the gradient of the loss function vanishes, which is a candidate for a local minimum (and in fact a global minimum if the loss function is convex). Figure 2 best describes the boundary conditions as obtained

*via*this condition for having the minimum optimal solution. Figure 3 represents the maximal margin hyperplane in the 2D space.

[7]

Often linear separation is not feasible on account of the geometry of the data. In this case, a kernel classifier techniques may prove useful. In such techniques the original feature space X is mapped through a nonlinear mapping *Φ*, into a higher dimensional space Z, where the data may display linear structure and hence linear separation techniques may be used. Convenient feature maps are those related with the so-called kernel functions *K:X×X*→*R*, in terms of the relation *K(x,x')=<Φ(x),Φ(x')> _{Z}*, where by <.,.>

_{Z}we denote the inner product in the feature space Z. Kernel functions are symmetric functions which quantify similarity between any two data points in the input space

*x,x*', in terms of the value

*K(x,x')*. Various choices for kernel function are possible, for example polynomial kernels of degree

*d*are defined as:

*K(x,x')=(x•x'+b)*, where

^{d}*x*and

*x*' are observations vectors,

*b*is a non-negative constant which is used to balance among higher and lower degree polynomial coefficients, and

*d*is the non-negative degree of the polynomial. A

*d*-degree polynomial kernel is used to create decision boundaries of that degree. Other widely adopted kernels are the Gaussian RBF kernels and the radial kernels (6). The nature of SVM classifier which uses multiple features to learn and drive a classification has led to a wide range of applications in biology. This can include image analysis approaches for predicting stages of tumor such as in (7) to more combinatorial methods of combining genomic data sets from gene expression, DNA methylation, GWAs studies and MRI images to drive clinical outcomes and treatments in oncology.

*Regression*. Regression aims to determine the relationship between a set of input features *x _{i}*∈

*R*and a continuous or otherwise quantitative outcome variable

^{n}*y*∈

_{i}*R*, in terms of a model of the form

^{m}*y*∈

_{i}=f(x_{i})+*i*, where f:R

^{n}→R

^{m}, is a suitable function to be determined and ∈

_{i}are (random) errors modeling fluctuations around the deterministic law

*y*. In ML, regression techniques are used to first fit and construct an appropriate function f, that models a set of training input data and corresponding responses, say

_{i}=f(x_{i})*(x*

_{i},y_{i}),i=1,...,N_{train}, along with a distribution of the errors.

Linear regression (8) is a widely adopted supervised learning technique. Linear regression uses a linear function f, (in the case m=1) of the form f(x)=b_{0}+b•x for a suitable scalar *b _{0}* and vector

*b=(b*∈

_{1},...,b_{n})*R*. By appropriately specifying the parameters of the model from test data, the model can be used to provide predictions for new data. The estimation of these parameter is obtained by minimizing an appropriate loss function. A large variety of models is built around linear regression depending on how the model parameters are determined. A common choice comprising four important models is to choose b so that the following loss function is minimized (Eq. 8):

^{n}[8]

where the first term corresponds to the mean square error of the model which focuses on the closeness of the model to the observations, and the rest correspond to regularization terms. The above model when a=0 corresponds to the standard linear regression model, which may give rise to issues related to over-fitting, multicolinearity or high dimensional data that might lead to models which are difficult to interpret. The case where r=0, α≠0 leads to Ridge Regression, also known as Tikhonov regularization (9), a variation of linear regression which attempts to solve the multi-collinearity problems of the latter approach. The case where r=1, α≠0 corresponds to least absolute shrinkage and selection operator (Lasso) regression (10) that uses the most important features of the dataset, while the rest are set to zero and consequently are ignored, thus leading to more economical and easier to interpret models. The last variation discussed here is the elastic nets (11) corresponding to r∈(0,1), α≠0 and can be considered as a combination of the previous two.

Logistic regression (12) estimates the probability that an input belongs to the target class (labeled as ‘1’, called the positive class) or to the other one (labeled as ‘-1’, the negative class –often labeled as 0). The basis for this approach is the logistic function, which connects the probability of assignment to each class with the features vector x connected to each data point, as in Eq. 9:

[9]

If for a feature vector x, p(x)<1/2 then this is classified as y=–1 and y=1 otherwise (often y=–1 is labeled as y=0). A common alternative form of this model is in terms of the log of the odds *O=p/(1-p)*, which turns out to be a linear function, *log(O)=b _{0}+b.x*, better known as the logit model. Model fitting is done by using the maximum likelihood estimation (MLE) (13), which intuitively can be seen as a way of determining coefficients so that by inserting them to the logit function which will produce outputs as close to 1 as possible for all data items that are indeed classified as such. Likewise, it should produce a number as close to zero as possible for all items that are classified using the other label. As soon as training is complete and the coefficients determined, predictions are made simply by inserting unknown inputs to the logit function. This model, which is extendible to more than two classes, is related to SVM providing thus a probabilistic based version of them.

Regression techniques have a broad range of applications in bioinformatics. These methods have been widely adopted in normalization procedures of RNA-seq data, such as in (14) where a loess regression has been applied on RPKM values by using a spike-in RNA as control to fit the loess regression model. More recently in (15) the authors make use of Pearson residuals from “regularized negative binomial regression,” to estimate for cellular sequence depth in order to be used as a covariate in a generalized linear model, which can efficiently normalize the data but also maintain the biological heterogeneity of scRNA-seq.

*Naïve Bayes*. Naïve Bayes (16) is a popular probabilistic approach for supervised learning which estimates the conditional probability for a random data point being in class j. Suppose we have data points described by n features, *x=(x _{1},x_{2}...,x_{n})*, and a target feature

*y*. This method assigns probabilities that a data point characterized by features x is classified in terms of y using the conditional probability

*P(y|x)*called the posterior probability. This is the desired outcome of the model, which is obtained in terms of information available from the training (

*i.e*., already classified) data,

*i.e*. the quantities

*P(x|y)*(the likelihood), and the joint probability P(

*x*) and P(y) of the input and target features respectively. This connection is obtained through Bayes’ theorem as

*P(y/x)=P(y)P(x/y)/P(x)*. We then classify a data point x, to this class y for which

*P(x|y)*is maximized, leading to an estimation for the classifier in terms of

*y^=argmax*, with P(y/x) given in terms of the training data as above. To simplify the optimization problem required for the classification, the likelihood term can be reduced by assuming that the conditional probabilities of each feature given the target feature y are independent (or in some cases conditional independent), so that approximately , where upon ignoring the unimportant factor

_{{y}}P(y|x)*1/P (x)*, which does not depend on y, we obtain the estimator , which is simpler to calculate. The maximum a posteriori (MAP) rule (17) can be used to find an estimate, that maximizes the product of the likelihood and the prior probabilities.

Bayesian classification procedures have been applied to a variety of biological problems. These include determining gene expression differences from bulk RNA-seq analysis to the analysis of more complex systems such as in scRNA-seq analysis where probabilistic models are able to learn cell-specific parameters in order to drive normalization (18).

*Random forests (RFs)*. A popular methodology based on decision trees (19) is random forests (RFs). An RF consists of a multitude of decision trees and is essentially an ensemble learning method the outcome of which is determined as either in terms of the majority or as some sort of average of the outcomes of the trees comprising the forest. To construct an RF, a procedure known as bagging is applied where the outcomes from multiple, randomly constructed decision trees are combined to determine the final decision. Bootstrap aggregation (20), which is also referred to as bootstrap aggregating or bagging, is a technique which enables the reduction of variance of ML methods, considerably improving their accuracy. This technique may prove useful to avoid overfitting. The predictions from each data instance on the corresponding testing data instances are combined through a process known as majority voting to produce the final prediction as the one prediction with the highest frequency across the data instances. In RFs, bagging is applied albeit with a key difference: each decision tree selects a random subset of the features at each split (Figure 4).

The sampling of features enables the random forest to eventually filter out the “weaker” predictors and use those that are strongly related to the outcome. Thus, bagging alongside with random features subsets selection, increases the diversification to produce better outcomes. Having as input the original training data, sampling must be done uniformly to yield the predetermined number of subsets which are to be used to train the same number of instances of the decision trees. The same item may be included to more than one subset, thus bagging sampling is done with replacement. If the training subsets have no overlaps the technique is called pasting. The quality of an overall random forest architecture depends, to a great extent, on the quality of each individual predictor (*i.e*., each individual decision tree). The correlation between two or more features in the random forest should be minimal. An ensemble of trees using bagging is expected to produce lower quality outcomes in comparison to an equivalent Random Forest.

Decision trees and random forest help drive prediction of complex problems in biology. This includes the prediction and applicability of specific drug treatments in oncology (21) or in combining metabolic labelling of RNA to extract RNA kinetic rates and to use a Random Forest approach to classify the function and use of coding and nc-RNAs in eukaryotes (22).

*Artificial neural networks (ANNs)*. Artificial neural networks (ANNs) are computational systems whose layout and operation were inspired by the way a biological brain works. An ANN consists of a set of interconnected artificial neurons, where each artificial neuron is described as a mathematical function which receives a set of input features that are aggregated in a proper way to produce an output that is inserted into a non-linear activation function that produces outcomes which can be transmitted to other neurons to repeat the same process until the final output layer is reached. Thus, a neuron is inspired by the operation of the nerve cells of a biological brain which operates in a similar manner. ANNs are organized in layers. In its simplest form an ANN consists of three main parts, namely: (i) the input layer, which provides the original inputs to the network, (ii) a single hidden layer which consists of a number of artificial neurons tasked with transforming inputs in order to produce activations, and (iii) an output layer, which produces the results of the network using the activations. Multiple hidden layers can exist between the input and output layers. A fully connected neural network with more than one hidden layer is called a multilayer ANN.

ANNs can modify their internal structure (by using their inputs which consist of actual measurements, weights, and biases) in relation to a set of desirable outcomes. This is the underlying concept of the learning process followed in all implementations. ANNs are used in fields, such as, pattern recognition, classification tasks and natural language processing (NLP). A variety of actual implementations has shown that they are particularly suitable for determining acceptable, almost optimal, solutions to complex non-linear problems. In a feed-forward ANN, the information flow has a single direction. To clarify the operation of an ANN let us start with the simple case of a single layer and k neurons. The key components there are the set of input vectors *x=(x _{1},...,x_{n})*, associated biases

*b=(b*, and weights

_{1},...,b_{k})*W=(w*considered as a k×n matrix. Each neuron i takes the input vector x, nonlinearly transforms it in terms of an activation function h and returns an output z

_{ij},i=1,...k,j=1,...n)_{i}which depends on the weights and the activation function as , where h

_{i}is the activation function and its argument is often called the activation of the neuron. The vector

*z=(z*is considered as the output of the neurons. Based on training data the parameters of the ANN, which are essentially the weight matrices and the biases at each layer are computed and are minimizers of an appropriate loss function which connects the actual input with the observed final output for the training data. Then, once trained, the ANN can be used to classify or predict new data. The optimization procedure involved in the training of an ANN is complicated and often time consuming but technically simple as it often relies on gradient descent methods or its variants.

_{1},...,z_{k})In classification scenarios, the output function is usually tasked with mapping numeric parameters to labels (or classes). The number of neurons in the output layer is typically the same as the number of classes. To calibrate a Neural Network, the loss function is required in order to measure the model’s error. The primary task of the training process is the minimization of the loss function. This is achieved by re-adjusting weights and biases through a repetitive process which concludes after a set number of iterations is completed and/or in case the loss function produces acceptable (*i.e*., sufficiently minimized) results. The two most widely adopted loss functions are the mean squared error (MSE) and the cross-entropy loss (CEL). Non-linear activation functions are required to solve complex problems. Among them some of the most prominent are the sigmoid, rectified linear unit (ReLU) (23) and softmax (24). The latter proves particularly suitable for the output layer. There is no intuitive way of determining the number of hidden layers. In many cases this is chosen through a trial-and-error process. It is generally acceptable that relatively “shallow” ANNs, consisting of up to three layers, produce good quality outcomes. On the other hand, a great number of layers (Figure 5) may lead to overfitting.

These basic principles provide the background for introducing deep learning approaches for machine learning which as we will investigate have been applied to a plethora of applications in biology.

*The rise of deep learning*. The increasing technological achievements has enabled us to implement network architectures, embedding thousands of neurons. These topologies use “deep” depth neural networks with or without prior knowledge for training and their fundamental principles use the general notion of Hopfield or Perceptron network architectures (25).

To extract the ultimate features that can increase a classification performance, filtering steps can be applied with the use of convolution layers. Convolution layers walk through the matrix of values and filter the important features in one or three dimensions. This can be done *via* adopting a specific kernel function for partitioning or grouping the data. Moreover, sub-sampling techniques can be used for extracting important features *via* the use of maximum pooling layers. The maximum pooling can be evaluated either *via* grouping every *n* data and extracting either the maximum, median, L2-norm, or average values. The data from all input sources after passing the previous steps can be concatenated and merged into a fully connected neural network. The output will be a set of weights after training where these will be evaluated on a test dataset. Such architectural schemes are defined as convolutional neural networks (CNNs). An example of such a network can be visualized bellow (Figure 6). The filter scanning or windowing can be set accordingly to either 2×2 or 3×3 depending on the data set being used. Other popular deep neural network architectures under investigation are encoders-decoders or autoencoders (Figure 7).

This type of architecture consists of three layers, an input layer, a hidden (encoding) layer, and a decoding layer. Regarding the decoding layer the input is partially reconstructed, based on the important features. The hidden layer serves to learn good representations of the inputs. In an encoder-decoder architecture the input is suppressed and only the features that differ are presented. The convolutional filters are reduced at the different layers. The decoders try to reproduce the input, while the convolution layers are increased oppositely than the encoders. This allows to distinguish the ultimate features that can define a classification regime in every turn of training. Recurrent neural networks (RNN), such as the long short-term memory, or LSTM can learn recursively (Figure 8).

These types of networks make use of an internal state of memory to process sequences of inputs. RNNs are popular for speech or handwriting recognition. Furthermore, this form of architecture makes all the inputs relate to each other. Deep–learning applications have gained a lot of use in bioinformatics and system biology as they can be used either in supervised or unsupervised ML schemes and predict an outcome using a number of features. The increasing amount of biological input makes them suitable in various learning approaches from motif finding of TFs and RBPs to single cell clustering as it will be extensively analyzed in Section 3.

## Unsupervised Learning

Unsupervised Learning is an ML technique where the training is done without supervision, instead the model tries to discover patterns through mimicry and tries to build a compact representation of the input data. Known unsupervised learning techniques include the dimensionality reduction techniques PCA, UMAP and k-means while neural networks can be used as well.

*k-means*. The k-means algorithm (26) is perhaps the most popular and simplest data clustering algorithm. Assume we are given an input vector with N-data points (samples), *x _{i}*∈

*R*(considered as embedded in some Euclidean space

^{n}*R*, hence described as elements of

^{n}*R*and visualized as points in this space) say

^{n}*X={x*. Their embedding in

_{1},x_{2},...,x_{N}}*R*implies that data points which are similar (in some qualitative sense) will display this feature as having their Euclidean distance minimized in

^{n}*R*. Under this perception, one way of obtaining clusters of similar data points is to arrange these data points in sets

^{n}*C*, each located around a common “center”

_{r}, r=1,...,k*m*with the points in each such set (cluster) being similar in terms of their distances between themselves and the corresponding cluster center. This procedure could be easily visualized if

_{r}*l*=2,3 however a more sophisticated abstract formulation is needed in the case of high dimensional data, which is the case of practical interest in most applications. The above variational problem can be solved in terms of the k-means algorithm which is an iterative process which starts by randomly selecting k data points from x that serve as the initial candidate for the clustering centroids. Then, it assigns the remaining data points (

*i.e*., the data points of

*x*∈X, where

_{j}*x*into the k clusters by: (i) calculating the Euclidean distance between each data point from each clustering centroid, and (ii) assigning a data point

_{j}≠m_{r},r=1,...,k)*x*to the cluster

_{j}*C*if the Euclidean distance of

_{i}*x*from the r-th clustering centroid is smaller than the distance from the rest of the clustering centroids.

_{j}This is the main essence of the k-means algorithm, which is an NP hard problem in terms of computational complexity. Furthermore k-means has been used extensively in bioinformatics, one example is in cases of scRNA-seq data when creating the different sub-classes of cell types. Such an implementation can be seen in mbkmeans (28) formulation, where the authors make use of sub-sampling approaches with the use of mini-batches, while evaluating k-means for each batch, thus minimizing the memory requirements. Other methods can also include least absolute shrinkage (Lasso) techniques that can cope with the sparse high-dimensional nature of scRNA-seq together with k-means thus to apply feature selection and clustering of the cell sub-types (29).

*Dimensionality reduction*. The primary objective of all dimensionality reduction techniques is to seek for the optimal data representation model, in terms of data compression, which describes the initial dataset without significant information loss which is feasible since most datasets contain redundancies. To put it differently, some sets of features can be regarded as an indicator of another, initially unobserved, latent feature. This also implies that these sets of original features are correlated. On the other hand, if no redundancies exist within a dataset compression, regardless of the method, is unable to produce significant results. Thus, a dimensionality reduction method reduces an original -dimensional feature space to a -dimensional feature space (where, *n>k*), which can then be used to train a machine learning algorithm in a smaller feature space.

*Principal component analysis (linear)*. Principal component analysis (PCA) is one of the most widely adopted dimensionality reduction methods. PCA aims to create a feature space of reduced dimensions while preserving as much variance as possible in the original dataset. To illustrate the idea, consider a data set consisting of N data points *x _{i}*∈

*R*, where n is the original dimension of the feature space (equivalently a sample of the vector valued random variable

^{n}, i=1,...,N*x=(x*and which may conveniently be considered as a data matrix

_{1},...,x_{n})*X=[x*∈

_{1},...,x_{N}]*R*. Alternatively, each row j,

^{(n×N)}*j*=1,...,n. of the matrix can be considered as N observations

*(x*, of the feature j of the multidimensional data set. Then,

_{ji},i=1,...,N)is an estimate for the mean value of feature *x _{j}*, whereas

is an estimate of the covariance between the features *x _{i}* and

*x*, with the matrix

_{j}*S=(s*called the covariance matrix of the features. Without loss of generality, we consider that

_{kj})*E[x*, otherwise we may center the data matrix by subtracting the mean of each feature (row). Note that these important statistical quantities can be expressed directly in terms of the data matrix X. PCA attempts to find linear transformations of the

_{j}]=0,j=1,...,n**x**, in terms of n vectors

*a*∈

_{1},...,a_{n}*R*such that the new features

^{n}*z*

_{m}=a_{m}^{•}x

*, m=1,...,n*can well describe the sample and moreover that the transformation is such that only

*k<n*of the transformed features capture as much as possible of the variance of the original dataset. Using the sample for the vector valued random variable x, we have samples for each of the scalar random variables

*z*. Recall that if

_{m}*z=a•x*then

*var(z)=a•Sa*. An alternative way of looking at this is to consider the new features as a new coordinate system with which we view the original data set, as taking projections of the original data set along directions that minimize the projection error and thus the information loss. These directions define the new coordinate system and will satisfy appropriate orthogonality conditions

*i.e*.,

*cov(z*=0, m≠l. The resulting directions (equivalently linear combination of features) will be called principal components and will carry most of the information of the original data set in their first

_{m},z_{l})*k<n*components. The choice of directions will be made as follows: Choose

*a*respectively

_{1}*z*such that

_{1}*var(z*is maximum. At the r level,

_{1})*r≤k*, choose

*a*, respectively and

_{r}*z*such that var(z

_{r}_{r}) is maximum, subject to the constraints

*cov(z*. These problems can be handled using the technique of Lagrange multipliers, and the solution of these problems reveals that the desired directions a

_{r},z_{m})=0 m=1,...,r-1 and a_{r}•a_{r}=1_{r}are solutions of the eigenvalue problems (S-λI)a=0, where

*S*∈

*R*is the covariance matrix of the random variable x, which for centered data is related to

^{(n×n)}*X•X'*∈

*R*. The eigenvalue problem has n solutions for

^{(n×n)}*(λ*. All the eigenvalues are positive, and the eigenvector α

_{r},a_{r}),r=1,...,n_{1}corresponding to the largest eigen value λ

_{1}will indicate the first (dominant in terms of capturing the variance) PC, the eigenvector

*α*corresponding to the second largest eigen value

_{2}*λ*will correspond to the next (best performing after the dominant in terms of capturing the variance) PC

_{2}*etc*. The number k of PCs retained is related to the spectral properties of the matrix

*X•X'*∈

*R*, and can be found by ordering the eigenvalues of the covariance matrix in descending order and keeping the first k dominant ones. The procedure is closely related to the singular value decomposition SVD. The above procedure could be performed using SVD of the data matrix X, according to which X admits a representation of the form

^{(n×n)}*X=UΣV'*, where

*U*∈

*R*,

^{(n×n)}*Σ*∈

*R*,

^{(n×N)}*V*∈

*R*with U, V being orthogonal matrices containing the eigenvectors of

^{(N×N)}*X•X*' and

*X'•X*respectively, and Σ is a matrix which contains nonzero elements only on the diagonal consisting of the singular values of X. The choice of k can be made by ordering the singular values of the data matrix, while the transformation to principal components can be made using U.

*tSNE and UMAP: Two well-known dimensionality reduction techniques*. t-SNE is one of the most commonly used dimensionality reduction techniques. It represents high-dimensional data by assigning each datapoint, being in a higher dimension, to a two- or three-dimensional map. This is done by using a Gaussian probability Eq.10 for observing the distance between any two points in the high-dimensional map where an optimal σ (Eq. 11) is defined or so-called perplexity. The goal is to minimize a loss function when projecting from a high dimensional distribution to a lower one *via* a gradient optimization technique.

[10]

the symmetrization being used is so to ensure that the any points will be glued efficiently together and is set as p_{ij}=(pi|j +pj|i)/2N

[11]

The student *t*-test distribution can be used in the lower dimension to declare the distances (Eq. 12); thus, the t-SNE after describing the distance of any two points aims to learn the similarities of a d-dimensional map y_{i}...y_{n}.

[12]

The locations of the point in the map of d-dimension will be learned by estimating the Kullback-Leibler divergence loss function of the distribution P from the distribution Q (Eq. 13)

[13]

UMAP stands for uniform manifold approximation and projection: the algorithm tries to approximate a manifold on which the data lie and constructs a simpler representation of it. UMAP assumes that there exists a Riemannian manifold where the data are uniformly distributed. A simpler version of a manifold is assumed and derived *via* constructing a weighted k-neighbor graph. A fuzzy topological connectivity is applied on the edges of the graph having as a constraint the minimization of the cross entropy to deal with any inherent asymmetry. The algorithm will proceed by iteratively applying attractive and repulsive forces at each edge or vertex. This will converge to a local minimum by dynamically decreasing the attractive and repulsive forces. Like t-SNE given an input data set as *X={x _{1},...,x_{n}}*, and an input hyper-parameter k, for each

*x*we compute the distance of each set of

_{i}*{xi*points from the

_{1},...,xi_{k}}*k*- nearest neighbors. If, for each

*x*, we denote by p

_{i}_{i}its distance from its first nearest neighbor

*x*then:

_{j}[14]

with the symmetrization here to be p_{ij}=p_{i|j}+p_{j|i}-p_{i|j}p_{j|i} and like t-SNE the optimal *σ _{i}* will be

[15]

Since ϱ demonstrated the distance from each i-th data point to its first nearest neighbor this demonstrates a local connectivity of a manifold. The process of UMAP is approximating the number of nearest neighbors *k*, where for these *k* neighbors the UMAP function tries to glue together points with locally varying metrics. UMAP, unlike t-SNE, doesn’t use a t-distribution but instead it uses a family of curves which demonstrate the connectivity or strength between any of two points in a manifold as these can be defined as attractive or repelling forces, where a, b are hyper parameters: *1/((1+a•y ^{2b}))*. A distance probability

*q*

_{(i,j)}is set as

*q*. The goal is to find the minimum distance between points i,j in a 2D space described by

_{ij}=(1+a(y_{i}-y_{j})^{2b})^{–1}*X={x*, where overall this will lead to compact clusters with the lowest cost estimate. Regarding the cost estimate, UMAP sets this as a binary cross entropy CE approximation, as in Eq. 16:

_{1}...x_{i}}, Y={y_{1}...y_{i}}[16]

To estimate the cost function, we need to apply the gradient of the cross-entropy *via* applying the gradient descent. Setting the distance *d _{i,j}=y_{i}-y_{j} and Q(d_{i,j})=1/(1+ab_{ij}^{2b})* we obtain (Eq. 17):

[17]

In general, a Laplacian graph is used by UMAP to initial low dimension coordinates. In this context a graph is first constructed using the kNN algorithm (30), where it is formalized *via* constructing the Laplacian matrix. The eigen-value-decomposition problem is then used to factor the matrix. Figure 9 demonstrates an example of UMAP clustering when classifying gene expression cancer data from TCGA (31).

## Machine Learning Applications in Biology and Bioinformatics

In the last decade, the technological breakthroughs of biochemistry and next generation sequencing (NGS) have led to the generation of a huge amount of information. The need for novel ML approaches to decipher fundamental mechanisms of biology has gained great importance (32). This section will provide and describe several ML techniques that have been previously analyzed, while taking advantage of state-of-the-art biochemical methods to generate NGS data. Several techniques will be described for recognizing the functional domains of TFs and RNA binding proteins while taking advantage of neural networks to acquire this information. Then an analysis of the latest methods will be given on using expression data and genome wide association studies (GWAS) from TCGA and other consortia to learn the pathways that may drive diseases. Also, the latest achievements in scRNA-seq will be examined together with the ML techniques and dimensionality reduction algorithms that may enable us to learn and predict the cell states under normal conditions or in disease.

*The art of binding “from transcription factors to RBPs” Learn the determinants of transcription*. A large majority of encoded proteins have functional and regulatory roles. These elements such as TFs can recognize and bind specific regulatory sites of the DNA. These sites are composed of specific nucleotide motifs and recognize a characteristic geometry with high accessibility of the DNA helix. The binding of TFs in the promoter of a gene either facilitates the pre-initiation of the transcription complex with RNA Pol II or the TFs upon binding can recruit transcriptional cofactors to alter the chromatin state. Furthermore, histone methylation such as H3K4me3 contributes to active transcription while H3K27me3 and H3K9me3 have a repressive role. Other histone modifications include phosphorylation, ubiquitination, sumoylation and ADP ribosylation (33).

Chip-seq (34) and ATAC-seq (35) are two established methods that enable the identification of binding sites of TFs. With chromatin immunoprecipitation sequencing (ChIP-seq) we are able to identify the binding of TFs on the DNA as a specific antibody is used to IP the complex. More precisely, crosslinking DNA and proteins uses formaldehyde, followed by sonication of the DNA into smaller fragments usually around 200–600 bp fragments. This is then followed by immunoprecipitation of the DNA-protein complexes of interest with antibodies. The DNA is then uncross-linked, and the DNA is adapter ligated according to the library preparation steps before it is sequenced. The obtained fragments are the binding sites which hold the region of interest. Regarding ATAC-seq the nuclei of cells are isolated and a Tn5 transposase is used while ligated to adapters to identify open chromatin regions which are more accessible. Overall, these techniques have enabled us to identify the regulatory domains that drive transcription.

To predict the binding domains and decipher the regulatory properties of such sites, ML strategies have been employed. Such methods include convolutional neural networks such as in the case of DeepBind (36) which uses the sequence fragments of TFs and the open chromatin regions from experiments such as ATAC-seq or Chip-seq as input. More specifically, DeepBind determines a score for the binding positions in four stages. The convolution stage scans and groups appropriately a set of sequences of length m. There is a motif detector step which is 4 × m matrix, that extracts frequencies and forms position weight matrixes (PWM), which are then fed by a rectification stage where positions with high scores are selected, and all negative values are set to zero. There is a pooling layer which uses maximum and average techniques to identify short-in-long motifs which are then given as input into a nonlinear neural network. Similar studies include more complex architectures such as recurrent neural networks (RNNs) or long short-term memory (LSTM) which can improve the binding accuracy. Other methods such as KEGRU (37) use bidirectional recurrent networks named as b-GRUs which makes use of a k-mer sequence representation in combination with the states of the recurrent method to capture more efficiently the dependencies and thus achieve better performance. Other combinatorial methods such as Janggu (38) use DNA sequences from DNAse tracks as input to a convolution neural network. This method uses a higher order one-hot encoding of the DNA sequence that captures di- or trinucleotide-based motifs. Similar combinatorial methods as in DeepSEA (39), make use of various inputs such as DNase I–hypersensitive sites (DHSs), histone marks and TF binding profiles and have as a major goal the identification of functional effects of noncoding variants. Inputs are genomic sequences of the positions of the marks which are used in a deep neural network architecture. The effects of individual SNPs on TF binding sites have been evaluated with high performance. Similar DNA genomic features that can be investigated with deep neural networks include the investigation of DNA methylation sites and the analysis of chromatin loops from genome-wide interaction matrices from Hi-C experiments (40). The latter uses as input genome wide interaction maps and a set of positive defined and negative training sets to a binary classifier. A hyperparameter search is then followed to find the best random forest model separating the two classes, which can be used to detect loops from genome-wide contact maps. Another interesting approach that uses Hi-C to determine nuclear compartmentalization is presented in (41), where Hi-C data were employed to construct a Hi-C interaction graph whereby graph embedding techniques 1^{st} and 2^{nd} order proximities are derived, thus transforming the graph into a lower-dimensional space where *k*-means clustering is applied to cluster the nodes.

*Learning the RNA binding properties*. Apart from the DNA binding elements the research community has started to use ML applications to investigate the binding properties of RBPs. This task has gained great attention due to the hard nature and the many functional features of RNA such as the various sequence motifs and RNA structure in 2D and 3D. Unlike DNA, RNA is a molecule of high plasticity which adopts various regulatory structures from its transcription to decay. Furthermore, more than 100 different modifications have been examined and more than 1,800 regulatory RBPs have been discovered. Only recently, biochemistry has enabled us to detect the RNA structure in 2D or even in 3D genome-wide (42) while adopting the RNA backbone confirmations. These can be accomplished *via* evaluating techniques that enable us to use biochemical reagents that can cause mutations (43) or RT stops (44) upon open accessible regions or that make use of specific cross link techniques with compounds such as AMT (45) that can capture the double stranded RNA regions.

Furthermore, several enzymes such as in (46) can distinguish and cleave either single or double stranded RNA conformations to distinguish RNA structure. In addition, a large plethora of experiments have been developed to identify the binding domains of ribonucleocomplexes (RNPs). These mainly include UV cross linking and IP with a specific antibody and pull down of the complex. Furthermore, in order to increase the sensitivity of the binding position a 4sU analog of uridine can be used to allow for a specific mutation during cDNA synthesis (T>C conversion) upon the binding of the RBP with the RNA (47). Similar methods such as HITS CLIP (48) include mutations on the binding position that can be identified and increase resolution. In iCLIP (49) and upon circular ligation after the IP pull-down, the identification of an RT stop can be seen during cDNA synthesis on the RBP contact point. Thus, the exact location of the binding site irrespective of the sequence composition can be extracted. In addition, if these methods are combined with specific enzymatic cleavage, one can determine the RNA structure composition of the binding domain. Protein occupancy profiles (50) have determined various RNA binding proteins that can crosslink RNA. Moreover, cell fractionation experiments coupled with IPs and specific biochemistry for in vivo identification of RNA structure can be used to determine with more specificity the binding properties per cell compartment. This vast amount of information, if combined with ML, can lead to identification of the binding specificities of RBPs, *i.e*., the sequence and structural motifs that may drive such binding. DeepBind, as examined earlier, has also investigated the binding properties of RBPs using as input information the sequence of exon-exon junctions and demonstrates the leading binding motifs for RBPs known to regulate splicing such as TIA1, NOVA, PTBP1 or hnRNPC. IDeep (51) uses eCLIP-seq data from ENCODE (52) and also the RNA structure to train a hybrid network with two CNNs and a long-short temporary memory (LSTM) network. The input from sequence and structure is driven to CNNs and the LSTM learns the binding properties in terms of sequences and structures to improve prediction. Other methods such as DeepRipe (53) one-hot encode the sequence or in the case of pysster (54) also include the sequence and RNA 2D structure using an extended alphabet into a convolutional max-pooling and dropout layers. After the dropout layers the information is used by a dense neural network which can be easily tuned. Similarly, DeepCLIP (55) applies a similar network architecture that uses 1D convolutional layers to find and enhance features of a set of presented sequences. This is followed by a bidirectional long, short-term memory (BLSTM) layer which uses the extracted features and contextual information of the sequences to find areas of the RNA-sequences associated with RBP binding. A different approach is being used by a kernel-based model called Graphprot (56). This scheme extracts the structure and sequence information from CLIP-seq data. The structure of the binding sites is calculated *via* RNAshapes (57). The structure and sequence are encoded as a hypergraph, where graph kernels are used to extract features to be set as input information to support vector machine (SVM) and support vector regression (SVR) (58) modules. A graph-kernel can extract large number of features and when comparing bound and unbound regions while using a k-mer similarity, the binding features per RBP can be extracted versus the background noise.

Extracting information of the RNA backbone using 3D RNA modeling has been a great challenge. Molecular dynamics using Monte Carlo techniques (59) can be applied. Usually in this type of modeling the RNA in its 2D form will be placed on a grid. The RNA molecules will be perturbed on the grid and in each position the Poisson Boltzmann equation describing the energy state of each atom will be deciphered having as goal to extract the position with the minimum energy as in (60). ML techniques can be employed to learn the 3D properties such as dihedral angles and total energies per cluster of molecules. One such method is RNA3DCNN (61), where the RNA molecules can be treated as a 3D image or voxels as input to 3D CNNs. The RNA molecule is described using a 3D grid representation of the RNA molecules on a cartesian coordinate system directly as input to the convolutional neural network. This network is arranged using an input layer, of a two-stage convolutional layer, followed by a maxpooling layer and another two-stage convolutional layer following an output layer. The output is a score per nucleotide, defining how a nucleotide fits in its surrounding, taking into account all the conformations. Deepnet-rbp (62) uses information of the tertiary structure motifs as predicted by JAR3D (63) together with the sequence and structure into a multimodal deep learning module to predict RBP binding sites and motifs. JAR3D is a computational framework that extracts probable structural motifs in the hairpins and internal loop regions using RNA 3D Motifs Atlas (R3DMA) (64). The deep learning module uses restricted Boltzmann machines (RBMs) (65) of multi connected layers based on Markov random fields (66) which define the probability distribution of the variables. NucleicNet (67) uses features of the RNA backbone and the physicochemical characteristics of the RBPs such as hydrophobicity, molecular charges and accessibility surfaces calculated from Fpocket (68) with the ultimate goal to predict on each location of the RBP’s surface, scores for RNA interactions. More precisely the surface contact points of the RNPs are extracted using the physiochemistry of the RNA and the RBP’s potential contact points. Furthermore, these are clustered into classes that correspond to the bound and non-bound RNA sites. A deep residual network is trained to determine the scores according to the physicochemical properties and the network is optimized through standard back-propagation of the categorical cross entropy loss.

All these methods presented examples of different ML architectures used to decipher mechanisms of fundamental principles of biochemistry which mainly define specific structural and sequence motifs or the biophysical properties that can drive RBPs or TFs to bind and thus contribute to the post transcriptional regulation that determines the cell fate.

*Prediction of protein 3D structures using ML*. Accurate prediction of 3D protein structures remains a very demanding task in terms of computational power as there can be more than 10^{300} possible different ways that a protein can be folded before setting into a final stable 3D structure manifold (69). The importance however of evaluating an accurate 3D protein folding structure is of great essence in order to investigate protein function. Incorporating ML methods for predicting the 3D protein structure has been recently adopted to improve this task. The most well-known example is the Alphafold algorithm (70) where a neural network was trained to make predictions regarding the distances between pairs of residues and construct a potential mean force, able to define the shape of the protein with high accuracy. The predictions include backbone torsion angles and pairwise distances between residues. A gradient descent method was applied to optimize the predictions.

*Machine learning to identify the per cell transcriptome*. The recent advances in the field have enabled scientists to distinguish the transcriptome properties per cell using novel sc-RNA seq techniques. This achievement allows for the first time to discover the progression of a disease as the signaling is driven from the cell-to-cell communication and define more precisely marker genes in each cell stage and each cell cycle. The power of ML, and dimensionality reduction techniques is mandatory to extract the driver genes and RNA properties from this ever increasing per cell information. A plethora of methods and analysis exist taking advantage of several dimensionality reduction techniques. These methods as explained previously can demonstrate the various cell types, emerging from a single cell experiment but also examine driver genes of each cellular sub-population. Furthermore, cell deconvolution is an important aspect for such analysis. ML methods such as linear regression can be used on gene expression profiles (GEPs) of specifically expressed genes per cell type to estimate cellular sub-populations. Other methods such as Scaden (71) make use of a deep neural network for cell deconvolution. This method uses gene expression information as input, to a DNN where the hidden layer nodes of the DNN would account for higher-order latent representations of cell types. This seems to be more robust in terms of technical bias and noise. Many methods such as k-means, gaussian mixture models and spectral clustering have been employed for clustering cells from sc-RNAseq experiments.

Recently force directed layout graphs have been applied as another dimensionality reduction technique. FLOW-MAP (72), uses a graph layout analysis and sequential time ordering to extract cellular trajectories in high-dimensional single-cell data-sets. Diffusion maps (73) have also gained interest as they can preserve the relations between the data points and thus are more suitable for re-ordering the differentiating cells and for reconstructing developmental traces. Also, diffusion distance is robust to noise. Other methods to provide an alternative dimensionality reduction approach use the power of neural networks such as in the work of (74) and (75). Variational autoencoders (VAE) together with Bayesian inference have also been used to learn a probabilistic encoding of the data. Dhaka (76) is such a method where the assumption is that the data is coming from a multivariate Gaussian probability distribution. The autoencoder encodes the means (*μ _{z}*) and variances (

*σ*) of the Gaussian distributions. The sampled latent representation is then passed through a similar decoder network to reconstruct the input.

_{z}The major problem to tackle, in scRNA-seq analysis, is the large dropout rates in gene expression. Spectral clustering (SC) (77) is an emerging popular method that combines multiple kernels to learn a distance metric using the KNN algorithm that best fits the structure of the data. To resolve the issue of drop-out in sc-RNAseq data, several imputation methods have been implemented such as scImpute (78), MAGIC (79) and SAVER (80). Deep Impute (81) uses standard deep neural networks to predict the missing values by using correlated genes with high gene expression values. In brief, DeepImpute imputes gene counts in a divide-and-conquer approach, by constructing multiple sub-neural networks where each sub-neural network is used to decipher the relationship of certain category of genes with a subset targeted gene. Each sub neural network can use 512 input genes, a hidden layer of 256 neurons using a Relu activation, a 20% dropout layer and an output dense fully connected network. Similarly, DCA (82) uses a deep autoencoder scheme to denoise scRNA-seq by defining a reconstruction error as the likelihood of the distribution of the noise model instead of reconstructing the input data itself. Probabilistic principal components analysis (PPCA) or factor analysis (FA) (83) can also account for these events and provide another form of clustering. Zero inflated factor analysis (ZIFA) (84) defines a dropout relationship which takes into account the mean level of non-zero expressed genes (log read counts) as *μ* and the dropout rate *P _{o}* for each gene. The relationship can thus be defined as

*P*, where λ is a fitted parameter, based on a double exponential function. If one assumes that the separable cell states lie as points in a low dimensional space, these can be transformed and projected into a higher dimensional space using a linear transformation while adding a Gaussian distributes noise. Each point or cell then has the probability of being set to zero based on the dropout relationship.

_{o}=exp(–λμ^{2})Trajectory inference methods have enhanced single cell analysis which enables us to determine genes that are associated with specific lineages or are differentially expressed between lineages *via* the use of branching events. Monocole and Beam (85) fit additive models of gene expresion as a function of pseudotime. Bifurcation analysis can be inferred to identify whether a gene expression can be differentially associated with any of two lineages.

Quite recently it was shown that, by a technique named velocity (86), the future states of individual cells can be predicted. In RNA velocity, the time derivative of the gene expression between spliced and unspliced genes can be distinguished from the reads that fall within introns and exons. This can result in an extra layer or feature in a ML scheme to predict the cell states and cell cycle if also coupled with Viterbi (87) or a Hidden Markov algorithm or even if it is introduced in a neural network. In addition, when introducing mRNA labeling techniques such as in SLAM-seq (88) with scRNA-seq this creates an extra dimension for modelling RNA kinetis. Dynamo (89) provides a framework that incorporates intrinsic splicing with mRNA labeling kinetics to determine RNA velocities and extract vector fields that determine and predict future cell fates. To accomplish this, a machine learning scheme is employed which contracts a kernel method to learn a vector field in Hilbert space *via* the use of weighted linear combination of functions that describe the field.

Apart from deciphering the dynamics of the transcriptome per cell type, single cell RNA-seq technologies have advanced and allow us to decipher the binding properties of TFs per cellular sub-type. This can be achieved with scATAC-seq which has emerged as the method of choice to map open chromatin regions, which can be used to infer TF binding events per cell type. ScFAN (90) uses a CNN pretrained on bulk Chip-seq or ATAc-seq data and is used to predict TF binding properties at the single cell level. The data input is a feature vector of 1000 base pair bins from total ATAC-seq, Chip-seq and DNA sequence into a CNN linked to two fully connected layers while using a sigmoid function to make predictions of motifs for TFs. This model is the used on scATAC-seq (91) data to predict the candidate active TFs per cell state.

More recent methods based upon these principles introduce a weighted-nearest neighbor approach as in (92), which enables embedding of multiple datasets to be used for clustering cell sub-populations. Thus, this allows us to consider various single-cell approaches such as scATAC-seq for determining chromatin accessibility or CITE_seq (93) to be combined and thus infer a specific cell sub-population.

In this section, we have studied several applications of scRNA-seq analysis while incorporating ML methods to better project and cluster cells from a high dimensional manifold to a reduced representation. This however, as studied extensively in (94), can result to bias regarding the interpenetration of results. The authors provide an example of a sphere to determine dimensionality reduction loss by comparing the local neighborhood of a point in the sphere with the neighborhood of the same point in the reduced dimensional space using the Jaccard distance as a metric. Implementing AI techniques for such a task such as using a VAE network, can greatly contribute to the dimensionality reduction and loss. Moreover, considering the reduced representation, data from various single cell biochemistries and applying specific weights while building a new graph topology when using all available information can also improve the observed bias.

*Training on gene expression patterns and GWAs studies to learn drug targeted therapies*. Repositories such as TCGA (31), Cosmic (95), cBioportal (96) and CPTAC (97) have enabled scientists to use a large repertoire of data sets targeted on oncology from samples that are retrieved directly from patients from a variety of cancer types. Drugs targeting specific genes are frequently used in chemotherapy, thus learning the expression values of certain genes, the mutation and methylation profiles in addition to other features such as alternative splicing events or specific regulatory features of the RNA for these transcripts can help determine ultimate treatments and move towards a more personalized medicine approach per patient and per cancel type. Recent advances in this field such as in (98) have used association rule mining techniques (99) to distinguish how mutations in compliance with gene expression can result in chemoresistance; this is done by generating gene association correlations while extracting scores and ranking them to prioritize pathways. Furthermore, by incorporating deep neural networks (DNNs) as in (21), which were trained and optimized on a 1,001 cell-line drug response database, a model was generated which was tested blindly on patient cohorts to determine the best treatment. This approach compared several ML implementations such as random forests (RFs) and elastic nets (Enets) (100) with DNNs to decipher the best performance model, while tested from patients derived from TCGA cohorts and the Multiple myeloma consortium (101). Another approach named DrugCell (102) combines conventional artificial neural networks (ANN) with a visible neural network (VNN) to learn, using as input mutations from GWAS studies per cancer patient, molecular subsystems from 2,086 biological processes of the Gene Ontology (GO) database (Gene Ontology Consortium 2004) and the drug chemical structure encoding the Morgan fingerprint of a drug (103) to learn specific drug treatments per groups of patients.

*New sequencing technologies and ML*. In the last five years nanopore technology has started to shape a new era into introducing fast and simple sequencing technologies (104). These methods rely on a protein nanopore from which a DNA or RNA molecule with appropriate adapter priming is let through the pore. The interaction of each base with the pore creates a unique current signal whereby the base pair is determined. Areas of high GC content or polyA repeats can cause noise during the read-out of the current; thus, ML is applied to learn and correct for such errors. A hidden Markov model (HMM) based approach, has been applied to detect 5mC DNA in CpG from events of Nanopore reads. Similarly, DeepSignal (105) makes use of convolutional neural networks (CNN) using as input raw electrical signals around methylated bases. In parallel, a sequence feature module uses a bidirectional recurrent neural network (BRNN) that determines features from sequences of signal information. Then, the output features from the two modules (CNN and BRNN) are concatenated and fed into a fully connected neural network. The advances of such technology are many; small size of devise makes it portable but also direct sequencing of RNA molecules to define RNA modifications makes this technology a promising tool for RNA biology.

## Concluding Remarks

This work provides an overview of the supervised and unsupervised ML techniques and the dimensionality reduction methods that are widely coupled with the fundamental mathematics behind them. Examples of how these applications can be used in bioinformatics for multiple integration of genomic data was shown regarding various tasks, from deciphering the elements that drive TFs and RBP binding sites to sc-RNAseq applications while leading to more holistic approaches for using such data sets to learn treatments in oncology. Thus, the aim of this review was to cover the importance and the vast applications of ML in biology. As a future perspective, a very promising contribution of AI and ML is strongly related to what we call “precision medicine”. The main concept behind this type of approach is to apply principles of medical science tailored according to the needs and the personal characteristics of each patient. The era of personalized medicine, based on omics data, such as genomics or proteomics, has started. T-cell specific immunotherapy (106) *via* seeking for neo-antigens is the next bet of the 21st century. In conclusion, the vast amounts of daily generated medical data, the clinical unmet needs, the complexity of rare and common diseases and the patient’s center strategies applied in most biomedical institutions, point out an emerging need for handling medical data in the most efficient way. ML approaches are necessary tools to be employed in every aspect of medical clinical practice where specific research methodologies need to be applied, capable of optimizing the current health policies and promoting the transition towards the precision medicine era.

## Supplementary Material

We provide a handbook of extra supplemental material and examples of code in Python for the most common ML and Bioinformatic analysis. This can be found under the URL: https://www.gorgoulis.gr/images/Supplementary_Material_Pezoulas_et_al.pdf

## Acknowledgements

VGG and his colleagues received financial support from the following grants: National Public Investment Program of the Ministry of Development and Investment/General Secretariat for Research and Technology, in the framework of the Flagship Initiative to address SARS-CoV-2 (2020ΣΕ01300001); Horizon 2020 Marie Sklodowska-Curie training program no. 722729 (SYNTRAIN); Welfare Foundation for Social & Cultural Sciences, Athens, Greece (KIKPE); H. Pappas donation; Hellenic Foundation for Research and Innovation (HFRI) grants no. 775 and 3782, NKUA-SARG grant 70/3/8916 and H. Pappas donation. This study was also co-financed by the European Regional Development Fund of the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH – CREATE - INNOVATE (project code: Τ2EDK-02939).

## Footnotes

↵* These Authors contributed equally to this study.

**Authors’ Contributions**V.C.P., O.H., N.L. wrote the original draft and prepared the original figures. V.C.P., O.H., N.L., T.P.E., A.V.G. wrote, reviewed and edited the manuscript as well as assisted in literature search. A.G.T., D.I.F., I.G.S. supervised the preparation of the subsections, A.N.Y. and V.G.G. conseptualized and supervised the process of all the study and the manuscript preparation. V.G.G. achieved funding acquisition. All Authors have read and agreed to the published version of the manuscript.

This article is freely accessible online.

**Conflicts of Interest**The Authors declare no conflicts of interest.

- Received June 25, 2021.
- Revision received July 21, 2021.
- Accepted August 3, 2021.

- Copyright© 2021, International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved