JADBio Vocabulary

LOREM

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla et euismod nulla. Curabitur feugiat, tortor non consequat finibus, justo purus auctor massa, nec semper lorem quam in massa.

A

Accuracy

Accuracy is the ratio of the number of correct classifications (true positive samples and true negative samples) to the total number of positive and negative samples

Algorithm

A well-specified, unambiguous computational procedure. E.g., the bubble-sort procedure that is used for sorting an array of values, or the Decision Tree learning algorithm that learns a Decision Tree model to classify given samples

Analysis

Analysis includes data cleaning, transformation, exploration and modeling for desicion making

Applied model

A predictive model (configuration) whoses perfomance was validated by a new dataset.

AUC (Area Under the ROC curve)

The Area Under the ROC curve. The AUC equals the average sensitivity for all possible specificities (and/or false positive rates). The AUC also equals the probability that the model will correctly classify a randomly selected pair of one positive and one negative sample. A measure of how well a model can distinguish between two groups (e.g. cases/controls). An area of 1 represents a perfect model, an area of 0.5 represents random guessing

Average F1 Score

The average F1 score across all classes.

B

Balanced accuracy

Balanced accuracy is the average sensitivity for all classes. For binary problems this boils down to average sensitivity and specificity of the reference class.

Balanced data

Refers to the categorical target variable. Balanced data have classes that are equally represented, i.e. equal number of samples per class.

Baseline

A basic predictive model that uses no features and is based solely on the outcome.

Best interpretable model

See interpretable model.

Best performing model

The predictive model with the highest performance

Bias

The deviation from the true value of an estimated metric. This bias is often produced when computing unadjusted estimates from cross-validation, often resulting in over-estimation of final model’s true performance.

Binarization

During binarization, a categorical feature with k classes is transformed in k-1 binary vectors. Each row of this matrix has the value 1 in the column that corresponds to the previous labeled class, and 0 elsewhere. A sample with 0s in every column corresponds to the reference class, hence the k-1 columns.

C

Categorical feature

Categorical data can take discrete numerical, or qualitative values such as 1 for red colour and 2 for blue color, or strings.

Class

Classes are the different values of a categorical feature.

Classification analysis

An analysis that is used to predict the class (category) an instance of data belongs to. The input of a classification algorithm is a set of labeled examples and the input of a classification algorithm is a set of labeled examples and the output of a classification algorithm is a classifier, which you can use to predict the class of new unlabeled instances. If input data belong to two classes (categories), the analysis will be a binary classification task. If input data belong to more than two classes, the analysis will be a multi-class classification analysis.

Classification threshold

A threshold on the probability prediction of a classification model above which the prediction is classified as an instance of a specific class.

Concondance Index (CI)

A performance metric for a survival analysis. CI is a performance metric for survival (time-to-event) models. In specific, CI considers each pair of patients, and asks whether the predictor's values for those patients matches what actually happened to them.

Confidence Interval

A region corresponding to the estimate of an unknown quantity. A 95% confidence interval contains with a 95% certainty the true value of the estimated quantity.

Configuration

A pipeline of algorithms for preprocessing, transformations, imputation, feature selection, and modeling along with their hyper-parameter values that takes as input data and outputs a predictive model.

Confusion Matrix

A 2X2 double-entry matrix, where the first dimension refers to the observed class membership (i.e. “positive” or “negative”) frequencies, and the second dimension refers to the predicted class membership frequencies.

Correlation Coefficient

Correlation Coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1. A correlation coefficient of +1 indicates that two variables are perfectly related in a positive linear sense, a correlation coefficient of -1 indicates that two variables are perfectly related in a negative linear sense

Cross Validation (CV)

Cross-Validation (CV) is a technique for assessing how the results of a statistical analysis will generalize to an independent dataset. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. By partitioning the dataset with specific rules, e.g stratification or not, k-fold, holdout, a set is used for training and the other as validation.

Classifier

An algorithm trained for classification tasks.

Constant Removal

A procedure during which all variables with zero-variance are discarded. These features do not contain any useful information for discrimination or regression.

Censoring

In survival analysis, censoring of specific samples occurs when 1) a sample hasn’t experienced the event by the end of the study (right-censoring), 2) has left the study for other reasons before completion (right-censoring), 3) entered the study at a later time (left-censoring)

D

Data cleaning

~--~

Data exploration

~--~

Dataset

A set of samples with their corresponding feature values.

Decision making

~--~

Dependent

The variable that we wish to predict.

Dimensionality reduction

A technique that maps original features to a feature space with less dimensions. The mapping function can be a linear combinations of the original feature space, or complex non-linear kernel functions.

Datapoint

See sample.

E

Equivalent signatures

Signatures whose predictive performance is statistically indistinguishable.

Event feature

Event feature is a categorical feature that defines if the outcome (death, relapse) occured. It can take only two values 1 and 0. Value “1”assumes the event occured (happened) and “0” otherwise.

F

F1 score

F1 score is the harmonic average of the precision and recall of a reference class, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

False Negatives (FN)

False Negatives refer to the positive samples that are incorrectly predicted negative samples. For example, cancer patients that are predicted as control subjects.

False Positives (FP)

False Positives refer to the negative samples that are incorrectly predicted positive samples. For example, controls subjects that are predicted as cancer patients.

Feature

A measured or computed quantity that characterizes a sample.

Feature importance

Feature importance refers to techniques that assign a score to independent features (predictors) based on how much each feature affects the outcome (dependent) variable.

Feature Selection

The task of selecting a minimal-size feature subset that is optimally predictive for a specific outcome.

Feature Type

The type of a feature.

False Positive Rate (FPR)

See Specificity.

False Discovery Rate (FDR)

Total number of false positives divided by predicted positive samples (true positives + false positives).

H

Holdout

A partitioning technique where a portion of the samples are kept out of the training procedure to act as a test or validation set.

hyper-parameter

A parameter of an algorithm.

Heatmap

A heatmap is a data visualization technique that shows magnitude of a phenomenon as color in two dimensions.

Harmonic mean

Appropriate for computing mean values of ratios.

I

Identifier feature

~--~

In-sample prediction

The prediction derived by applying a model on data that was also used to train it.

Individual Conditional Expectation (ICE) plot

ICE plot displays one line per fearure that shows how the instance’s prediction changes when a feature changes.

Interpretable model

A machine learning algorithm's interpretability refers to how easy it is for humans to understand the processes it uses to arrive at its outcomes. Interpretable models are models who explain themselves, for instance from a decision tree you can easily extract decision rules. Linear regression, logistic regression and the decision tree are commonly used interpretable models

Independent

A variable that affects the target variable. Member of the model matrix.

K

k-fold

A partitioning technique where the samples are divided in k disjoint sets.

Kaplan Meier curve

Kaplan Meier is the visual representation of the Kaplan-Meier estimator, and it shows what the probability of an event (for example, survival) is at a certain time interval.

L

Learning algorithm

An algorithm that outputs a model given a dataset. Examples: the linear regression, ridge regression, and decision tree algorithms

M

Majority class

In categorical variables, the most prevalent class.

Mean Absolute Error (MAE)

MAE is the average of the absolute difference between the predicted values and observed value. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0.

Mean estimate

The average of a metric.

Mean Squared Error (MSE)

MSE is one of the most preferred metrics for regression tasks. It is simply the average of the squared difference between the target value and the value predicted by the regression model. As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is

Mean Squared Logarithmic Error (MSLE)

A performance metric for a regression analysis. It computes the expected value of the squared logarithmic (quadratic) error, useful to identify the percentual difference among true and predicted value.

Mean/Mode Imputation

The replacement of missing values (usually called NaN’s or NA’s) within a feature by its mean, if it is numeric, or its mode (most frequent), if it is categorical.

Mode estimate

The most frequent value of a variable.

Model

An algorithm (function) that accepts one or more vectors of feature values and returns the corresponding predictions. If the outcome is the state of the sample (e.g., disease vs. control) the corresponding model is better described as a diagnostic model. If the outcome corresponds to a prognosis for a patient the model is better described as a prognostic model. If the outcome regards a value that becomes known in the future, then it is a predictive model.

Model matrix

A matrix containing the measured independent variables across samples.

N

Numerical feature

A numerical feature whose values are quantitative (discreet or continuous).This data has meaning as a measurement such as height, weight, BMI.

Null hypothesis

The null hypothesis (often denoted H0,) is a default hypothesis that a quantity to be measured is zero (null). Usually expressed as a difference or ratio of measured quantities.

O

Observation

see sample

Out-of-sample prediction

The prediction derived by applying a model on data other than the data used to train it.

Outcome (feature)

The feature for which a predictive model is to be constructed, dependent on the values of the other, independent features. Examples: target variable, dependent variable, response variable.

Over/Under – estimation

~--~

Outlier

A sample that resides outside of a given confidence interval, e.g. 95%, for a specific distribution of a variable, or variables.

P

Performance metric

A metric that describes/explains the performance of your predictive model.

Pipeline

Algorithmic tasks used serially when performing an analysis

Precision

Sum of true positives divided by predicted positive samples (true positives + false positives).

Precision Recall curve

Precision Recall curve visualizes the perfomance of a binary classifier. In this curve, Precision is plotted against the Recall for different threshold points. The higher on y-axis your curve is the better your model performance.

Predictive algorithm

The modelling algorithm used for prediction. Usually the final step of a pipeline.

Predictive Performance

The performance of the created predictive model.

Predictor

Any independent feature included in the produced model/signature.

Probabilities plot

In a probabilities plot, the probability density function of each class is plotted against the predicted probability of belonging to this specific class.

Project

~--~

Preprocessing

~--~

Q

Quantitative

A feature, variable that measures the size, quantity.

Qualitative

A feature, variable that expresses a state, a description..

R

R-squared (R2)

R2 is a statistical measure for regression tasks of how close the data are to the fitted regression line; it is the percentage of the response variable variation that is explained by a linear model. Typically ranges from 0 to 1. A value of 0 means that the data is random or otherwise cannot be fit to the model. A value of 1 means that the model exactly matches the data. Negative values can occur when the predictions are worse than the ones produced by the trivial model (always predicting with the average value of the response variable).

Recall

See Sensitivity term for a definition.

Reference signature

A set of variables output from a feature selection algorithm.

Regression analysis

A regression analysis is used to predict the value of a quantitative trait from an instance of data (set of feature values). The trait values can be of any real value. The output of a regression algorithm is a function, which you can use to predict the numeric value for any new set of input data. Examples of regression scenarios include: predictiing the height of a newborn, predicting the age of a person

Relative Absolute Error (RAE)

The RAE is expressed as a ratio, comparing a mean error (residual) to errors produced by a trivial model. A reasonable model (one which produces results that are better than a trivial model) will result in a ratio of less than one.

Relative Squared Error

RSE is the sum of all squared absolute errors divided by the sum of squared distances between correct label values and the average of all correct label values.

Residuals' plot

Residuals are the differences between observed values and those predicted by the model. Any pattern in the residuals is evidence of an inadequate model or of irregularities in the data, such as outliers, and suggests how the model may be improved.

ROC (Receiver Operating Characteristic) curve

A curve showing all optimally achieved trade-offs between the False Positive Rate (equal to 1 minus Specificity) and the True Positive Rate (equal to the Sensitivity) of a binary classification model. The ROC curve is obtained by varying all possible different model classification thresholds. The higher TPR and the lower FPR is for each threshold the better, so classifiers that have curves that are more top-left-side are better. The integral of this curve corresponds to the model’s AUC.

Rergressor

An algorithm trained for regression tasks.

S

Sample

The measurement of a datapoint’s features.

Scenario

An analysis setting defined by the characteristics of the input data (type of outcome, type of features, size of data) and user preferences.

Sensitivity

Sensitivity is the ratio of the number of true positive samples divided by the total number of positive samples (true positives + false negatives)

Signature

The results of applying feature selection on dataset for a given outcome: a feature subset with the property that is of minimal-size and leads to an optimally predictive model.

Specificity

Specificity is the ratio of the number of true negative samples divided by the total number of negative samples (true negatives + false positives).

Standardization (z-score transformation)

Centering by subtracting a feature’s mean value and scaling by dividing with its standard deviation. In other words a z-score transformation.

Statistical Significance

The result of a statistical test on a certain hypothesis (usually called null hypothesis, or H0). It represents the probability of observing something at least as extreme as actually observed under the aforementioned hypothesis, and thus is associated with the likelihood of observing such extreme values given that the null hypothesis holds. In other words, it is a compatibility indication of a certain hypothesis and the observed data.

Stratification feature

A stratification feature is a categorical feature that is used to stratify samples into groups

Supervised Principal Component Analysis (SPCA)

A dimensionality reduction technique, here used to help visualise the distinguishability strength of the selected features and the applied Learning Algorithm.SPCA is a dimensionality reduction technique that seeks the linear combinations (Principal Components) of the original features, such that the derived features capture maximal variance. JADBIO performs PCA with only the features included in the signature. The appropriate visualization for this model, is a PCA plot, in which the predictions of the class are visualized.

Stratification

During dataset spliting in disjoint sets

T

Time to Event (Survival) analysis

Survival analysis computes a hazard (risk) function of time and basically assigns probabilitites for a specific event to occur based on the occurrence of past events. Survival analysis is challenged by the presence of censored data, events that have not occurred yet, but might in the future.

Time to Event feature

Time to Event feature refers to the time elapsed until the event or the censorship occurred.

Transformation

Applying a mathematical formula on a value or vector of values.

Trimmed values

Transformation on outliers??

True Negatives (TN)

True negatives are correctly predicted negative samples. For example, the healthy samples that are predicted as negative samples.

True Positives (TP)

True Positives are correctly predicted positive samples. For example, the cancer patients that are predicted as positive samples.

Training

A procedure during which an algorithm (eg. classifier, feature selection algorithm) tries to map the independent variables (model matrix) to a given dependent variable.

Trivial model

A model that produces the most basic predictions (null model) and is task-specific. For binary classification, the prediction is the most prevalent class, while for regression the average of the outcome.

True Positive Rate (TPR)

See Sensitivity.

True Negative Rate (TNR)

See Specificity.

Trait

See outcome.

U

Unadjusted estimate

The mean estimate produced by a cross-validation procedure, without bias correction.

Uniform Manifold Approximation and Projection (UMAP)

UMAP is a novel non-linear dimensionality-reduction technique. UMAP is constructed from a theoretical framework based in Riemannian geometry and algebraic topology. The result is a practical scalable algorithm that applies to real world data.

V

Validation

The procedure during which an external dataset is used to validate the final produced model.

Z

z-score

See Standardization.