Impute Missing Data#

xynergy.impute.pre_impute(df: DataFrame, dose_cols: list[str] = ['dose_a', 'dose_b'], response_col: str = 'response', experiment_cols: str | list[str] = 'experiment_id', method: str = 'RBFSurface', target: str = 'response', reference_for_target: str = 'bliss', ensemble_response_weight: float = 0.6, clip_response_bounds: tuple[float, float] | None = (0.0, 100.0), use_single_drug_response_data: bool = True, additional_imputation_cols: str | list[str] | None = None, log: str = 'all')#

Impute missing response data.

Parameters#

df: polars.DataFrame

Usually the output from tidy or one of its downstream functions

dose_cols: list, default [“dose_a”, “dose_b”]

A list of exactly two columns names that contain untransformed numeric values of agent dose

response_col: string, default “response”

The name of the column containing responses and missing responses to be imputed

experiment_cols: list[str], string, or None, default “experiment_id”

The names of columns that should be used to distinguish one dose pair’s response from another. If none are supplied, two rows with the same doses will be considered replicates. Experiments are imputed separately, so as to prevent information leakage. These columns are used strictly for grouping and are not used for imputation.

method: string, default “XGBR”
  • “RBFSurface” (recommended): RBF interpolation of Bliss residuals in log-dose space. Exploits the pharmacological smoothness of dose-response surfaces. Very fast and generally the most accurate method, especially when the observed cells are the single-drug edges and a positional diagonal.

  • “GaussianProcessSurface”: Gaussian-process regression in log-dose space. Slower than RBFSurface, but a strong non-parametric surface baseline for benchmarking.

  • “MatrixCompletion”: Iterative rank-truncated SVD that exploits the low-rank structure of dose-response matrices.

  • “XGBR” (slowest, most accurate of the tabular methods),

  • “RandomForest” (roughly medium speed and accuracy),

  • “LassoCV” (fast, poor accuracy. Not recommended.),

  • Otherwise, default sklearn IterativeImputer (fastest, sometimes better accuracy than LassoCV)

target: string, default “response”

What to impute. Options:

  • “response”: Impute response_col directly (existing behavior).

  • “combo_effect”: Impute residual interaction effect relative to reference_for_target, then reconstruct response.

  • “ensemble”: Blend the "response" and "combo_effect" predictions using ensemble_response_weight.

reference_for_target: string, default “bliss”

Only used when target = "combo_effect". Defines the no-interaction baseline used to create the target residual. Options:

  • “bliss”

  • “hsa”

ensemble_response_weight: float, default 0.6

Only used when target = "ensemble". Weight assigned to the direct "response" prediction; the "combo_effect" prediction receives 1 - ensemble_response_weight.

clip_response_bounds: tuple[float, float] | None, default (0.0, 100.0)

Bounds applied to reconstructed response columns (resp_imputed, response_imputed_from_effect) when available. Use None to disable.

use_single_drug_response_data: bool, default True

Some methods - like RandomForest - perform better when the dataset contains columns with the responses of, say, ‘drug A’ at ‘dose_a’ (no combination). If this parameter is True, automatically calculate this value and include it as data to be used for imputation. You might set this as False if you want to include your own data, for instance - in which case you would add the name of those columns to additional_imputation_cols. In general, this step can only help and is relatively quick.

additional_imputation_cols: string, list[str], optional

Additional column name(s) that should also be used for imputation. Columns not listed here will be dropped prior to imputation and rejoined afterwards.

log: string, default “all”

Verbosity of function. Options include “all”, “warn”, and “none”.

  • If “all”, will emit notes and warnings.

  • If “warn”, will emit only warnings.

  • If “none”, will not emit anything (except errors)

Returns#

Input df with a resp_imputed column (plus [dose_cols]_resp if use_single_drug_response_data = True).

If target = "combo_effect", additional columns are returned:

  • combo_effect_imputed

  • response_imputed_from_effect

If target = "ensemble", these additional columns are returned:

  • resp_imputed_response

  • resp_imputed_combo_effect

  • resp_imputed_ensemble

xynergy.impute.post_impute(df: DataFrame, dose_cols: list[str] = ['dose_a', 'dose_b'], response_col: str = 'response', experiment_cols: str | list[str] | None = 'experiment_id', imputed_response_cols: list[str] | None = None, imputed_resp_prefix: str = 'resp_imputed_', post_impute_tuning: str = 'Predefined', log: str = 'all')#

Predict missing data using matrix factorization.

Parameters#

df: polars.DataFrame

Usually the output from tidy or one of its downstream functions

dose_cols: list, default [“dose_a”, “dose_b”]

A list of exactly two columns names that contain untransformed numeric values of agent dose

response_col: string, default “response”

The name of the column containing responses and missing responses to be imputed

experiment_cols: list[str], string, or None, default “experiment_id”

The names of columns that should be used to distinguish one dose pair’s response from another. If none are supplied, two rows with the same doses will be considered replicates. Experiments are imputed separately, so as to prevent information leakage. These columns are used strictly for grouping and are not used for imputation.

imputed_response_cols: list[str], optional

Columns to use for imputation. If unspecified, will use imputed_resp_prefix and use all columns that match

imputed_resp_prefix: str, default "resp_imputed_"

Only used if imputed_response_cols is None. When looking for columns to use for imputation, will use columns that have this prefix in their name.

post_impute_tuning: string, default “Predefined”

Strategy for tuning XGBoost hyperparameters in post-imputation.

  • “Predefined”: Use fixed hyperparameters (learning_rate=0.1, max_depth=3, subsample=0.9, gamma=0.5, n_estimators=50). Very fast.

  • “RandomizedSearchCV”: Sample a subset of the hyperparameter space (20 random combinations, up to 3-fold CV). Moderate speed.

  • “GridSearchCV”: Exhaustive search over the full hyperparameter grid (324 combinations, up to 3-fold CV). Slowest but most thorough.

log: string, default “all”

Verbosity of function. Options include “all”, “warn”, and “none”.

  • If “all”, will emit notes and warnings.

  • If “warn”, will emit only warnings.

  • If “none”, will not emit anything (except errors)

Returns#

The same input with values in response_col imputed