Impute Missing Data#
- xynergy.impute.pre_impute(df: DataFrame, dose_cols: list[str] = ['dose_a', 'dose_b'], response_col: str = 'response', experiment_cols: str | list[str] = 'experiment_id', method: str = 'RBFSurface', target: str = 'response', reference_for_target: str = 'bliss', ensemble_response_weight: float = 0.6, clip_response_bounds: tuple[float, float] | None = (0.0, 100.0), use_single_drug_response_data: bool = True, additional_imputation_cols: str | list[str] | None = None, log: str = 'all')#
Impute missing response data.
Parameters#
- df: polars.DataFrame
Usually the output from
tidyor one of its downstream functions- dose_cols: list, default [“dose_a”, “dose_b”]
A list of exactly two columns names that contain untransformed numeric values of agent dose
- response_col: string, default “response”
The name of the column containing responses and missing responses to be imputed
- experiment_cols: list[str], string, or None, default “experiment_id”
The names of columns that should be used to distinguish one dose pair’s response from another. If none are supplied, two rows with the same doses will be considered replicates. Experiments are imputed separately, so as to prevent information leakage. These columns are used strictly for grouping and are not used for imputation.
- method: string, default “XGBR”
“RBFSurface” (recommended): RBF interpolation of Bliss residuals in log-dose space. Exploits the pharmacological smoothness of dose-response surfaces. Very fast and generally the most accurate method, especially when the observed cells are the single-drug edges and a positional diagonal.
“GaussianProcessSurface”: Gaussian-process regression in log-dose space. Slower than RBFSurface, but a strong non-parametric surface baseline for benchmarking.
“MatrixCompletion”: Iterative rank-truncated SVD that exploits the low-rank structure of dose-response matrices.
“XGBR” (slowest, most accurate of the tabular methods),
“RandomForest” (roughly medium speed and accuracy),
“LassoCV” (fast, poor accuracy. Not recommended.),
Otherwise, default sklearn IterativeImputer (fastest, sometimes better accuracy than LassoCV)
- target: string, default “response”
What to impute. Options:
“response”: Impute
response_coldirectly (existing behavior).“combo_effect”: Impute residual interaction effect relative to
reference_for_target, then reconstruct response.“ensemble”: Blend the
"response"and"combo_effect"predictions usingensemble_response_weight.
- reference_for_target: string, default “bliss”
Only used when
target = "combo_effect". Defines the no-interaction baseline used to create the target residual. Options:“bliss”
“hsa”
- ensemble_response_weight: float, default 0.6
Only used when
target = "ensemble". Weight assigned to the direct"response"prediction; the"combo_effect"prediction receives1 - ensemble_response_weight.- clip_response_bounds: tuple[float, float] | None, default (0.0, 100.0)
Bounds applied to reconstructed response columns (
resp_imputed,response_imputed_from_effect) when available. UseNoneto disable.- use_single_drug_response_data: bool, default True
Some methods - like RandomForest - perform better when the dataset contains columns with the responses of, say, ‘drug A’ at ‘dose_a’ (no combination). If this parameter is
True, automatically calculate this value and include it as data to be used for imputation. You might set this asFalseif you want to include your own data, for instance - in which case you would add the name of those columns toadditional_imputation_cols. In general, this step can only help and is relatively quick.- additional_imputation_cols: string, list[str], optional
Additional column name(s) that should also be used for imputation. Columns not listed here will be dropped prior to imputation and rejoined afterwards.
- log: string, default “all”
Verbosity of function. Options include “all”, “warn”, and “none”.
If “all”, will emit notes and warnings.
If “warn”, will emit only warnings.
If “none”, will not emit anything (except errors)
Returns#
Input df with a
resp_imputedcolumn (plus[dose_cols]_respifuse_single_drug_response_data = True).If
target = "combo_effect", additional columns are returned:combo_effect_imputedresponse_imputed_from_effect
If
target = "ensemble", these additional columns are returned:resp_imputed_responseresp_imputed_combo_effectresp_imputed_ensemble
- xynergy.impute.post_impute(df: DataFrame, dose_cols: list[str] = ['dose_a', 'dose_b'], response_col: str = 'response', experiment_cols: str | list[str] | None = 'experiment_id', imputed_response_cols: list[str] | None = None, imputed_resp_prefix: str = 'resp_imputed_', post_impute_tuning: str = 'Predefined', log: str = 'all')#
Predict missing data using matrix factorization.
Parameters#
- df: polars.DataFrame
Usually the output from
tidyor one of its downstream functions- dose_cols: list, default [“dose_a”, “dose_b”]
A list of exactly two columns names that contain untransformed numeric values of agent dose
- response_col: string, default “response”
The name of the column containing responses and missing responses to be imputed
- experiment_cols: list[str], string, or None, default “experiment_id”
The names of columns that should be used to distinguish one dose pair’s response from another. If none are supplied, two rows with the same doses will be considered replicates. Experiments are imputed separately, so as to prevent information leakage. These columns are used strictly for grouping and are not used for imputation.
- imputed_response_cols: list[str], optional
Columns to use for imputation. If unspecified, will use
imputed_resp_prefixand use all columns that match- imputed_resp_prefix: str, default
"resp_imputed_" Only used if
imputed_response_colsisNone. When looking for columns to use for imputation, will use columns that have this prefix in their name.- post_impute_tuning: string, default “Predefined”
Strategy for tuning XGBoost hyperparameters in post-imputation.
“Predefined”: Use fixed hyperparameters (learning_rate=0.1, max_depth=3, subsample=0.9, gamma=0.5, n_estimators=50). Very fast.
“RandomizedSearchCV”: Sample a subset of the hyperparameter space (20 random combinations, up to 3-fold CV). Moderate speed.
“GridSearchCV”: Exhaustive search over the full hyperparameter grid (324 combinations, up to 3-fold CV). Slowest but most thorough.
- log: string, default “all”
Verbosity of function. Options include “all”, “warn”, and “none”.
If “all”, will emit notes and warnings.
If “warn”, will emit only warnings.
If “none”, will not emit anything (except errors)
Returns#
The same input with values in
response_colimputed