Distribution consistency-based missing value imputation algorithm for large-scale data sets (2024)

Abstract

Objective

As a significant research branch in the field of data mining, missing value imputation (MVI) aims to provide high-quality data support for the training of machine learning algorithms.However, MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy.To improve the performance of the existing MVI algorithms, we propose a distribution consistency-based MVI (DC-MVI) algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.

Methods

First, the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency.Second, the data set is preprocessed by random initialization of missing values and normalization, and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values.Next, in a distributed environment, the large-scale data set is divided into multiple groups of random sample partition (RSP) data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set.Finally, the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values.The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency.In addition, the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms (mean, KNN, MICE, RF, EM, SOFT, GAIN, and MIDA) through the following three indicators: distribution consistency, time complexity, and classification accuracy.

Results

The experimental results on seven selected large-scale data sets showed that: 1) The objective function of the DC-MVI method was effective, and the missing value update rule was feasible, allowing the imputation values to remain stable throughout the adjustment process; 2) the DC-MVI algorithm obtained the smallest maximum mean discrepancy and Jensen-Shannon divergence on all data sets, showing that the proposed method had a more consistent probability distribution with the complete original values under the given significance level; 3) the running time of the DC-MVI algorithm tended to be stable in the time comparison experiment, whereas the running time of other state-of-the-art MVI methods increased linearly with data volume; 4) the DC-MVI approach could produce imputation values that were more consistent with the original data set compared to existing methods, which was beneficial for subsequent data mining analysis.

Conclusions

Considering the peculiarities and limitations of missing large-scale data, this paper incorporates RSP into the imputation algorithm and derives the update rules of imputation values to restore the data distribution and further confirm the effectiveness and practical performance of DC-MVI in the large-scale data set imputation, such as preserving distribution consistency and increasing imputation quality.The method proposes in this paper achieves the desired result and represents a viable solution to the problem of large-scale data imputation.

References

[1]

SCHEET P, STEPHENS M. A fast and flexible statistical model for large-scale population genotype data: Applications to inferring missing genotypes and haplotypic phase[J]. The American Journal of Human Genetics, 2006, 78(4): 629-644.

Distribution consistency-based missing value imputation algorithm for large-scale data sets (2024)

Abstract

References

FAQs

Which method is most suitable for datasets with many missing values such as this one? ›

What are the two types of imputation? ›