Archive/Data-Driven Analysis for University Governance Modernization, Discipline Development, and Academic-Risk Management: A Spark-GPU Heterogeneous Acceleration Framework for Large-Scale Multi-Source Data Mining
Data-Driven Analysis for University Governance Modernization, Discipline Development, and Academic-Risk Management: A Spark-GPU Heterogeneous Acceleration Framework for Large-Scale Multi-Source Data Mining
Yong Huang
June 30, 2026
en

Abstract

Large-scale spatial data mining contains two performance bottlenecks: iterative high-dimensional distance computation and irregular polygon verification. This study develops a Spark-compatible CPU-GPU acceleration framework for K-Means clustering and polygon spatial join. Spark is retained for data ingestion, partitioning, scheduling, and result reconstruction, whereas CUDA executes the dominant numerical and geometric kernels. SGK-Means combines flattened array communication, unified index mapping, K-Means++ initialization, Yinyang bound filtering, and a single Spark partition plus single-GPU strategy for the iterative clustering loop. SG-Join integrates CUDA refinement into Apache Sedona through Spark-side KD-Tree partitioning, GPU-side equal-grid indexing, MBR filtering, point-in-polygon (PIP) verification, edge-intersection (EI) verification, duplicate removal, and CUDA dynamic parallelism. Runtime is reported as end-to-end wall-clock time, including data conversion and host-device transfer. On the RTX6000 SGK-Means setting, the maximum speedup is 247.19 and the geometric-mean speedup is 6.21 across the sixteen dataset-k configurations; the Tesla A40 hardware comparison reports a maximum speedup of 270.15 on Higgs with k = 10,000. For SG-Join, the maximum speedup is 3.90 and the geometric-mean speedup is 1.99 across dataset and CPU-core settings. A configuration-level exact sign-test check is added for the retained paired runtime records, while the evaluation still distinguishes this directional check from repeated-run uncertainty analysis. These results indicate that Spark-GPU cooperation improves CPU-baseline runtime when the data layout, communication path, and CUDA kernels are matched to the dependency structure of each algorithm. The conclusion is restricted to the evaluated datasets, hardware settings, and baseline protocols. The revised evaluation treats output equivalence, run-to-run uncertainty, partial component sensitivity, and GPU-alternative comparison as explicit scope controls rather than as unsupported correctness or statistical-superiority claims.

IPC Classification

G06H04B60

Keywords

data-drivenanalysisuniversitygovernancemodernizationdisciplinedevelopmentacademic-riskmanagementspark-gpuheterogeneousaccelerationframeworklarge-scalemulti-sourcedataminingsymmetryspatialcontainsperformancebottlenecksiterativehigh-dimensional
Reference this publication

€ 4.00