← Back Home

Scalable DBSCAN Implementation

Github Repository View Project Report

Parallel DBSCAN on Spark is a distributed spatial data mining project focused on improving the scalability and efficiency of the DBSCAN clustering algorithm using Apache Spark. The project leverages distributed partitioning, KD-tree based neighbor search, ghost point replication, and parallel cluster merging techniques to process large scale spatial datasets efficiently.


Responsibilities

Designed and implemented a distributed version of the DBSCAN clustering algorithm on Apache Spark for scalable spatial data mining.
Developed grid based partitioning strategies and ghost point replication mechanisms to preserve cluster continuity across partition boundaries.
Optimized neighbor search performance using KD-tree indexing, replacing naive O(n²) range queries with efficient spatial indexing.
Implemented distributed cluster merging using Union-Find techniques to maintain global cluster consistency across partitions.
Worked on improving execution efficiency, distributed processing, and scalability for large spatial datasets using parallel computation.

Technologies & Domains

Apache Spark Distributed Systems DBSCAN Spatial Data Mining Parallel Processing KD-Tree Union-Find Grid Partitioning Scalable Computing Big Data Processing Cluster Analysis Data Mining