As a typical big data application, geospatial analysis nowadays has been receiving extensive attention from both academic and industrial domains. Along collecting massive geospatial data, more and more manufacturers as well as research institutions find that the analysis over geospatial data in existing legacy architecture cannot be scalable. The reason is typical two-fold. On one hand, extending traditional databases to support modern complex geospatial data analytics is rather challenging. On the other hand, integrating the emerging techniques in other big data applications to traditional databases may suffer from compatibility issue, resulting in the poor performance or even painful debugging tasks. Specifically, most of today’s general-purpose relational databases (e.g., Oracle, Microsoft SQL Server, together with their geospatial components) are particularly designed as OLTP systems. Their shared-disk or shared-everything architectures are especially optimized for high-throughput transaction execution while sacrificing analytical query performance. In contrast to the exiting relational database systems, Pivotal offers the Greenplum Database (GPDB), which is an extensible relational database platform that uses a shared-nothing, massive parallel processing (MPP) based architecture to vastly accelerate the online analytical processing (OLAP) over geospatial big data. Even better, GPDB can seamlessly integrate in-database analytical processing with our extended analytics stacks, such as heterogeneous Hadoop environments and in-memory data grid. Recent reports from Gartner highly scored Pivotal GPDB on data warehousing and analytics.
We design and develop geospatial analytics toolkits on GPDB in terms of three aspects. First, we migrate the latest PostGIS project into GPDB so that GPDB is able to run as a spatial database system for regular GIS users. Second, we extend the spatial component with various types of advanced geospatial functions, such as geospatial group-by, similarity search and network-constrained scenarios. Third, we are making effort to support associable retrievals of data across geospatial and other data domains, i.e, queries involving in both geospatial information as well as other non-spatial information, like RDF (which is known as GeoSPARQL queries), Text (which is known as spatial keyword search), time (which is known as trajectory search) etc. Above all we aim to integrate full breath of big data developers on geospatial analytics.
This talk will briefly introduce (1) the architecture of Pivotal GPDB that provides automatic high-performance parallelization of geospatial data loading and data processing, (2) GPDB’s extensive and growing library of in-database geospatial analytic functions, and (3) the capability to build up a comprehensive geospatial data analytics platform around Pivotal GPDB.
I will provide examples of how data science teams may transform billions of geo-tagged customer records to tackle the real-world problem of identity resolution in one minute. I will also discuss our plan of making Pivotal Greenplum Database open-source in the coming quarters.