CRC Press, 2014. — 306 p.
What is big data? Due to increased interest in this phenomenon, many recent papers and reports have focused on defining and discussing this subject. A review of these publications would point to a consensus about how big data is perceived and explained. It is widely agreed that big data has three specific characteristics: volume, in terms of large-scale data storage and processing; variety, or the availability of data in different types and formats; and velocity, which refers to the fast rate of new data acquisition. These characteristics are widely referred to as the three Vs of big data, and while projects involving datasets that only feature one of these Vs are considered to be big, most datasets from such fields as science, engineering, and social media feature all three Vs.
To better understand the recent spurt of interest in big data, I provide here a new and different perspective on it. I argue that the answer to the question of What is big data? depends on when the question is asked, what application is involved, and what computing resources are available. In other words, understanding what big data is requires an analysis of time, applications, and resources. In light of this, I categorize the time element into three groups: past (since the introduction of computing several decades ago), near-past (within the last few years), and present (now). One way of looking at the time element is that, in general, big data in the past meant dealing with gigabyte-sized datasets, in the near-past, terabyte-sized datasets, and in the present, petabyte-sized datasets. I also categorize the application element into three groups: scientific (data used for complex modeling, analysis, and simulation), business (data used for business analysis and modeling), and general (data used for general-purpose processing). Finally, I classify the resource element into two groups: advanced computing (specialized computing platforms) and common computing (general-purpose workstations and desktops). It is my hope that analyzing these categories in combination will provide an insight into what big data is, as summarized in the following: Past-scientific, near-past-scientific, and present-scientific: Big data has routinely challenged scientists and researchers from various fields as the problems are often data intensive in nature and require advanced computing methods, mainly high-performance computing resources (e.g., supercomputers, grids, and parallel computing).
Past-business and near-past-business: While business analysts occasionally had to deal with large datasets, they were faced with limited big data challenges (where the data volume was large and/or fast data processing was required) for which advanced computing, mainly distributed computing (e.g., clusters) and powerful common computing resources, was often utilized.
Present-business: Business analysts are now routinely challenged by big data problems as modern business applications typically call for analysis of massive amounts of data, which might be in various types and formats, and fast analysis of data to produce quick responses. Advanced computing, mainly cloud computing, which is becoming a common computing resource, is used to address these challenges.
Past-general, near-past-general, and present-general: When not addressing science problems or business applications, business analysts are occasionally faced with complex data that overwhelms available resources. In such cases, more powerful common computing resources are considered.
These general considerations are severely exacerbated when big data problems concern geospatial data. This is because geospatial applications are intrinsically complex and involve large datasets; data are collected frequently and rapidly through advanced geospatial data collection technologies that are widely available in mobile devices (e.g., smartphones); and geospatial data are inherently multidimensional.
In light of these challenging aspects of geospatial data, this book is focused on big data techniques and technologies in geoinformatics. The chapters of the book, contributed by experts in the field as well as in other domains such as computing and engineering, address technologies (e.g., distributed computing such as clusters, grids, supercomputers, and clouds), techniques (e.g., data mining and machine learning), and applications (in science, in business, and in social media).
Chapter 1 provides an overview of distributed computing, high-performance computing, cluster computing, grid computing, supercomputing, and cloud computing. Chapter 2 describes the Global Earth Observation System of Systems Clearinghouse, an infrastructure that facilitates integration and access to Earth observation data for global communities. Chapter 3 discusses a cloud computing environment (CCE) for processing large 3D spatial datasets and illustrates the application of the CCE using a case study. Chapter 4 describes building open environments as a means of overcoming the challenges of big data in Earth sciences. Chapter 5 discusses the development of visualization and analysis services for NASA’s global participation products. Chapter 6 addresses the design of algorithms suitable for geospatial and temporal big data. In Chapter 7, various machine learning techniques for geospatial big data analytics are discussed. Chapter 8 describes the three Vs of geospatial big data and presents a case study for each of them. Big data opportunities in volunteered geographic information to improve routing and navigation services are explored in Chapter 9 . Chapter 10 presents a discussion of data mining of taxi trips using road network shortcuts. Big data challenges in social media are outlined in Chapter 11 . Chapter 12 presents a pattern detection technique called TCM-Pattern to provide insights into big data. Chapter 13 discusses a geospatial cyberinfrastructure for addressing big data challenges on the World Wide Web. Chapter 14 provides a review of Open Geospatial Consortium (OGC) standards that address geospatial big data.