Big Data offers enterprises the potential for predictive metrics and insightful statistics, but these data sets are often so large that they defy traditional data warehousing and analysis methods. However, if properly stored and analyzed, businesses can track customer habits, fraud, advertising effectiveness, and other statistics on a scale previously unattainable. The challenge for enterprises is not so much how or where to store the data, but how to meaningfully analyze it for competitive advantage. For more info on Big Data see Fusionex here.
Big Data storage and Big Data analytics, while naturally related, are not identical. Technologies associated with Big Data analytics tackle the problem of drawing meaningful information with three key characteristics. First, they concede that traditional data warehouses are too slow and too small-scale. Second, they seek to combine and leverage data from widely divergent data sources in both structured and unstructured forms. Third, they acknowledge that the analysis must be both time- and cost-effective, even while deriving from a legion of diverse data sources including mobile devices, the Internet, social networking, and Radio-frequency identification (RFID).
The relative newness and desirability of Big Data analytics combine to make it a diverse and emergent field. As such, one can identify four significant developmental segments: MapReduce, scalable database, real-time stream processing, and Big Data appliance.
The open-source Hadoop uses the Hadoop Distributed File System (HDFS) and MapReduce together to store and transfer data between computer nodes. MapReduce distributes data processing over these nodes, reducing each computer’s workload and enabling computations and analysis greater than that of a single PC. Hadoop users usually assemble parallel computing clusters from commodity servers and store the data either in a small disk array or solid-state drive format. These are typically called “shared-nothing” architectures. They are considered more desirable than storage-area networks (SAN) and network-attached storage (NAS) because they offer greater input/output (IO) performance. Within Hadoop – available for free from Apache – there exist numerous commercial incarnations such as SQL 2012, Cloudera, and more.
Not all Big Data is unstructured, and the open-source NoSQL uses a distributed and horizontally-scalable database to specifically target streaming media and high-traffic websites. Again, many open-source alternatives exist, with MongoDB and Terrastore residing among the favorites. Some enterprises will also choose to use Hadoop and NoSQL in combination.
As the name suggests, real-time stream processing uses real-time analytics to provide up-to-the-minute information about an enterprise’s customers. StreamSQL is available through numerous commercial avenues and has functioned adequately in this regard for financial, surveillance, and telecommunications services since 2003.
Finally, Big Data “appliances” combine networking, server, and storage gear in order to accelerate user data queries with analytics software. Vendors abound, and include IBM/Netazza, Oracle, Terradata, and many others.
Enterprises seeking to edge out their rivals are looking to Big Data. Storage is only the first part of the battle, and those than can efficiently analyze the new wealth of information better than their competitors will almost certainly profit from it. These ambitious enterprises would do well to regularly reassess their Big Data analytics methods, as the technological landscape will change often and dramatically in the coming months and years.