New York, US (PRWEB) October 22, 2012
Read the full article here: http://bit.ly/RcjuMK
The excitement in Hadoop has reached frothy proportions in recent months. Everyone I speak to is asking about use cases for Hadoop. I came across an interesting one this week, so I decided to share it here.
Many companies have a need to retain certain types of data for long periods of time: seven years, 15 years and sometimes even longer. The typical approach is to partition aging data in month long segments, and then store those partitions in archives. One example of this is call detail records that carriers are required to store – they’re voluminous but must often be retained for extended periods for compliance and other purposes.
Most of the time, these records aren’t needed and they remain happily untouched in the archive. From time to time, however, it becomes necessary to serve up anywhere from six to 18 months or more of these records for a particular customer from a period several years in the past. This type of request is most often driven by an investigation, subpoena or other legal inquiry.
This can be a messy job for carriers to comply with. First, they must locate and load from the archive an entire partition for a particular month. Typically, more than 99% of the data in each of these huge partitions is not relevant to the records in question. They must then extract the relevant records and store them in a staging area, then close the partition and move on to the next one. If 18 months of records are needed, this process is repeated 18 times. Worse still, if the records are regionalized and the customer is present in, say, three regions, then the process could repeat as many as 54 times. If the process is manual – and often it is at least partially manual – fulfilling this request could take days or even weeks.
Enter Hadoop. Instead of partitioning off aging data into a traditional archive solution, it can instead (or in addition) be stored as flat files in an online Hadoop Distributed File System (HDFS). When it comes time to extract customer specific data that’s spread across a great number of large files, Hadoop doesn’t bat an eye. Its core Map/Reduce functionality automatically divides the task across multiple nodes, and provides consolidated results in seconds or minutes.
And because Hadoop is designed to provide fault tolerance using COTS (commercial off-the-shelf) servers, the cost of a Hadoop solution can be trivial compared with that of a traditional archival approach.
Have you had experience with this use case or a similar one? If so, please tell us about it in the comments section.