Big Data in simple words

BigdataTraditional data generation is happening at a record rate. In between 2009-2010, the world generated over 1 Zetabytes of data by 2014 it may go up to 8 Zetabytes a year. Regular Increment of this data is the result of a drastic increase in devices located at the edge of the network including Satellite, 4G phones, Super computers etc. All of this data creates great opportunities to “extract more value add” in human lifestyle and any industry or sector.

What is Big Data?

From Data Warehouse to Business Intelligence (BI) now we are thinking one more level above because we are experiencing unexpected growth in structured and unstructured data (i.e. various documents like word, excel, power point, images, videos, or PDF, HTML document, various Database schemas, telecom data, satellite data etc) is very huge. After seeing all this, a thought comes to our mind is how Amazon, Walmart, Google, Facebook, Yahoo, Youtube and other big players are managing such massive information and day to day transactions that too with a mindset to deliver information quickly. All this is possible because of Bigdata, although the term big data is relatively new but principally Bigdata exceeds the processing capacity of conventional database systems whether data is too big, moves too fast, or doesn’t fit in the present structure of your database architectures. The most popular choice for a big data software stack is Hadoop.

Big Data has three main characteristics: Volume (amount of data), Velocity (speed of data in and out), Variety (range of data types and sources).

  • Volume - Volume describes the amount of data generated by organizations or individuals. Big Data is usually associated with this characteristic. Enterprises of all industries will need to find ways to handle the ever-increasing data volume that’s being created every day.
  • Velocity - Velocity describes the frequency at which data is generated, captured and shared. Recent developments mean that not only consumers but also businesses generate more data in much shorter cycles. Because of the speed enterprises can only capitalize on this data if the data is captured and shared in real-time.
  • Variety - Big data means much more than rows and columns. It means unstructured text, video, audio that can have important impacts on company decisions – if it’s analyzed properly in time.

Here are few examples of Big Data to get the idea:

  • Twitter produces over 90 million tweets per day
  • Wal-Mart is logging one million transactions per hour
  • Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photos etc.

Why Big Data?

Big Data allows corporate and research organizations to do things not previously possible economically.

  • Analysis
  • Business Trends
  • Prevent Diseases
  • Combat Crime, etc
  • Centralization of the data

Potential of Big Data:

The use of Big Data offers tremendous untappeed potential for creating value. Organizations in many industry sectors and business functions can leverage big data to improve their allocation and coordination of human and physical resources, cut waste, increase transparency and accountability and facilitate the discovery of new ideas and insights.

Sectors with greatest potential for big data:

  • Healthcare
  • Public Sector
  • Retail
  • Manufacturing
  • Telecommunications

What is Hadoop?

Hadoop is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of computational independent computers and petabytes of data. Hadoop was derived from Google’s MapReduce and Google File System (GFS) papers.

Hadoop is a top-level Apache project being built and used by a global community of contributors, written in the Java programming language. Yahoo! has been the largest contributor to the project, and uses Hadoop extensively across its businesses

Apache Hadoop has two main subprojects:

1. MapReduce - Map/Reduce is a term commonly thrown about these days, in essence, it is just a way to take a big task and divide it into discrete tasks that can be done in parallel.

2. HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes.

Who all are Hadoop Users?

Besides Facebook and Yahoo!, many other organizations are using Hadoop to run large distributed computations. Some of the notable users include: Amazon, American Airlines AOL, Apple, Ebay, Federal Reserve Board of Governors, foursquare, Fox Interactive Media, Hewlett-Packard, IBM, Intuit, Joost, Last.fm, LinkedIn, Microsoft, NetApp, Netflix, The New York Times, SAP AG, SAS Institute, StumbleUpon, Twitter

Hadoop Vendors?

Consistent with the demand, many vendors such as AWS, IBM, EMC Greenplum, MapR, Cloudera, DataStax, Pentaho, Outerthough, HStreaming, Datameer, Zettaset, Microsoft, and Oracle are evolving their offerings to support Hadoop.

Leave a Reply