If you felt the term bigdata itself defines the meaning as ‘data that is big in size’,congrats..! You are going to easily understand my words. In general words, big data is the collection of large datasets that are complex to solve by using traditional tools.
In another terms the definition of big data is that more Varieties of data, occurring with increasing Volumes along with much more Velocity. This big data can be often characterized as ” The three Vs “.
With large volumes of data, you may need to process large volumes of low-density and unstructured data. This could be data of unknown value,such as web pages, clickstreams in a mobile app. This could be size of terabytes or petabytes.A Terabyte consists of 1,024 gigabytes(GB),where a petabyte consists of 1,024 TB.
The velocity can be termed as generated, processed and analyzed speed of data. The more data volumes into your organization per second, the bigger your velocity challenge.
Variety in big data refers to both structured and unstructured data that may be generated by humans or by machines. The most commonly added data is structured. They are texts, tweets, images, videos. Meanwhile, the unstructured data such as emails, voicemails, handwritten text, ECG readings,audio recordings are also important under variety.
Companies are now realizing that analyzing the bigdata could help in major organizational predictions. Hadoop allows to store the large data in whatever the form simply by adding the servers to Hadoop clusters.
The Hadoop is an open source distributed processing framework. It is used to manage data processing and storage for big data applications in scalable clusters of computer servers. Hadoop systems can handle both structured and unstructured data.
Modules of Hadoop:
Collection of common utilities which support the other Hadoop modules.
Hadoop Distributed File System (HDFS):
A distributed file system which provides the high-throughput access to the application data.
Fullform of YARN is “Yet Another Resource Negotiator”. A framework for cluster resource management and job scheduling.
Map reduce program work in two phases namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data.
The advantages of hadoop are scalability, cost effecient, high throughput, open source, multiple languages supported etc.
Disadvantages of hadoop are vulnerable by nature, security issues, supports only batch processing, issue with small files etc.
For bigdata and hadoop external reference:http://hadoop.apache.org/
Know about blue brain technology by clicking here