Data storage
Computer technology is developing very rapidly. In the 1980s, the average capacity of a hard drive for a personal computer was 5MB (if the computer even had a hard drive at all). Today, it is possible to purchase drives with a capacity of up to 8TB of data. Although this may seem like a very large amount, it is actually quite small when compared to the estimated volume of new data produced daily, which reaches 2.5 EB — about 2.5 million TB.
If data is structured, such as bank statements or electronic address books, it is stored in relational databases. To manage them, a database management system (RDBMS – Relational Database Management System) is used. It serves to create, store, share, and operate on such data. First, the database is designed, then relationships between tables are defined, and finally, once created, it is populated with data. For many applications working with structured data, this process is fast and reliable. However, one of the problems with this solution is scalability. Relational databases are designed to run on a single server. When the amount of data increases, they become slow and inefficient. The only way to increase scalability in such a case is to boost computing power.
For unstructured data, a relational database system is unsuitable. One of the main reasons is that once the database schema is established, it is very difficult to change. Moreover, such data cannot be easily stored in defined rows and columns. Additionally, in the case of data generated in real time, they require extensive processing—preferably also in real time. Therefore, a relational database system is completely ineffective in this scenario.
The solution to the above problem is the Distributed File System (DFS). Such a system provides an effective and reliable way of storing large datasets across multiple computers. One of the most popular solutions is Hadoop. Written in the widely used Java language, it is currently employed by Facebook, Twitter, and even eBay. It enables the analysis of both partially structured and unstructured data. The data is distributed across many nodes, often tens of thousands located in data centers around the world. A single Hadoop cluster consists of one main node, called the NameNode, and many subordinate nodes, called DataNodes. The NameNode is responsible for receiving client requests, managing disk space, storing data access paths and their locations. In addition, it manages operations such as closing or opening files and controls data access on client machines. The DataNodes, on the other hand, are responsible for the actual storage of data, which means they handle the creation, deletion, and replication of data blocks. This replication is a crucial function of Hadoop. Each block has several copies in case of DataNode failure. Thanks to this, if one node is damaged, another can take over and continue the task without data loss. For monitoring and verification purposes, the NameNode receives a notification called a Heartbeat every three seconds; if it does not receive it, it assumes that the given DataNode is not functioning. The blocks are small, only 64MB, which means there are very many of them. Moreover, adding additional DataNodes is inexpensive and does not require changes to the existing nodes. Conversely, if previous nodes become unnecessary, they can easily be shut down.
NoSQL databases refer to non-relational databases. The non-relational model allows for continuous addition of new data. It provides functions essential for managing large datasets, namely scalability, availability, and performance.