As data volumes grow, so does your need to understand how to partition your data. Until you understand this distributed storage concept, you will be unable to choose the best approach for the job. This post gives an introductory explanation of partitioning and you will see why it is integral to the Hadoop Distributed File System (HDFS) increasingly used in modern big data architectures.
Balancing Lines of People
One of my earliest memories of (human) partitioning comes from registration day at university. New students lined up in the hall to sign up for classes. A registrar helper would punch our information into a computer and make sure everything was in order.
I remember looking for the shortest line only to find that it wasn’t a first come, first served scenario. Instead, at the start of each line there was a sign with a range of letters indicating which surnames were to use that line.
Naturally, not all lines were equal length, but they were close enough when the lines were long that no helper was ever allowed a break. When lines became short, they became unbalanced but at that point the queues were not used anymore.
This approach to distributing a large number of students into reasonably sized groupings is a type of partitioning. The result was that each helper was continually fed a group of students to work on.
It’s worth taking a moment to consider the alternative: one huge line of people being processed by one helper. Or similarly, multiple helpers all access students from a huge line. So much time would be wasted waiting for students to move around – not to mention the line going out the door.
(Ironically, with poorly distributed loads, there is often less incentive for workers to perform well, but that’s a psychological discussion for another day.)
This is the crux of distributed computing – providing bite-sized pieces of work to various “workers” to digest. Similar methods are used on our data all the time, often behind the scenes.
What Is Data Partitioning?
verb [ with obj. ]
divide into parts: an agreement was reached to partition the country.
• divide (a room) into smaller rooms or areas by erecting partitions: the hall was partitioned to contain the noise of the computers.
Partitioning data is the act of breaking a single dataset into multiple pieces. For example, you may break a file with one million lines into a 10 files with one hundred thousand lines. Each of those files would be stored in a different location or on a different computer.
As distributed data storage is usually done across a cluster of networked computers, we often call each computer a node in the cluster. A node that holds part of the distributed data is usually called a data node.
The purpose of partitioning is to distribute data across multiple machines or storage areas. These different locations are networked together so data can be reassembled (or redistributed) as needed.
There are several reasons to use partitioning and several methods for performing it. In this post we’ll mainly cover two of those reasons.
While it may seem counterintuitive to create added complexity by breaking data into smaller pieces, there are some good reasons for doing so. In fact, much of the recent innovation in scalable computing wouldn’t be possible without partitioning. This is why HDFS is such a big deal, as it provides a backbone for higher level Hadoop-based applications.
When Your Eyes Are Bigger Than Your Stomach
Big Data often refers to data that is so large that it cannot efficiently fit on a single disk or in memory of a single computer. Even if the data could fit on a single disk it does not mean that a single computer could process it for any meaningful purpose – it may just be too big to chew on when in one piece. Instead, large datasets are broken into smaller pieces and then pushed out to live on other parts of the infrastructure in a more digestible form.
Even if you are not dealing with colossal datasets now, partitioning becomes increasingly important when data volumes are expected to grow rapidly. Rather than having to continually increase the size of a storage unit and move your data, instead, growing data is spread out across the infrastructure. Often it’s as simple as adding a new node and instantly have more data storage available.
In essence, this approach allows the balancing of data across a network to prevent overloading a single storage unit.
Keep Your Friends Close
Perhaps more importantly, this type of distributed storage also allows us to store data in a way that keeps it “close to” the computing resources that are going to use it.
If you were sending and receiving lots of money through a bank account, you wouldn’t keep all your cash at home. Bringing it into the bank each time you wanted to send some would be horribly inefficient. Instead, keeping the cash on deposit with the bank means quick and efficient access to funds through that bank because the cash is easily at hand.
Likewise, modern distributed computing puts the data nearest to the machine that will be using the data for analysis or processing. The goal here is to keep all processors as busy as possible, while trying to reduce network activity. In most environments this means each CPU accesses locally stored data on a disk attached to the same machine.
Types of Partitioning
There will always be a limit to the amount of data that can be stored on a single device, but partitioning allows us to make optimal use of that storage. Also, as a rule, there will always be a limit to how much data a single resource can process over a certain period of time. But by leveraging additional processors you can process more data, just as fast or faster than with previous environments.
Knowing that partitioning is important is only half the battle – you have to know how to apply it. In a future post, I’ll describe the various types of partitioning and help you decide why/when to use them.
- VID: Solving Performance Problems on Hadoop - July 5, 2016
- Storing Zeppelin Notebooks in AWS S3 Buckets - June 7, 2016
- VirtualBox extension pack update on OS X - April 11, 2016
- Zeppelin Notebook Quick Start on OSX v0.5.6 - April 4, 2016
- Spark Analysis of Global Place Names (GeoNames) - January 20, 2016
- Serverspec checks settings on a Hadoop cluster - December 8, 2015
- Hadoop Options for SQL Databases - October 15, 2015
- “Big Data” off 2015 Hype Cycle? - August 18, 2015
- Spatial Data Made Useful - May 29, 2015
- Common Zeppelin Errors - May 17, 2015