Partitioned Data & Why It Matters

Partitioned Bread

As data volumes grow, so does your need to understand how to partition your data.  Until you understand this distributed storage concept, you will be unable to choose the best approach for the job.  This post gives an introductory explanation of partitioning and you will see why it is integral to the Hadoop Distributed File System (HDFS) increasingly used in modern big data architectures.

Balancing Lines of People

One of my earliest memories of (human) partitioning comes from registration day at university.  New students lined up in the hall to sign up for classes.  A registrar helper would punch our information into a computer and make sure everything was in order.

I remember looking for the shortest line only to find that it wasn’t a first come, first served scenario.  Instead, at the start of each line there was a sign with a range of letters indicating which surnames were to use that line.

Naturally, not all lines were equal length, but they were close enough when the lines were long that no helper was ever allowed a break.  When lines became short, they became unbalanced but at that point the queues were not used anymore.

This approach to distributing a large number of students into reasonably sized groupings is a type of partitioning.  The result was that each helper was continually fed a group of students to work on.

It’s worth taking a moment to consider the alternative: one huge line of people being processed by one helper.  Or similarly, multiple helpers all access students from a huge line.  So much time would be wasted waiting for students to move around – not to mention the line going out the door.

(Ironically, with poorly distributed loads, there is often less incentive for workers to perform well, but that’s a psychological discussion for another day.)

This is the crux of distributed computing – providing bite-sized pieces of work to various “workers” to digest.  Similar methods are used on our data all the time, often behind the scenes.

What Is Data Partitioning?

verb [ with obj. ]
    divide into parts: an agreement was reached to partition the country.
     • divide (a room) into smaller rooms or areas by erecting partitions: the hall was partitioned to contain the noise of the computers.

Partitioning data is the act of breaking a single dataset into multiple pieces.  For example, you may break a file with one million lines into a 10 files with one hundred thousand lines.  Each of those files would be stored in a different location or on a different computer.

As distributed data storage is usually done across a cluster of networked computers, we often call each computer a node in the cluster.  A node that holds part of the distributed data is usually called a data node.

Why Partition?

The purpose of partitioning is to distribute data across multiple machines or storage areas.  These different locations are networked together so data can be reassembled (or redistributed) as needed.

There are several reasons to use partitioning and several methods for performing it.  In this post we’ll mainly cover two of those reasons.

While it may seem counterintuitive to create added complexity by breaking data into smaller pieces, there are some good reasons for doing so.  In fact, much of the recent innovation in scalable computing wouldn’t be possible without partitioning.  This is why HDFS is such a big deal, as it provides a backbone for higher level Hadoop-based applications.

When Your Eyes Are Bigger Than Your Stomach

Big Data often refers to data that is so large that it cannot efficiently fit on a single disk or in memory of a single computer.  Even if the data could fit on a single disk it does not mean that a single computer could process it for any meaningful purpose – it may just be too big to chew on when in one piece.  Instead, large datasets are broken into smaller pieces and then pushed out to live on other parts of the infrastructure in a more digestible form.

Even if you are not dealing with colossal datasets now, partitioning becomes increasingly important when data volumes are expected to grow rapidly.  Rather than having to continually increase the size of a storage unit and move your data, instead, growing data is spread out across the infrastructure.  Often it’s as simple as adding a new node and instantly have more data storage available.

In essence, this approach allows the balancing of data across a network to prevent overloading a single storage unit.

Keep Your Friends Close

Perhaps more importantly, this type of distributed storage also allows us to store data in a way that keeps it “close to” the computing resources that are going to use it.

If you were sending and receiving lots of money through a bank account, you wouldn’t keep all your cash at home.  Bringing it into the bank each time you wanted to send some would be horribly inefficient.  Instead, keeping the cash on deposit with the bank means quick and efficient access to funds through that bank because the cash is easily at hand.

Likewise, modern distributed computing puts the data nearest to the machine that will be using the data for analysis or processing.  The goal here is to keep all processors as busy as possible, while trying to reduce network activity.  In most environments this means each CPU accesses locally stored data on a disk attached to the same machine.

Types of Partitioning

There will always be a limit to the amount of data that can be stored on a single device, but partitioning allows us to make optimal use of that storage.  Also, as a rule, there will always be a limit to how much data a single resource can process over a certain period of time.  But by leveraging additional processors you can process more data, just as fast or faster than with previous environments.

Knowing that partitioning is important is only half the battle – you have to know how to apply it.  In a future post, I’ll describe the various types of partitioning and help you decide why/when to use them.

About Tyler Mitchell

Developer, author and technology writer in big data, graph analytics, geospatial and Internet of Things. Follow me @1tylermitchell or get my book from
Similar posts
  • VID: Solving Performance Problems on Hadoop The full video of my talk from Hadoop Summit (San Jose, June 28, 2016) is now available.  In this talk I cover performance considerations when moving analytic workloads into production.  I even give away the game changing secret sauce for extreme performance in Actian’s Vector in Hadoop product for SQL analytics. VID: Solving Performance Problems [...]
  • Serverspec checks settings on a Hadoop cluster Serverspec is a Ruby-based system that can run Rspec formatted tests against a host.  It can check for a long list of system details and status such as total memory, CPU count, running services, and more, including custom shell callouts.  It compares the results of the checks with predefined thresholds and reports a pass or [...]
  • Hadoop Options for SQL Databases Drowning while trying to understand your options for SQL-based database management in Hadoop?  This graphic is a simplified comparison of the various features of several popular products being used today.  I outline some of my biggest differentiators in this post. While this is a marketing slide for Actian’s SQL in Hadoop enterprise solution, I wish I saw it earlier so I could [...]
  • “Big Data” off 2015 Hype Cycle? See this official 2015 hype cycle video here to get it straight from Gartner. In the video she says, first, it’s passed over the hump and is no longer just hype.  Second, it’s embedded within other items throughout the cycle now.  I can understand how this can get confusing to track and qualify, but isn’t [...]
  • Web console for Kafka messaging system Running Kafka for a streaming collection service can feel somewhat opaque at times, this is why I was thrilled to find the Kafka Web Console project on Github yesterday.  This Scala application can be easily downloaded and installed with a couple steps.  An included web server can then be launched to serve it up quickly.  Here’s [...]

No Comments Yet

I'd love to hear from you!