VID: Solving Performance Problems on Hadoop

Featured Video Play Icon

The full video of my talk from Hadoop Summit (San Jose, June 28, 2016) is now available.  In this talk I cover performance considerations when moving analytic workloads into production.  I even give away the game changing secret sauce for extreme performance in Actian’s Vector in Hadoop product for SQL analytics. VID: Solving Performance Problems […]

Storing Zeppelin Notebooks in AWS S3 Buckets


Zeppelin has the option to change the storage options of its notebook system to allow you to use AWS S3.  I’m not sure how long this has been around but I know it isn’t particularly new. However, I wanted to make a note of it as more users of cluster environments are spinning up resources to […]

VirtualBox extension pack update on OS X


Update the VM extension pack on OS X for improved performance and features.  But when the auto update fails you likely have to manually remove current packs and then reinitiate the installation process or do it manually.  This very short video shows you how.

Zeppelin Notebook Quick Start on OSX v0.5.6

Zeppeling tutorial example using Python instead of Scala for Spark SQL

This is a follow-up to my post from last year Apache Zeppelin on OSX – Ultra Quick Start but without building from source.  Today I tested the latest version of Zeppelin (0.5.6) and, using their distributed binaries, was instantly able to launch Zeppelin and run both Scala and Python jobs on my Macbook. This was with zero configuration, […]

Spark Analysis of Global Place Names (GeoNames)


Spark Analysis on a Large File has free gazetteer data by country or for the world, provided in tab-separated text files.  In this post I show you how to do some simple analysis using DataFrames in Spark.  As the global file is 280M compressed and 1.2G uncompressed.  This size of file makes it difficult to […]

Serverspec checks settings on a Hadoop cluster

Table showing report from serverspec output

Serverspec is a Ruby-based system that can run Rspec formatted tests against a host.  It can check for a long list of system details and status such as total memory, CPU count, running services, and more, including custom shell callouts.  It compares the results of the checks with predefined thresholds and reports a pass or […]

Hadoop Options for SQL Databases

SQL in Hadoop Solution Comparison Chart

Drowning while trying to understand your options for SQL-based database management in Hadoop?  This graphic is a simplified comparison of the various features of several popular products being used today.  I outline some of my biggest differentiators in this post. While this is a marketing slide for Actian’s SQL in Hadoop enterprise solution, I wish I saw it earlier so I could […]

“Big Data” off 2015 Hype Cycle?

“Big Data” off 2015 Hype Cycle?

See this official 2015 hype cycle video here to get it straight from Gartner. In the video she says, first, it’s passed over the hump and is no longer just hype.  Second, it’s embedded within other items throughout the cycle now.  I can understand how this can get confusing to track and qualify, but isn’t […]

Spatial Data Made Useful

Geospatial Power Tools book cover

Everyone that deals with geographic/spatial/geospatial data knows they need specialised tools for the job.  Fortunately, there are a ton of open source tools up for the challenge.  In my latest book, Geospatial Power Tools, I show how to use an advanced set of command line tools that you can start using today.  From re-projecting point coordinates […]

Common Zeppelin Errors


A few different errors have popped during my initiation into Apache Zeppelin, here are a few of them, summarised with workarounds if you need them. Tutorial Failure Due To Spark Versions Default Zeppelin comes with Spark 1.1 (though it may be updated by the time you read this).  The current Zeppelin tutorial assumes Spark 1.3 or greater […]

Python Spark SQL – Zeppelin Tutorial – No Scala

Zeppeling tutorial example using Python instead of Scala for Spark SQL

My latest notebook aims to mimic the original Scala-based Spark SQL tutorial with one that uses Python instead.  Above you can see the two parallel translations side-by-side. Python Spark SQL Tutorial Code Here is the resulting Python data loading code.  The SQL code is identical to the Tutorial notebook, so copy and paste if you need it. I […]

Zeppelin Notebook Tutorial Walkthrough


This is my short video (14 min) showing how to build and launch the Apache Zeppelin notebook platform – a web UI for interactive query and analysis.  This is all done running locally via OSX on a Macbook. In this video we focus on using the tutorial notebook that comes with Zeppelin and discuss each step – including interactive querying and charting – […]

Apache Zeppelin on OSX – Ultra Quick Start


The Zeppelin project provides a powerful web-based notebook platform for data analysis and discovery.  Behind the scenes it supports Spark distributed contexts as well as other language bindings on top of Spark. This post is a very simple introduction to show the first few steps to get started.  You’ll find all you need to know […]

Partitioned Data & Why It Matters

Partitioned Bread

As data volumes grow, so does your need to understand how to partition your data.  Until you understand this distributed storage concept, you will be unable to choose the best approach for the job.  This post gives an introductory explanation of partitioning and you will see why it is integral to the Hadoop Distributed File System (HDFS) increasingly […]

Home Energy Monitor Series – Internet of Things

IoT Energy Series Header Image showing home energy gateway and monitor services

Recently I started an Internet of Things series on my experiences installing, using and analysing data from a smart electrical meter.  This included a BC Hydro smart meter, Eagle monitoring gateway from rainforest automation, and a cloud-based analytics service from Bidgely. I’ve collated all the posts on the topic for you below.  More will be […]

IoT Day 4: Bidgely Cloud Energy Monitor Dashboard

Bidgely Energy Monitor - Appliance Breakdown

After a week of collecting smart meter readings, I’m now ready to show results in a cloud-based energy monitor system – Bidgely – complete with graphs showing readings, cost and machine learning results breaking down my usage by appliance. This is part 4 of a series of posts about the Internet of Things applied to Home Energy […]

Review of 3 Recent Internet of Thing (IoT) Announcements

Amazon dash buttons image - showing handheld scanning device

Working in the big data and analytics space, I’m always interested in parts of the Internet of Things (IoT) that will produce more data, require more backend systems, and help users/customers get on with their day better. The past week has shown a few interesting announcements relating to Internet of Things topics.  Here are just […]

Amazon Dash Button – Concept and Article re: IoT use

Amazon announces Dash Buttons

I like this concept: “Just press and never run out”.   It’s the Amazon Dash Button:… – intended to be stuck onto appliances, basically retrofitting ones that don’t have them built in (in the future).  Pressing a button orders refills of products, just like Amazon one click ordering online. Also read this article about how […]

IoT Day 3: Viewing data on the Eagle Energy Monitor

Eagle energy monitor data download as CSV file

The Eagle energy monitor from Rainforest Automation is a very handy device.  It reads the wireless signal from my electricity meter and makes it available through a web interface – both a graphical environment and a RESTful API.  In this post we look at the standard graphical screens and the data download option. Next time […]

Plug-In Solar: Moveable Solar Power For Renters and Do-It-Yourselfers

Plug-In Solar: Moveable Solar Power For Renters and Do-It-Yourselfers

via Green Building Elements | From brick and mortar shops to city planning, we cover sustainable trends in construction, renovation, and more.. This definitely sounds promising, anyone have experience running a similar setup? A small company, SpinRay Energy, has announced the production of a new UL listed, grid-tied solar power system that couldn’t be easier to […]