“One area where graph analytics particularly earns its stripes is in data discovery. While most of the discussion around big data has centered on how to answer a particular question or achieve a specific outcome, graph analytics enables us, in many cases, to discover the “unknown unknowns” — to see patterns in the data when we don’t know the right question to ask in the first place.”
In the remainder of this post I outline a few more of my thoughts on this topic and give you pointers to some more resources to help you understand what to do next.
Finding the unknown unknowns?
You are already familiar with the idea of writing SQL queries to extract data from a database. Perhaps more recently you even remember having to wrap your head around how to consolidate data from a distributed data store?
But that will only answer your questions.
Say again? Of course you want answers to your questions, but what about discovering new insights via relationships you didn’t even known existed in your data? Hence, extracting data with a question you did not anticipate. For example, you may be looking for edge cases or anomalies that can only be seen when analysing multiple “hops” of a relationships in a network (e.g. friends of friends who like the same music).
Until you try some graph analytics or think through a related problem, you probably won’t really understand the power in these ideas. In many ways our SQL experience has taught us to not ask some of the harder questions because they weren’t possible to answer with SQL! I’m speaking from my own personal experience on this and how challenged I was to break out of my previous paradigm.
When SQL is a hammer…
…every data problem that can fit in a table looks like a nail. However, when a single table is full of a myriad of complex relationships, that’s when things start to get really tricky. If that data should end up on your Data Analyst’s desk, just wait for the fireworks. That is, unless the DA has graph-like querying tools available.
Let’s look at an example, the many-to-many relationships table. The table in SQL would just be two columns with names in them, representing a relationship between two people: i.e. Harry, Sally. Those from the graph world see them as two special subjects with a relationship, not two pieces of text in a table.
These two subjects may have a bidirectional relationship too. So when doing RDF querying or graph analytics, you will often be building queries to find both relationships: Harry->Knows->Sally and Sally->Knows->Harry.
You can emulate this in SQL with a self join or a second query but don’t bother unless your dataset is really simple and your questions are very superficial as you’ll only get one degree of relationships for each join – then you’ll be tempted to do 20 joins to emulate what can be much easier in a graph analysis engine.
You need graph analytics in the future
“You don’t know my business!” Yes, but we know what the common data handling needs are for most organisations. Without fail, companies, governments and non-profits alike all deal with relationship-laden data:
- Customers “know” other customers – who are the key influencers in your buying community?
- Donors support certain kinds of causes and share about them in social media – what other causes are gaining more traction than yours with similar types of donors?
- Patients have various diagnosis that relate to other patients with similar issues – how can we predict better outcomes if certain factors of care are modified?
How are you planning to deal with such complexity in a proactive way? Your competitors (and all the Internet titans) are already using it in some way (likely to sell to you better), so it is not going away any time soon.
If you’re just getting started in understanding graph analytics, may I recommend my earlier blog post which is a mini HOWTO using the SPARQLverse analytic engine (from SPARQLcity). Install it in a couple minutes, push some simple data in and get your hands dirty with the ultra simple example there.
Also, if you don’t know some of the differences between RDF, graph data, SQL and SPARQL, it’s worth digging into. I’ll leave you with a great article by Robin Bloor on the subject. He clears the air very well in this article from Inside Analysis, here is a snippet but you should really read the whole thing as it’s very insightful:
“Where the RDF databases really score is when you want to do set processing (a la SQL) at the same time that you want to do graph processing. Consider a query such as “Who are the biggest influencers on Twitter over the past six months?”
Both the RDF and Graph database would handle such a query and return the same results quickly. But if you ask the very different question, “Which influencers have had the same pattern of influence on Twitter over the last six months?” you are asking both for graph processing and set processing at the same time to get to the answer, and the RDF databases do both well. Not only that, but this is an area of analytics, which was virtually untapped until recently, because there was no software that could easily do it.”
Follow me @1tylermitchell for further discussion or see my links in the sidebar.
- Diving into #NoSQL from the SQL Empire … - February 28, 2017
- VID: Solving Performance Problems on Hadoop - July 5, 2016
- Storing Zeppelin Notebooks in AWS S3 Buckets - June 7, 2016
- VirtualBox extension pack update on OS X - April 11, 2016
- Zeppelin Notebook Quick Start on OSX v0.5.6 - April 4, 2016
- Spark Analysis of Global Place Names (GeoNames) - January 20, 2016
- Serverspec checks settings on a Hadoop cluster - December 8, 2015
- Hadoop Options for SQL Databases - October 15, 2015
- “Big Data” off 2015 Hype Cycle? - August 18, 2015
- Spatial Data Made Useful - May 29, 2015