My latest notebook aims to mimic the original Scala-based Spark SQL tutorial with one that uses Python instead. Above you can see the two parallel translations side-by-side.
Python Spark SQL Tutorial Code
Here is the resulting Python data loading code. The SQL code is identical to the Tutorial notebook, so copy and paste if you need it.
I would have tried to make things look a little cleaner, but Python doesn’t easily allow multiline statements in a lambda function, so some lines get a little long. Improvements invited!
from os import getcwd
# sqlContext = SQLContext(sc) # Removed with latest version I tested.
# Note: Prior to 0.6.0 the sqlContext variable is called sqlc in %pyspark
zeppelinHome = getcwd()
bankText = sc.textFile(zeppelinHome+"/data/bank-full.csv")
bankSchema = StructType([StructField("age", IntegerType(), False),StructField("job", StringType(), False),StructField("marital", StringType(), False),StructField("education", StringType(), False),StructField("balance", IntegerType(), False)])
bank = bankText.map(lambda s: s.split(";")).filter(lambda s: s != "\"age\"").map(lambda s:(int(s), str(s).replace("\"", ""), str(s).replace("\"", ""), str(s).replace("\"", ""), int(s) ))
bankdf = sqlContext.createDataFrame(bank,bankSchema)
Update: In a Zeppelin 0.6.0 snapshot I found that the “sqlContext = SQLContext(sc)” worked in the Python interpreter, but I had to remove it to allow Zeppelin to share the sqlContext object with a %sql interpreter. After all, Zeppelin already initiated it behind the scenes so you should probably not be overwriting it here.
If you don’t comment it out, it will tell you that:
Table "bank" does not exist
Or something similar. I assume this behaviour is newer than last time I used Zeppelin and will continue going forward, so I’ve commented it out to hopefully ease your pain. (Thanks Matt S. for the tip!)
- Learnings from TigerGraph and Expero webinar - April 1, 2020
- 4 Webinars This Week – GPU, 5G, graph analytics, cloud - March 30, 2020
- Diving into #NoSQL from the SQL Empire … - February 28, 2017
- VID: Solving Performance Problems on Hadoop - July 5, 2016
- Storing Zeppelin Notebooks in AWS S3 Buckets - June 7, 2016
- VirtualBox extension pack update on OS X - April 11, 2016
- Zeppelin Notebook Quick Start on OSX v0.5.6 - April 4, 2016
- Spark Analysis of Global Place Names (GeoNames) - January 20, 2016
- Serverspec checks settings on a Hadoop cluster - December 8, 2015
- Hadoop Options for SQL Databases - October 15, 2015