A ssh tunnel is necesary in order to run jupyter locally in the browser. First ssh into your cluster with command below:
ssh -L localhost:1122:localhost:1122 shoh@10.64.22.66
After logging into the cluster, launch the jupyter notebook with
jupyter notebook --no-browser --port=1122 --ip=127.0.0.1
It will yield a link which needed to be pasted in the url field in the browser, if everything is ok, the browser will load the cluster environment with jupyter notebook. You are good to go.
Each notebook runs as a separate Python process, but Spark itself is implemented in Scala (which is related to Java). The PySpark interface connects the two languages and handles passing data back and forth between the two. To gain access to Spark, you first need to create a SparkSession who loads the Scala binaries via PySpark. Once the connection is complete, PySpark returns a SparkSession, which is the central entrypoint to Spark.
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master(
"spark://10.64.22.66:7077"
) \
.appName(# Name of your application in the dashboard/UI
"printNanoAodAPP"
) \
.config(# Tell Spark to load some extra libraries from Maven (the Java repository)
'spark.jars.packages',
'org.diana-hep:spark-root_2.11:0.1.13,org.diana-hep:histogrammar-sparksql_2.11:1.0.4',
) \
.config('spark.cores.max',3
) \
.getOrCreate()
#Read the ROOT file into a Spark DataFrame
df = spark.read.format("org.dianahep.sparkroot").load("hdfs:///user/shoh/nanoaod/bbdm80X_NANO.root")
# ... and print the number of events
print df.count()
#show GenMET_pt and HLTT_AK8PFHT750_TrimMass50
df.select("GenMET_pt","HLT_AK8PFHT750_TrimMass50").show(3)
This block demonstrated that the root reading capability is intact. The operation on files stored in hdfs could be found here
Furhter Verification could be achive by running the below script in your jupyter notebook
#Verifying spark, spark-root, histogrammer, and matplot works
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master(
"spark://10.64.22.66:7077"
) \
.appName(# Name of your application in the dashboard/UI
"benchmark check"
) \
.config(# Tell Spark to load some extra libraries from Maven (the Java repository)
'spark.jars.packages',
'org.diana-hep:spark-root_2.11:0.1.13,org.diana-hep:histogrammar-sparksql_2.11:1.0.4',
) \
.config('spark.cores.max',3
) \
.getOrCreate()
#verifying spark
import random
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
# SparkContexts provide most of the "execution engine" of Spark
sc = spark.sparkContext
# One million
num_samples = 1000000
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)
#Verifying Spark-ROOT functionality
# Read the ROOT file into a Spark DataFrame...
import os
testPath = "hdfs:///user/shoh/nanoaod/bbdm80X_NANO.root"
df = spark.read.format("org.dianahep.sparkroot").load(testPath)
# ... and print the number of events
print df.count()
#Verifying histogrammar
# Produce 200,000 random numbers
import random
data = [random.gauss(0, 1) for i in range(200000)]
# Define the histogram
import histogrammar as hg
gauss = hg.Bin(num=16,
low=-4,
high=4,
quantity=lambda x: x,
value=hg.Count())
# Fill the histogram with the values we generated earlier
for d in data:
gauss.fill(d)
# Plot the resulting histogram
print gauss.ascii()
#Verifying Matplotlib Functionality
import matplotlib.pyplot as plt
%matplotlib inline
plot = gauss.plot.matplotlib(name="Unit Gaussian")
If you manage to the output of each block, you have verified that spark, spark-root, histogrammer and matlibplot are working!