Traffic accidents in the UK, 1979-2004.
Whether you are a journalist, a researcher or a data geek, in order to start working with large data sets, you have to complete laborious tasks of setting-up an infrastructure, configuring an environment, learning new unfamiliar tools and coding complicated apps – with DC/OS you can start crunching those numbers within minutes.
Let’s start with a problem of analyzing a set of data and take a road safety data from Great Britain, 1979-2004. While the data set might seem small, some of the analysis might require distributed processing and we should have an environment that allows our processing jobs to scale horizontally. To achieve this, we’ll be running a DC/OS cluster on top of a cluster of virtual machines. We’ll be using AWS EC2 in this scenario, but the same solution can be ported to other public and private clouds.
DC/OS sets up a cluster and deploys pre-configured components services needed to complete a task on hand. You don’t have to entirely understand the complexity of the infrastructure and how to set it up, DC/OS helps you creating necessary abstractions. Once complete, you will have a running cluster with interactive research notebook (container of Jupyter Python Notebook with Apache Spark) and distributed file system (HDFS), ready to tackle any large-scale data processing task.
Step #1 – Setup your DC/OS cluster.
Refer to manual installation documentation and setup your cluster https://dcos.io/docs/latest/administration/installing/custom/scripted-installer. You should have one bootstrap node, at least one master node and at least five slave nodes. Please use m3.xlarge to have enough memory for HDFS. You also can setup DC/OS with Amazon CoudFormation described here https://dcos.io/installing.
IMPORTANT: If you’ve chosen OS other than CoreOS please add user core to every mesos slave.
Step #2 – Start HDFS
Using instructions from DC/OS Dashboard install DC/OS CLI. You can use bootstrap node for this.
Install HDFS package with
dcos package install hdfs.
HDFS will slowly start all services and eventually it will have nine services ready.
Step #3 – Upload a file to HDFS
Uploading a file will be done by one time task executed by Marathon.
Download json for marathon with
Then run it in marathon with
dcos marathon add uk-data-to-hdfs.json.
This command will actually download archive http://data.dft.gov.uk/road-accidents-safety-data/Stats19-Data1979-2004.zip and save it to HDFS.
Step #4 – Run Jupyter container.
Get JSON for Marathon with
Run Jupyter with
dcos marathon add jupyter.json
Step #5 – Create Python notebook.
Note that if namenode1.hdfs.mesos is in standby and you get error message try namenode2.hdfs.mesos.
from pyspark import SparkContext sc =SparkContext()
import time start_time = int(round(time.time() * 1000)) # now we have a file text_file = sc.textFile("hdfs://namenode1.hdfs.mesos:50071/Accidents7904.csv") # getting the header as an array header = text_file.first().split(",") # getting data data = text_file \ .map(lambda line: line.split(",")) \ .filter(lambda w: w[header.index('Date')] != 'Date') output = data.filter(lambda row: len(row[header.index('Date')].strip().split("/")) == 3) \ .map(lambda row: row[header.index('Date')].strip().split("/")) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .sortByKey(True) \ .collect() for (line, count) in output: print("%s: %i" % (line, count)) print ("Duration is '%i' ms" % (int(round(time.time() * 1000)) - start_time))
%matplotlib inline import matplotlib import numpy as np import matplotlib.pyplot as plt plt.plot([str(x) for x in output], [str(x) for x in output])
Run the notebook. First you will notice new tasks in Mesos, these are Spark executors:
Your Jupyter notebook will look like this:
As you’ve seen in this post, you can start containerized services in minutes. DC/OS gives you complete environment and lets you focus on your problem not on routine deployment or service configuration adjustments.