Arush Pragith Prakash
Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc.
[How-To] Run SparkR with RStudio
Your private vip singapore escort has elite call girls ready to provide social services for any of your demands.

With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and Analysts who are comfortable with their R ecosystem and still want to utilize the speed and performance of Spark.

In this article, I’ll walk you through creating an Ubuntu instance from scratch, installing R, RStudio, Spark, configuring SparkR with RStudio, concluding with a quick example of SparkR code running from RStudio. I’ll be using Google Cloud (GC) for this walkthrough; however check here similar process is applicable for Amazon Web Service or even a fresh Ubuntu installation within your local system.

Don’t you know that using college helper increases chances to get better grades? Just take a look at the website of one of such services and you will understand that it is your helping hand for now.

Create a new Instance:

  • Log on to your Google Cloud Console
  • Click on Compute from the left-side Navigation Bar
  • Click on the blue New Instance button

1-new-instance

Describe the Virtual Machine (VM) specifications:

  • Create a new instance with hardware specs of your choice
  • Make sure you choose Ubuntu 15.04 as your Operating System
  • Tick Allow HTTP traffic and Allow HTTPS traffic

2-create-new-instance

Configure Network Ports:

  • Once you’ve created your instance, you’ll see it in the Computer Engine page
  • Click on default as highlighted below

5-network-ports

  • Click on the blue Add firewall rule button
  • In the new window, type this exactly -> tcp:4039-60000 and save

6-firewall-rules

Open SSH terminal:

  • From the Compute Engine page, click on SSH on the right most side of your window

4.1-ssh

In case your terminal isn’t configured to BASH, then run this command:
sudo chsh -s /bin/bash

The GC instance is now setup. We will now proceed with our SparkR installation.

Upgrade sources to Install R
sudo nano /etc/apt/sources.list,/span>

Add the following to the end
deb http://cran.cnr.berkeley.edu/bin/linux/ubuntu/ vivid/
sudo apt-get update

Download and unpack Spark 1.4.0:
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.6.tgz
tar -xvzf spark-1.4.0-bin-hadoop2.6.tgz
mv spark-1.4.0-bin-hadoop2.6.tgz spark

Install R:
sudo apt-get install r-base

Install RStudio Server:
sudo apt-get install gdebi-core
wget http://download2.rstudio.org/rstudio-server-0.99.447-amd64.deb
sudo gdebi rstudio-server-0.99.447-amd64.deb

Launch R:
R

Install any package to create a “user” library:
install.packages(“magrittr”) ## You can install any package but we will be using this one in our SparkR example.

Quit R:
q()

Create a soft link to the Spark folder in R’s user library:
ln -s /home//spark/R/lib/SparkR /home//R/x86_64-pc-linux-gnu-library/3.2

Install OpenJDK:
sudo apt-get install openjdk-7-jdk
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Launch RStudio from web browser:
http://:8787

Within RStudio, run the following to test the setup:

Sys.setenv(SPARK_HOME=/home//spark)
Sys.setenv(SPARKR_SUBMIT_ARGS=“–packages” “com.databricks:spark-csv_2.10:1.0.3” “sparkr-shell”)
library(SparkR)
library(magrittr)
# Initialize SparkContext and SQLContext
sc <- sparkR.init(appName=SparkR-Flights-example)
sqlContext <- sparkRSQL.init(sc)
  
# The SparkSQL context should already be created for you as sqlContext
sqlContext
# You should get this output — Java ref type org.apache.spark.sql.SQLContext id 1

Make sure you replace with your Ubuntu username. Please leave a comment if you had any trouble with the process. 

Recommended for you

Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics

March 29th, 2016|

Akhil Das Akhil, a Software Developer at Sigmoid focuses on distributed computing, big data analytics, scaling and optimising performance. Akhil Das He was a Software Developer at Sigmoid. Why Apache Arrow is the Future for Open Source Columnar In-Memory Analytics Performance gets redefined when the data is in memory, Apache Arrow is a de-facto standard for columnar in-memory analytics, Engineers from across the top level Apache projects are contributing towards to create Apache Arrow. In the coming years we

Implementing a Real-Time Multi- dimensional Dashboard

July 13th, 2015|

Arush Kharbanda Arush was a technical team member at Sigmoid. He was involved in multiple projects including building data pipelines and real time processing frameworks. Arush Kharbanda He was a technical team member at Sigmoid. Implementing a Real-Time Multi- dimensional Dashboard The Problem Statement An analytics dashboard must be capable enough to highlight to its users areas needing their attention. This Rolex Replica needs to be done in real time and displayed within acceptable display time lag to the

[How-To] Run SparkR with RStudio

July 3rd, 2015|

Pragith Prakash Pragith was a part of the Data Science Team. His areas of expertise being mathematical modeling, statistical analysis etc. [How-To] Run SparkR with RStudio Your private vip singapore escort has elite call girls ready to provide social services for any of your demands. With the latest release of Apache Spark 1.4.0, SparkR which was a third-party package by AMP Labs, got integrated officially with the main distribution. This update is a delight for Data Scientists and

By |2019-03-11T06:37:09+00:00July 3rd, 2015|Spark, Streaming, Technology|