brainsteam.co.uk/brainsteam/content/posts/legacy/2017-11-12-spacy-spark-nlp-big.md at a3c39f0c494c7ef8b0d6feff825455c35da11755

3.5 KiB

Raw Blame History

author

date

draft

medium_post

post_meta

Installing Spark + Hadoop

Installing Spark + Hadoop is actually relatively easy. Apache ship tar balls for Windows, Mac and Linux which you can simply download and extract (on Mac and Linux I recommend extracting to /usr/local/spark as a sensible home.

You’ll need Java and although Spark seems to ship with Python (in the bin folder you’ll find a script called pyspark which launches spark with a python 2.7 session and a SparkContext object already set up) I tend to use standalone Python and findspark which I’ll explain now.

FindSpark

findspark is a python library for finding a spark installation and adding it to your PYTHONPATH during runtime. This means you can use your existing python install with a newly created Spark setup without any faff.

Firstly run

pip install findspark

Then you’ll want to export SPARK_HOME environment variable so that findspark knows where to look for the libraries (if you don’t do this, you’ll get an error in your python session.

export SPARK_HOME=/usr/local/spark

Obviously you’ll want to change this if you’re working with a Spark install at a different location – just point it to the root directory of the Spark installation that you unzipped above.

a pro-tip here is to actually add this line to your .bashrc or .profile files so that every time you start a new terminal instance, this information is already available.

Python and Findspark first steps

If you did the above properly you can now launch python and start your first Spark job.

Try running the following:

import findspark findspark.init() import pyspark #if the above didn't work then you'll get an error here from pyspark.sql import SQLContext if name == "main": """ Usage: pi [partitions] """ sc = pyspark.SparkContext() partitions = int(sys.argv1) if len(sys.argv) > 1 else 2 n = 100000 * partitions <pre><code>def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0


count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()

3.5 KiB Raw Blame History Unescape Escape

**Installing Spark + Hadoop **

FindSpark

Python and Findspark first steps

3.5 KiB

Raw Blame History

Installing Spark + Hadoop