brainsteam.co.uk/brainsteam/content/posts/legacy/2017-11-12-spacy-spark-nlp-...

---
author: James
date: -001-11-30T00:00:00+00:00
draft: true
medium_post:
- O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";N;}
post_meta:
- date
tags:
- nlp
- python
- phd
title: Spacy, Spark and BIG NLP
type: posts
url: /?p=212
---

Recently I have been working on a project that involves trawling full text newspaper articles from the JISC UK Web Domain Dataset &#8211; covering all websites with a .uk domain suffix from 1996 to 2013. As you can imagine, this task is pretty gargantuan and the archives themselves are over 27 Terabytes in size (that&#8217;s enough space to store 5000 high definition movies).

I&#8217;ll be writing more about my work with the JISC dataset another time. This article focuses on getting started with Apache Spark and Spacy which has the potential to be a bit of a pain.

## **Installing Spark + Hadoop **

Installing Spark + Hadoop is actually relatively easy. Apache ship tar balls for [Windows, Mac and Linux][1] which you can simply download and extract (on Mac and Linux I recommend extracting to /usr/local/spark as a sensible home.

You&#8217;ll need Java and although Spark seems to ship with Python (in the bin folder you&#8217;ll find a script called pyspark which launches spark with a python 2.7 session and a SparkContext object already set up) I tend to use standalone Python and findspark which I&#8217;ll explain now.

## FindSpark

[findspark][2] is a python library for finding a spark installation and adding it to your PYTHONPATH during runtime. This means you can use your existing python install with a newly created Spark setup without any faff.

Firstly run

<pre>pip install findspark</pre>

Then you&#8217;ll want to export SPARK_HOME environment variable so that findspark knows where to look for the libraries (if you don&#8217;t do this, you&#8217;ll get an error in your python session.

<pre>export SPARK_HOME=/usr/local/spark</pre>

Obviously you&#8217;ll want to change this if you&#8217;re working with a Spark install at a different location &#8211; just point it to the root directory of the Spark installation that you unzipped above.

a pro-tip here is to actually add this line to your .bashrc or .profile files so that every time you start a new terminal instance, this information is already available.

## Python and Findspark first steps

If you did the above properly you can now launch python and start your first Spark job.

Try running the following:
  
<code lang="python">import findspark&lt;br />
findspark.init()&lt;/p>
&lt;p>import pyspark #if the above didn't work then you'll get an error here&lt;/p>
&lt;p>from pyspark.sql import SQLContext&lt;/p>
&lt;p>if &lt;strong>name&lt;/strong> == "&lt;strong>main&lt;/strong>":&lt;br />
    """&lt;br />
        Usage: pi [partitions]&lt;br />
    """&lt;br />
    sc = pyspark.SparkContext()&lt;br />
    partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2&lt;br />
    n = 100000 * partitions&lt;/p>
&lt;pre>&lt;code>def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 &lt;= 1 else 0

count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()
</code></pre> 

</code>

 [1]: https://spark.apache.org/downloads.html
 [2]: https://pypi.python.org/pypi/findspark