brainsteam.co.uk/brainsteam/content/posts/2017-11-12-spacy-spark-nlp-...

3.5 KiB
Raw Blame History

title author type date draft url medium_post categories
Spacy, Spark and BIG NLP James post -001-11-30T00:00:00+00:00 true /?p=212
O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";N;}
Uncategorized

Recently I have been working on a project that involves trawling full text newspaper articles from the JISC UK Web Domain Dataset covering all websites with a .uk domain suffix from 1996 to 2013. As you can imagine, this task is pretty gargantuan and the archives themselves are over 27 Terabytes in size (thats enough space to store 5000 high definition movies).

Ill be writing more about my work with the JISC dataset another time. This article focuses on getting started with Apache Spark and Spacy which has the potential to be a bit of a pain.

**Installing Spark + Hadoop **

Installing Spark + Hadoop is actually relatively easy. Apache ship tar balls for Windows, Mac and Linux which you can simply download and extract (on Mac and Linux I recommend extracting to /usr/local/spark as a sensible home.

Youll need Java and although Spark seems to ship with Python (in the bin folder youll find a script called pyspark which launches spark with a python 2.7 session and a SparkContext object already set up) I tend to use standalone Python and findspark which Ill explain now.

FindSpark

findspark is a python library for finding a spark installation and adding it to your PYTHONPATH during runtime. This means you can use your existing python install with a newly created Spark setup without any faff.

Firstly run

pip install findspark

Then youll want to export SPARK_HOME environment variable so that findspark knows where to look for the libraries (if you dont do this, youll get an error in your python session.

export SPARK_HOME=/usr/local/spark

Obviously youll want to change this if youre working with a Spark install at a different location just point it to the root directory of the Spark installation that you unzipped above.

a pro-tip here is to actually add this line to your .bashrc or .profile files so that every time you start a new terminal instance, this information is already available.

Python and Findspark first steps

If you did the above properly you can now launch python and start your first Spark job.

Try running the following:

import findspark<br /> findspark.init()</p> <p>import pyspark #if the above didn't work then you'll get an error here</p> <p>from pyspark.sql import SQLContext</p> <p>if <strong>name</strong> == "<strong>main</strong>":<br /> """<br /> Usage: pi [partitions]<br /> """<br /> sc = pyspark.SparkContext()<br /> partitions = int(sys.argv1) if len(sys.argv) > 1 else 2<br /> n = 100000 * partitions</p> <pre><code>def f(_): x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0

count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n))

sc.stop()