74 lines
3.5 KiB
Markdown
74 lines
3.5 KiB
Markdown
---
|
||
title: Spacy, Spark and BIG NLP
|
||
author: James
|
||
type: post
|
||
date: -001-11-30T00:00:00+00:00
|
||
draft: true
|
||
url: /?p=212
|
||
medium_post:
|
||
- 'O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:6:"public";s:3:"url";N;}'
|
||
|
||
tags:
|
||
- nlp
|
||
- python
|
||
- phd
|
||
|
||
---
|
||
Recently I have been working on a project that involves trawling full text newspaper articles from the JISC UK Web Domain Dataset – covering all websites with a .uk domain suffix from 1996 to 2013. As you can imagine, this task is pretty gargantuan and the archives themselves are over 27 Terabytes in size (that’s enough space to store 5000 high definition movies).
|
||
|
||
I’ll be writing more about my work with the JISC dataset another time. This article focuses on getting started with Apache Spark and Spacy which has the potential to be a bit of a pain.
|
||
|
||
## **Installing Spark + Hadoop **
|
||
|
||
Installing Spark + Hadoop is actually relatively easy. Apache ship tar balls for [Windows, Mac and Linux][1] which you can simply download and extract (on Mac and Linux I recommend extracting to /usr/local/spark as a sensible home.
|
||
|
||
You’ll need Java and although Spark seems to ship with Python (in the bin folder you’ll find a script called pyspark which launches spark with a python 2.7 session and a SparkContext object already set up) I tend to use standalone Python and findspark which I’ll explain now.
|
||
|
||
## FindSpark
|
||
|
||
[findspark][2] is a python library for finding a spark installation and adding it to your PYTHONPATH during runtime. This means you can use your existing python install with a newly created Spark setup without any faff.
|
||
|
||
Firstly run
|
||
|
||
<pre>pip install findspark</pre>
|
||
|
||
Then you’ll want to export SPARK_HOME environment variable so that findspark knows where to look for the libraries (if you don’t do this, you’ll get an error in your python session.
|
||
|
||
<pre>export SPARK_HOME=/usr/local/spark</pre>
|
||
|
||
Obviously you’ll want to change this if you’re working with a Spark install at a different location – just point it to the root directory of the Spark installation that you unzipped above.
|
||
|
||
a pro-tip here is to actually add this line to your .bashrc or .profile files so that every time you start a new terminal instance, this information is already available.
|
||
|
||
## Python and Findspark first steps
|
||
|
||
If you did the above properly you can now launch python and start your first Spark job.
|
||
|
||
Try running the following:
|
||
|
||
<code lang="python">import findspark<br />
|
||
findspark.init()</p>
|
||
<p>import pyspark #if the above didn't work then you'll get an error here</p>
|
||
<p>from pyspark.sql import SQLContext</p>
|
||
<p>if <strong>name</strong> == "<strong>main</strong>":<br />
|
||
"""<br />
|
||
Usage: pi [partitions]<br />
|
||
"""<br />
|
||
sc = pyspark.SparkContext()<br />
|
||
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2<br />
|
||
n = 100000 * partitions</p>
|
||
<pre><code>def f(_):
|
||
x = random() * 2 - 1
|
||
y = random() * 2 - 1
|
||
return 1 if x ** 2 + y ** 2 <= 1 else 0
|
||
|
||
count = sc.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
|
||
print("Pi is roughly %f" % (4.0 * count / n))
|
||
|
||
sc.stop()
|
||
</code></pre>
|
||
|
||
</code>
|
||
|
||
[1]: https://spark.apache.org/downloads.html
|
||
[2]: https://pypi.python.org/pypi/findspark |