brainsteam.co.uk/brainsteam/content/posts/legacy/2017-06-05-exploring-web-ar...

115 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
author: James
date: 2017-06-05 07:24:22+00:00
medium_post:
- O:11:"Medium_Post":11:{s:16:"author_image_url";N;s:10:"author_url";N;s:11:"byline_name";N;s:12:"byline_email";N;s:10:"cross_link";s:2:"no";s:2:"id";N;s:21:"follower_notification";s:3:"yes";s:7:"license";s:19:"all-rights-reserved";s:14:"publication_id";s:2:"-1";s:6:"status";s:4:"none";s:3:"url";N;}
- null
post_meta:
- date
tags:
- python
- webarchive
- PhD
title: Exploring Web Archive Data CDX Files
type: posts
url: /2017/06/05/exploring-web-archive-data-cdx-files/
---
I have recently been working in partnership with [UK Web Archive][1] in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of [web archive dumps of the rest of the .UK top level domain.][1]
## WARC and CDX Files
The web archive project have produced standardized file formats for describing historic web resources in a compressed archive. The website is scraped and the content is stored chronologically in a [WARC][2] file. A CDX index file is also produced describing every URL scraped, the time it was retrieved at and which WARC file the content is in, along with some other metadata.
Our first task is to identify news content in order to narrow down our search to a subset of WARC files (in order not to fill 60TB of storage or have to traverse that amount of data). The CDX files allow us to do this. These files are available for [free download from the Web Archive website.][3] They are compressed using Gzip compression down to around 10-20GB per file. If you try to expand these files locally, you’re looking at 60-120GB of uncompressed data – a great way to fill up your hard drive.
## Processing Huge Gzip Files
Ideally we want to explore these files without having to uncompress them explicitly. This is possible using Python 3’s gzip module but it took me a long time to find the right options.
Python file i/o typically allows you to read a file in line by line. If you have a text file, you can iterate over the lines using something like the following:
<pre lang="python">with open("my_text_file.txt", "r") as f:
for line in f:
print(line)
</pre>
Now clearly trying this approach with a .gz file isn&#8217;t going to work. Using the [gzip][4] module we can open and uncompress gz as a stream &#8211; examining parts of the file in memory and discarding data that we&#8217;ve already seen. This is the most efficient way of dealing with a file of this magnitude that won&#8217;t fit into RAM on a modern machine and would will a hard drive uncompressed.
I tried a number of approaches using the gzip library, trying to run the gzip command line utility using [subprocess][5] and combinations of [TextIOWrapper][6] and [BufferedReader][7] but to no avail.
## The Solution
The solution is actually incredibly simple in Python 3 and I wasn&#8217;t far off the money with [TextIOWrapper.][6] The gzip library offers a file read/write flag for accessing gzipped text in a buffered line-by-line fashion as above for the uncompressed text file. Simply passing in &#8220;rt&#8221; to the gzip.open() function will wrap the input stream from Gzip in a TextIOWrapper and allow you to read the file line by line.
<pre lang="python">import gzip
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
If you&#8217;re using an older version of Python (2.7 for example) or you would prefer to see what&#8217;s going on beneath the covers here explicitly, you can also use the following code:
<pre lang="python">import io
import gzip
with io.TextIOWrapper(gzip.open("2012.cdx.gz","r")) as gzipped:
for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
And its as simple as that. You can now start to break down each line in the file using tools like [urllib][8] to identify content stored in the archive from domains of interest.
## Solving a problem
We may want to understand how much content is available in the archive for a given Domain. To put this another way, which are the domains with the most pages that we have stored in the web archive. In order to answer this, we can run a simple script that parses all of the URLs, examines the domain name and counts instances of each.
<pre>import io
import gzip
from collections import Counter
from urllib.parse import urlparse
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
        
        parts = line.split(" ")
        
        urlbits = urlparse(parts[2])
        
        urlcounter[urlbits.netloc] += 1
#at the end we print out the top 10 URLs
print(urlcounter.most_common(10))</pre>
Just to quickly explain what is going on here:
1. We load up the CDX file in compressed text mode as described above
2. We split each line using space characters. This gives us an array of fields, the order and content of which are described by the WebArchive team [here.][3]
3. We parse the URL (which is at index 2) using the [urlparse][9] function which will break the URL up into things like domain, protocol (HTTP/HTTPS), path, query, fragment.
4. We increment the counter for the current domain (described in the &#8216;netloc&#8217; field of the parsed url.
5. After iterating we print out the domains with the most URLs in the CDX file.
This will take a long time to complete since we&#8217;re iterating over ~60TB of text. I intend to investigate parallel processing of these CDX files as a next step.
## Conclusion
We&#8217;ve looked into how to dynamically unzip and examine a CDX file in order to understand which domains host the most content. The next step is to identify which WARC files are of interest and request access to them from the Web Archive.
[1]: https://www.webarchive.org.uk/ukwa/
[2]: http://commoncrawl.org/2014/04/navigating-the-warc-file-format/
[3]: http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/
[4]: https://docs.python.org/3.6/library/gzip.html
[5]: https://docs.python.org/3/library/subprocess.html
[6]: https://docs.python.org/3/library/io.html#io.TextIOWrapper
[7]: https://docs.python.org/3/library/io.html#io.BufferedReader
[8]: https://docs.python.org/3/library/urllib.html
[9]: https://docs.python.org/3/library/urllib.parse.html