brainsteam.co.uk/public/2017/06/05/exploring-web-archive-data-.../index.html

203 lines
12 KiB
HTML
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Exploring Web Archive Data CDX Files - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
<meta itemprop="name" content="Exploring Web Archive Data CDX Files">
<meta itemprop="description" content="I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the ."><meta itemprop="datePublished" content="2017-06-05T07:24:22&#43;00:00" />
<meta itemprop="dateModified" content="2017-06-05T07:24:22&#43;00:00" />
<meta itemprop="wordCount" content="899">
<meta itemprop="keywords" content="cdx,python,webarchive," /><meta property="og:title" content="Exploring Web Archive Data CDX Files" />
<meta property="og:description" content="I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the ." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2017/06/05/exploring-web-archive-data-cdx-files/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2017-06-05T07:24:22&#43;00:00" />
<meta property="article:modified_time" content="2017-06-05T07:24:22&#43;00:00" />
<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Exploring Web Archive Data CDX Files"/>
<meta name="twitter:description" content="I have recently been working in partnership with UK Web Archive in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of web archive dumps of the rest of the ."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />
<link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />
<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>
<body>
<div class="container wrapper">
<div class="header">
<div class="avatar">
<a href="https://brainsteam.co.uk/">
<img src="/images/avatar.png" alt="Brainsteam" />
</a>
</div>
<h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
<div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
<ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
</nav></div>
<nav class="nav">
<ul class="flat">
<li>
<a href="/">Home</a>
</li>
<li>
<a href="/tags">Tags</a>
</li>
<li>
<a href="https://jamesravey.me">About Me</a>
</li>
</ul>
</nav>
</div>
<div class="post">
<div class="post-header">
<div class="meta">
<div class="date">
<span class="day">05</span>
<span class="rest">Jun 2017</span>
</div>
</div>
<div class="matter">
<h1 class="title">Exploring Web Archive Data CDX Files</h1>
</div>
</div>
<div class="markdown">
<p>I have recently been working in partnership with <a href="https://www.webarchive.org.uk/ukwa/">UK Web Archive</a> in order to identify and parse large amounts of historic news data for an NLP task that I will blog about in the future. The NLP portion of this task will surely present its own challenges, but for now there is the small matter of identifying news data amongst the noise of 60TB of <a href="https://www.webarchive.org.uk/ukwa/">web archive dumps of the rest of the .UK top level domain.</a></p>
<h2 id="warc-and-cdx-files">WARC and CDX Files</h2>
<p>The web archive project have produced standardized file formats for describing historic web resources in a compressed archive. The website is scraped and the content is stored chronologically in a <a href="http://commoncrawl.org/2014/04/navigating-the-warc-file-format/">WARC</a> file. A CDX index file is also produced describing every URL scraped, the time it was retrieved at and which WARC file the content is in, along with some other metadata.</p>
<p>Our first task is to identify news content in order to narrow down our search to a subset of WARC files (in order not to fill 60TB of storage or have to traverse that amount of data). The CDX files allow us to do this. These files are available for <a href="http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/">free download from the Web Archive website.</a> They are compressed using Gzip compression down to around 10-20GB per file. If you try to expand these files locally, youre looking at 60-120GB of uncompressed data a great way to fill up your hard drive.</p>
<h2 id="processing-huge-gzip-files">Processing Huge Gzip Files</h2>
<p>Ideally we want to explore these files without having to uncompress them explicitly. This is possible using Python 3s gzip module but it took me a long time to find the right options.</p>
<p>Python file i/o typically allows you to read a file in line by line. If you have a text file, you can iterate over the lines using something like the following:</p>
<pre lang="python">with open("my_text_file.txt", "r") as f:
for line in f:
print(line)
</pre>
<p>Now clearly trying this approach with a .gz file isnt going to work. Using the <a href="https://docs.python.org/3.6/library/gzip.html">gzip</a> module we can open and uncompress gz as a stream examining parts of the file in memory and discarding data that weve already seen. This is the most efficient way of dealing with a file of this magnitude that wont fit into RAM on a modern machine and would will a hard drive uncompressed.</p>
<p>I tried a number of approaches using the gzip library, trying to run the gzip command line utility using <a href="https://docs.python.org/3/library/subprocess.html">subprocess</a> and combinations of <a href="https://docs.python.org/3/library/io.html#io.TextIOWrapper">TextIOWrapper</a> and <a href="https://docs.python.org/3/library/io.html#io.BufferedReader">BufferedReader</a> but to no avail.</p>
<h2 id="the-solution">The Solution</h2>
<p>The solution is actually incredibly simple in Python 3 and I wasnt far off the money with <a href="https://docs.python.org/3/library/io.html#io.TextIOWrapper">TextIOWrapper.</a> The gzip library offers a file read/write flag for accessing gzipped text in a buffered line-by-line fashion as above for the uncompressed text file. Simply passing in “rt” to the gzip.open() function will wrap the input stream from Gzip in a TextIOWrapper and allow you to read the file line by line.</p>
<pre lang="python">import gzip
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
<p>If youre using an older version of Python (2.7 for example) or you would prefer to see whats going on beneath the covers here explicitly, you can also use the following code:</p>
<pre lang="python">import io
import gzip
with io.TextIOWrapper(gzip.open("2012.cdx.gz","r")) as gzipped:
for i,line in enumerate(gzipped):
print(line)
# stop this thing running off and printing the whole file.
if i == 10:
break</pre>
<p>And its as simple as that. You can now start to break down each line in the file using tools like <a href="https://docs.python.org/3/library/urllib.html">urllib</a> to identify content stored in the archive from domains of interest.</p>
<h2 id="solving-a-problem">Solving a problem</h2>
<p>We may want to understand how much content is available in the archive for a given Domain. To put this another way, which are the domains with the most pages that we have stored in the web archive. In order to answer this, we can run a simple script that parses all of the URLs, examines the domain name and counts instances of each.</p>
<pre>import io
import gzip
from collections import Counter
from urllib.parse import urlparse
with gzip.open("2012.cdx.gz","rt") as gzipped:
    for i,line in enumerate(gzipped):
        
        parts = line.split(" ")
        
        urlbits = urlparse(parts[2])
        
        urlcounter[urlbits.netloc] += 1
#at the end we print out the top 10 URLs
print(urlcounter.most_common(10))</pre>
<p>Just to quickly explain what is going on here:</p>
<ol>
<li>We load up the CDX file in compressed text mode as described above</li>
<li>We split each line using space characters. This gives us an array of fields, the order and content of which are described by the WebArchive team <a href="http://data.webarchive.org.uk/opendata/ukwa.ds.2/cdx/">here.</a></li>
<li>We parse the URL (which is at index 2) using the <a href="https://docs.python.org/3/library/urllib.parse.html">urlparse</a> function which will break the URL up into things like domain, protocol (HTTP/HTTPS), path, query, fragment.</li>
<li>We increment the counter for the current domain (described in the netloc field of the parsed url.</li>
<li>After iterating we print out the domains with the most URLs in the CDX file.</li>
</ol>
<p>This will take a long time to complete since were iterating over ~60TB of text. I intend to investigate parallel processing of these CDX files as a next step.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Weve looked into how to dynamically unzip and examine a CDX file in order to understand which domains host the most content. The next step is to identify which WARC files are of interest and request access to them from the Web Archive.</p>
</div>
<div class="tags">
<ul class="flat">
<li><a href="/tags/cdx">cdx</a></li>
<li><a href="/tags/python">python</a></li>
<li><a href="/tags/webarchive">webarchive</a></li>
</ul>
</div><div id="disqus_thread"></div>
<script type="text/javascript">
(function () {
if (window.location.hostname == "localhost")
return;
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
var disqus_shortname = 'brainsteam';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
</div>
<div class="footer wrapper">
<nav class="nav">
<div>2021 © James Ravenscroft 2020 | <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
</nav>
</div>
<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
ga('create', 'UA-186263385-1', 'auto');
ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>