brainsteam.co.uk/2018/04/13/programmatically-downloadin.../index.html

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8" />
	<meta http-equiv="X-UA-Compatible" content="IE=edge"><title>Programmatically Downloading Open Access Papers - Brainsteam</title><meta name="viewport" content="width=device-width, initial-scale=1">
	<meta itemprop="name" content="Programmatically Downloading Open Access Papers">
<meta itemprop="description" content="(Cover image “Unlocked” by Sean Hobson)
If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper."><meta itemprop="datePublished" content="2018-04-13T16:04:47&#43;00:00" />
<meta itemprop="dateModified" content="2018-04-13T16:04:47&#43;00:00" />
<meta itemprop="wordCount" content="357">
<meta itemprop="keywords" content="open access,scientific papers,unpaywall," /><meta property="og:title" content="Programmatically Downloading Open Access Papers" />
<meta property="og:description" content="(Cover image “Unlocked” by Sean Hobson)
If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper." />
<meta property="og:type" content="article" />
<meta property="og:url" content="https://brainsteam.co.uk/2018/04/13/programmatically-downloading-open-access-papers/" /><meta property="article:section" content="posts" />
<meta property="article:published_time" content="2018-04-13T16:04:47&#43;00:00" />
<meta property="article:modified_time" content="2018-04-13T16:04:47&#43;00:00" />

<meta name="twitter:card" content="summary"/>
<meta name="twitter:title" content="Programmatically Downloading Open Access Papers"/>
<meta name="twitter:description" content="(Cover image “Unlocked” by Sean Hobson)
If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper."/>
<link href='https://fonts.googleapis.com/css?family=Playfair+Display:700' rel='stylesheet' type='text/css'>
	<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/normalize.css" />
	<link rel="stylesheet" type="text/css" media="screen" href="https://brainsteam.co.uk/css/main.css" />

        <link id="dark-scheme" rel="stylesheet" type="text/css" href="https://brainsteam.co.uk/css/dark.css" />

	<script src="https://brainsteam.co.uk/js/feather.min.js"></script>
	
		<script src="https://brainsteam.co.uk/js/main.js"></script>
</head>

<body>
	<div class="container wrapper">
		<div class="header">
    
    <div class="avatar">
        <a href="https://brainsteam.co.uk/">
            <img src="/images/avatar.png" alt="Brainsteam" />
        </a>
    </div>
    
    <h1 class="site-title"><a href="https://brainsteam.co.uk/">Brainsteam</a></h1>
    <div class="site-description"><p>The irregular mental expulsions of a PhD student and CTO of Filament, my views are my own and do not represent my employers in any way.</p><nav class="nav social">
            <ul class="flat"><li><a href="https://twitter.com/jamesravey/" title="Twitter" rel="me"><i data-feather="twitter"></i></a></li><li><a href="https://github.com/ravenscroftj" title="Github" rel="me"><i data-feather="github"></i></a></li><li><a href="/index.xml" title="RSS" rel="me"><i data-feather="rss"></i></a></li></ul>
        </nav></div>

	<nav class="nav">
		<ul class="flat">
			
			<li>
				<a href="/">Home</a>
			</li>
			
			<li>
				<a href="/tags">Tags</a>
			</li>
			
			<li>
				<a href="https://jamesravey.me">About Me</a>
			</li>
			
		</ul>
	</nav>
</div>

		<div class="post">
			<div class="post-header">
				
					<div class="meta">
						<div class="date">
							<span class="day">13</span>
							<span class="rest">Apr 2018</span>
						</div>
					</div>
				
				<div class="matter">
					<h1 class="title">Programmatically Downloading Open Access Papers</h1>
				</div>
			</div>
					
			<div class="markdown">
				<p><em><a href="https://www.flickr.com/photos/seanhobson/6216334720/in/photolist-atjkJQ-QuYgDA-cb9bGo-4o84DP-9GAeQ5-5dopRY-hyQV19-ngTMst-4rRwgg-qQr5Sy-e4XhCg-mQJpZ-6ttPLT-6zQxh2-dsE6bM-qQcUxd-6msKYB-4HRo5J-8W2ryV-4B5rRC-xj9C8-2V5HKa-7zS5wE-Ldsdy-bwMFxR-nibhxt-5mKLS5-5m2URM-7CsC9C-4nJ5jt-a4mQik-6GPYgf-cb9c8s-363XxR-8R4jGd-4qHxrv-T4A8wx-T1NyJG-4tR45P-f5bde-4tV62J-cDEZ9L-Te2m9S-NLeKd-orGJh5-4j53Za-T4Abnn-fqPY88-T1NwPE-7deVVp" target="_blank" rel="noopener">(Cover image “Unlocked” by Sean Hobson)</a></em></p>
<p>If you’re an academic or you’ve got an interest in reading scientific papers, you’ve probably run into paywalls that demand tens or even hundreds of £ just to read a scientific paper. It’s ok if you’re affiliated with a university that has access to that journal but it can sometimes be luck of the draw as to whether your institute has access and even if they do, sometimes the SAML login processes don’t work and you still can’t see the paper. Thankfully, the guys at<a href="http://unpaywall.org/"> Unpaywall</a> (actually built by <a href="http://impactstory.org/">Impact Story</a>) have been doing a fantastic job of making open access papers much more easily available to interested academics in the browser. If you end up at a publisher paywall and Unpaywall know about a legitimate free copy of the paper you’re trying to read, they’ll link you straight to it for direct download. Problem solved.</p>
<p>For me, as someone interested in text mining on large volumes of scientific papers, getting hold of high quality, peer reviewed open access papers that I can analyse can be a pain. I previously wrote about <a href="https://papro.org.uk/2013/02/26/plosget-py/">downloading batches of papers from PLOS One</a> for data mining purposes but I’m currently interested in downloading papers that get mentioned and linked to in the news and although that can sometimes include PLOS journals, it also includes many other publishers, both open access and closed. Thankfully, Unpaywall come to the rescue again.</p>
<p>Unpaywall.org provide a free API that takes in a DOI and spits out any and all known free versions of that paper. That makes my life a lot easier: all I have to do is find a long list of DOIs that I’m interested in analysing and run them through the API.</p>
<p>I’ve provided a gist of the python function I’ve written that wraps this API. I’ve been using it in a Jupyter notebook (which I’m not ready to publish just yet). Feel free to use it in your project. It might save you an hour or two of development time (it took me a while to work out what errors I needed to try and catch).</p>

			</div>

			<div class="tags">
				
					
						<ul class="flat">
							
							<li><a href="/tags/open-access">open access</a></li>
							
							<li><a href="/tags/scientific-papers">scientific papers</a></li>
							
							<li><a href="/tags/unpaywall">unpaywall</a></li>
							
						</ul>
					
				
			</div><div id="disqus_thread"></div>
<script type="text/javascript">
	(function () {
		
		
		if (window.location.hostname == "localhost")
			return;

		var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
		var disqus_shortname = 'brainsteam';
		dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
		(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
	})();
</script>
<noscript>Please enable JavaScript to view the </a></noscript>
<a href="http://disqus.com/" class="dsq-brlink">comments powered by <span class="logo-disqus">Disqus</span></a>
</div>
	</div>
	<div class="footer wrapper">
	<nav class="nav">
		<div>2021  © James Ravenscroft 2020 |  <a href="https://github.com/knadh/hugo-ink">Ink</a> theme on <a href="https://gohugo.io">Hugo</a></div>
	</nav>
</div>


<script type="application/javascript">
var doNotTrack = false;
if (!doNotTrack) {
	window.ga=window.ga||function(){(ga.q=ga.q||[]).push(arguments)};ga.l=+new Date;
	ga('create', 'UA-186263385-1', 'auto');
	
	ga('send', 'pageview');
}
</script>
<script async src='https://www.google-analytics.com/analytics.js'></script>
<script>feather.replace()</script>
</body>
</html>