0 Replies - 25010 Views - Last Post: 19 June 2011 - 02:50 PM Rate Topic: -----

#1 Brewer  Icon User is offline

  • Awesome
  • member icon

Reputation: 179
  • View blog
  • Posts: 1,044
  • Joined: 14-June 10

Scraping Tweets using Python and Flask

Posted 19 June 2011 - 02:50 PM

Scraping Tweets using Python and Flask


One thing that Python is really good at doing is scraping information from web pages. In fact, most scripting languages can do this with relative ease. In this tutorial I will show you how to scrape tweets from any account using Python and the Flask micro-framework.

This can be done with any templating system, it just so happens that Flask comes with a built-in templating system called Jinja. The reason we want to use templates is so that we can write a single block of code and have it show all of our tweets. You could create a page with all of your tweets using only HTML and CSS, but then you would have to hard-code everything, and let's be honest, none of us want to do that.

Also, I would like you to note that all of the commands I use in this tutorial are Linux/Ubuntu commands. While some MIGHT exist in Windows/OS X, the majority will not and it will be up to you to find an equivalent.

So, without further ado, let's get hacking!

Setting Up Your Flask Project


If you aren't using Flask then there is no reason for you to read the rest of this section, so feel free to skip ahead. Personally, I think Flask is a great framework, so I would recommend that you read this anyway.

If you're familiar with Django then you'll know that you should get started by going to the command line and typing in django-admin.py startproject <project name>. Now that's great an all, but personally I think it's annoying to remember. In Flask, all you have to do is create a folder using the mkdir command. I prefer to put all of my projects in a folder called 'workspace', but this is not by any means mandatory.

After you create the main folder, you'll need to create two more folders, 'static' and 'templates'. The static folder will be used to hold things like your CSS and Javascript files and the templates folder is where you'll save your templates (obviously).

Finally, you'll need to create a Python file. The name doesn't really matter, although most people use <ProjectName>.py. This is where you'll put all of the views and other server-side code. Those familiar with Linux will know to use the touch <filename> command to do this from the command line.

Here are the commands I used throughout this section:

james@ubuntu:~ cd workspace
james@ubuntu:~/workspace mkdir TweetScraper
james@ubuntu:~/workspace cd TweetScraper
james@ubuntu:~/workspace/TweetScraper mkdir static templates
james@ubuntu:~/workspace/TweetScraper touch TweetScraper.py


TweetScraper.py


The first thing I like to do when I start a new project is figure out everything that I will need to import, and go ahead and get that out of the way. For this particular project we will need to import 5 things.

First off, we'll need to import Flask, render_template, and url_for from the flask module. Flask will be used to tell Flask that this project is a Flask project. render_template will be used in views to do exactly what you might guess, render a template. Lastly, we will use url_for mostly to include our CSS file in our templates, url_for has more uses but that's all we need for this project.

Next, we need to import urlopen from the urllib module. This will be used to open a JSON file that will list all of the tweets made by any given member.

Finally, we need to import the json module. We'll use this to extra data from the json file that we opened using urlopen.

Type this into TwitterScraper.py:

from flask import Flask, render_template, url_for
from urllib import urlopen
import json


After we take care of the imports we need to tell Flask that this project is a Flask project. To do this we simply add this to our TwitterScraper.py:

DEBUG = True
TwitterScraper = Flask(__name__)
TwitterScraper.config.from_object(__name__)


So let's explain this line by line. DEBUG = True tells Flask that we are in debug mode. The first thing I want to make clear is that if your app goes public, then you need to turn off debug mode by setting DEBUG = False. If you forget to do this then your users will be able to see things they shouldn't, which is never good. Debug mode shows you error messages that you normally wouldn't be able to see. It makes it a lot easier to figure out what is going wrong and I highly recommend using it while developing.

TwitterScraper = Flask(__name__) tells Flask that TwitterScraper is a Flask project and that, as such, it should be allowed to use all of the methods available to a Flask project. TwitterScraper.config.from_object(__name__) gets the configuration information, such as debug mode, that should have been declared already. Debug mode is only one thing that you can configure, there are more things, such as database information, but we don't need to get into all of that right now.

The next thing we need to do is create a view that will define what shows up when a user visits the front page of our site.

@TweetScraper.route('/')
def home():

	return render_template('index.html')


The argument that we pass to route() is the directory that we want this view to be called from. In this case, we want this to be the front page, so we simply put a forward slash.

The next line is def home():. The view for the home page should always be called home().

For the time being, the only thing we want to return is the template, without any variables. So to do that we use the return I've provided. This will render a web page using the index.html template, which we'll create later on. Once we finish putting in the rest of the code we will add more arguments to include the variables that we'll use in our template.

Now let's actually do something cool.

Scraping the Tweets


So if you remember, we imported both urlopen and json, which we haven't used yet. Time to change that.

username = “jamsbrewr”
timelineURL = “http://api.twitter.com/1/statuses/user_timeline/” + username + “.json”

content = urlopen(timelineURL)
data = json.load(content)

tweets = [ dict(tweet = tweet[“text”],
		  tweet_id = tweet[“id”]) for tweet in data ]


The first thing we do here is create a variable to hold our Twitter username. My username is jamsbrewr, so that is the value I assigned to the variable. After this we create another variable that holds the url to the json file for whatever username we decided to use.

After that we use urlopen to open the url provided by timelineURL. Then we use json to load all of the information in that file.

The last thing we need to do is to extract the information for each tweet in the json file. I find that the best way to do this is to use a dictionary, so that's what we will do. We'll store this in a variable called tweets.

Now, let's consider which of this information we will want to use in our template. Obviously we'll want to use 'tweets' as it holds the tweet itself and the id number for the tweet. Also, I think we could put 'username' to use somewhere, so we'll take that too.

Now that we've decided which variables we want to use, we can alter our return statement to include these two.

Finally, the very last thing we need to do is add one more bit of code.

if __name__ == “__main__”:
	TweetScraper.run()


This tells Python that if we run TweetScraper.py, then it should call Flask's run() method. However, if TweetScraper.py is called by another Python file, then we don't want to use run().

After all of that, here is our TweetScraper.py file:

from flask import Flask, render_template, url_for
from urllib import urlopen

import json

DEBUG = True

TweetScraper = Flask(__name__)
TweetScraper.config.from_object(__name__)

@TwitterScraper.route('/')
def home():

	return render_template('index.html', None)

if __name__ == “__main__”:
	TweetScraper.run()


index.html


Seeing as this isn't a tutorial on HTML and CSS, these next two sections will be rather short. All I want to do is give you a quick idea of how the templating system works in Flask. To start off, we need to create a file called index.html in the templates folder we made earlier. Easy enough:

james@ubuntu: ~/workspace/TweetScraper/ cd templates
james@ubuntu: ~/workspace/TweetScraper/templates touch index.html


Here's the code that we'll be using for this file. Most of it is pretty self-explanatory, I'll cover some of it though.

<!DOCTYPE HTML>

<html>

	<head>
		<title>TweetScraper</title>
		<link rel=”stylesheet” type=”text/css” href=”{{ url_for('static', filename='style.css') }}” />
	</head>

	<body>
		<div id=”container”>
			{% for tweet in tweets %}
				<p id=”tweet”>
					{{ tweet.tweet }} <br />
					<span>Posted by <a href=”http://www.twitter.com/{{ username }}” target=”_blank”>{{ username }}</a></span><br />
				</p>
			{% endfor %}
		</div>
	</body>

</html>


There are only a couple of things that we need to focus on for the purposes of this tutorial. Pretty much everything wrapped in {% %} or {{ }} tags. The {% %} delimiters are used in a template where one wants to make a statement. Usually assigning a value to a variable or, in our case, using a for loop. The {{ }} delimiters are used to print the result of an expression to the template.

In the <head></head> tags, you'll notice that the <link> tag we use to include our CSS file looks a bit odd. That's because we're using the url_for method that we imported in our view. Basically all this is doing is finding the folder we called 'static' and including the file name 'style.css' from that folder, in our template.

Finally, to make things a little less ugly, let's style this shit up. Here's what we'll put in our CSS file, which we'll call style.css and save in our 'static' folder:

body { font-family: Arial; }

#container {
	margin: auto auto; /* Centers div */
	width: 500px;
	background: #000000;
	border: 1px solid #000000;
}

#twitter { background: #CCCCCC; }
#twitter span { font-size: 12px; }


So save that and then when you're ready you can navigate back to the TweetScraper folder and run this command:

james@ubuntu: ~/workspace/TweetScraper python TweetScraper.py 
 * Running on http://127.0.0.1:5000/
 * Restarting with reloader ...


Congratulations, your project is up and running! All you have to do is type 127.0.0.1:5000 into a web browser of your choice and you're done!

I hope this tutorial taught you something useful. Please feel free to ask any questions and let me know what you like and what you didn't like. If things go well maybe I'll put up some more tutorials later on!

Enjoy!

Is This A Good Question/Topic? 0
  • +

Page 1 of 1