Python performance optimization techniques

If you want to optimize the performance of your Python script you need to be able to analyze it. The best source of information I found on the web is the PerformanceTips page on the Python wiki. We are going to describe two types of performance analysis in Python. The first type uses a stopwatch to time the repeated execution of a specific piece of code. This allows you to change or replace the code and see whether or not this improved the performance. The other is by enabling a profiler that will track every function call the code makes. These calls can then be related, aggregated and visually represented. This type of profiling allows you to identify what part of your code is taking most time. We will show how to do both, starting with the stopwatch type.

Simple stopwatch profiling

You can apply basic stopwatch style profiling using the “timeit” module. It outputs the time that snippet of code takes to execute the specified number of times (in milliseconds), default number of times is one million. You can specify a startup statement that will be executed once and not counted in the execution time. And you can specify the actual statement and the number of times it needs to be executed. You can also specify the timer object if you do not want wall clock time but for example want to measure CPU time.


def lazyMethod(stringParts):
  fullString = ''
  for part in stringParts:
    fullString += part
  return fullString

def formatMethod(stringParts):
  fullString = "%s%s%s%s%s%s%s%s%s%s%s" % (stringParts[0], stringParts[1],
  stringParts[2], stringParts[3],
  stringParts[4], stringParts[5],
  stringParts[6], stringParts[7],
  stringParts[8], stringParts[9],
  stringParts[10])
  return fullString

def joinMethod(stringParts):
  return ''.join(stringParts)

print 'Join Time: ' + str(timeit.timeit('joinMethod()', 'from __main__ import joinMethod'))
print 'Format Time: '+ str(timeit.timeit('formatMethod()', 'from __main__ import formatMethod'))
print 'Lazy Time: ' + str(timeit.timeit('lazyMethod()', 'from __main__ import lazyMethod'))

The output should be something like this:

Join Time: 0.358200073242
Format Time: 0.646985054016
Lazy Time: 0.792141914368

This shows us that the join method is more efficient in this specific case.

Advanced profiling using cProfile

To identify what takes how much time within an application we first need an application. Let us profile a very simple Flask web application. Below is the code of a very simple “Hello World” application in Flask. We replaced “app.run()” with “app.test_client().get(‘/’);” to make the application run only the one request.

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
  return "Hello World!"

if __name__ == "__main__":
  #app.run()
  app.test_client().get('/');

Running the application with the profiler enabled can be done from the command line, so there is no need to change the code. The command is:

python -m cProfile -o flask.profile flaskapp.py

Visualizing cProfile results

RunSnakeRun is a GUI tool by Mike Fletcher which visualizes profile dumps from cProfile using square maps. Function/method calls may be sorted according to various criteria, and source code may be displayed alongside the visualization and call statistics.” – source: Python PerformanceTips

We are now analyzing the generated “flask.profile” file by running the “runsnake” tool using with following command:

runsnake flask.profile

It gave us some real nice insights:

profile_results_miliseconds_small

Picture 1: The visual output of RunSnakeRun

profile_results_miliseconds_expensive_calls

Picture 2: The list of function calls shows 77 calls to the regex library (re.py) acounting for only 0.5 of the 79 ms.

profile_results_miliseconds_call_map

Picture 3: A map showing all calls, the rectangle in the upper right (testing.py) is the test client running.

We showed you how to profile your Python application, now go practice and optimize your code. One advice though: go for low hanging fruit only, because over-optimized code is not Pythonic.

Share

Leaseweb BigData and MapReduce (Part 2)


This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.

In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):

MapReduce Example

When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…

So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:

My version

So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).

Resuming, what you need to know to make it work:

  • Get the data to the Cluster (could be a non-issue)
  • Slice the data in appropriate chunks
  • Write the Map
  • Write the Reduce
  • Do something with the output! Store it in someplace, generate files, etc…

This was the second post on BigData and MapReduce. The first post can be found here.

Share

Leaseweb BigData and MapReduce (Part 1)

Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?

I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.

First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:

  • 6 machines
  • 48 processing units (4 cores per machine with hyper-threading)
  • 1Gb Network

As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.

After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.

MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂

Share

Big data – do I need it?

Big Data?

Big data is one of the most recent “buzz words” on the Internet. This term is normally associated to data sets so big, that they are really complicated to store, process, and search trough.

Big data is known to be a three-dimensional problem (defined by Gartner, Inc*), i.e. it has three problems associated with it:
1. increasing volume (amount of data)
2. velocity (speed of data in/out)
3. variety (range of data types, sources)

Why Big Data?
As datasets grow bigger, the more data you can extract from it, and the better the precision of the results you get (assuming you’re using right models, but that is not relevant for this post). Also better and more diverse analysis could be done against the data. Diverse corporations are increasing more and more their datasets to get more “juice” out of it. Some to get better business models, other to improve user experiences, other to get to know their audience better, the choices are virtually unlimited.

In the end, and in my opinion, big data analysis/management can be a competitive advantage for corporations. In some cases, a crucial one.

Big Data Management

Big data management software is not something you buy normally on the market, as “off-the-shelf” product (Maybe Oracle wants to change this?). One of biggest questions of big data management is what do you want to do with it? Knowing this is essential to minimize to problems related with huge data sets. Of course you can just store everything and later try to make some sense of the data you have. Again, in my opinion, this is the way to get a problem and not a solution/advantage.
Since you cannot just buy a big data management solution, a strategy has to be designed and followed until something is found that can work as a competitive advantage to the product/company.

Internally at LeaseWeb we’ve got a big data set, and we can work on it at real-time speed (we are using Cassandra** at the moment) and obtaining the results we need. To get this working, we had several trial-and-error iterations, but in the end we got what we needed and until now is living up to the expectations. How much hardware? How much development time? This all depends, the question you have to ask yourself is “What do I need?”, and after you have an answer to that, normal software planning /development time applies. It can be even the case that you don’t need Big Data at all, or that you can solve it using standard SQL technologies.

In the end, our answer to the “What do I need?” provided us with all the data we needed to search what was best for us. In this case was a mix of technologies and one of them being a NoSQL database.

* http://www.gartner.com/it/page.jsp?id=1731916
** http://cassandra.apache.org/

Share