Python performance optimization techniques

If you want to optimize the performance of your Python script you need to be able to analyze it. The best source of information I found on the web is the PerformanceTips page on the Python wiki. We are going to describe two types of performance analysis in Python. The first type uses a stopwatch to time the repeated execution of a specific piece of code. This allows you to change or replace the code and see whether or not this improved the performance. The other is by enabling a profiler that will track every function call the code makes. These calls can then be related, aggregated and visually represented. This type of profiling allows you to identify what part of your code is taking most time. We will show how to do both, starting with the stopwatch type.

Simple stopwatch profiling

You can apply basic stopwatch style profiling using the “timeit” module. It outputs the time that snippet of code takes to execute the specified number of times (in milliseconds), default number of times is one million. You can specify a startup statement that will be executed once and not counted in the execution time. And you can specify the actual statement and the number of times it needs to be executed. You can also specify the timer object if you do not want wall clock time but for example want to measure CPU time.


def lazyMethod(stringParts):
  fullString = ''
  for part in stringParts:
    fullString += part
  return fullString

def formatMethod(stringParts):
  fullString = "%s%s%s%s%s%s%s%s%s%s%s" % (stringParts[0], stringParts[1],
  stringParts[2], stringParts[3],
  stringParts[4], stringParts[5],
  stringParts[6], stringParts[7],
  stringParts[8], stringParts[9],
  stringParts[10])
  return fullString

def joinMethod(stringParts):
  return ''.join(stringParts)

print 'Join Time: ' + str(timeit.timeit('joinMethod()', 'from __main__ import joinMethod'))
print 'Format Time: '+ str(timeit.timeit('formatMethod()', 'from __main__ import formatMethod'))
print 'Lazy Time: ' + str(timeit.timeit('lazyMethod()', 'from __main__ import lazyMethod'))

The output should be something like this:

Join Time: 0.358200073242
Format Time: 0.646985054016
Lazy Time: 0.792141914368

This shows us that the join method is more efficient in this specific case.

Advanced profiling using cProfile

To identify what takes how much time within an application we first need an application. Let us profile a very simple Flask web application. Below is the code of a very simple “Hello World” application in Flask. We replaced “app.run()” with “app.test_client().get(‘/’);” to make the application run only the one request.

from flask import Flask

app = Flask(__name__)

@app.route("/")
def hello():
  return "Hello World!"

if __name__ == "__main__":
  #app.run()
  app.test_client().get('/');

Running the application with the profiler enabled can be done from the command line, so there is no need to change the code. The command is:

python -m cProfile -o flask.profile flaskapp.py

Visualizing cProfile results

RunSnakeRun is a GUI tool by Mike Fletcher which visualizes profile dumps from cProfile using square maps. Function/method calls may be sorted according to various criteria, and source code may be displayed alongside the visualization and call statistics.” – source: Python PerformanceTips

We are now analyzing the generated “flask.profile” file by running the “runsnake” tool using with following command:

runsnake flask.profile

It gave us some real nice insights:

profile_results_miliseconds_small

Picture 1: The visual output of RunSnakeRun

profile_results_miliseconds_expensive_calls

Picture 2: The list of function calls shows 77 calls to the regex library (re.py) acounting for only 0.5 of the 79 ms.

profile_results_miliseconds_call_map

Picture 3: A map showing all calls, the rectangle in the upper right (testing.py) is the test client running.

We showed you how to profile your Python application, now go practice and optimize your code. One advice though: go for low hanging fruit only, because over-optimized code is not Pythonic.

Share

Leaseweb BigData and MapReduce (Part 2)


This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.

In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):

MapReduce Example

When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…

So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:

My version

So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).

Resuming, what you need to know to make it work:

  • Get the data to the Cluster (could be a non-issue)
  • Slice the data in appropriate chunks
  • Write the Map
  • Write the Reduce
  • Do something with the output! Store it in someplace, generate files, etc…

This was the second post on BigData and MapReduce. The first post can be found here.

Share

Leaseweb BigData and MapReduce (Part 1)

Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?

I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.

First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:

  • 6 machines
  • 48 processing units (4 cores per machine with hyper-threading)
  • 1Gb Network

As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.

After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.

MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂

Share

Interactive programming tutorials

Dennis Ritchie was an American computer scientist who created the C programming language and the Unix operating system. He also was one of the authors (the ‘R’) of the famous in K&R C book. The first edition of this book was published in 1978 and it was the first widely available book on the C programming language. It is one of the first books I read about programming and I still think it is one of the best.

While some people enjoy reading a book like K&R C to learn a new language (and I certainly did back in the days), I now think reading text books about new programming languages is not the most effective way of learning a new language. Although it may still be the most effective way for learning your first language, because everything is new and you need a thorough understanding of the concepts, I think there may be more effective ways for learning your second, third or fourth programming language.

It was when the world wide web became a commodity that programming books lost there use as a reference, even for code examples. This changed the nature of programming books. I notice most programmers today do not (know how to) implement from documentation. They use implementation examples that they copy/paste from code they find using Google and Koders.

Another concept in programming books is taking advantage of prior knowledge. I remember it was a delight to learn C++ from a book titled “from C to C++”. The author assumed the reader knew everything about C and only discussed the differences, therefor not wasting the readers precious time.

This idea is also the basis for a (relatively) new phenomenon: interactive web applications that behave as language tutors. It learns you programming by example and it allows you to progress at your own speed. Learning by example skips a lot of the theory and this is why I don’t know how well it will work for real beginners, but for programmers that already know a few languages it is a fun and fast way to learn. I made a list of good (free) interactive language tutorials online:

  1. Javascript: Codecademy Javascript
  2. Regular Expression: RegexOne
  3. SQL: SQLzoo
  4. Ruby: Try Ruby
  5. Python: Try Python (Runs on Linux using Moonlight 2)
  6. Haskell: Try Haskell
  7. Scala: Simply Scala
  8. PHP: W3Schools
  9. CSS: CSS 101

Even though you should try all of the above, there may be one “traditional” text-book that you should read. It is the free and very well written book Eloquent JavaScript by Marijn Haverbeke.

http://www.w3schools.com/php/default.asp

Share

Automated testing with Selenium part II

We are starting with test automation and after a selection of tools; we decided to go with Selenium. An open source tool for user interface testing for web applications.

Features that are important in the Selenium tool

  •  Record & playback interface for quick scripting and low learning curve.
  •  The record & playback interface is working as a plug-in in Firefox.
  •  There is a plug-in for Record and Playback to do loops in IDE scripts, but is limited
  •  There is a plug-in for data driven testing
  •  Selenium is usable with open source programming platforms (such as Ruby or Python)

Combination with Python

For now within LeaseWeb we experimented with the Python development platform. Python is also open Source software, with a low learning curve. When the scripts are programmed (or imported from Record and Playback IDE interface) it can be used for more sophisticated automated testing. For example read data from a database and check this against web software.

When this is setup with Selenium webdriver you can do cross browser platform automated checking.

  • Cross browser and cross-platform
  • Performance testing by using basic test scenarios

Install Selenium IDE (the record and playback tool)
Go to http://seleniumhq.org/ and follow the instructions on that page for installing Selenium IDE
Install the Selenium webdriver

Get Python

  • Download Python from www.python.org (please install Python 2.x, not Python 3.x, this new version has little support yet)

Add Python to your PATH

  • Right click on “My Computer”.
  • Select “Properties” from the context menu.
  • In the “System Properties” dialog box, click on the “Advanced” tab.
  • Click on the “Environment Variables” button.
  • Highlight the “Path” Variable in “System variables” section.
  • Click the “Edit” button.
  • Append the following lines to the text inside the “Variable value” text box semi-colon delimited.
  • C:\Python25\;C:\Python25\Scripts\ (where 25 is the version number of your downloaded Python version)
  • Click “OK” on the “Edit System Variable” dialog box then “OK” on the “Environment Variables” dialog box to commit the changes.

Install setuptools
Download setuptools via http://pypi.python.org/pypi/setuptools#downloads
The file “setuptools-x.x.win32-pyx.x.exe (md5)” is an executable that will self extract and setup the setuptools.

Install Python
For the next step you’ll need Python 2 installed, which you can get from http://www.python.org/getit/. You’ll also need to install setuptools from http://pypi.python.org/pypi/setuptools.
Once you have these, you can run the following to install the Selenium python client library:
– easy_install pip
– pip install selenium

Support for browsers
At the time of writing this only the Firefox and Chrome webdrivers work (1st of November 2011). This is because of the webdriver is a new platform for Selenium. The old Selenium platform (Selenium RC and Selenium Grid) are used now in lots of companies, but within Leaseweb we decided to use the new one. This is something for the near future to experiment with within Leaseweb.
Install Google Chrome driver and Internet Explorer
If you want to play around with the Selenium Webdriver and some other browsers than Firefox, you will have to install the drivers for these web browsers.

  • Google Chrome: you should download the chrome driver and place in a system path (you could for example put the driver in the c:\pythonxx folder)
  • For Internet Explorer you will have to turn on ALL security zones: Turn ON protected mode in ALL Internet Explorer Zones (Security Tab in IE settings)

Conclusion
The new version of Selenium is better and more user friendly than the old version, but the compatibility with web browsers is not good enough yet. But for learning and starting with this tool it is good enough. Reasonably sophisticated scripts can be made in combination with Python as a programming environment, what makes it very flexible. Selenium is updated a lot, so in near future I guess all the internet browsers will work.

Share