3 popular GOTO conference talks

GOTO conferences are for developers by developers. On gotocon.com you find the upcoming conferences:

  • GOTO London: Sep. 14 – 18, 2015
  • GOTO Copenhagen: Oct. 5 – 8, 2015
  • GOTO Berlin: Dec. 2 – 4, 2015
  • GOTO Chicago: May 2016
  • GOTO Amsterdam: June 13 – 15, 2016

There have already been many GOTO conferences. Many of the past talks are available on YouTube. Below you find 3 interesting talks from the YouTube GOTO conference channel.

1) Introduction to NoSQL

goto_happy_unhappy_sql

A GOTO classic from the 2012 Aarhus conference. This is actually the most viewed talk on the YouTube GOTO conference channel. Martin Fowler explains what NoSQL is and when it needs to be applied. Like no other he explains the advantages of SQL and the circumstances under which choosing NoSQL may make sense. Apart from the NoSQL topics, his explanation of off-line locks and document based databases is so good that is enough reason by itself to watch this video.

2) Challenges in implementing microservices

goto_how_fast_can_you_go

This video was recorded more recently (August 2015) in Amsterdam. Fred George explains how Web and SOA have led to new paradigms for structuring enterprise applications. Instead of a few, business­ related services, he developed systems made of many small short-lived services. This approach is called “micro­services” and Fred George talks about his experience building applications in this paradigm.

3) How Go is making us faster

goto_go_is_making_us_faster

In July 2015 Wilfried Schobeiri explained in Chicago on a GOTO conference why Go is a good match for microservices. I’m not sure about the factual correctness of the speed comparisons with C++ and Java, but Go is definitely very fast. The thing to take away from this talk is that Go might be a great match for microservices, since it has a nice standard library/toolset, easy parallelism/concurrency and simple deployment.

Leaseweb BigData and MapReduce (Part 2)


This is the second part of a series of posts about the current Leaseweb BigData and MapReduce. The first post can be found here.

In this one we will focus on understanding MapReduce and building a job that will run on the Cluster. Let’s start to look for a diagram of what a MapReduce job looks like (Source: http://discoproject.org/media/images/diagram.png):

MapReduce Example

When I first saw this image I started thinking “That’s quite simple! I just need to provide 2 functions and things will fly!” After a while it become clear that is not the case…

So, what is the problem? First let me update the image above what I think is a more realistic view of what are the steps you need to effectively get something out of a MapReduce cluster:

My version

So, now we moved from 2 problems (write a Map and a Reduce function) to 4. Could be 3 or still 2, depending on the cases. But I do think majority of the people will face 4 or 5 (Getting the data from whatever to the cluster).

Resuming, what you need to know to make it work:

  • Get the data to the Cluster (could be a non-issue)
  • Slice the data in appropriate chunks
  • Write the Map
  • Write the Reduce
  • Do something with the output! Store it in someplace, generate files, etc…

This was the second post on BigData and MapReduce. The first post can be found here.

Leaseweb BigData and MapReduce (Part 1)

Today BigData is a enormous Buzz word. So what all about it? How difficult is to use it?

I already wrote on a previous post (http://www.leaseweb.com/labs/2011/11/big-data-do-i-need-it/) about the need for it and gave some insight on our BigData structure here at Leaseweb. On this post I will dive more into how we process this data using MapReduce and Python.

First off all, as the name says, BigData is actually a lot of data, so to retrieve information from it in a timely manner (And we do it in real time) you need to build a infrastructure that can handle it. So without more delay, this is our cluster structure:

  • 6 machines
  • 48 processing units (4 cores per machine with hyper-threading)
  • 1Gb Network

As said before, we want to use our in-house python knowledge (although Java knowledge also exists). So we went with Disco (http://discoproject.org/) to build our MapReduce infrastructure. We didn’t benchmark Disco vs Classic Hadoop structure (the all knowing “Internet” says you take a hit on performance), but we are satisfied we the results we are getting and the cluster is not even getting that much work.

After the hardware and the framework is up and running, it starts the second part, programming the MapReduce Jobs.

MapReduce is not some miracle technology that will solve all your problems with data. So before starting programming the jobs you will have to actually understand how it works. But for that, wait for part 2 🙂

Big data – do I need it?

Big Data?

Big data is one of the most recent “buzz words” on the Internet. This term is normally associated to data sets so big, that they are really complicated to store, process, and search trough.

Big data is known to be a three-dimensional problem (defined by Gartner, Inc*), i.e. it has three problems associated with it:
1. increasing volume (amount of data)
2. velocity (speed of data in/out)
3. variety (range of data types, sources)

Why Big Data?
As datasets grow bigger, the more data you can extract from it, and the better the precision of the results you get (assuming you’re using right models, but that is not relevant for this post). Also better and more diverse analysis could be done against the data. Diverse corporations are increasing more and more their datasets to get more “juice” out of it. Some to get better business models, other to improve user experiences, other to get to know their audience better, the choices are virtually unlimited.

In the end, and in my opinion, big data analysis/management can be a competitive advantage for corporations. In some cases, a crucial one.

Big Data Management

Big data management software is not something you buy normally on the market, as “off-the-shelf” product (Maybe Oracle wants to change this?). One of biggest questions of big data management is what do you want to do with it? Knowing this is essential to minimize to problems related with huge data sets. Of course you can just store everything and later try to make some sense of the data you have. Again, in my opinion, this is the way to get a problem and not a solution/advantage.
Since you cannot just buy a big data management solution, a strategy has to be designed and followed until something is found that can work as a competitive advantage to the product/company.

Internally at LeaseWeb we’ve got a big data set, and we can work on it at real-time speed (we are using Cassandra** at the moment) and obtaining the results we need. To get this working, we had several trial-and-error iterations, but in the end we got what we needed and until now is living up to the expectations. How much hardware? How much development time? This all depends, the question you have to ask yourself is “What do I need?”, and after you have an answer to that, normal software planning /development time applies. It can be even the case that you don’t need Big Data at all, or that you can solve it using standard SQL technologies.

In the end, our answer to the “What do I need?” provided us with all the data we needed to search what was best for us. In this case was a mix of technologies and one of them being a NoSQL database.

* http://www.gartner.com/it/page.jsp?id=1731916
** http://cassandra.apache.org/

Scalable RDBMS

My name is Mukesh, I worked with fairly large (or medium large) scale websites as my previous assignments – and now in LeaseWeb’s cloud team, as an innovation engineer. When I say large scale I’m talking about a website serving 300 million webpages per day (all rendered within a second), a website storing about half a billion photos & videos, a website with an active user base of ~10 million, a web application with 3000 servers …and so on!

We all know it takes a lot to keep sites like these running especially if the company has decided to run it on commodity hardware. Coming from this background, I’d like to dedicate my first blog post to the subject of scalable databases.

A friend of mine,  marketing manager by profession, inspired by technology, asked me why are we using MySQL in knowing that it does not scale (or there is some special harry potter# magic?). He wanted to ask, from what reasons we have chosen MySQL?  And are there any plans to move to another database?

Well the answer for later one is easy “No, we’re not planning to move to another database”. The former question  however, can’t be answered in a single line.

#Talking of Harry Potter, what do you think about ‘The Deathly Hallows part -II’?

Think about Facebook –  a well recognised social networking website. Facebook handles more than 25 billion page views per day; even they use MySQL.

The bottleneck is not MySQL (or any common database). Generally speaking, every database product in the market has the following characteristics to some extent:

  1. PERSISTENCE:  Storage and (random) retrieval of data<
  2. CONCURRENCY:  The ability to support multiple users simultaneously (lock granularity is often an issue here)
  3. DISTRIBUTION:  Maintenance of relationships across multiple databases (support of locality of reference, data replication)
  4. INTEGRITY:  Methods to ensure data is not lost or corrupted (features including automatic two-phase commit, use of dual log files, roll-forward recovery)
  5. SCALABILITY:  Predictable performance as the number of users or the size of the database increase

This post deals about scalability, which we hear quite often when we talk about large systems/big data.

Data volume can be managed if you shard it. If you break the data on different servers at the application level, the scalability of MySQL is not such a big problem. Of course, you cannot make a JOIN with the data from different servers, but choosing a non-relational database doesn’t help either. There is no evidence that even Facebook uses (back in early 2008 its very own) Cassandra as primary storage, and it seems that the only things that’s needed there is a search for incoming messages.

In reality, distributed databases such as Cassandra, MongoDB and CouchDB or any new database (if that matters) lacks on scalability & stability unless there are some real users (I keep seeing post from users running into issues, or annoyance ) For example, the guys at Twitter were trying to move on with MySQL and Cassandra for about a year (great to see that have a bunch of feature working).
I’m not saying they aren’t good, they are getting better with time.  My point is any new database needs more time & cover a few large profiles to be mature (to be considered over MySQL). Of course, if someone tells about how he used any of these databases as primary storage for 1 billion cases in one year, then I’ll change my opinion.

I believe it’s a bad idea to risk your main base on new technology. It would be a disaster to lose or damage the database, and you may not be able to restore everything. Besides, if you’re not a developer of one of these newfangled databases and one of those few who actually use them in combat mode, you can only pray that the developer will fix bugs and issues with scalability as they become available.

In fact, you can go very far on a single MySQL without even caring about a partitioning data at the application level. While it’s easy to scale a server up on a bunch of kernels and tons of RAM, do not forget about replication. In addition, if the server is in front of the memcached layer (which simply scales), the only thing that your database cares is writes. For storing large objects, you can use S3 or any other distributed hash table.  Until you are sure that you need to scale the base as it grows, do not shoulder the burden of making the database an order of magnitude more scalable than you need it.

Most problems arise when you try to split the data over a large number of servers, but you can use an intermediate layer between the base, which is responsible for partitioning. Like for example FriendFeed does.
I believe that the relational model is the correct way of structuring data in most applications – content that users create. Schemes can contain data in a particular form as new versions of the service; they also serve as documentation and help avoid a heap of errors. SQL allows you to process more data as needed instead of getting tons of raw information, which then still need to be reprocessed in the application. I think once the whole hype around the NoSQL is over, someone will finally develop a relational database with free semantics.

Conclusion!

  1. Use MySQL or other classic databases for important, persistent data.
  2. Use caching delivery mechanisms – or maybe even NoSQL – for fast delivery
  3. Wait until the dust settles, and the next generation, free-semantics relational database rises up.