Become a better programmer in 30 minutes

Programmers nowadays should all know these 3 very important rules:

  1. Don’t use floating points for money
  2. Store date/time in “UTC” timezone
  3. Always use the “UTF8” character set

Tom Scott brilliantly explains the above 3 topics in videos that roughly last 10 minutes each. He explains them clearly, with depth and in a passionate way. They are listed below. Start the video’s by clicking on them.

1. Don’t use floating points for money

…never ever…

utf8_video

2. Store date/time in “UTC” timezone

…yes, always…

timezone_video

3. Always use the “UTF8” character set

…no exceptions…

floating_points_video

IMHO every programmer should watch these videos, because they are very educational. And “Yes”, it will takes you 30 valuable minutes of your precious time. But “No”, you will not regret it, because they teach you some valuable and elementary lessons. Lessons that apply on any software project, written in any programming language.

Thank you Tom Scott, you rock! Keep doing videos on your YouTube channel!

Share

File name has � (invalid encoding) and CRLF issues

i_dont_understand

On Linux you sometimes you get a “�” in a file name and a trailing “(invalid encoding)” in the filename. This is something that can happen when moving files from Windows to Ubuntu Linux. When uploading files to a Linux box you basically need two Linux tools to “repair” any incompatibility: “convmv” and “dos2unix”. The following commands will install them (on a Debian based Linux):

sudo apt-get install convmv
sudo apt-get install dos2unix

Character encoding

To remove the “(invalid encoding)” you use the “convmv” tool. It is a tool that will convert the character encoding used in the file name. You can try the conversion of file names from different character set to UTF-8 using the following commands:

convmv -r -f windows-1252 -t UTF-8 .
convmv -r -f ISO-8859-1 -t UTF-8 .
convmv -r -f cp-850 -t UTF-8 .

These are the three most popular character encodings (for Western Europe). If you need another character encoding use the “locale -m” command for a full list of options. Check out the Wikipedia character encoding page to find the characteristics of each of them.  After you confirmed that the conversion is correct you can run the actual conversion by adding the “notest” flag. A typical run would look like this (use “-r”  for recursive):

$ convmv -r -f windows-1252 -t UTF-8 .
Your Perl version has fleas #37757 #49830
Starting a dry run without changes...
mv "./jag f�rst�r inte.txt"    "./jag förstår inte.txt"
No changes to your files done. Use --notest to finally rename the files.
$ convmv -r -f windows-1252 -t UTF-8 . --notest
Your Perl version has fleas #37757 #49830
mv "./jag f�rst�r inte.txt"    "./jag förstår inte.txt"
Ready!

Line endings

Different operating systems have different line endings. The line endings are marked by one or two ASCII characters. These are the common styles:

  • CRLF: for the DOS\Windows world
  • CR: for the pre-OSX Mac world
  • LF: for the Unix and Unix-like world (including OSX)

Where the CR and LF characters are defined as such:

  • CR: Carriage Return is ASCII character 13 (0x0D)
  • LF: Line Feed is ASCII character 10 (0x0A)

To detect what line endings a file has you can use “vi” and look for ^M (control-M) characters:

$ vi jag\ förstår\ inte.txt
Do you understand IT?^M
Yes I do!^M
~
~
"jag förstår inte.txt" 2 lines, 34 characters

Or you can use the “file” command:

$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text, with CRLF line terminators
$ dos2unix jag\ förstår\ inte.txt
dos2unix: converting file jag förstår inte.txt to Unix format ...
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text

To do the conversion of line endings from Windows to Linux

$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text, with CRLF line terminators
$ dos2unix jag\ förstår\ inte.txt
dos2unix: converting file jag förstår inte.txt to Unix format ...
$ file jag\ förstår\ inte.txt
jag förstår inte.txt: ASCII text

Alternatively you can use an editor that supports conversion of line endings. Examples of open source text editors that support conversion of line endings are:

  • “TextMate” on OSX
  • “Notepad++” on Windows
  • “Gedit” on Ubuntu Linux

When I committed files from my OSX laptop to the Git repo, the “git diff” command showed way too many lines (since the line endings were changed). My colleagues showed me how to use the above commands to avoid any problems.

Share

UTF-8 in PHP and MySQL under Ubuntu 12.04

UTF-8 is the de facto standard character set for PHP websites and there are but a few reasons not to use UTF-8 (utf_general_ci) as the default MySQL database collation. However, anyone arguing that UTF-16 is a better standard would probably be right, but because UTF-8 is more popular, nobody cares. Unfortunately, the guys at Ubuntu (or upstream at Debian, PHP and MySQL) still have some strange defaults configured in their software, as follows:

  1. PHP connects explicitly to MySQL with an “Latin 1” character set unless you send the “set names utf8” query.
  2. Apache does not specify a character set by default (nor does PHP), letting the browser determine which character set is used.
  3. MySQL sets the “latin1” as default character set and “latin1_swedish_ci” as default collation (for string comparison).

This is a longstanding issue. The reason for these western/Swedish defaults is that MySQL AB has a Swedish origin. Now that MySQL is the world’s most popular web database, and has been bought by Oracle (based in California/US), it seems like a strange choice. These days you would expect the following defaults:

  1. PHP connects to the server and uses the character set of the server, unless specified.
  2. Apache should assume all text content to be UTF-8 encoded.
  3. MySQL should have UTF-8 as the default character set and “utf_general_ci” as the default collation.

It is easy to make Apache/MySQL/PHP (under Ubuntu 12.04) behave the way you like. First we add the character set to Apache:

sudo echo "AddDefaultCharset utf-8" >  /etc/apache2/conf.d/utf8.conf

Now for MySQL, we open “/etc/mysql/my.cnf” and under the “[mysqld]” section we add the following 3 lines:

[mysqld]
...
character-set-server=utf8
collation-server=utf8_general_ci
init-connect='SET NAMES utf8'

For a default of UTF-8 in the MySQL command line client (optional) you must add the following line in the “/etc/mysql/my.cnf” file under the “[client]” section:

[client]
...
default-character-set=utf8

Now restart the Apache and MySQL servers with the following commands:

sudo service mysql restart
sudo service apache2 restart

This is really all you have to do on a default Ubuntu 12.04. To check whether or not everything works correctly put the following “utf8.php” file on your website:

<?php
mysql_connect('localhost', 'username', 'password');
mysql_select_db('database');
$re = mysql_query('SHOW VARIABLES LIKE "c%";')or die(mysql_error());
while ($r = mysql_fetch_assoc($re))
{ echo $r&#91;"Variable_name"&#93;.': '.$r&#91;"Value"&#93;; echo "<br />";
}

The output should be:

character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: utf8
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
collation_connection: utf8_general_ci
collation_database: utf8_general_ci
collation_server: utf8_general_ci
completion_type: NO_CHAIN
concurrent_insert: AUTO
connect_timeout: 10

Let me know if you still have any trouble making it work. Good luck!

Share