Archlinux + CUDA on Ivy-Bridge + Kepler

June 2, 2012 Category :Archlinux| CPU| GPU| HOWTO 32

Introduction

My new Asus N56V arrived earlier this week, and the first thing I wanted to do was install ArchLinux on it. Being a CUDA programmer by profession I also needed to get it working with the Optimus configuration NVIDIA GPUs have to work with on board Intel Graphics cards.

 

Being a new arrival in the market it has had some issues. Some due to my unfamiliarity with them (such as grub2 and UEFI) and others such as lack of availability (such as drivers for the ethernet).

 

Thankfully, the internet has been helpful; Sometimes providing outright solutions, sometimes guiding you in the right direction. Since I had to scour through the web to figure out the many issues facing me, I think putting together this guide (by no means fool proof or comprehensive) should be helpful to those with similar hardware in the future.

 

Here is the hardware configuration:

 

CPU Intel 3610QM
GPU NVIDIA GT 650M
Ethernet AR8161
Wireless AR9485

Preinstallation

 

Installation Media

I chose to use the Netinstall Image (dated August 2011) because it is quicker to download. It can also automatically download and install the latest packages over the internet. If you have a slow ineternet connection or already have Core Images of archlnux it is best to use the core image.

 

Preparing the disks

The laptop came with UEFI enabled and the harddisk using GPT as its partitioning table. Why choose this over regular BIOS I am not sure. If your laptop uses UEFI, while you can partition and format the disks from the archlinux live image, it is best if you create the required partition (and format them as ntfs for now) from Windows or any other utility before you proceed with the installation.

 

Guides

Please have the Archlinux Beginners guide open at all times. It is absolutely necessary for beginners and highly recommended for everyone else. For a forgetful person like me it saved my ass more than a few times. This guide about setting up ArchLinux with UEFI is also helpful. I had to change a couple of paths, but we will get to that later.

 

UEFI setup

 

If you have an UEFI setup, you will need to create a bootable flash drive with an UEFI shell. Have a flash drive (256MB should be more than enough) formatted as FAT32. Download the appropriate shell from here. And rename it as bootx64.efi and place it in /path/to/flashdrive/efi/boot/bootx64.efi

This will be useful for later.

 

Installation

 

Network

The following command should show you the networking devices that are readily available.

ip addr

If you can see eth0 and have an ethernet cable, then running dhcpcd should allow you to connect to the internet directly. If you can not see eth0 or want to use wifi, you can set up WPA connections in the following manner.

 

ip link set wlan0 up
wpa_passphrase ssid “passphrase” > /etc/wpa_supplicant.conf
wpa_supplicant -B -Dwext -i wlan0 -c /etc/wpa_supplicant.conf
iwconfig wlan0
dhcpcd wlan0
ping -c3 www.google.com # Testing the network

You can find the instructions for setting up other kinds of networks over here.

 

Partitioning the disks

This section is in case you have not created the partitions you want already. If the laptop is using GPT tables, the archlinux installer will not be able to detect free space for you to set up the partitions. You will need to install gptfdisk and create partitions.

 

pacman -S gptfdisk
gdisk # instructions exactly like fdisk
# create partitions and write to disk

 

At this point I needed to reboot the system for the changes to be identified. Painfully I had to set up the wifi again (and everytime after that if you reboot your system and end up in the live session).

 

Starting the install process

 

from the prompt run the following

/arch/setup

The best thing to do here is to follow the Beginners guide. I will try to provide the options I chose.

You will be asked to select a server to download from if you are using a net install. Choose the server that is closest to you. For those in USA, rit.edu should work well.

When prompted to select repositories and packages, choose what is relavent to you. I chose the following.

Repositories: core-remote extra-remote community-remote

Packages: xorg, xorg-font, gnome, gnome-extra

Packages: boost, boost-libs, emacs, gcc-fortran, libreoffice

 

If you are using an Optimus setup machine, DO NOT SELECT nvidia or nvidia-utils. You can always install additional packages after archlinux is setup.

 

Next you will be prompted to prepare the hard disks. Do not wipe the entire disk (unless that is what you intend to do). Select the manual setup and use the layout you want. Now proceed to the installation step where the installer will try to download packages over the network and install them. This will take about 10 – 20 minutes (depending on the machine and the packages selected), so it would be a good time for a coffee break.

 

At the end of the install process you will be asked to configure your system. The only real things you need to care about are rc.conf and pacman.d/mirrorlist.

Add your machine’s name at the end of HOSTNAME in rc.conf like so:

HOSTNAME=”archer”

 

Edit pacman.d/mirrorlist and uncomment the servers that are nearest to you. I chose rit.edu because their servers seem to be in sync the most.

 

Next setup your root password, and proceed to finish the setup. You will be asked about the location grub needs to be installed. Choosing /dev/sdX (sdX is where archlinux is installed, for most it is /dev/sda) should finish the setup and you can boot into archlinux.

 

For those using UEFI however, you will be notified that the procedure failed. This is expected and can be remedied.

 

Setting up UEFI entry

 

If you are not using an UEFI machine, please skip over this part. For those of you still here, this is what needs to be done.

 

Chrooting to archlinux

 

mount /dev/your_root_partition /mnt/
mount /dev/your_boot_partition /mnt/boot # Only if you have seperate boot partition
mount -o bind /dev /mnt/dev
mount -t proc /proc /mnt/proc/
mount -t sysfs /sys /mnt/sys/
chroot /mnt bash
mkdir /boot/efi
mount /dev/your_efi_partition /boot/efi
pacman-db-upgrade
pacman -Syy

Time to install grub

pacman -S grub2-efi-x86_64
grub-install –directory=/usr/lib/grub/x86_64-efi –target=x86_64-efi –efi-directory=/boot/efi –bootloader-id=Archlinx –boot-directory=/boot –recheck –debug

However it will fail to add an entry to EFI with the following error

Fatal: Couldn’t open either sysfs or procfs directories for accessing EFI variables

 

Run the following to finish setting up grub 2 before fixing EFI.

 

grub-mkconfig -o /boot/grub/grub.cfg

 

The flash drive

 

This is when you use the flash drive created earlier. Plug it in, reboot, choose to boot from the flash drive. Once you are in the shell, type the following:

bcfg boot add 0 fs0:\efi\Archlinux\grubx64.efi “Arch Linux”

This will create a temporary entry with which you can boot into ArchLinux.

 

Finishing it off

 

Reboot machine, choose “Arch Linux” entry from EFI table. You will be logged into Arch Linux. However this is a one time thing and you will need to make it permanent by running the following command

grub-install –directory=/usr/lib/grub/x86_64-efi –target=x86_64-efi –efi-directory=/boot/efi –bootloader-id=Archlinx –boot-directory=/boot –recheck –debug

Reboot and choose “Archlinux” as the boot option and you should be good to go!

 

Post installation

Ethernet

Apparently the driver required for this card has not yet been merged into the linux kernel. However the linuxfoundation website explains how to set it up.

 

Step 1: Download the driver. Unpack it.

Step 2: Install the driver.

./scripts/driver-select alx
make
sudo make install

Step 3: Reboot. You now have ethernet.

 

X and Graphics

For those of you unfamiliar with Archlinux, the initial setup is very minimal. It is command line only and you will need to set up the X server and the relavent drivers

pacman -S xf86-video-intel xorg-server xorg-xinit xorg-server-utils

To test X

pacman -S xorg-twm xorg-xclock xterm
startx

You should a minimal X come up. Exit it to continue with the rest of the setup.

Creating users

adduser username
pacman -S sudo
visudo # Give the user sudo permissions

Installing a desktop manager

This is where the installation steps may not be relavent to most people. I prefer gnome 3 over the other alternatives. You can replace gnome and gdm with their equivalents for the dm of your choice.

Log out as root, login with your user name.

sudo pacman -S dbus gnome gnome-extras networkmanager network-manager-applet
sudo rc.d start dbus
sudo rc.d start gdm # should bring up the login screen

after logging in,

- edit /etc/rc.conf and add dbus to DEAMONS immediately after syslog-ng

gnome 3.4 uses systemd instead of initscripts for a few things, including time and rc.local

You will need to do the following to set them up.

pamcan -S systemd systemd-arch-tools initscripts-systemd
systemctl enable gdm.service
systemctl enable NetworkManager.service
systemctl enable rc-local.service

Edit /etc/default/grub and add the following just below a similar line

GRUB_CMDLINE_LINUX=”init=/bin/systemd”

run the following command and reboot

grub-mkconfig -o /boot/grub/grub.cfg

 

Installing other packages

Sound:

pacman -S pulseaudio

Touchpad:

pacman -S xf86-input-symantics

Skype:

Edit /etc/pacman.conf. Uncomment multilib repository.

pacman -Syy
pacman -S lib32-pulseaudio skype # skype only available for 32 bit

Packer:

A package manager for third party Arch User Repository (AUR)

Download packer tarball

unpack tarball and cd into the directory

makepkg
pacman -S jshon
pacman -U packer-*

Installing nvidia driver

Since NVIDIA does not officially support Optimus on linux, the opensource project bumblebee is filling its shoes. You can set this up quite easily if you already installed packer (or other AUR helpers)

packer -S bumblebee bbswitch
packer -S nvidia-bumblebee nvidia-utils-bumblebee
sudo usermod -a -G bumblebee $USER

edit /etc/rc.conf and add nvidia to end of MODULES and @bumblebeed to end of DAEMONS

For some of you this may be enough. But just to be extra careful, create the following script and name it nvwake

This makes sure the nvidia module is loaded and /dev/nvidia0 and /dev/nvidiactl are created (which are required for those who intend to use CUDA)

#!/bin/bash
 
/sbin/modprobe nvidia
 
if [ "$?" -eq 0 ]; then
 
# Count the number of NVIDIA controllers found.
 
N3D=`lspci | grep -i NVIDIA | grep "3D controller" | wc -l`
 
NVGA=`lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l`
 
N=`expr $N3D + $NVGA - 1`
 
for i in `seq 0 $N`; do
mknod -m 666 /dev/nvidia$i c 195 $i
done
 
mknod -m 666 /dev/nvidiactl c 195 255
else
exit 1
fi

Now make it executible and make sure it gets run at startup (by putting it in rc.local)

chmod +x nvwake
sudo echo /path/to/nvwake >> /etc/rc.local # make sure it is run at startup

Installing CUDA

This is the easiest part.

pacman -S cuda-toolkit cuda-sdk # should install opencl-nvidia

Reboot!

 

Welcome back.

 

You don’t need to do anything special for CUDA programs.
You can test if everything is alright by running the following

sudo make -C /opt/cuda-sdk/C/ -j # why use only one thread?
/opt/cuda-sdk/C/bin/linux/release/deviceQuery
/opt/cuda-sdk/C/bin/linux/release/deviceQueryDrv

To run any graphics examples you need to use optirun like this:

optirun glxgears
optirun glxspheres

They may take some time to load though.

And that is it.
Happy hacking!

 

EDIT

Added some more information about CUDA since it felt a bit abandoned.

If any of you are having trouble, leave a comment or dm me @pavan_ky on twitter.

Installing MAGMA with GOTOBLAS2

December 30, 2011 Category :GPU| HOWTO| Parallel 0

I finally decided to give MAGMA a try since their latest release looked promising. I think I finally have it compiled properly, but I have had to jump through some hoops to get there. So here I am, putting my experience out there in the wild for other wanderers to come across.

OPTIONS

Firstly, MAGMA needs a CPU LAPACK and BLAS backend installed on your machine.
There are four options for this.

  1. Intel’s MKL
  2. AMD’s ACML
  3. Netlib’s LAPACK + ATLAS
  4. Netlib’s LAPACK + GotoBLAS2

Each of the four options can be configured by one of the files make.inc.$(LIB). LIB is either mkl, acml, atlas or goto. I wanted to go the opensource all the way with this. For reasons inexplicable, I chose GOTOBLAS2 over ATLAS.

GOTOBLAS2

That meant, I had to build GOTOBLAS2 first. It was mostly painless; Except, I had gcc 4.6. Which meant the compiler started  complaining about -l flags with nothing mentioned to the right. It was quickly evident that a parser was broken in the pipeline. After digging through perl code (with which I have *no* experience) for a few minutes, I had the fix. The following patch had to be made to f_check inside the root directory of gotoblas.

$link =~ s/\-rpath\s+/\-rpath\@/g;
$link =~ s/\-l\ /\-l/g; # Add this new line around line 237.

MAGMA

Finally, with everything setup, I had to make a change or two to make.inc.goto.
- Change GPU_TARGET = 1 (because I use a fermi card. Leave as 0 if you have pre-fermi cards).
- Change lgoto to lgoto2
- Copy make.inc.goto to make.inc

Doing a make at this point halts with this  linker error. The forum post linked above talks about how to fix the issue.
-  in zlatrd.cpp and clatrd.cpp by replacing blasf77_*dotc with cblas_*dotc_sub.
-  Be aware that the function is used twice. The first around line 256, and the second around line 325.

Here are the changes to be made in zlatrd.cpp

  cblas_zdotc_sub(i, W(0, iw), ione, A(0, i), ione, &value);  // Line 256
  //blasf77_zdotc(&value, &i, W(0, iw), &ione, A(0, i), &ione);
  ...
  ...
  cblas_zdotc_sub(i_n, W(i +1, i), ione,A(i +1, i), ione, &value); // Line 326
  //blasf77_zdotc(&value, &i_n, W(i+1,i), &ione, A(i+1, i), &ione)

Make similar changes in clatrd.cpp.

  make -j 4

You are now good to go!

Parallel programming 1: CPU, GCC, OpenMP

September 20, 2011 Category :Parallel| Programming 0

I have been waiting to do this for a long time. Experiment with parallel programming from scratch.

I decided I am going to use Matrix multiply as the benchmark algorithm for my experiments. Boring ? Yes. But used the algorithm for the following reasons.

  • Easily understandable.
  • Quick to implement (read as I am lazy).
  • Can provide good flop counts

With that said, let’s get to the code. The code assumes row major format. This can trivially be changed to column major format by changing the order of A, B.

The Naive implementation
void naive1(int M, int N, float *A,
            int K, float *B, float *C, float *S)
{
    for (int k = 0; k < K; k++)
        for (int m = 0; m < M; m++) {
            float tmp = 0;
            for (int n = 0; n < N; n++)
                tmp += A[m * N + n] * B[n * K + k];
            C[m * K + k] = tmp;
        }
}

This should be pretty obvious. A [M x N], B [N x K] are inputs. C [M x K] is the output. The only new variable here is S, which isn’t doing anything here (will be used later on).

My immediate thought after this implementation: What happens when you switch m and k loops. I figured the performance will differ once the memory required > cache size. To experiment, here is the second naive version.

Another naive implementation
void naive2(int M, int N, float *A,
            int K, float *B, float *C, float *S)
{
    for (int m = 0; m < M; m++)
        for (int k = 0; k < K; k++) {
            float tmp = 0;
            for (int n = 0; n < N; n++)
                tmp += A[m * N + n] * B[n * K + k];
            C[m * K + k] = tmp;
        }
 
}

Proceeding along this logic, I think the matrix that is larger (and hence referenced more times), should be pushed to the cache if possible. This is my silly, naive, attempt to make that possible.

Switching between naive implementations
void naivec(int M, int N, float *A,
            int K, float *B, float *C, float *S)
{
    return (M > K) ? naive1(M, N, A, K, B, C, S) : naive2(M, N, A, K, B, C, S);
}

Now, for the first and only code-optimization for the day. Looking at the naive code we notice

  • Reads from A are contiguous.
  • Writes to C are contiguous.
  • Reads from B are not contiguous.

However, reading the entire column (which is not contiguous in row major format), before it is read repeatedly reduces the number of non contiguous reads (from K * M non contiguous column reads to K non contiguous and K * M contiguous reads).

Getting slightly clever
void trans(int M, int N, float *A,
           int K, float *B, float *C, float *S)
{
    for (int k = 0; k < K; k++) {
        for (int n = 0; n < N; n++)
            S[k * N + n] = B[n * K + k];
        for (int m = 0; m < M; m++) {
            float tmp = 0;
            for (int n = 0; n < N; n++)
                tmp += A[m * N +n] * S[k * N + n];
            C[m * K + k] = tmp;
        }
    }
 
}

Now that we have the code, what else can be done ? Look for some compiler optimizations. Here is a list of optimization options available with gcc. General rule of thumb (read “laziness to understand in detail”), when in doubt use -O3.

Prior experience, suggested that this code can be trivially implemented as multi-threaded version using OpenMP. The modified version of the above function now looks like this.

Simple threaded version
void trans(int M, int N, float *A,
           int K, float *B, float *C, float *S)
{
#if defined (_OPENMP) // Check to see if OPENMP available
#pragma omp parallel for // Use them threads
#endif
    for (int k = 0; k < K; k++) {
        for (int n = 0; n < N; n++)
            S[k * N + n] = B[n * K + k];
        for (int m = 0; m < M; m++) {
            float tmp = 0;
            for (int n = 0; n < N; n++)
                tmp += A[m * N +n] * S[k * N + n];
            C[m * K + k] = tmp;
        }
    }
 
}

Barely any changes. Here is the code on github (may change over time).

Setup

  • Intel Core i7 740QM processor
  • Archlinux (linux 3.1,  x86_64) with gcc 4.6.2
$ gcc matrix_multiply.c -std=c99 -o mmul
$ gcc matrix_multiply.c -std=c99 -o mmul_O3 -O3
$ gcc matrix_multiply.c -std=c99 -o mmul_O3_OMP -O3 -fopenmp
$ ./mmul 1024 1024 1024 1 2 3
Time taken for naive1 matrix multiply: 19097.0628 mSecs
Time taken for naive2 matrix multiply: 19207.3578 mSecs
Time taken for trans matrix multiply: 17308.4158 mSecs
$ ./mmul_O3 1024 1024 1024 1 2 3
Time taken for naive1 matrix multiply: 4804.5552 mSecs
Time taken for naive2 matrix multiply: 4719.7902 mSecs
Time taken for trans matrix multiply: 3233.6918 mSecs
$ ./mmul_O3_omp 1024 1024 1024 1 2 3
Time taken for naive1 matrix multiply: 1830.1074 mSecs
Time taken for naive2 matrix multiply: 1803.4624 mSecs
Time taken for trans matrix multiply: 411.4554 mSecs

The naive version goes down from 19 seconds to 0.4 seconds with just a few minor tweaks.

  1. Reduce the non-contiguous data reads (1.1x to 4.6x)
  2. Use compiler optimizations (3.9x to 5.4x)
  3. Use OpenMP (2.7x to 7.1x)

 

Notes

  • GFLOP count for matrix multiply = 2 * (n^3) / (time in nano seconds).
  • Matrix size of 1024 x 1024 was chosen to simplify the formula to GFLOPS = (2 / time in seconds)
  • That would leave the best version at ~4 GFLOPS.
  • Not a bad start considering the amount of time spent on the code = ~ 1 hour

Hello World

October 9, 2010 Category :Programming 0

#include <stdio.h>
int main(int argc, char **args)
{
printf("Hello world");
while (true) while (false);
return 0;
}

Yes it compiles!