Sunday, February 15, 2015

Optimizing performance on the Raspberry Pi and Raspberry Pi 2

The Raspberry Pi is a popular platform that usually runs from a flash-based SD card as the root filesystem. Most of the tips from the previous article apply to the Raspberry Pi. They work best on the common 512 MB memory variant of Raspberry Pi Model B, rather than the early 256 MB version.

Raspberry Pi's standard Raspbian OS configuration by default has an extensive set of logging options enabled for rsyslog, and the file system is configured in ordered data mode. While such a configuration might be understandable from the viewpoint of system stability and error reporting, it is not beneficial for performance to the say the least, causing a flash write access bottleneck for overall system performance.

Reducing logging activity, using tmpfs and using ramdisk for cache directories


The Raspberry Pi's rsyslog configuration files are stored in /etc/rsyslog.conf. Comment out the many logging rules by prefixing them with a '#' will eliminate logging activity.

The following lines added to /etc/fstab cause /tmp and /var/tmp to be stored in ramdisks:
    tmpfs    /tmp       tmpfs    defaults    0 0
    tmpfs    /var/tmp   tmpfs    defaults    0 0
To move cache directories to ramdisk, create the file /etc/profile.d/xdg_cache_home.sh with the following content:
    #!/bin/bash
    export XDG_CACHE_HOME="/dev/shm/.cache"

Optimizing the file system


The Raspberry Pi does not accept the journal_async_commit mount option. However, write-back mode can be enabled and barriers can be disabled. Note that the risk of file system corruption with these settings is greater for the Raspberry Pi than for battery-powered devices.

It is best to store the performance options as default mount options in the filesystem itself using tune2fs. Data write-back mode (as opposed to the default ordered data mode) and no barriers are configured with the following command, assuming the root file system is stored in the /dev/mmcblk0p2 partition as it is on Raspbian:
    sudo tune2fs -o journal_data_writeback,nobarrier /dev/mmcblk0p2
When an error is detected during mounting at boot (including an option that is not accepted), the root file system will be mounted read-only, so that configuration files cannot be changed. It is however possible to remount the filesystem in read-write mode using the following command:
    sudo mount -o remount,rw /dev/mmcblk0p2 / 

Disabling X windows system error logging


Edit the file /etc/X11/Xsession and edit the relevant lines to look like this (somewhere in the middle of the file), The log-generating line is commented out and replaced with a one that routes messages to /dev/null:
    #exec >> "$ERRFILE" 2>&1
    exec >> /dev/null 2>&1

Increasing the size of the console font


The Raspberry Pi comes configured with a relatively small font in the text console. However, it is easy to configure a larger font that greatly improves readability by editing /etc/default/console-setup to include, for example, the following two lines:
    FONT_FACE="TerminusBold"
    FONT_SIZE="16x32"
This configures a highly readable 16x32 font. This amounts to 80x32 characters on a 1280x1024 monitor. On a 1920x1024 monitor, this amount to 120x33 characters with 24 scanlines unused.

Editing nano context highlighting configuration


The convenient nano text editor comes configured with context-sensitive color highlighting by default. It operates using rules read from a configuration file, which depends on the file extension of the file being edited. Some of these rules are very complex and much too slow for a device such as the Raspberry Pi, resulting in sluggish editing. For example, the rules for C/C++ source files in /usr/share/nano/c.nanorc includes a rule labelled with the comment "This string is VERY resource intensive!".  Commenting out this rule by putting a '#' in front of it results in greatly improved experience editing C/C++ source files. Similar resource intensive rules exist in the configuration for assembler files in /usr/share/nano/asm.nanorc.

Overclocking


Overclocking options can be set using raspi-config configuration utility. Overclocking is often stable on the device. I had no problems with the highest "Turbo" setting, which overclocks the CPU to 1000 MHz and the RAM to 600 MHz. The "core frequency" is also doubled from 250 MHz to 500 MHz. This frequency is tied to the L2 cache used by the GPU (it does appear to cause large increase in GPU performance).

On the Raspberry Pi 2, the default clock speed is 900 MHz, while the "core frequency" (which has a different meaning than it has for original Raspberry Pi) is a conservative 250 MHz. The main overclocking option intended for the Raspberry Pi 2 clocks the CPU cores at 1000 MHz, RAM at 500 MHz, and doubles the "core frequency" to 500 MHz. This has a positive effect on performance (especially memory performance), suggesting that the core clock is indeed correlated with CPU cache speed on the Raspberry Pi 2, which has a seperate L2 cache for the CPU. However, this setting turned out to be not quite 100% stable, with the culprit being an SDRAM speed that is slightly too high.

The amount of memory allocated to the GPU (memory split) should be set to the something like 128 if you want to be able run a wider range of OpenGL ES 2.0 applications, assuming a 512 MB Raspberry Pi.

Benchmarks


The performance increase from overclocking the CPU and RAM is measurable in low-level benchmarks. The following benchmarks were run with framebuffer_depth=32 in /boot/config.txt and a monitor resolution of 1280x1024.

Running the benchmark program from the "fastarm" repository (http://www.github.com/hglm/fastarm) as follows shows a significant performance increase from the default settings:
    ./benchmark --memset a --test 4
    ./benchmark --memcpy a --test 43
    ./benchmark --memcpy a --test 0
The memset benchmark (which reflects sequential DRAM write access) shows a performance increase from 1220 MB/s to 1929 MB/s (58%). The memcpy benchmark (which reflects copying a sequential memory region of 4 KB) shows an increase from 320 MB/s to 432 MB/s (35%). The third benchmark (which reflects smaller memory copies of varying size) is also dependent on CPU performance. It shows an impressive increase from 206 MB/s to 353 MB/s (71%).

On the Raspberry Pi 2,  the memset benchmark only scores 1090 MB/s with default settings. This seems to be a bandwidth bottleneck related to CPU and L2 cache speed limitations since other memset variants (including one using NEON SIMD instructions) show the same result. Multi-threaded benchmarks may potentially show higher throughout. However, when using the overclocking option, which significantly increases the clock speed of the L2 CPU cache,  memset performance increases to 1730 MB/s, in line with the Raspberry Pi.

The second benchmark (4K memcpy) reports 965 MB/s, which is significantly faster than the original Raspberry Pi. This seems to imply that write bandwidth provides the bottleneck. When overclocking, the result further increases to 1380 MB/s. The third benchmark (memcpy of regions of varying size) reports 673 MB/s, significantly higher than the original Raspberry Pi, which increases to 785 MB/s when overclocking.

The low-level benchx program (http://www.github.com/hglm/benchx) can be used to measure pixel performance in the X server. A command line similar to the one below was used:
    ./benchx --window All
Overclocking shows the following improvements (measurements in MBytes/s) on the Raspberry Pi:
    Test                          Standard     Overclock (Turbo)   Speed-up
    ScreenCopy (33x33)               84.7           150               77%
    ScreenCopy (549x549)            184             277               51%
    FillRect (33x33)                842            1282               52%
    FillRect (549x549)             1161            1835               58%
    PutImage (549x549)               45.1            72.8             74%
    ShmPutImage (549x549)           229             365               59%
    ShmPutImageFullWidth (109x109)  298             534               79%
    ShmPutImageFullWidth (549x549)  233             364               56%
The SRE real-time rendering library (http://www.github.com/hglm/sre) allows us to measure 3D GPU performance. The following benchmarks were run:
    ./sre-demo --benchmark --multi-pass --multiple-lights --shadow-volumes demo2
    ./sre-demo --benchmark demo2
    ./sre-demo --benchmark demo9
The first benchmark, a complex 3D benchmark with multiple lights, multi-pass rendering and shadows, shows an increase from 1.118 fps to 1.325 fps (19%) when overclocking. The second benchmark only uses only a single light with single-pass rendering and no shadows and shows an impressive increase from 13.2 fps to 27.2 fps (106%). The third benchmark, a lighter benchmark which uses a lot of alpha blending, increased from 30.2 fps to 48.0 fps (59%).

On the Raspberry Pi 2, the first benchmark reports an even slower framerate of 0.91 fps, while the second benchmark scores 20.5 fps, and the third benchmark scores 32.5 fps. Surprisingly, these results do remain about the same when overclocking. This is probably because on the Raspberry Pi 2, the core_freq setting does not directly affect the GPU.

File system benchmarks


Testing low-level file system performance using flash-bench (http://github.com/hglm/flash-bench) gives an indication of flash storage performance. A high-speed UHS 1 MicroSD card was used (using an SD card adapter). Measurements in MB/s.

                                          Seq.    Seq.    Random  Random
                                          Read    Write   Read    Write
    PC (using USB SD card adapter)        18.1    15.7     3.55   15.3
    Raspberry Pi, optimized fs options    17.2    13.3     4.35    0.96
    Raspberry Pi (turbo overclock)        17.2    14.4     4.83    0.86
    Raspberry Pi 2                        17.4    14.5     5.09    1.47
    Raspberry Pi 2 (same card via USB)    16.2    10.9     4.38    1.19
    Raspberry Pi 2 (different card)       17.7     9.5     4.46    1.31
    Raspberry Pi 2 (same card, overclock) 17.7     9.7     4.52    1.56
    Raspberry Pi 2 eMMC via USB           17.8    14.4     4.38    6.80
The results show very slow random write performance, despite the use of the journal_data_writeback and nobarrier options, when compared to testing on other devices. Whether this is reflected in actual real-world performance is unclear.

The last entry is for a high-performance eMMC flash module fitted on an eMMC-to-MicroSD adapter fitted on a USB Micro-SD reader. This shows that higher random write performance is possible given high performance flash memory. Although I have not tested this set-up with the Raspberry Pi 2's internal MicroSD card slot, it would probably deliver similar or better performance.

Updated February 25, 2015.

8 comments:

  1. This is some quality content, thanks

    ReplyDelete
  2. This is why I want to get my Computer Engineering degree, great job!

    ReplyDelete
  3. I would also consider reducing the RAM allocated to the GPU. You can do this under "sudo raspi-config".

    ReplyDelete
    Replies
    1. Small amount of RAM for GPU is not good idea either - of course depends what you actually want to do with Rpi.

      "Setting gpu_mem to low values may automatically disable certain firmware features (as there are some things the GPU simply can't do with too little memory). So if a certain feature you're trying to use isn't working, try setting a larger GPU memory split."

      For basic desktop use I found the 256 to be good compromise between usability (GUI performance) and available RAM for system.

      Delete
    2. I was under the impression that dynamic memory splitting had been the default for a really long time now...

      Delete
  4. I tried these tweaks without any overclocking, and I noted that I had to chmod -x ntp, and now my keyboard and mouse stop working when I start python, I have to unplug/replug.

    ReplyDelete
  5. Found part of the problem:
    It seems that when python tries to open the USB serial port device I have, we loose the keyboard 'n mouse. So we've exposed an unknown problem.

    ReplyDelete
  6. I only connected the tx/rx, and ground pins to the the robot, not VCC. It's on a hub.

    ReplyDelete