# Glamo slowness question. Glamo transfer speed improvements (+33%). Video sofdecoding playing profile. Glamo dma transfer.

Gennady Kupava gb at bsdmn.com
Sat Jul 17 22:42:19 CEST 2010

```Hi, list.

Here is my next rant to freerunner slowness :)

Today, topic of discussion is freerunner's most famous beast - glamo. I
want to show that it is much better than some people thinking about it.
Also i propose different significant optimizations: usage of DMA,
changing cpu<->glamo bus timings. Also you can find analysis of high
resolution software video decoding.

1. --->>> Glamo bus speed

The most famous problem about our beast is legendary mythical cpu <->
glamo bus slowness. Let's measure actual bus speed and fix it.

1.1. --->>> Current situation

The older statement is that glamo should have 7Mb/s transfer speed. It's
not easy to find out how this number were calculated. I hope someone who
did glamo speed measurement can comment this mail if i am wrong
somewhere.

1.2. --->>> Theory

Actually bus speed of glamo is limited by settings of memory controller
in our cpu. This speed can be calculated by formulae
(HCLK/TWORD)*WORDSIZE, where HCLK is frequency of memory bus == 100Mhz
(in normal conditions, without overclocking), WORDSIZE = 2 bytes, and
TWORD is waiting period, by default we have TWORD=4+4+4 bus clocks. So,
for default settings we have CPU<->glamo bus speed (100*10^6/(4+4
+4))*2/1024^2 Mb/s=15.8Mb/s. This speed may or may not be also
influenced by nwait state of cpu. We'll measure actual practical speed
in section 1.3.

So, together with Thomas White we did review of memory contoller
settings in s3c and found out that 4+4+4 setting seem not reasonable.
according to Thomas analysis of timings in glamo documentation it should
be 2+4+2, which is 33% less than default and gives us:
(100*10^6/(2+4+2))*2/1024^2=23.8Mb/s.

As you can see, both numbers are much more than 7mb/s.

1.3. --->>> Synthetic bus speed measurements.

So i used simple tool to measure actual cpu<->glamo bus transfer speed.
Tool opens framebuffer device and starts memcpy or memset session for
whole video frame. it's speed measurement +-1 frame. so i tried to find
how many frames may be displayed with each method (memcpy or memset) in
1 second for eash s3c memory controller settings (default 4+4+4 or
better 2+4+2), and use formulae (640*480*2)*nr_frame to get transfer
speed. Or course, memcpy is always slower than memset as cpu need first
fetch 4byte burst of data from main memory and only after that send it
to glamo, but memcpy is only most common operation with glamo (in both
video and mmc transfer).

So, after actual measurement i produced following table:

theory (see 1.2)      memset        memcpy
4+4+4 speed:         15.8Mb/s         12Mb/s        10.5mb/s
2+4+2 speed:         23.8Mb/s         17.5Mb/s      14.0mb/s

So, as you can see both default and new settings are very far from
7Mb/s. Also you can see that glamo can really do 14mb/s which is 22 full
screen 640*480 frames. Also one can notice that changing settings
increase throughtput by 33%.

1.4. --->>> Profiling real application: mpeg2 video decoding with
mplayer.

To check effect of changing timings settings, i did complex profiling
for different versions of mplayers, decoding 480x640 mpeg2 video at
5fps. Such frame rate used to avoid any case of overruns or framedrops.

I tested default(AKA slow) and fast timing settings for -vo x11 and -vo
fbdev. The settings with different vo's are not directly comparable, as
different libc used occasionally, so only (slow/x11 and fast/x11) and
(slow/fbdev and fast/fbdev) are comparable.

The profiling setup is opcontrol --start; mplayer ... ; opcontrol --stop

First, I recorded cpu usage values from mplayer:
fbdev    x11
slow     35/31    40/35
fast     31/22    42/26

So, here is comparison of slow/fast fbdev:
www.bsdmn.com/openmoko/glamo/oprofile/mplayerfbdev.png

And comparison of slow/fast x11:
www.bsdmn.com/openmoko/glamo/oprofile/mplayerx11.png

As you can see, libc memcpy call cpu usage decreased from 25.2% to 17.6%
for fbdev. (i think most of this memcpys are memcpy to glamo)
In x11 version, Xorg memcpy usage decreased from 19.1%->13.2%.

in both cases, memcpy function from glibc is most time-consuming
operation.

Generally speaking about software video decoding, next by cpu
consumption is yuv2rgb function. I checked mplayer code and found that
this function seem not optimized specifically for arm, so it might be
possible to do something with it.

In general, cpu usage in 5fps mpeg2 decoding is following: 17%
memcpy(probably to glamo)/9%(yuv2rbg)/3% mpeg_decode_slice/41% idle/11%
oprofile/1.6%io/ + others. others 100% is all cpu.

You may find full profile results at:
www.bsdmn.com/openmoko/glamo/oprofile/profiles

1.5. --->>> Open questions

Actually, out of glamo documentation, speed should be set to 1+4+2. But
this seem not working, and we didn't found why. So this question is open
for further unvestigation.

1.6. --->>> How to test/use it

I used this settings by default for several days with qtmokoV24 and with
debian on usd.

I prepared u-boot to set new timings by default:
www.bsdmn.com/openmoko/glamo/242/u-boot_glamo242.udfu

You may check current timings value with following tool:
www.bsdmn.com/openmoko/glamo/timings/memwrite

to check run:
#./memwrite 1207959560

You should get for new timings:
Old value:
addr[48000008]=0x80 0x13 0x00 0x00

Expecting some reports from users, did it work for you flawlessly?

2. --->>> DMA transfers

2.1. --->>> Theory and current situation

Currently, both mmc and glamo X driver using memcpy memory transfer.
this means two things:
a. cpu is 100% busy while transfer.
b. cpu may shedule other task instead of transfer.

Older investigation of DMA transfers result were 'slight slower on small
transfer' and 'small benefit on large transfer.

2.2. --->>> Synthetyc testing

i wrote proof-of-concept kernel module to do dma transfer directly from
userspace to glamo memory.

Contrary to prevoius investigations, I found that dma transfer is
working exactly at same speed to usual memcpy to framebuffer.

Also DMA transfer not 'blocks' memory bus. i did lmbench testing under
100% continious dma transfer to glamo, it were significantly slower but
system worked.

(all tests were done with default timings)

Conclusion ->  implementing dma for mmc and glamo may provide speedup to
system, also it may reduce emergy consumption (as cpu may sleep while
dma transfer is active)