[PATCH 0/2] Improve GTA02 NAND read performance by 41%

Harald Welte laforge at openmoko.org
Tue Oct 21 11:22:31 CEST 2008


On Tue, Oct 21, 2008 at 12:33:34AM +0100, Andy Green wrote:
> | result (based on 512byte dd): 9.197MByte/sec	(98% speed-up)
>
> Wow... the 98% sounds good already but on a benefit-per-line-of-patch
> basis it's probably a record.

well, it is both the timings _and_ the hardware ECC benefit together (the
latter work by zecke is already in your kernel tree, so it's cheating
a bit).

> | However, I don't think that all of the time is spent copying data, but
> | rather polling for when data is finished. The s3c244x (not 2410) support a
> | RnB interrupt which should solve this issue.
> |
> | The mainline kernel NAND code doesn't have infrastructure for this yet,
> | but I'm working on this right now.
>
> Yes this is similar to the Glamo MCI thing, you ask for a block and then
> some time later you get a completion interrupt.  In the meanwhile the
> MCI stack has allowed other processes to get the CPU... it'd be cool to
> have that for NAND too because at boot-time there can easily be other
> processes floating around that have a use for the CPU inbetween NAND,
> and if not then parallel startup will increase the probability of it.

Interestingly, I now have a patchset that uses completion-interrupt based
logic and I still get the exact same throughput at the same CPU usage.

I have confirmed that the new code was actually used by printk's in the
interrupt and completion function.  Also, the interrupt count in /proc/interrupts
is visibly increasing every 2k that is read from the device.

So now I'm looking at it with oprofile and we get 35% in __raw_readsl, which
is expected. at 45ns theoretical byte clock, there are 18 cpu core instructions
per byte we read from NAND.  Since we read word-wise, we get 72 instructions for 
each word.  Since our actual clock runs at 50ns it is something like 80
instructions per quad-word.  Given that our actual data rate has 108.73ns 
per byte, we actually get something like 173 core clocks (and thus maximum cpu
instructions) per quad-word.   So there's not much time, and the loads+stores
in __raw_readsl will likely take significant time.

However, what is really surprising is default_idle showing up with 21.7% !

Top reports 98% CPU load, but oprofile claims 21.7% idling.  Really strange.

The actual s3c2410 NAND driver shows up with a totl of 1.12%, the
interrupt/completion related functions are 0.0625%

I'm considering this to be some kind of artefact of using oprofile.  I've 
never used it on ARMv4 before, and it seems like it can only use the timer
tick.

dyn_tick is disabled, so the timer ticks should be monotonic. (and powertop
proves they are)

I'm really lost here, don't know what else to do.  I'll get some profiles on a
soft-ECC and on a non-irq-based-NAND kernel to compare the results and see if
they also show this 'artefact'.  Maybe 'top' is actually wrong?  Any ideas?

Cheers,
-- 
- Harald Welte <laforge at openmoko.org>          	        http://openmoko.org/
============================================================================
Software for the world's first truly open Free Software mobile phone



More information about the openmoko-kernel mailing list