[PATCH 0/2] Improve GTA02 NAND read performance by 41%
Harald Welte
laforge at openmoko.org
Tue Oct 21 11:22:31 CEST 2008
On Tue, Oct 21, 2008 at 12:33:34AM +0100, Andy Green wrote:
> | result (based on 512byte dd): 9.197MByte/sec (98% speed-up)
>
> Wow... the 98% sounds good already but on a benefit-per-line-of-patch
> basis it's probably a record.
well, it is both the timings _and_ the hardware ECC benefit together (the
latter work by zecke is already in your kernel tree, so it's cheating
a bit).
> | However, I don't think that all of the time is spent copying data, but
> | rather polling for when data is finished. The s3c244x (not 2410) support a
> | RnB interrupt which should solve this issue.
> |
> | The mainline kernel NAND code doesn't have infrastructure for this yet,
> | but I'm working on this right now.
>
> Yes this is similar to the Glamo MCI thing, you ask for a block and then
> some time later you get a completion interrupt. In the meanwhile the
> MCI stack has allowed other processes to get the CPU... it'd be cool to
> have that for NAND too because at boot-time there can easily be other
> processes floating around that have a use for the CPU inbetween NAND,
> and if not then parallel startup will increase the probability of it.
Interestingly, I now have a patchset that uses completion-interrupt based
logic and I still get the exact same throughput at the same CPU usage.
I have confirmed that the new code was actually used by printk's in the
interrupt and completion function. Also, the interrupt count in /proc/interrupts
is visibly increasing every 2k that is read from the device.
So now I'm looking at it with oprofile and we get 35% in __raw_readsl, which
is expected. at 45ns theoretical byte clock, there are 18 cpu core instructions
per byte we read from NAND. Since we read word-wise, we get 72 instructions for
each word. Since our actual clock runs at 50ns it is something like 80
instructions per quad-word. Given that our actual data rate has 108.73ns
per byte, we actually get something like 173 core clocks (and thus maximum cpu
instructions) per quad-word. So there's not much time, and the loads+stores
in __raw_readsl will likely take significant time.
However, what is really surprising is default_idle showing up with 21.7% !
Top reports 98% CPU load, but oprofile claims 21.7% idling. Really strange.
The actual s3c2410 NAND driver shows up with a totl of 1.12%, the
interrupt/completion related functions are 0.0625%
I'm considering this to be some kind of artefact of using oprofile. I've
never used it on ARMv4 before, and it seems like it can only use the timer
tick.
dyn_tick is disabled, so the timer ticks should be monotonic. (and powertop
proves they are)
I'm really lost here, don't know what else to do. I'll get some profiles on a
soft-ECC and on a non-irq-based-NAND kernel to compare the results and see if
they also show this 'artefact'. Maybe 'top' is actually wrong? Any ideas?
Cheers,
--
- Harald Welte <laforge at openmoko.org> http://openmoko.org/
============================================================================
Software for the world's first truly open Free Software mobile phone
More information about the openmoko-kernel
mailing list