Centralization of graphical awesomeness

Carsten Haitzler (The Rasterman) raster at rasterman.com
Tue Oct 27 15:58:14 CET 2009


On Tue, 27 Oct 2009 17:11:08 +0300 Gennady Kupava <gb at bsdmn.com> said:

> I am sorry, but my letter is not about trolling and blaming but about
> optimization, qt and e, speed is interesting for me, not blaming. Calm
> down guys! I've numbered separate points overwise my letter will look
> endless :)
> 
> 1. First, bit about qt scrolling - It's not so simple Carlsen want it to
> be. I see background image, rendered text and 1-2 relativaly small image
> each line. "Apllications" menu have ~40 entries. All scrolling very
> smooth, and no rectangles where. Carlsen, have you run qtmoko? Buttons
> are changing only then you press them. What prevents E from prerendering
> contents of scrollable area, it is not changing on the fly? Lack of this
> optimization makes menus and scrollable areas much slower. 

scrolling isnt any special operation in efl. it's moving some objects around.
that's all it is. a scroller just moves its child around. moving an object
queues redraws for previous and current positions. evas' merges all redraws at
render time and just does them. it will avoid drawing things that will be
later overwritten by solid pixels. as long as it knows that they are solid (eg
solid rect, image without an alpha channel etc.) scrolling is done very
differently. you can't "pre-render" as they get rendered on the fly. everythign
does. evas has caches to save copies of scaled images (if smooth scaled) to
save computation making the smooth when scaling on every redraw. but it's still
a draw. this is done this way because it is increidbly flexible. you get the
ability to have translucent items and all sorts of goodies. a draw in the end
is a copy from some source and a write to a dest in evas. the more reads you do
and writes - the worse it gets. worse is alpha blend as its read source, dest
then write to dest (after some calculations).

now... if your list in elementary had NO backgroun except the selecte item ALL
it woudl do it draw the changes in test items - ie fill in the background
(solid color would be writes onlt, image woudl be read then write) and then
draw text on top (an alpha blend op with source data being only 8bit alpha).
and this only for where the text is.

for qte/qtopia/qtwhatever it is called now, if you have a background that moves
with the text that scrolls, then it is a simple copy (copy current area up N
pixels or down) thus a read and write, then draw new area. if the display is
with a static bg and scrolling text - it's the same as evas. evas's scroling is
ONLY this method. if you configured the theme to be the same as qt (from memory
it was solid black bg's etc.) you'd end up with approximately the same speed.
evas would do a bit more work as it'd alpha blend the text, but it would avoid
copying areas of the list that didnt change (eg strings dont fill the entire
line and only part of it).

it's Carsten btw. "t" not "l" :) and i have run qtmoko... before the freerunenr
was even out. i tememebr it being orange for selected list items (rectangle),
empty if not (just text) and a greyish "qt" logo background with some visible
dither patterns on scale up.

> 2. Second point of my letter was that Glamo seem should not be blamed
> for everything. I wrote simple program to measure simple memcpy speed on
> om... This program just allocates 2 buffers of defined size and outputs
> count of memcpys of defined size it did in 1 second (interrupt via
> alarm()). Initally I want to see how arm cache cleanup and task switch
> influences parallel memory access tasks. Result were surprising for me:

glamo is one of the big problems. a write to video memory - eg a new screen
fram is.. based on your numbers below, about 1/5h the speed. it is as IF you
copied 5x the data from memory to memory. thats a heavy cost. 

> OM:
> buffer_size average_number computed_throughput
> 128 1260880 153Mb/s
> 256  540900 132Mb/s
> 512  252399 123Mb/s
> 1024 121988 119Mb/s
> 2048  58827 114Mb/s
> 4096  29000 113Mb/s
> 8192  14000 109Mb/s
> 16384  3660 57Mb/s
> 32768  1105 36Mb/s
> 65536   553 34.5Mb/s
> 131072  274 34.2Mb/s
> 262144  135 33.7Mb/s
> 524288   69 34.5Mb/s

only the last really counts. the first are just caching effects.

> I did same test on my very-old-Celeron 600 router:
> 256     2522958         615Mb/s
> 512     2088723         1019Mb/s
> 1024    1554162         1571Mb/s
> 2048    1019996         1992Mb/s
> 3072    762667          2234Mb/s
> 4096    109489          427Mb/s
> 16384   27389           427Mb/s
> 262144  318             79Mb/s
> 524288  151             75Mb/s
> 1048576 76              76Mb/s

yes. better. the 2442 in the gta02 is an ooold arm cpu. it's not too modern.
given when the gta02 was released... it'd like making a pentium4 laptop and
releasing it and selling it as new in todays shops.

> On desktop (xeon 2Gz):
> x64 binary:
> 26214400 74              1850 Mb/s
> 262144   31971           7992 Mb/s
> 256      53994512        13232 Mb/s
>                      
> x32 binary: 
> 26214400        59                 1475 Mb/s
> 262144          29068              7267 Mb/s
> 256             20810406           5080 Mb/s
> 
> Old arm-based device at my work (at91rm9200, 180Mhz)
> 2560000 16	39Mb/s
> 256000	167	40Mb/s
> 256 	418044	102Mb/s

yup. even that is better. :)

> So, we can see that we have speed of 34Mb/s (it's ever only 5 times
> declared 7Mb/s for Glamo!) can someone comment? why memcpy is so slow -

well it possibly is partly memcpy itself. i'd have to check but you may be able
to improve it with some asm. i know on newer arms you really want to use vfp or
neon especially for memcpy's - u can get something like 2x the speed.

> it is 2 times slower than ancient celeron, and on par with very old
> arm-based machine, it is not related to glamo anyhow! We can even skip
> results with cache, where om 10 times slower old machine.

correct. never claimed the cpu was fantastic :)

> 3. ... e - every N seconds (see config dialogs for what it is set
> to there, but let's say 60 seconds) will flush caches. ... and things are
> having to be repopulated from disk ...
> 
> >From disk?! This is cost or having small memory footprint? This looks very
> >wrong.

where do u think images come from ? the arrows on buttons? the buttons
themselves? icons? all that data comes from a disk - from a file on disk. if it
isnt needed anymore (it's invisible) and it cycles into speculative cache, it
can be flushed. eg.

a png icon on disk might be 24kb - in ram it becomes 64kb. those 64kb's add up.
this is a tunable thand can be disabled - if you want. but this keeps memory
usae low. the default cache fir e is 4mb of ram. it will keep the most recent
decoded images in there. (For images - more caches for other things). it's
loaded then kept - dereferenced when not visible. if you want to be sure,
strace and see when its actually opening files.

> 3. ... Actually, yes the GTA01 is very noticeably faster in
> graphics. ...
> Can you expose a bit more details: How much it is faster: x2 times, x3,
> x1.5, x1.2?

ask kenyoung. he said it:

"Actually, yes the GTA01 is very noticeably faster in graphics.
I've got both, and I've run 'em side-by-side.   The glamo actually
is a graphics DEcellerator.   That's why GTA02-core is kicking it out."

quoted from his mail to this list yesterday. i dont have a gta01 anymore. it
went missing at some point, but i'd believe it.

> 4. ... for every second spent uploading contents to glamo, you CANNOT
> spend it calculating a new fram. ... 
> Yes, this is bad... But qt works :)

efl works too. see above. not comparing apples vs apples.

> 5. ... that's because you have 2 processes competing for the cpu to
> render. ...
> My measurements of parallel memcpy showed that this is neglibible.

that will be totally wrong. 2 processes doing 2 copies from 2 different source
and destination addresses WILL have performance suffer - likely less than half
performance as you'd have cache flushing between context switches etc. i'd say
the bencmarking method is not right? you can't have 2 processes compete for
memory and both get all they want when it's a limited resource and 1 cpu only
to share around. and my statement was not just cpu, but x ang glamo. x will be
sharing requests from 2 processes at once trying to draw to it. and... i stated
IO. :)

> 6. ... you wont find routines for rendering faster in most of the
> world. ...
> I will. I can recall you previous posts on the topic.

go for it. they exist. but not in most of the world. in some corners. find a
faster alpha blender or a faster super-sampling/sub-sampling scaler or... :) (i
am excluding neon asm here as its off topic and not on gta02, but if you had
something with neon you wouldnt have this conversation as performance would be
fine)

> 7. but.. if i were smart.. i'd not develop apps for the freerunner. it's
> a "dead product". it has no more being produced. it has no evolution
> path. there won't be a gtao3, 04, 05 etc. everyone quit or was fired/let
> go from om that worked on phones.. or worked on pretty much anything.
> your future is other devices.. and these don't suck with EFL. i'd not
> compromise the future if i were smart.
> 
> Frankly speaking you never developed for GTA02, yes? you aim seem always
> were in future, and this is ok. I am sure that for example scrolling
> area pre-rendering if good for future.

wait? never? maybe you forget. i WORKED at openmoko before gta02 was released,
during and after. i very much did develop on and for it. i was kicking glamo
around long before it was sold. there were REASON why i punted questions to
this list like "what would you guys say if we dropped to qvga?" as there was a
replacement lcd with the same dimensions but qvga. i stopped fiddling with my
gta02 some time late last year/early this year as i got hardware that is much
better in my hands.

> 8. ... most games i know of are written to work on the highest end
> graphics cards at the time. why? ...
> Best games are written with other objectives in mind, this games are
> really interesting for anyone from time to time and for sure will live
> in ages (chess, nethack and so), our grandchilds will play nethack, be
> sure. Is it better to make pefrect things? 
> And optimization is always good - you can feel that 10ms latency and
> 100ms latency is different even both are more than enoght for UI, but
> you feel that 10ms latency is much better.

ok. talking different worlds of games here. i'm talking the ones that come out
for ps3, xbox, and the pc games you buy on a shelf - not "chess" or
"solitaire". regardless.. there's a multi-billion dollar industry for the
quakes of this world. not so much for chess or solitaire :) so maybe i didnt
explain that well - i apologize. i was thinking THESE as games, not chess
etc. :)

> 9. ... BUSINESS CHOICE ...
> Everyone here follows it's goals. Carlsen make E. Other want to do
> hardware. Others want to use free hardware. Others want to increase
> development skills and hack that HW. Others just feel fun reading this
> book. Others have this job. Someone even makeing money from OM. ;) All
> this is ok, and I see nothing bad on making some great E developer to
> think a bit about optimizations - nobody loose from optimizing of E and
> writing a bit of technical descriptions :)

trust me. optimisation is what i do. i have an xrender engine for evas. it's
complete. it does everything. why isn't it used? because my own software
rendering code has outperformed xrender year on year. i am still waiting for
xrender with its partial hardware or claimed "full hardware acceleration" to
beat the software i wrote. i  have been waiting for years. i have an OpenGL and
GLES engine. i have benchmark suites that compare engines.. apples vs apples.
they do the exact same operations. the same drawing (within the limits of their
system). and yes - OpenGL on my desktop (Nvidia 8600GTS vs core2-duo 3ghz).
opengl... is 2x the speed of software. but considering thats software... thats
not too bad. a modern high end dedicated gpu is only doing 2x the software
speed. i know something of optimising. i know something of playing tricks to
avoid work. in fact evas is avoiding work all over the place. but none of the
themes are apples vs apples. i know just where evas has performance problems,
and some of them i just chalk up to "well.. it is simply not worth my time and
effort to try as frankly.. the problem is already solved - newer systems are
fast enough were it "doesn't matter"". some others its more a matter of just
not pushing efl so far. if you have to sit and compare. make sure your
comparison is fair. apples vs apples. 

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com




More information about the community mailing list