Friday, January 21, 2011

Skype discussion

0 comments
  1. parallelize for JasPer 9/7 portion
  2. quantization, vectorization? parallelization?
  3. memory intel asm vectorization, copying stuff, 64bits, upto factor 2 by using the 128 bits. pin down the problems.
  4. packagelization, Rate/Distortion control, print the number separated by the comma,
  5. modify simplescalar to create 3-D memory access pattern graph
now counting access,
  • count the load and write sedately. 2 sets of counters,
  • print the sum of 2 numbers
  • spill out as 2 set of data for graphing
  • use more resolution, now using 80, pixel based, more pixel for memory and for time.
  • plot color, on 2D, color identify the density


Other images for testing
  • memory access pattern, (8k by 8k)
  • understand the behavior based on the different graph. printout the loop iterations, graph that?
take close look at the fusion
  • cache grind for the certain line of the code.
  • by counting access and miss in SimpleScalar.. from begin to the end of the fused loop.
  • graphical pixel approached to zoom into the fused loop behavior.

I/O problem - large time fraction
  • for the large image.
  • can get a better performance for I/O

Lab work
2 large FPGA board
look at clear system installation
try the equipment, and make sure it's working

Wednesday, January 19, 2011

Ideas for the next-step work

0 comments
  1. parallelize for JasPer 9/7 portion
  2. intel asm verctorization
  3. modify simplescalar to create 3-D memory access pattern graph

Thursday, January 13, 2011

Skype call discussion

0 comments
lossy compression rate time, increase the rate??
comparison with the original image? [confirmed]


isolate the quantization from previous paper.

fig. 4
a) drop the comment, replace by '...'

fig.5
a)

Index term...
ieee standard index... 3-4
ieee.ca -> next ccece ->author kit...

Wednesday, January 12, 2011

Skype discussion for the paper

0 comments
Section 3

  1. slightly longer
  2. look at the cache profile to shorten
  3. scale the memory access graph down...[vertical should be smaller.
  4. look at the source code for qmfb.c, colgrp, defined a constant which is 16. follow up the 1.9 does this this.
  5. working at entire column, but 16 columns at time?? or just work on some columns not all..

Section 5

  1. which image used in discussion
  2. figure 7 -> graph should be smaller, font size large.
  3. rename the color image -> PCB _orginal _half _tiled
  4. figure 5 for presentation, remove ...
  5. move the open bracket.
  6. do the same thing for figure 4
  7. retrieve pointer value... -> set
  8. figure changes for fig2.
  9. table 1 should be changes..
  10. http://www.blogger.com/post-create.g?blogID=9131256253599090292
  11. image thumbnails in the paper??
  12. table caption goes on the top
  13. heading for section 4 - drop all the word "initial"
  14. section 4 heading - keep it as is...
  15. Title - drop the word "execution"


TOMORROW
** create .zip file contains everything
phone number...

Friday, January 7, 2011

Skype call discussion

0 comments
Section 2
  • they used 9/7
  • we chosen to used 5/3, but we tried 9/7


Section 3
  • cache behavior, instead, start in general, show cachegrind result, and simplescalar results... show the figure (memory graph 2-D). not touch frequently, data access is concentrated in certain area.
  • cachegrind L2 7million misses, 0.34%, ss-direct-map misses 2% (L1=L2, worst case), best=0.3%; RISC vs. CISC
  • later to generate the number in the memory region that has been access, to produce a 3-D bar graph. number separated by comma.


Section 4 is appropriate..

Cluster used is not the latest and greatest, with DDR3 memory, making the comment somewhere. we already got the relatively low cache missing rate.

Section 5 [me, adding material]
  • written comments.
  • confirming the number of instruction [done]
  • at the end of section 5.2, adding brief discussion, along with a figure to should various speedups in one paragraph. /cats/color/color2/color4/galaxy/galaxy_4(4k by 4k)/
  • 9/7 rely on the profile, comment to the previous work [section 3.5 ??]

Experiment
  • galaxy 4k*4k(and 8k*8k)/color(longest execution time)/, native running with 9/7, time. (30% improvement)
  • profile for the galaxy 4k * 4k native.

Send
  • section 5 - pdf and latex
  • section 3 - pdf and latex

Wednesday, January 5, 2011

Skype call discussion

0 comments
1. update the caption for the fused loops
  • caption (a) original loop with pointer index variable.
  • caption (b) original loop with integer index variable.
  • add more detail in the loop body
  • have some explanation in the text for the paper

2. Experiments
  1. confirm the pointer ->integer without increment the execution time.... can compare the cachegrind result. (origianl vs. fused loop version)
  2. native hardware run (pointer-based loop (original), and integer-based loop only)
  3. try 7/9 configuration , and run image encoding for 7/9 and 3/5
  • single processor, natively, see any significantly difference
  • gprof for 7/9 to see the difference in cblk part, and dwt portion.