Thursday, May 26, 2011

Skype Discussion

0 comments
cluster execution across the network
heap growing for the mountain shape
only reads under the mountain
  • confirm the read-only activity under the growing heap area. (for parallel execution)
  • create load/store graph for parallel execution


having each thread allocate own memory range.
parallel bumps at the end, parallel across the network...
single write... only write once for one location, independent write.

  • priority of the chapter, thesis writing.

Saturday, May 7, 2011

Discussion

0 comments
In the SimpleScalar 3-D memory tracking
  • test other images and create graphs
  • tracking memory for the parallel version of JasPer
  • tracking particular interval of execution (e.g. zoom into the graph)

Encode the large images by using more than 2 node in cluster, see any speedup gains.

Friday, April 22, 2011

Skype Discussion

0 comments
Research:
  1. Revise Presentation
  2. cluster execution for the loops in their original form, and parallelized individually. (with/without fusion, any differences?)
  3. 3-D memory chart (report read/write/total graph separately)
  4. vectorization, paper that vectorized jasper in a single processor.(2005)
  5. look thesis format
  6. cachegrind profile the largest image... to check the cache miss rate (whether inline with other images.)

Lab work
  1. move tables/computers
  2. put FPGA in the box

Friday, April 15, 2011

Skype discussion

0 comments
Conference:

1. presentation, 15 mins, and 15 slices (including everything, title, outline, overview, technical content 3-4, result 2-3, conclusion 1)

2. draft presentation for CCECE conference

3. choose the next machine to put into cluster....



Research:

small amount iteration --> overhead
granularity --> small -->less work per processor..

maximum chart --> in the linear scale... only show all maximum together in one graph.
iteration graphs for two largest images

Wednesday, March 30, 2011

Skype Discussion

0 comments
Conference:
1. presentation/poster
2. Xilinx hardware


Research:
1. memory chart (one of the pcb images)
2. experiment, simulation for the fusion, the loops in their original form, and parallelized individually. (with/without fusion, any differences?)
3. loop graph, number of iterations, pattern graph for other image, (cats image, black regions, showing lots of zeros)
4. 3-D memory chart (report read/write/total graph separately)
5. vectorization, paper that vectorized jasper in a single processor.(2005)

Monday, March 21, 2011

Latex related...

0 comments
Generate the paper in 8.5x11 Letter size

dvips -t letter -o output.ps input.dvi
ps2pdf
output.ps


$ PS F:\EclipseCode\E36_WS_Latex\latex_research_cal_ccece_paper> dvips -t letter
-o .\research_cal_ccece_paper_lettersize.
ps .\research_cal_ccece_paper.dvi

$ PS F:\EclipseCode\E36_WS_Latex\latex_research_cal_ccece_paper>
ps2pdf .\research_cal_ccece_paper_lettersize.ps

Friday, February 18, 2011

Skype Discussion

0 comments

Lab work next week, memory, and harddisk,
setup the computer and connect the network.
Check the type of memory
which system would be right target to work with next.

Cachegrind for original code

l1 miss sum
line 219
line 221
make a table to show the number of miss and percentage...


step into the function jpc_enc_enccblk() function


Total L1 and Total L2 miss for each line in each of 2 loops, sum up the miss in called function.

loop counting..
for other image...

create bar graph for loop counting.

needs get details.....

Monday, February 7, 2011

Sky Discussion

0 comments

Quantization
  • cache simulation to see the behavior, miss rate high??
  • modifying data frequently, impact the performance.
  • what has been called before during and after this routine, cache data, (later...)
  • try to simulate in Cachegrind... to check the miss for each lines... and each source code.

Memory access pattern
  • similar to the cats image. (pcb_large), memory behaves very similar, show relative low cache miss rate
  • sim_num_cycles, print the number of warp around.

Work
  • cachegrind to analyze the fusion... check the cache miss before and after fusion (the line inspecting feature in cachegrind)
  • loop bound graph in cblk for other color images
  • 3-D plot in mpfast, separate reads and writes count, and showing the intensity
  • vectorization for quantization? on 32 bit quad core processor
  • visual cache behavior

Friday, January 21, 2011

Skype discussion

0 comments
  1. parallelize for JasPer 9/7 portion
  2. quantization, vectorization? parallelization?
  3. memory intel asm vectorization, copying stuff, 64bits, upto factor 2 by using the 128 bits. pin down the problems.
  4. packagelization, Rate/Distortion control, print the number separated by the comma,
  5. modify simplescalar to create 3-D memory access pattern graph
now counting access,
  • count the load and write sedately. 2 sets of counters,
  • print the sum of 2 numbers
  • spill out as 2 set of data for graphing
  • use more resolution, now using 80, pixel based, more pixel for memory and for time.
  • plot color, on 2D, color identify the density


Other images for testing
  • memory access pattern, (8k by 8k)
  • understand the behavior based on the different graph. printout the loop iterations, graph that?
take close look at the fusion
  • cache grind for the certain line of the code.
  • by counting access and miss in SimpleScalar.. from begin to the end of the fused loop.
  • graphical pixel approached to zoom into the fused loop behavior.

I/O problem - large time fraction
  • for the large image.
  • can get a better performance for I/O

Lab work
2 large FPGA board
look at clear system installation
try the equipment, and make sure it's working

Wednesday, January 19, 2011

Ideas for the next-step work

0 comments
  1. parallelize for JasPer 9/7 portion
  2. intel asm verctorization
  3. modify simplescalar to create 3-D memory access pattern graph

Thursday, January 13, 2011

Skype call discussion

0 comments
lossy compression rate time, increase the rate??
comparison with the original image? [confirmed]


isolate the quantization from previous paper.

fig. 4
a) drop the comment, replace by '...'

fig.5
a)

Index term...
ieee standard index... 3-4
ieee.ca -> next ccece ->author kit...

Wednesday, January 12, 2011

Skype discussion for the paper

0 comments
Section 3

  1. slightly longer
  2. look at the cache profile to shorten
  3. scale the memory access graph down...[vertical should be smaller.
  4. look at the source code for qmfb.c, colgrp, defined a constant which is 16. follow up the 1.9 does this this.
  5. working at entire column, but 16 columns at time?? or just work on some columns not all..

Section 5

  1. which image used in discussion
  2. figure 7 -> graph should be smaller, font size large.
  3. rename the color image -> PCB _orginal _half _tiled
  4. figure 5 for presentation, remove ...
  5. move the open bracket.
  6. do the same thing for figure 4
  7. retrieve pointer value... -> set
  8. figure changes for fig2.
  9. table 1 should be changes..
  10. http://www.blogger.com/post-create.g?blogID=9131256253599090292
  11. image thumbnails in the paper??
  12. table caption goes on the top
  13. heading for section 4 - drop all the word "initial"
  14. section 4 heading - keep it as is...
  15. Title - drop the word "execution"


TOMORROW
** create .zip file contains everything
phone number...

Friday, January 7, 2011

Skype call discussion

0 comments
Section 2
  • they used 9/7
  • we chosen to used 5/3, but we tried 9/7


Section 3
  • cache behavior, instead, start in general, show cachegrind result, and simplescalar results... show the figure (memory graph 2-D). not touch frequently, data access is concentrated in certain area.
  • cachegrind L2 7million misses, 0.34%, ss-direct-map misses 2% (L1=L2, worst case), best=0.3%; RISC vs. CISC
  • later to generate the number in the memory region that has been access, to produce a 3-D bar graph. number separated by comma.


Section 4 is appropriate..

Cluster used is not the latest and greatest, with DDR3 memory, making the comment somewhere. we already got the relatively low cache missing rate.

Section 5 [me, adding material]
  • written comments.
  • confirming the number of instruction [done]
  • at the end of section 5.2, adding brief discussion, along with a figure to should various speedups in one paragraph. /cats/color/color2/color4/galaxy/galaxy_4(4k by 4k)/
  • 9/7 rely on the profile, comment to the previous work [section 3.5 ??]

Experiment
  • galaxy 4k*4k(and 8k*8k)/color(longest execution time)/, native running with 9/7, time. (30% improvement)
  • profile for the galaxy 4k * 4k native.

Send
  • section 5 - pdf and latex
  • section 3 - pdf and latex

Wednesday, January 5, 2011

Skype call discussion

0 comments
1. update the caption for the fused loops
  • caption (a) original loop with pointer index variable.
  • caption (b) original loop with integer index variable.
  • add more detail in the loop body
  • have some explanation in the text for the paper

2. Experiments
  1. confirm the pointer ->integer without increment the execution time.... can compare the cachegrind result. (origianl vs. fused loop version)
  2. native hardware run (pointer-based loop (original), and integer-based loop only)
  3. try 7/9 configuration , and run image encoding for 7/9 and 3/5
  • single processor, natively, see any significantly difference
  • gprof for 7/9 to see the difference in cblk part, and dwt portion.