CAL: 2011

Thursday, May 26, 2011

Skype Discussion

cluster execution across the network
heap growing for the mountain shape
only reads under the mountain

confirm the read-only activity under the growing heap area. (for parallel execution)
create load/store graph for parallel execution

having each thread allocate own memory range.
parallel bumps at the end, parallel across the network...
single write... only write once for one location, independent write.

priority of the chapter, thesis writing.

Saturday, May 7, 2011

Discussion

at 12:23 PM Labels: Research 0 comments

In the SimpleScalar 3-D memory tracking

test other images and create graphs
tracking memory for the parallel version of JasPer
tracking particular interval of execution (e.g. zoom into the graph)

Encode the large images by using more than 2 node in cluster, see any speedup gains.

Friday, April 22, 2011

Skype Discussion

at 1:10 PM Labels: Research 0 comments

Research:

Revise Presentation
cluster execution for the loops in their original form, and parallelized individually. (with/without fusion, any differences?)
3-D memory chart (report read/write/total graph separately)
vectorization, paper that vectorized jasper in a single processor.(2005)
look thesis format
cachegrind profile the largest image... to check the cache miss rate (whether inline with other images.)

Lab work

move tables/computers
put FPGA in the box

Friday, April 15, 2011

Conference:

1. presentation, 15 mins, and 15 slices (including everything, title, outline, overview, technical content 3-4, result 2-3, conclusion 1)

2. draft presentation for CCECE conference

3. choose the next machine to put into cluster....

Research:

small amount iteration --> overhead
granularity --> small -->less work per processor..

maximum chart --> in the linear scale... only show all maximum together in one graph.
iteration graphs for two largest images

Wednesday, March 30, 2011

Skype Discussion

at 7:01 AM Labels: Research 0 comments

Conference:
1. presentation/poster
2. Xilinx hardware

Research:
1. memory chart (one of the pcb images)
2. experiment, simulation for the fusion, the loops in their original form, and parallelized individually. (with/without fusion, any differences?)
3. loop graph, number of iterations, pattern graph for other image, (cats image, black regions, showing lots of zeros)
4. 3-D memory chart (report read/write/total graph separately)
5. vectorization, paper that vectorized jasper in a single processor.(2005)

Monday, March 21, 2011

Latex related...

at 10:53 AM Labels: Research 0 comments

Generate the paper in 8.5x11 Letter size

dvips -t letter -o output.ps input.dvi
ps2pdf output.ps

$ PS F:\EclipseCode\E36_WS_Latex\latex_research_cal_ccece_paper> dvips -t letter
-o .\research_cal_ccece_paper_lettersize.ps .\research_cal_ccece_paper.dvi

$ PS F:\EclipseCode\E36_WS_Latex\latex_research_cal_ccece_paper>
ps2pdf .\research_cal_ccece_paper_lettersize.ps

Friday, February 18, 2011

Skype Discussion

at 7:45 AM Labels: Research 0 comments

Lab work next week, memory, and harddisk,

setup the computer and connect the network.

Check the type of memory

which system would be right target to work with next.

Cachegrind for original code

l1 miss sum

line 219

line 221

make a table to show the number of miss and percentage...

step into the function jpc_enc_enccblk() function

Total L1 and Total L2 miss for each line in each of 2 loops, sum up the miss in called function.

loop counting..

for other image...

create bar graph for loop counting.

needs get details.....

Monday, February 7, 2011

Sky Discussion

at 10:59 AM Labels: Research 0 comments

Quantization

cache simulation to see the behavior, miss rate high??
modifying data frequently, impact the performance.
what has been called before during and after this routine, cache data, (later...)
try to simulate in Cachegrind... to check the miss for each lines... and each source code.

Memory access pattern

similar to the cats image. (pcb_large), memory behaves very similar, show relative low cache miss rate
sim_num_cycles, print the number of warp around.

Work

cachegrind to analyze the fusion... check the cache miss before and after fusion (the line inspecting feature in cachegrind)
loop bound graph in cblk for other color images
3-D plot in mpfast, separate reads and writes count, and showing the intensity
vectorization for quantization? on 32 bit quad core processor
visual cache behavior

Friday, January 21, 2011

Skype discussion

at 7:37 AM Labels: Research 0 comments

parallelize for JasPer 9/7 portion
quantization, vectorization? parallelization?
memory intel asm vectorization, copying stuff, 64bits, upto factor 2 by using the 128 bits. pin down the problems.
packagelization, Rate/Distortion control, print the number separated by the comma,
modify simplescalar to create 3-D memory access pattern graph

now counting access,

count the load and write sedately. 2 sets of counters,
print the sum of 2 numbers
spill out as 2 set of data for graphing
use more resolution, now using 80, pixel based, more pixel for memory and for time.
plot color, on 2D, color identify the density

Other images for testing

memory access pattern, (8k by 8k)
understand the behavior based on the different graph. printout the loop iterations, graph that?

take close look at the fusion

cache grind for the certain line of the code.
by counting access and miss in SimpleScalar.. from begin to the end of the fused loop.
graphical pixel approached to zoom into the fused loop behavior.

I/O problem - large time fraction

for the large image.
can get a better performance for I/O

Lab work
2 large FPGA board
look at clear system installation
try the equipment, and make sure it's working

Wednesday, January 19, 2011

Ideas for the next-step work

at 8:11 AM Labels: Research 0 comments

parallelize for JasPer 9/7 portion
intel asm verctorization
modify simplescalar to create 3-D memory access pattern graph

Thursday, January 13, 2011

Skype call discussion

at 10:13 AM Labels: Research 0 comments

lossy compression rate time, increase the rate??
comparison with the original image? [confirmed]

isolate the quantization from previous paper.

fig. 4
a) drop the comment, replace by '...'

fig.5
a)

Index term...
ieee standard index... 3-4
ieee.ca -> next ccece ->author kit...

Wednesday, January 12, 2011

Skype discussion for the paper

at 8:00 AM Labels: Research 0 comments

Section 3

slightly longer
look at the cache profile to shorten
scale the memory access graph down...[vertical should be smaller.
look at the source code for qmfb.c, colgrp, defined a constant which is 16. follow up the 1.9 does this this.
working at entire column, but 16 columns at time?? or just work on some columns not all..

Section 5

which image used in discussion
figure 7 -> graph should be smaller, font size large.
rename the color image -> PCB _orginal _half _tiled
figure 5 for presentation, remove ...
move the open bracket.
do the same thing for figure 4
retrieve pointer value... -> set
figure changes for fig2.
table 1 should be changes..
http://www.blogger.com/post-create.g?blogID=9131256253599090292
image thumbnails in the paper??
table caption goes on the top
heading for section 4 - drop all the word "initial"
section 4 heading - keep it as is...
Title - drop the word "execution"

TOMORROW
** create .zip file contains everything
phone number...

Friday, January 7, 2011

Skype call discussion

at 7:01 AM Labels: Research 0 comments

Section 2

they used 9/7
we chosen to used 5/3, but we tried 9/7

Section 3

cache behavior, instead, start in general, show cachegrind result, and simplescalar results... show the figure (memory graph 2-D). not touch frequently, data access is concentrated in certain area.
cachegrind L2 7million misses, 0.34%, ss-direct-map misses 2% (L1=L2, worst case), best=0.3%; RISC vs. CISC
later to generate the number in the memory region that has been access, to produce a 3-D bar graph. number separated by comma.

Section 4 is appropriate..

Cluster used is not the latest and greatest, with DDR3 memory, making the comment somewhere. we already got the relatively low cache missing rate.

Section 5 [me, adding material]

written comments.
confirming the number of instruction [done]
at the end of section 5.2, adding brief discussion, along with a figure to should various speedups in one paragraph. /cats/color/color2/color4/galaxy/galaxy_4(4k by 4k)/
9/7 rely on the profile, comment to the previous work [section 3.5 ??]

Experiment

galaxy 4k*4k(and 8k*8k)/color(longest execution time)/, native running with 9/7, time. (30% improvement)
profile for the galaxy 4k * 4k native.

Send

section 5 - pdf and latex
section 3 - pdf and latex

Wednesday, January 5, 2011

Skype call discussion

at 7:16 AM Labels: Research 0 comments

1. update the caption for the fused loops

caption (a) original loop with pointer index variable.
caption (b) original loop with integer index variable.
add more detail in the loop body
have some explanation in the text for the paper

2. Experiments

confirm the pointer ->integer without increment the execution time.... can compare the cachegrind result. (origianl vs. fused loop version)
native hardware run (pointer-based loop (original), and integer-based loop only)
try 7/9 configuration , and run image encoding for 7/9 and 3/5

single processor, natively, see any significantly difference

gprof for 7/9 to see the difference in cblk part, and dwt portion.