Skip to main content

CVMstoreImpl assembly code

17 replies [Last post]
gyu
Offline
Joined: 2007-06-03
Points: 0

Hi,

I have two questions about CVMstoreImpl in src/linux-arm/javavm/include/sync_arch.h.

/* NOTE: CVMmicrolockUnlock() could conceptually be implemented using a
a normal assignment statement. It is implemented using a call to
CVMatomicSwapImpl() instead to enforce the 'volatile' status of the
lockWord. Calling a function to set it ensures that it is done at
the location that the programmer intended and not shuffled around
by the compiler. */
#define CVMmicrolockUnlock(m) \
CVMstoreImpl(CVM_MICROLOCK_UNLOCKED, &(m)->lockWord)

static inline void
CVMstoreImpl(CVMAddr new_value, volatile CVMAddr *addr)
{
CVMAddr scratch;
#ifndef __RVCT__
asm volatile (
"str %1, [%2];"
"ldr %0, [%2]"
: "=r" (scratch)
: "r" (new_value), "r" (addr)
/* clobber? */);
#else
__asm
{
str new_value, [addr];
ldr scratch, [addr]
}
#endif
}

Q1.

Why "ldr %0, [%2]" is needed?
There is no use for variable scratch.

Q2.

The comment says, CVMmicrolockUnlock is implemented using a call to CVMatomicSwapImpl() to enforce the "volatile" status of the lockword.

As far as I know, inline function does not have its own body. And
CVMstoreImple is inline function. Therefore, instuctions of CVMstoreImpl may be moved to the location that the programmer did not intend. Is it right?

Regards,
SangGyu

Reply viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
arunib
Offline
Joined: 2007-04-16
Points: 0

And the test results are

With atomic
It took 973ms to execute 1000000 synchronized null loops

With no fastlock

It took 554ms to execute 1000000 synchronized null loops

clearly there is a performance degradation.

cjplummer
Offline
Joined: 2006-10-16
Points: 0

I got similar results on linux/x86. However, on linux/arm the default version was faster, and I believe it uses CVM_FASTLOCK_MICROLOCKS.

If you have it allocate a new object on each iteration, you'll find that the time difference becomes much smaller. So it seems that the effort to inflate an object monitor is fairly big, but once inflated, the inflated object monitor is faster than an uninflated re-entrant fastlock record.

Basically whether you have fastlocks or use system mutexes, the locking code needs to do something similar (basically bookkeeping on some sort of lock record). I think if you have a system where you always use a heavyweight inflated monitor, then you'll can possibly do better when dealing with contended or re-entered locks. However, uncontended locks, or ones that are not re-entered will perform much slower this way.

Another way of looking at this is if you always assume the worse, and the worse never happens, then you'll pay a performance price for having assumed the worse. However, if the worse always happens, then you'll see gains. With CVM_FASTLOCK_NONE, you are assuming the worse locking cases, and your test case is an example of one.

arunib
Offline
Joined: 2007-04-16
Points: 0

Thanks Chris for the reply.I think that answers my question.

I have one more queer output with JIT,
when JIT is enabled, a synchorised null method is always executed slowly(slower than interpretted ). Why should this be so?

A synchronized null method is

public synchronized void nullMethod(){}

Arun

cjplummer
Offline
Joined: 2006-10-16
Points: 0

Because for x86 we never got around to providing a real implementation of CVMCCMinvokeNonstaticSyncMethodHelper. It just defers to the interpreter. We have implemented it on all the other platforms, and your test case will run about twice as fast with the JIT on them.

Chris

cjplummer
Offline
Joined: 2006-10-16
Points: 0

Locks that are re-entered by the same thread should be faster with atomicops than without. Are you sure the example is as simple as the one you gave, and there isn't, for example, another thread also contending for the lock?

Chris

arunib
Offline
Joined: 2007-04-16
Points: 0

Yea it's as simple as that...When it is re-enered just twice atomicops is faster. But in the case of 3 and 4 times re-entering it's much slower than fastlock=none...

Arun

cjplummer
Offline
Joined: 2006-10-16
Points: 0

> Yea it's as simple as that...When it is re-enered
> just twice atomicops is faster. But in the case of 3
> and 4 times re-entering it's much slower than
> fastlock=none...
>
> Arun

Please provide the following:

*Complete test case, not just snippets
*Version of CDC you are using (svn revisions are best)
*Platform (which build directory, and what is your device)
*Any changes you have made to the CDC sources, including those made to use fastlock=none
*build command
*build output
*test results (actual benchmark times with and w/o atomicops).

thanks,

Chris

arunib
Offline
Joined: 2007-04-16
Points: 0

/* Test case */

public class FastLockTest{
public static void main(String args[]){
Object o1=new Object();
long t1=System.currentTimeMillis();
for(int i=0;i<1000000;i++){
synchronized(o1){
synchronized(o1){
synchronized(o1){
synchronized(o1){
synchronized(o1){
}
}
}
}
}
}
long t2=System.currentTimeMillis();
System.out.println("It took "+(t2-t1)+"ms to execute 1000000 synchronized null loops");
}

}

Version of CDC :phoneME Advanced (phoneme_advanced_mr2-b34)
Platform -x86

Changes made for fastlock=none

1)linux-x86/javavm/include/defs_arch.h
change the following lines
#define CVM_ADV_ATOMIC_CMPANDSWAP
#define CVM_ADV_ATOMIC_SWAP
to

#undef CVM_ADV_ATOMIC_CMPANDSWAP
#undef CVM_ADV_ATOMIC_SWAP

2)linux-x86/javavm/include/sync_arch.h
change the line
#define CVM_FASTLOCK_TYPE CVM_FASTLOCK_ATOMICOPS
to
#define CVM_FASTLOCK_TYPE CVM_FASTLOCK_NONE

Message was edited by: arunib

arunib
Offline
Joined: 2007-04-16
Points: 0

Sorry that the last comment was incomplete...
The changes for fastlocktype none are
1) linux-x86/javavm/include/defs_arch.h
#define CVM_ADV_ATOMIC_CMPANDSWAP
#define CVM_ADV_ATOMIC_SWAP

Change the above lines to
#undef CVM_ADV_ATOMIC_CMPANDSWAP
#undef CVM_ADV_ATOMIC_SWAP

2)linux-x86/javavm/include/sync_arch.h
#define CVM_FASTLOCK_TYPE CVM_FASTLOCK_ATOMICOPS

change the above line to
#define CVM_FASTLOCK_TYPE CVM_FASTLOCK_NONE

cjplummer
Offline
Joined: 2006-10-16
Points: 0

> Q1.
>
> Why "ldr %0, [%2]" is needed?
> There is no use for variable scratch.
>
I believe this was a bit of a hack to fix a major performance issue with swp on the strongarm. If the address wasn't cached, then swp would take something like 50 cycles instead of about 7. You'll also see the following in our JIT assembler helpers to do the same:

ldr TEMP, [MICROLOCK]

And then TEMP is never used. This is done a few instructions before the swp, which seemed to help performance of the swp even more. In the case of the C code, doing the ldr after unlocking the microlock (setting things up for the next lock) instead of before locking seemed to help more. I think that's because if you put the ldr right before the swp, it's not soon enough and the swp can still take a long time.

I'm not sure why after the swp instruction we also do an ldr, although I have a hunch. We use to use CVMatomicSwapImpl on both the lock and unlock, but then found for the unlock a store would suffice. So originally the ldr went into the CVMatomicSwapImpl, and later was cloned when CVMvolatileStoreImpl was created. Probably the ldr in CVMatomicSwapImpl can be removed.

> Q2.
>
> The comment says, CVMmicrolockUnlock is implemented
> using a call to CVMatomicSwapImpl() to enforce the
> "volatile" status of the lockword.
>
> As far as I know, inline function does not have its
> own body. And
> CVMstoreImple is inline function. Therefore,
> instuctions of CVMstoreImpl may be moved to the
> location that the programmer did not intend. Is it
> right?
>
That is why the asm instruction uses the volatile keyword. This prevents the rescheduling.

Chris
>
> Regards,
> SangGyu

arunib
Offline
Joined: 2007-04-16
Points: 0

Why is the atomicops fastlock slower than no fastlock(fastlocktype=none) in the case of synchronised blocks?

See the o1 output of FastSync.java test case..

cjplummer
Offline
Joined: 2006-10-16
Points: 0

atomicops are faster. If you are seeing otherwise, then either there is a bug, or you have some pathological test case (like locks always being contended). You will need to provide details about how you built CDC, how you ran TestSync, and what the results are you are questioning.

arunib
Offline
Joined: 2007-04-16
Points: 0

I agree..I think the test case is a case of thread contention that should be the reason why it's slower.
One more observation i had was that synchronized execution in case of JIT enabled cvm is slower than JIT disabled versions. Why is this so?

cjplummer
Offline
Joined: 2006-10-16
Points: 0

Which test case is slower? If it is for a contended lock, then the reason is because we defer to the interpreter to handled the contended case. Contended locks are rare and difficult to handle, so it was decided that the existing support in the interpreter should be used:

LABEL(_objAlreadyLocked)
cmp ip, #CONSTANT_CVM_LOCKSTATE_LOCKED
bne _fastReentryFailed

...

LABEL(_fastTryLockFailed)
ldr MB, [NEW_JFP, #OFFSET_CVMFrame_mb]
b letInterpreterDoInvoke

arunib
Offline
Joined: 2007-04-16
Points: 0

synchronized (lock) {
synchronized (lock) {
synchronized (lock) {
synchronized (lock) {
}
}
}
}
}

This is a typical case....reentering the same lock....I dont know what is the use of taking the same lock again...but in this case the fastlock_atomicops is slower than fastlock_none.

arunib
Offline
Joined: 2007-04-16
Points: 0

Regarding JIT becoming slower,we are using the default c helpers (in ccm_runtime.c) for object allocation. So it's not using the fastlock at all I think...

cjplummer
Offline
Joined: 2006-10-16
Points: 0

> Regarding JIT becoming slower,we are using the
> default c helpers (in ccm_runtime.c) for object
> allocation. So it's not using the fastlock at all I
> think...

Object allocation does not use fastlocking, but it does use what we call a microlock, which accomplishes something similar in the contended case. The assembler code in ccmallocate_cpu.c does attempt to grab the microlock with just a swap instruction. If successful, it can then allocate the object inline if there is room in the youngen for it. If you've changed it to instead just call out to the C helper in ccm_runtime.c, this will make it slower, but it should not make the JIT slower than the interpreter.

BTW, why are you changing how object allocation is done? If you are using something than the ARM port, or have done your own port, you really need to let us know when you ask these types of questions.

thaks,

Chris