Hi Bo - great job digging into the issue and provided this detailed analysis. To answer your question, Yes- inside the P550 core, if nothing causes the store to drain, then we are permitted to buffer it for a finite amount of time, which we do to allow for additional store combining opportunities. For our more recent products after P550, we simplify the fence variations and make them behave the same, which is easier for synthesis timing at higher clock frequencies, and this will also recover the performance issue you so eloquently pointed out.