The loops up and dn in your assembler code should ideally be equally fast, as they each consist of two instructions: a str (Store Register) and a b (Branch). However, there can be differences in execution time.
There are several reasons why this might not be the case:
Pipeline Effects and Dependencies:
Pipeline Filling: When the CPU loads and executes instructions, this happens in multiple stages (Fetch, Decode, Execute, etc.). If the branch (b) occurs to an address not in the Branch Prediction Cache (BPC) or Instruction Cache (IC), this can lead to delays.
Cache Effects:
Instruction Cache: The points to which the loops branch may be differently organized in the cache, leading to varying cache hits and misses.
Data Cache: If the memory addresses for str are not optimally placed in the cache, this can lead to additional latency.
Memory Accesses:
Different addresses [x1, #0x1c] and [x1, #0x28] might be organized differently in memory or cache. If one of the addresses causes cache misses or is located within the same cache line area as a recently used address, this could cause additional latencies.
Instruction Synchronization:
Branch instructions can be differently affected by CPU-internal Branch Predictor effects and Branch Target Buffer optimizations. If one loop is better predicted by the Branch Target Buffer than the other, it can lead to differences in execution time.
Revised Version to Minimize Timing DifferencesThis essentially looks the same, but it actually helped me once.
There are several reasons why this might not be the case:
Pipeline Effects and Dependencies:
Pipeline Filling: When the CPU loads and executes instructions, this happens in multiple stages (Fetch, Decode, Execute, etc.). If the branch (b) occurs to an address not in the Branch Prediction Cache (BPC) or Instruction Cache (IC), this can lead to delays.
Cache Effects:
Instruction Cache: The points to which the loops branch may be differently organized in the cache, leading to varying cache hits and misses.
Data Cache: If the memory addresses for str are not optimally placed in the cache, this can lead to additional latency.
Memory Accesses:
Different addresses [x1, #0x1c] and [x1, #0x28] might be organized differently in memory or cache. If one of the addresses causes cache misses or is located within the same cache line area as a recently used address, this could cause additional latencies.
Instruction Synchronization:
Branch instructions can be differently affected by CPU-internal Branch Predictor effects and Branch Target Buffer optimizations. If one loop is better predicted by the Branch Target Buffer than the other, it can lead to differences in execution time.
Revised Version to Minimize Timing Differences
Code:
.text.global mainmain: ldr x1, =0xfe200000 mov w0, #0x40000 str w0, [x1, #0x08] mov w0, #0x4000000loop: str w0, [x1, #0x1c] b .L1 .L1: str w0, [x1, #0x28] b loop
Statistics: Posted by satyria — Thu Dec 05, 2024 12:52 pm