U740 Hardware Perfomance Monitor support for Linux perf

Hello All.

I’ve managed to get perf working for Unmatched PMU in linux.

The most recent OpenSBI needs to be patched, also Linux patches from Atish Patra are needed with a single patch (if you need firmware insret/cycle spent in M-Mode only).

Feel free to grab patches from:

Original Linux Kernel patch series:
https://patchwork.kernel.org/project/linux-riscv/cover/20210528184405.1793783-1-atish.patra@wdc.com/

perf list new (with patches applied):

  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  cache-misses                                       [Hardware event]
  cache-references                                   [Hardware event]
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  ref-cycles                                         [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]

The reason for OpenSBI patches is that it is currently relying on mscountinhibit, that is absent on U740 Unmatched, i don’t see any real need for such a strict check.

Looking forward for comments.

1 Like

Not sure if it’s due to something i messed up while rebasing but had to make the following change to opensbi to get sbi to compile otherwise gcc’d throw error: label at end of compound statement

diff --git a/lib/sbi/sbi_pmu.c b/lib/sbi/sbi_pmu.c
index 7486456b9ece1544ca6a4d77e2b4326c2720e662..3e1a0802eecfbf0844a9df31f9bd6b5344c76338 100644
--- a/lib/sbi/sbi_pmu.c
+++ b/lib/sbi/sbi_pmu.c
@@ -315,6 +315,7 @@ static int pmu_ctr_start_fw(uint32_t cidx, uint32_t fw_evt_code,
                fevent->data = csr_read_num(CSR_MINSTRET);
                break;
        default:
+               break;
        }
 
        fevent->bStarted = TRUE;
@@ -393,6 +394,7 @@ static int pmu_ctr_stop_fw(uint32_t cidx, uint32_t fw_evt_code)
                fevent->data = 0;
                break;
        default:
+               break;
        }
 
        return 0;

Hello @vmedea.

No - you did everything fine - it’s my fault - “break;” is required after label.

Thank you for reporting !

I fixed and forced pushed to GitHub - YADRO-KNS/opensbi at yadro/unmatched/pmu.

Thanks! I managed to compile opensbi (PLATFORM=generic, and build it into u-boot), and the kernel with the PMU driver included and CONFIG_RISCV_PMU_SBI enabled.

The device is visible in /sys/bus/platform/drivers/riscv-pmu and

$ dmesg|grep PMU
[    3.933499] SBI PMU extension is available

I see the extra hardware events (such as branch-instructions) in perf list.

However I don’t seem to be getting any events. I tried with perf top -e branch-instructions , perf top -e cycles. The numbers stay at 0, even though some things are happening on the system. It only seems to work with the software events like cpu-clock. Not sure if it’s related, some of these in dmesg:

[ 3554.578191] Starting counter idx 0 failed with error -524
[ 5278.344707] Starting counter idx 2 failed with error -524

(-524 is ENOTSUPP, apparently)
Edit: oh, the command line from the LKML post does work:

# perf stat -e r8000000000000005 -e r8000000000000007 -e r8000000000000006 -e r0000000000020002 -e r0000000000020004 -e branch-misses -e cache-misses -e dTLB-load-misses -e dTLB-store-misses -e iTLB-load-misses -e cycles -e instructions  hackbench  --pipe 15 process
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.408

 Performance counter stats for 'hackbench --pipe 15 process':

               214      r8000000000000005                                             (53.71%)
             2,300      r8000000000000007                                             (62.60%)
             3,119      r8000000000000006                                             (68.50%)
     <not counted>      r0000000000020002                                             (0.00%)
     <not counted>      r0000000000020004                                             (0.00%)
     <not counted>      branch-misses                                                 (0.00%)
     <not counted>      cache-misses                                                  (0.00%)
     <not counted>      dTLB-load-misses                                              (0.00%)
     <not counted>      dTLB-store-misses                                             (0.00%)
     <not counted>      iTLB-load-misses                                              (0.00%)
       934,956,767      cycles                                                        (21.07%)
       539,665,451      instructions              #    0.58  insn per cycle           (40.55%)

       0.592143959 seconds time elapsed

       0.425950000 seconds user
       1.462119000 seconds sys

Interesting…
I should look into it, thanks for testing !

And what does

perf stat -r 5 --table -a -e r8000000000000016,r8000000000000018 sleep 5
perf stat -r 5 --table -a -e r8000000000000017,r8000000000000019 sleep 5

Says ?

And can you try invoking

perf top -e cycles

Twice in a row ?

Perf stat seems to work fine:

# perf stat -r 5 --table -a -e r8000000000000016,r8000000000000018 sleep 5

 Performance counter stats for 'system wide' (5 runs):

         6,026,738      r8000000000000016                                             ( +- 13.09% )
           482,410      r8000000000000018                                             ( +-  2.56% )

          # Table of individual measurements:
          5.002116 (+0.000545) #
          5.002065 (+0.000494) #
          5.001596 (+0.000025) #
          5.001231 (-0.000340) #
          5.000847 (-0.000724) #

          # Final result:
          5.001571 +- 0.000243 seconds time elapsed  ( +-  0.00% )
# perf stat -r 5 --table -a -e r8000000000000017,r8000000000000019 sleep 5

 Performance counter stats for 'system wide' (5 runs):

         2,951,172      r8000000000000017                                             ( +- 15.14% )
           405,660      r8000000000000019                                             ( +-  4.78% )

          # Table of individual measurements:
          5.001978 (+0.000960) #
          5.001235 (+0.000217) #
          5.000946 (-0.000072) #
          5.000662 (-0.000356) #
          5.000267 (-0.000751) #

          # Final result:
          5.001017 +- 0.000288 seconds time elapsed  ( +-  0.01% )

Tried it a a few times, every time it stays at “Collecting samples…”.

Looks like “perf record” gets no samples, either. Maybe it’s related?

# perf record -e cycles -e instructions  -c 1000 hackbench
Running in process mode with 10 groups using 40 file descriptors each (== 400 tasks)
Each sender will pass 100 messages of 100 bytes
Time: 0.917
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.047 MB perf.data ]
# perf report --stdio
Error:
The perf.data data has no samples!

Looks like some issue with kernel PMU driver to me.

Thanks for the report - i ll investigate it and report here!

No problem. I’m really happy to see this. I tried to get perf to work with the performance counters on the Unleashed board once but it went nowhere, too many levels of abstraction in between.

Speaking of which: don’t there need to be pmu and pmu,event-to entries in the DTS file for the board, for the vendor-specific counters? (and accompanying u-boot patch) Or are these general counters always available?

Perf record will not work on hifive unmatched as it doesn’t implement sscofpmf implementation. sscofpmf extension provisions for local counter overflow interrupts.

However, the linux pmu driver should print that event counting is not supported in absense of sscofpmf implementation. I will look into that.

Perf record will not work on hifive unmatched as it doesn’t implement sscofpmf implementation. sscofpmf extension provisions for local counter overflow interrupts.

Yes - sscofpmf make sense - didn’t thought about it.

Status update:

Feel free to grab branches with applied patches from:
U-Boot: GitHub - YADRO-KNS/u-boot at yadro/riscv/unmatched-v2021.07
Linux (v5.16-rc6): Commits · YADRO-KNS/linux · GitHub
Qemu (rebased on the top of v6.1.0): Commits · maquefel/qemu · GitHub

# perf stat sleep 1

 Performance counter stats for 'sleep 1':

              1.23 msec task-clock                #    0.001 CPUs utilized          
                 1      context-switches          #  815.661 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                45      page-faults               #   36.705 K/sec                  
           1468356      cycles                    #    1.198 GHz                    
            508982      instructions              #    0.35  insn per cycle         
             69255      branches                  #   56.489 M/sec                  
             25223      branch-misses             #   36.42% of all branches        

       1.002246000 seconds time elapsed

       0.002639000 seconds user
       0.000000000 seconds sys

Now displayed correctly see table.

You still can’t and it won’t be possible to use sampling (perf record, perf top) with hardware counters as leader, but error is now displayed correctly:

# perf record 
Error:
cycles: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

Still you can record with task-clock, cpu-clock (or any software counter) as leaders - if it useful for you, i.e.:

# perf record -e '{cpu-clock,cycles,instructions,branches}:s' sleep 1
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.003 MB perf.data (36 samples) ]

And even build flame graphs with -g.

Please give feedback on this table Perf list to FU740 HPM bindings question

1 Like

Hi,

This is great, but curious how we can validate that the counter is accurate? With more events than counters, is the kernel automatically multiplexing them? These values could be more of an estimate and not an actual count, so wondering how we can gain confidence.

Thanks.