Using Faust for ER-301 DSP development

I created a thread on a unit I’ve been working on (Custom reverb unit: Dattorro Reverb) but I thought a separate thread on how I’ve used Faust might be interesting to some other people doing DSP development for the 301.

I started looking at this because I’m not really a c++ developer (and don’t really want to be either :slight_smile: )

So anyway, Faust is a high level language for DSP, that can then generate C++. You need to supply some wrapper code for your chosen target platform.

Here’s the DSP code for the reverb:

You can cut and paste that straight into the Online Faust IDE and then play around with running wav files though it.

If you look in to the Makefile in my repo you can see the .cpp and .h files are created from the dsp file.

The inputs and outputs of the process function (Faust’s equivalent to main) become inputs and outputs of the object, and Faust controls (like the hslider) become parameters. (The declare statements are because the inputs and outputs of process are not named, but we need names in the er-301 object)

For the Makefile to work, these executables need to be in your $PATH.

This is all a work in progress, and the way faust2er301 works may not be the best, but it is working…

The other interesting thing is that Faust claims to be able to optimise the generated c++ to take advantage of SIMD (Using the Compiler - Faust Documentation) however as I’ve experimented with these compiler options, the CPU usage on my 301 has usually gone up by a few percentage points, so there is some work to do here still.

16 Likes

This is a brilliant development! I’ve been using Faust on my Befaco Lich module for some time now, but it never occurred to me to target the ER too! Very good idea!

3 Likes

Wow excellent work! This is a huge development for 301 DSP, although to my eyes the Faust syntax looks very alien compared to C++ :slight_smile:

Especially amazing that it’s able to take advantage of SIMD! I’d love to see some of the generated code for this, in particular the way it creates delay lines. Brian’s frame buffers seem kind of particular about the way memory is allocated and accounted for, so that could be something to look out for.

Edit: Really it’s hard to overstate how big a door you’ve opened here, seriously excited to see the new units that will be created with this. A quick search turned up extensive documentation on how to work with Faust.

2 Likes

Yeah the Faust syntax does indeed look pretty alien, but it opens up once you’ve read the docs a bit. Its definitely a different style to cpp. If you’ve ever worked with Haskell it’d probably help a little bit.

Yeah the docs are pretty good, and the extensive libraries too.

I’ve only looked at the generated code for that reverb, which is petty enormous. I think will need to look at the generated code for some really basic things to try to work out why I’ve not been seeing CPU improvements with the SIMD options turned on. I’ll post some here…

1 Like

Wow, great work! It looks like you can even generate Faust (and C++) code from an online, visual patching environment:

https://faustplayground.grame.fr

The export options include things like Bela, Juce, VCV Rack, etc… maybe we could get an ER-301 option in there?

EDIT: Looks like the various export services exist as makefiles here: https://github.com/grame-cncm/faustservice/tree/master/makefiles

…which use “architectures” here: https://github.com/grame-cncm/faust/tree/master-dev/architecture

6 Likes

That does look cool.

So far, the thing I have only generates your cpp and h file. You still need to write your own lua file for the interface, and the swig file, and the toc.lua. I thought about trying to get it build the UI lua file too, but backed away from that, as it seems like you’d need to bake in more assumptions than I wanted about how things are translated into UI…

1 Like

So here’s a basic example (not chosen for any musical merit!) – a mono delay/filter, delay.dsp:

process = _ <: @(2000), @(1000), @(100),_ :> ( / (4));

It averages the current sample and samples delayed by 100, 1000, and 2000.

Running faust delay.dsp gives (edited down for brevity):

#ifndef FAUSTFLOAT
#define FAUSTFLOAT float
#endif

class mydsp : public dsp {
        
 private:
        int IOTA;
        float fVec0[2048];
 public:
        virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
                FAUSTFLOAT* input0 = inputs[0];
                FAUSTFLOAT* output0 = outputs[0];
                for (int i0 = 0; (i0 < count); i0 = (i0 + 1)) {
                        float fTemp0 = float(input0[i0]);
                        fVec0[(IOTA & 2047)] = fTemp0;
                        output0[i0] = FAUSTFLOAT((0.25f * (fVec0[((IOTA - 2000) & 2047)] + (fVec0[((IOTA - 1000) & 2047)] + (fTemp0 + fVec0[((IOTA - 100) & 2047)])))));
                        IOTA = (IOTA + 1);
                }
        }
};

Running Running faust -vec -vs 4 delay.dsp gives (again edited down for brevity):

#ifndef FAUSTFLOAT
#define FAUSTFLOAT float
#endif 

class mydsp : public dsp {
        
 private:
        float fYec0[2048];
        int fYec0_idx;
        int fYec0_idx_save;
 public:
        virtual void compute(int count, FAUSTFLOAT** inputs, FAUSTFLOAT** outputs) {
                FAUSTFLOAT* input0_ptr = inputs[0];
                FAUSTFLOAT* output0_ptr = outputs[0];
                int vindex = 0;
                /* Main loop */
                for (vindex = 0; (vindex <= (count - 4)); vindex = (vindex + 4)) {
                        FAUSTFLOAT* input0 = &input0_ptr[vindex];
                        FAUSTFLOAT* output0 = &output0_ptr[vindex];
                        int vsize = 4;
                        /* Vectorizable loop 0 */
                        /* Pre code */
                        fYec0_idx = ((fYec0_idx + fYec0_idx_save) & 2047);
                        /* Compute code */
                        for (int i = 0; (i < vsize); i = (i + 1)) {
                                fYec0[((i + fYec0_idx) & 2047)] = float(input0[i]);
                        }
                        /* Post code */
                        fYec0_idx_save = vsize;
                        /* Vectorizable loop 1 */
                        /* Compute code */
                        for (int i = 0; (i < vsize); i = (i + 1)) {
                                output0[i] = FAUSTFLOAT((0.25f * (fYec0[(((i + fYec0_idx) - 2000) & 2047)] + (fYec0[(((i + fYec0_idx) - 1000) & 2047)] + (float(input0[i]) + fYec0[(((i + fYec0_idx) - 100) & 2047)])))));
                        }
                }
                /* Remaining frames */
                if ((vindex < count)) {
                        FAUSTFLOAT* input0 = &input0_ptr[vindex];
                        FAUSTFLOAT* output0 = &output0_ptr[vindex];
                        int vsize = (count - vindex);
                        /* Vectorizable loop 0 */
                        /* Pre code */
                        fYec0_idx = ((fYec0_idx + fYec0_idx_save) & 2047);
                        /* Compute code */
                        for (int i = 0; (i < vsize); i = (i + 1)) {
                                fYec0[((i + fYec0_idx) & 2047)] = float(input0[i]);
                        }
                        /* Post code */
                        fYec0_idx_save = vsize;
                        /* Vectorizable loop 1 */
                        /* Compute code */
                        for (int i = 0; (i < vsize); i = (i + 1)) {
                                output0[i] = FAUSTFLOAT((0.25f * (fYec0[(((i + fYec0_idx) - 2000) & 2047)] + (fYec0[(((i + fYec0_idx) - 1000) & 2047)] + (float(input0[i]) + fYec0[(((i + fYec0_idx) - 100) & 2047)])))));
                        }
                }
        }

};

If I wrap these appropriately into units and test on device, I don’t see any difference at all in CPU usage.

I would assume that in the second case, the generated cpp has that /* Remaining frames */ section since Faust doesn’t assume the frame size is a multiple of 4. But in practice that will not be entered. For any cpp experts, does the structure of /* main loop */ mean it should be making use of SIMD instructions?

In order to measure the CPU usage, I added 40 of the unit to a chain. In both cases I ended up with 22% CPU usage.

1 Like

Ah I see, yea this is running the main frame loop by four and then processing 4 samples per step using those inner loops. There’s no SIMD here atm so it should benefit from it if possible.

The suspicious thing here is that float fVec0[2048] allocation. That should ideally come from the reserved frame buffer heap (I think, defer to @odevices).

It’s pretty cool to see how it did this, looking forward to learning some tricks :slight_smile:

Edit: I don’t think the compiler is smart enough to turn those loops into simd but I could be wrong of course. Checking the asm output would prove it one way or the other.

Edit2: You know, I probably am wrong. Trying to find compiler docs that say it uses auto-vectorization.

yeah I the faust docs suggest it should be auto-vectorised, but I don’t know; thanks for the link there – just scanning some asm output now…

I found this which implies passing the --neon compiler flag will do it? idk if this is the same version though https://e2e.ti.com/support/processors-group/processors/f/processors-forum/266613/vectorization-for-neon-am335x-starterkit

Edit: should see asm instructions like vmul, vld etc.

No v.. instructions :frowning_face:

I think the Makefile is already set to try auto-vectorisation. It’s passing a parameter that implies -mvectorize-with-neon-quad (that’s listed in comments at the top of the asm output). Also the third bullet here: https://github.com/odevices/er-301/tree/master/tutorial#tips-for-coding-with-neon-intrinsics

Just testing with a different delay in Faust, using the fractional delay fdelaylti (delays) I have been able to observe a very slight CPU improvement with -vec -vs 4, and looking now at the generated asm there a bunch of v* instructions, vmul, vadd etc. So it is managing to do some auto-vectorising.

Would be good to know where the basic sample delay is falling down though. I feel that would be instructive.

2 Likes

GCC -O3 (which implies -ftree-vectorize) will optimize both cases and produce essentially the same code, probably slightly better assembly in the first simpler case since the second case might be getting in the compiler’s way a bit.

If you can convince the compiler that your loop count is always a multiple of 4, then it will also leave out the post-processing loop. In the ER-301 source code that is handled by the FRAMELENGTH macro.

This shouldn’t affect the compute time. However, you are right that I should probably document the memory allocation situation. For now here is an outline:

  • In true RTOS-style, I avoid the use of virtual memory. The MMU is configured to be a simple pass-thru.

  • RAM is divided (at link time) into basically 4 sections:

    1. Program memory (total size of the firmware .text and .data sections)
    2. C/C++ runtime heap (20MB) used by new/delete, malloc/free and so on. Dynamically linked code is also loaded here. Since the dynamic linker does not support far jumps, this memory must be as close as possible to the program memory.
    3. The .bss section (memory statically allocated by the firmware)
    4. BigHeap (whatever is left over, typically around 480MB).
  • You SHALL NOT allocate memory in the audio thread. Thus if you need memory for audio processing you must either pre-allocate it (in a non-audio thread) during construction (this includes static allocations) or request it from the pre-allocated frame buffer pool (a constant-time allocator).

  • Especially, use the frame buffer pool for temporary scratch space that doesn’t need to be saved between calls from the DSP scheduler (i.e. calls to your process method). This improves cache efficiency and reduces pressure on the kernel memory.

  • Large contiguous amounts of memory (anything more than a few KB ~32KB) should be allocated from the BigHeap.

3 Likes

Just to confirm I’m understanding right, In this case, you’re saying it’s okay but not ideal — should be using BigHeap?

2048 floats is 8KB.

Hmm, maybe I was too conservative in the outline. I see that MicroDelay allocates a 0.1s buffer from kernel memory, which is about 19KB at 48kHZ. So I think I will amend the recommended threshold to around 32KB (not a few KB).

The goal is to not unnecessarily fragment the BigHeap with lots of little buffers thus preventing the allocation of very large buffers for sample memory and the like.

2 Likes

thanks a lot! trying to experiment with Faust and er-301. found out that for these executables to work on macOS Big Sur (without docker, to build packages for the emulator) /bin/env needs to be replaced with /usr/bin/env in the beginning of the files. Also, in the line 27 of the file faust2er301 the --tmpdir key needs to be removed (evidently this option only exists on linux).

3 Likes

Thanks for the feedback!

I’ll look to update those. The env change will be fine on linux too. Will have a look into what to do re --tmpdir

Let me know if you run into any more issues :slight_smile:

Thanks! Everything good so far. Maybe an option to enable / disable vectorization in the makefile would be great for experiments…

1 Like

Thought I’d share here a little study utility that’s actually useful for me — it allows controlling stereo width by converting the signal to M/S and back to stereo after adjusting M and S levels. I am not a coder and know nothing yet of neon intrinsics but the 10 same units I loaded took only 2% CPU which is okay with me. I love how the faust code can be written in a very comprehensible way:

import("stdfaust.lib");

xytoms (x,y) = (x+y),(x-y);
ctrl (m,s) = (m*hslider("M",1,0,1,0.01)), (s*hslider("S",1,0,1,0.01));
mstoxy (m,s) = (m+s),(m-s);

process = xytoms : ctrl : mstoxy;

I apologize if this is an inappropriate thread to do this.

Here it is: Github

9 Likes