(back to main)

Matlab: replicate columns of a matrix an arbitrary number of times

A code story by Ahmed Fasih 2016/5/30

We needed to replicate each column of an array a potentially different number of times in Matlab. This is how we did it. But this is also a little about the process we went through to get to where we finally got. Writing is a nonlinear activity. Writing code is no exception.

We’ve got a matrix P. In Matlab, this means a 2D array of numbers. It has widthP columns and the number of rows doesn’t matter. We need a second matrix M that contains each column of the first matrix, P, repeated an arbitrary number of times. A vector reps, containing widthP entries, tells you how many times any given column of P is replicated.

E.g. If P = [42 -1 101] and reps = [4 2 3], we want M = [42 42 42 42 -1 -1 101 101 101], assuming I typed that correctly.

We had a working implementation that profiling revealed to be a bottleneck. Here’s how we went about optimizing this.

First attempt

Create a standalone script which sets up some random data, and a correct implementation. (The one here is actually simpler than the technique we started out with.)

P = [42 -1 101];
reps = [2 1 5];

widthP = length(reps);
M = [];
for colIdx = 1 : length(reps)
  M = [M repmat(P(:, colIdx), 1, reps(colIdx))];

(As (unwilling) professional Matlab writers, we wouldn’t have written this to start out with since, (i) for loops are still best avoided in Matlab (“Instead of writing loop-based code, consider using MATLAB matrix and vector operations” (Mathworks)), and (ii) re-allocating storage every iteration will be slow in any system.

Some more fancy

Find ways to do the same thing faster. Here’s the approach we actually started out with:

c = arrayfun(@(n, i) i*ones(n,1), reps, 1:widthP, 'un', 0);
i = cell2mat(c(:));
M2 = P(:, i);
assert(all(M2(:) == M(:)))

arrayfun (docs) is a lame map: taking a function handle (viz., @(n, i) i*ones(n,1)) and any number of arrays, and returning a cell array—which are Matlab’s arbitrary-type arrays.

What we’re doing here is building a vector i, with repeated values, which we’ll use to index into our original array P. This does what we need because in Matlab, indexing an array with a vector containing repeats produces a result array containing repeats. Same story in Python/Numpy (0-indexed!):

import numpy as np
np.array([10, 20, 30])[[0,1,2,1,0,1,2]]
#=> array([10, 20, 30, 20, 10, 20, 30])
and Julia (1-indexed):
[10 20 30][[1 2 3 2 1 2 3]]
#=> 1x7 Array{Int64,2}:
#=>  10  20  30  20  10  20  30

We make sure all this fancy indexing works, in our Matlab code snippet, by comparing M2 to the M generated by the first implementation: assert(all(M2(:) == M(:))).

(assert is a common function across programming languages that unceremoniously throws an error and halts program execution if its argument is false.)

This way of building a big matrix out of a smaller matrix—using an index vector—is a really fast way to build that big matrix. We knew that because profiling showed the bottleneck to be before indexing into P. So we just need to find a way to efficiently build this index vector i.

As mentioned above, this was actually the code we started out with. Why think we could do better than this? From the days of Tom Minka’s Lightspeed package [Microsoft Research], we’ve known that the built-in repmat and even ones can be inefficient (bsxfun FTW!). Also, empirical experience over the years tells me that cell2mat makes things unfast.

Inspiration for speed

I started sketching—you know, pen and paper. A few minutes later, I had something fast but unreadable. I’ll show it in a second, but here’s the central idea. Remember the example we started out with, reps = [4 2 3]. The index vector we want to make is:

where cumsum is the cumulative sum (a.k.a. prefix sum, scan, …). Staring at the vector being cumulatively summed (the second vector to the right), it dawned on me that it was non-zero only where a new group of indexes begins, i.e., at the boundary between 1 and 2, and between 2 and 3. If I could generate this boolean vector of zeros-and-ones efficiently—and betting that cumsum is fast—I could have an efficient way to generate the index vector i. I called the boolean vector z (“zeros, mostly” as the mnemonic).

This next part is a little ugly. I could have logically and formally reasoned out how to calculate which indexes of a vector-of-zeros to flip on to obtain z. I couldn’t be bothered with all that, so I did something I often do—just came up with code that worked for this exact numerical example: “the rule must involve reps, another cumsum, hrm the dimensions won’t line up, just ignore the last element…” It turned out that the rule discovered in this way does work for the general case.

widthM = sum(reps);
% vector to be cumsummed
z = zeros(widthM, 1);
% indexes of z to be 1 (otherwise z is 0)
j = [1 1 + cumsum(reps(1 : end - 1))]; % 🌶 pixie dust, not much
z(j) = 1;
% as promised, cumsum z to get i:
i = cumsum(z);
% use i to index into P:
M3 = P(:, i);
% Works right?
assert(all(M3(:) == M(:)))

The code (minus the comments) follows the stream-of-consciousness narrative above because it was written during it. The pixie dust in finding the indexes of non-zero z (j above, because it’s an index like i)—it works, but it’s not easy to see or to explain. This isn’t academia so I can tell you—I got lucky. In this case, the stupid shot in the dark worked. Furthermore, I don’t have to impress you with a mountain of mathematical exposition “showing” you how or why this works—truth is, I don’t know (I mean, I can squint at it and see that it works, but that’s not “knowing”). Had this code not worked, I’d have looked at how it fails, and patched up the implementation to avoid that failure mode, and try again. I think this is how I usually code when someone hasn’t given me a formal algorithm to implement.

Actually, this code doesn’t work in the absolutely general case, when reps contains 0 (meaning, some columns of input P aren’t in output M). The code fails in an interesting way, but the fix isn’t that interesting: filter reps to be non-zero, run the above algorithm to generate i and shift it around to account for those 0-repeated indexes to get a corrected index vector i2. Here’s that full code:

nzidx = reps~=0;
nzreps = reps(nzidx);

% reps != 0 algorithm as above, except using nzreps instead of reps
widthM = sum(reps);
z = zeros(widthM, 1);
j = [1 1 + cumsum(nzreps(1 : end - 1))]; % 🌶 pixie dust, not much
z(j) = 1;
i = cumsum(z);

% adjust i
n = 1:numel(reps);
f = n(nzidx);
i2 = f(i);

M3 = P(:, i2);
assert(all(M3(:) == M(:)))

Timings, or, Real Engineering

Now that we had a couple of alternative implementations that we hope are fast, it’s time to benchmark speed and test generality.

Matlab is an exceedingly clumsy language:

  1. anonymous functions are single expressions, and so cannot contain if, for, multiple statements of any kind (same story in Python, but that has other features that lessen the pain, without eliminating it).
  2. Second, Matlab assignment and indexing have to be statements, so they cannot be inside anonymous functions.
  3. Finally, Matlab requires (sub)functions to be defined in a function file, not a script file. (JavaScript and Clojure are far less brutalist. Less brutal too.)
As a result of this constellation of inadequacies (to which we’ve long resigned ourselves), we had to convert our simple easy-to-understand script file to a more opaque and less flexible function file. You can read it in full at its [Gist]. The software carpentry aspect I want to point out is timing in Matlab: after checking that all three methods yield the correct answer for large, randomly generated inputs, we used timeit to run each implementation many times, and normalize the runtimes by the first implementation.

t1 = timeit(@() method1(reps, P));
t2 = timeit(@() method2(reps, P));
t3 = timeit(@() method3(reps, P));
relativeTimes = [t1 t2 t3] / t1

We found that, for P 10×1000, and each of the thousand columns repeated between 0 and 39 times,

Compared to the code we originally started out with (method 2 here), the final code runs 14× faster. Now, many other parts of the application dominate runtime, but not this part.

In summary, the three steps we followed, and that I’ve done countless times, are

  1. profile to see what’s slow, then
  2. carve out that piece into its own file, with some test data and a reference implementation.
  3. Then, in parallel:
    1. think very hard and find ways to make it go faster while remaining correct, and
    2. time the code.
Think-and-profile may need to be run for several iterations.


The above discussion presumes that you are lucky enough to identify a juicy candidate for optimization. In Matlab, with mostly-math code, this is often the case in my experience. In larger applications that do many things besides number-crunching, the inefficiencies may be spread throughout the codebase. A reasonable statement of the two schools of thought is via one of Karsten Schmidt’s workshop report [Medium].

That said, perhaps the best restatement of what we found is this:

We experienced Dr Alexandrescu’s comments about measurements disconfirming intuition first-hand above, when method 2 (arrayfun, cell2mat), which we started out with, was slower than method 1 (for loop, with array reallocation every loop)!

Post-postscript. timeit was submitted to [Mathworks File Exchange] in 2008 and made it into Matlab proper in 2013. Python 2.3, from 2003, had this built-in.