Matlab: replicate columns of a matrix an arbitrary number of times
A code story by Ahmed Fasih 2016/5/30
We needed to replicate each column of an array a potentially different number of times in Matlab. This is how we did it. But this is also a little about the process we went through to get to where we finally got. Writing is a nonlinear activity. Writing code is no exception.
We’ve got a matrix P
. In Matlab, this means a 2D array of numbers. It has widthP
columns and the number of rows doesn’t matter. We need a second matrix M
that contains each column of the first matrix, P
, repeated an arbitrary number of times. A vector reps
, containing widthP
entries, tells you how many times any given column of P
is replicated.
E.g. If P = [42 1 101]
and reps = [4 2 3]
, we want M = [42 42 42 42 1 1 101 101 101]
, assuming I typed that correctly.
We had a working implementation that profiling revealed to be a bottleneck. Here’s how we went about optimizing this.
First attempt
Create a standalone script which sets up some random data, and a correct implementation. (The one here is actually simpler than the technique we started out with.)
P = [42 1 101];
reps = [2 1 5];
widthP = length(reps);
M = [];
for colIdx = 1 : length(reps)
M = [M repmat(P(:, colIdx), 1, reps(colIdx))];
end
(As (unwilling) professional Matlab writers, we wouldn’t have written this to start out with since, (i) for
loops are still best avoided in Matlab (“Instead of writing loopbased code, consider using MATLAB matrix and vector operations” (Mathworks)), and (ii) reallocating storage every iteration will be slow in any system.
Some more fancy
Find ways to do the same thing faster. Here’s the approach we actually started out with:
c = arrayfun(@(n, i) i*ones(n,1), reps, 1:widthP, 'un', 0);
i = cell2mat(c(:));
M2 = P(:, i);
assert(all(M2(:) == M(:)))
arrayfun
(docs) is a lame map
: taking a function handle (viz., @(n, i) i*ones(n,1)
) and any number of arrays, and returning a cell array—which are Matlab’s arbitrarytype arrays.
What we’re doing here is building a vector i
, with repeated values, which we’ll use to index into our original array P
. This does what we need because in Matlab, indexing an array with a vector containing repeats produces a result array containing repeats. Same story in Python/Numpy (0indexed!):
import numpy as np
np.array([10, 20, 30])[[0,1,2,1,0,1,2]]
#=> array([10, 20, 30, 20, 10, 20, 30])
and Julia (1indexed):
[10 20 30][[1 2 3 2 1 2 3]]
#=> 1x7 Array{Int64,2}:
#=> 10 20 30 20 10 20 30
We make sure all this fancy indexing works, in our Matlab code snippet, by comparing M2
to the M
generated by the first implementation: assert(all(M2(:) == M(:)))
.
(assert
is a common function across programming languages that unceremoniously throws an error and halts program execution if its argument is false.)
This way of building a big matrix out of a smaller matrix—using an index vector—is a really fast way to build that big matrix. We knew that because profiling showed the bottleneck to be before indexing into P
. So we just need to find a way to efficiently build this index vector i
.
As mentioned above, this was actually the code we started out with. Why think we could do better than this? From the days of Tom Minka’s Lightspeed package [Microsoft Research], we’ve known that the builtin repmat
and even ones
can be inefficient (bsxfun
FTW!). Also, empirical experience over the years tells me that cell2mat
makes things unfast.
Inspiration for speed
I started sketching—you know, pen and paper. A few minutes later, I had something fast but unreadable. I’ll show it in a second, but here’s the central idea. Remember the example we started out with, reps = [4 2 3]
. The index vector we want to make is:
cumsum
is the cumulative sum (a.k.a. prefix sum, scan, …). Staring at the vector being cumulatively summed (the second vector to the right), it dawned on me that it was nonzero only where a new group of indexes begins, i.e., at the boundary between 1 and 2, and between 2 and 3. If I could generate this boolean vector of zerosandones efficiently—and betting that cumsum
is fast—I could have an efficient way to generate the index vector i
. I called the boolean vector z
(“zeros, mostly” as the mnemonic).
This next part is a little ugly. I could have logically and formally reasoned out how to calculate which indexes of a vectorofzeros to flip on to obtain z
. I couldn’t be bothered with all that, so I did something I often do—just came up with code that worked for this exact numerical example: “the rule must involve reps
, another cumsum
, hrm the dimensions won’t line up, just ignore the last element…” It turned out that the rule discovered in this way does work for the general case.
widthM = sum(reps);
% vector to be cumsummed
z = zeros(widthM, 1);
% indexes of z to be 1 (otherwise z is 0)
j = [1 1 + cumsum(reps(1 : end  1))]; % 🌶 pixie dust, not much
z(j) = 1;
% as promised, cumsum z to get i:
i = cumsum(z);
% use i to index into P:
M3 = P(:, i);
% Works right?
assert(all(M3(:) == M(:)))
The code (minus the comments) follows the streamofconsciousness narrative above because it was written during it. The pixie dust in finding the indexes of nonzero z
(j
above, because it’s an index like i
)—it works, but it’s not easy to see or to explain. This isn’t academia so I can tell you—I got lucky. In this case, the stupid shot in the dark worked. Furthermore, I don’t have to impress you with a mountain of mathematical exposition “showing” you how or why this works—truth is, I don’t know (I mean, I can squint at it and see that it works, but that’s not “knowing”). Had this code not worked, I’d have looked at how it fails, and patched up the implementation to avoid that failure mode, and try again. I think this is how I usually code when someone hasn’t given me a formal algorithm to implement.
Actually, this code doesn’t work in the absolutely general case, when reps
contains 0 (meaning, some columns of input P
aren’t in output M
). The code fails in an interesting way, but the fix isn’t that interesting: filter reps
to be nonzero, run the above algorithm to generate i
and shift it around to account for those 0repeated indexes to get a corrected index vector i2
. Here’s that full code:
nzidx = reps~=0;
nzreps = reps(nzidx);
% reps != 0 algorithm as above, except using nzreps instead of reps
widthM = sum(reps);
z = zeros(widthM, 1);
j = [1 1 + cumsum(nzreps(1 : end  1))]; % 🌶 pixie dust, not much
z(j) = 1;
i = cumsum(z);
% adjust i
n = 1:numel(reps);
f = n(nzidx);
i2 = f(i);
M3 = P(:, i2);
assert(all(M3(:) == M(:)))
Timings, or, Real Engineering
Now that we had a couple of alternative implementations that we hope are fast, it’s time to benchmark speed and test generality.
Matlab is an exceedingly clumsy language:

anonymous functions are single expressions, and so cannot contain
if
,for
, multiple statements of any kind (same story in Python, but that has other features that lessen the pain, without eliminating it).  Second, Matlab assignment and indexing have to be statements, so they cannot be inside anonymous functions.
 Finally, Matlab requires (sub)functions to be defined in a function file, not a script file. (JavaScript and Clojure are far less brutalist. Less brutal too.)
timeit
to run each implementation many times, and normalize the runtimes by the first implementation.
t1 = timeit(@() method1(reps, P));
t2 = timeit(@() method2(reps, P));
t3 = timeit(@() method3(reps, P));
relativeTimes = [t1 t2 t3] / t1
We found that, for P
10×1000, and each of the thousand columns repeated between 0 and 39 times,
 method 2 (
arrayfun
,cell2mat
) runtime is 1.4× (slower!) than method 1, and  method 3 (pixie dust,
cumsum
) runtime is 0.08× (much faster!) than method 1.
Compared to the code we originally started out with (method 2 here), the final code runs 14× faster. Now, many other parts of the application dominate runtime, but not this part.
In summary, the three steps we followed, and that I’ve done countless times, are
 profile to see what’s slow, then
 carve out that piece into its own file, with some test data and a reference implementation.

Then, in parallel:
 think very hard and find ways to make it go faster while remaining correct, and
 time the code.
Postscripts
The above discussion presumes that you are lucky enough to identify a juicy candidate for optimization. In Matlab, with mostlymath code, this is often the case in my experience. In larger applications that do many things besides numbercrunching, the inefficiencies may be spread throughout the codebase. A reasonable statement of the two schools of thought is via one of Karsten Schmidt’s workshop report [Medium].
That said, perhaps the best restatement of what we found is this:
We experienced Dr Alexandrescu’s comments about measurements disconfirming intuition firsthand above, when method 2 (arrayfun
, cell2mat
), which we started out with, was slower than method 1 (for
loop, with array reallocation every loop)!
Postpostscript. timeit
was submitted to [Mathworks File Exchange] in 2008 and made it into Matlab proper in 2013. Python 2.3, from 2003, had this builtin.