2016-10-29 Solving the Pairs of Percentages Puzzle

Here’s a math puzzle inspired by the election. I noticed a while back that the two candidates were at 81.5% and 18.5%, respectively. The numbers 185 and 815 are anagrams, and add to 1000 ($10^3$). How many such pairs of positive integers add to $10^3$? I’ve been doing a bit of Haskell recently, so that’s my language of choice.

Exploring the space

The first step is to define a function which takes an integer pair and decides if the pair is valid (that is, if the two integers are anagrams or not.)

import Data.List

valid     :: Int -> Int -> Bool
valid a b = list a == list b
    where list = sort . filter (/=0) . digits
          digits n = map (\x -> read [x] :: Int) (show n)

Let’s use this valid function to check a range of numbers (a,b) where $a+b=1000$, and valid a b == True.

import Data.List
λ> [(a,b) | a <- [1..500], let b = 1000 - a, valid a b]
[(95,905),(185,815),(275,725),(365,635),(455,545),(500,500)]

This is pretty cool. Let’s fit this into a function $f(n)$, where

  • $f(n)$ is the power $10^n$ we’re matching against,
  • pairs n returns the list of potential pairs,
  • validPairs n = filter(\(a,b) -> valid a b) $ pairs n, and
  • f n = length $ validPairs n, which is what we’re ultimately trying to find an expression for.
pairs     :: Int -> [(Int, Int)]
pairs pow = [(a, (10^pow)-a) | a <- candidates]
    where candidates pow = [0,5..(div (10^pow) 2)]

validPairs = filter(\(a,b) -> valid a b) . pairs

f = length . validPairs

If we run validPairs 4, we see a cool pattern.

λ> validPairs 4
[(995,99005), (1895,98105), (1985,98015), ...  (49055,50945), (49505,50495), (50000,50000)]
λ> f 4
6
λ> f 5
141
λ> f 6
...

Before we go further, a few notes on this: for performance reasons, we only compute pairs (a,b), not (b,a). Because the last pair computed is always something like (50,50), (500,500), etc., the total number of pairs is actually $2f(n) - 1$. We’ll keep this in mind for later.

Here’s what we have so far:

n 1 2 3 4 5 6
f(n) 1 1 6 6 141 141
2*f(n)-1 1 1 11 11 281 281

Computing $f(6)$ took a really long time, so let’s eventually wrap this in a .hs file, compile, and execute it with timing.

But before we go further, let’s check the Online Encyclopedia of Integer Sequences to see if any of the sequences (1,1,6,6,141,141), (1,6,141), or (1,11,281) appear at all. It turns out (1,6,141) are the first few digits of A241015, which is the

Number of pairs of endofunctions $f$, $g$ on $[n]$ satisfying $g(g(g(f(i)))) = f(i)$ for all $i$ in $[n]$.

…whatever that means.

If our pattern matches theirs, we might expect the next number in the sequence to be is 6184. We’ll find out in a bit, but first let’s bring runtime down a bit.

Profiling

OK, it’s time to wrap this in a file, compile, and execute it with timing.

--puzzle.hs <num> (L | _)
import System.Environment
import Data.List
import Data.Char

valid     :: Int -> Int -> Bool
valid a b = list a == list b
    where list = sort . filter (/=0) . digits
          digits n = map (\x -> read [x] :: Int) (show n)

pairs     :: Int -> [(Int, Int)]
pairs pow = [(a, (10^pow)-a) | a <- candidates pow]
    where candidates pow = [0,5..(div (10^pow) 2)]

validPairs = filter(\(a,b) -> valid a b) . pairs

f = length . validPairs

main :: IO ()
main = do
    [num, action] <- getArgs
    let n = digitToInt (head num)
    let shouldList = (head action == 'L')
    if shouldList
        then print $ validPairs n
        else print $ f n

Just for kicks, ./puzzle takes two arguments; the first is the number $n$, and the second is either L, to print the full list of matches, or _, to print the value $f(n)$.

We have a few ways to run this. One is to open up a ghci shell, import the file, and run validPairs <num> or f <num> right there in the shell:

with ghci

$ ghci
GHCi, version 7.10.3: http://www.haskell.org/ghc/  :? for help
λ> :l puzzle
[1 of 1] Compiling Main             ( puzzle.hs, interpreted )
Ok, modules loaded: Main.
λ> validPairs 3
[(95,905),(185,815),(275,725),(365,635),(455,545),(500,500)]
λ> f 3
6

This is great for testing, but we don’t get too many stats. Let’s use a real-time thing like runhaskell.

with runhaskell

$ runhaskell puzzle.hs 3 L
[(95,905),(185,815),(275,725),(365,635),(455,545),(500,500)]
$ runhaskell puzzle.hs 3 _
6

To add stats, we use the +RTS runtime system flag with the -s summary flag.

$ runhaskell puzzle.hs 3 _ +RTS -s
6
			 101,696 bytes allocated in the heap
				 3,464 bytes copied during GC
				68,912 bytes maximum residency (1 sample(s))
				13,008 bytes maximum slop
						 1 MB total memory in use (0 MB lost due to fragmentation)

																	 Tot time (elapsed)  Avg pause  Max pause
Gen  0         0 colls,     0 par    0.000s   0.000s     0.0000s    0.0000s
Gen  1         1 colls,     0 par    0.000s   0.000s     0.0001s    0.0001s

INIT    time    0.000s  (  0.000s elapsed)
MUT     time    0.001s  (  0.189s elapsed)
GC      time    0.000s  (  0.000s elapsed)
EXIT    time    0.000s  (  0.000s elapsed)
Total   time    0.006s  (  0.189s elapsed)

%GC     time       2.5%  (0.1% elapsed)

Alloc rate    178,562,837 bytes per MUT second

Productivity  96.1% of total user, 2.9% of total elapsed

Very cool. As a final step, we’ll properly compile and optimize it with ghc.

with ghc

$ ghc -Odph puzzle.hs -rtsopts
[1 of 1] Compiling Main             ( puzzle.hs, puzzle.o )
Linking puzzle ...
$ ./puzzle 3 _
6

We can run this with +RTS -s too. $ ./puzzle 3 _ +RTS -s generates the following table for f(1) through f(7):

n 1 2 3 4 5 6 7
f(n) 1 1 6 6 141 141 5591
time 0.001s 0.001s 0.003s 0.017s 0.190s 2.301s 26.711s

Let’s see if this matches our sequence from earlier. We now have the next value, putting our sequence at (1,6,141,5591). This is not in the OEIS.. So we’ll have to find an expression ourselves, and check with the scripts.

Unfortunately, our timing statisics are not optimistic. At this rate, $f(8)$ will take around five minutes to calculate. It would be a good idea to parallelize.

Parallelization

We can use the Control.Parallel library, (see the docs for more) which gives us the handy par and pseq functions, which force the compiler to split evaluations into separate threads and then wait to recombine. Much of the code has to be modified slightly. I run this on a four-core processor, so I rewrote this to split up the list of candidates into four roughly equal parts, evaluate the validity of each element in each of them, compute the length of that valid sublist, and recombine afterwards.

We’ll call this new file puzzle-parallel.hs. You’ll notice we drop support for printing the full list – constructing the sum was easier to parallelize.

-- ghc -O2 --make puzzle-parallel.hs -threaded -rtsopts
-- ./puzzle-parallel <n> +RTS -N4 -s
--
import Control.Parallel
import Data.List.Split
... -- all the same imports as before

valid     :: Int -> Int -> Bool -- as before

pairs          :: Int -> Int -> [(Int, Int)]
pairs pow core = [(x, (10^pow)-x) | x <- candidates]
    where candidates = [a+5,a+10..b] -- not as before!
          a = (core - 1) * (div (10^pow) 8)
          b = (core - 0) * (div (10^pow) 8)

-- validPairs can no longer be point-free, since it takes
-- two arguments: pow and core. There's probably a nice currying fix, though.
validPairs pow core = filter(\(a,b) -> valid a b) $ pairs pow core

f n = s1 `par` s2 `par` s3 `par` s4 `pseq` (s1 + s2 + s3 + s4)
    where
        s1 = length $ validPairs n 1
        s2 = length $ validPairs n 2
        s3 = length $ validPairs n 3
        s4 = length $ validPairs n 4

main :: IO ()
main = do
  [num] <- getArgs
	-- we drop support for `./puzzle <num> L`, since computing these
	-- large lists, sorting, and printing them is sort of out of
	-- scope for the moment.
  let n = digitToInt (head num)
  print $ f n

The file puzzle-parallel.hs can now be compiled to be multithreaded with ghc:

$ ghc -O2 --make puzzle-parallel.hs -threaded -rtsopts

and executed on four cores with the +RTS flag -N4 (or -N2, or however many cores you have. The above code is written for four cores, but it’s not hard to modify for any other number of cores.)

$ ./puzzle-parallel <n>             #unparallelized
$ ./puzzle-parallel <n> +RTS -N4    #force 4 cores
$ ./puzzle-parallel <n> +RTS -N4 -s #with stats
n 1 2 3 4 5 6 7 8
f(n) 1 1 6 6 141 141 5591 5591
time w/o parallelization 0.001s 0.001s 0.003s 0.017s 0.190s 2.301s 26.711s 501.863s
time w/ parallelization 0.001s 0.001s 0.003s 0.021s 0.133s 1.567s 18.489s 208.285s

We see around a 2x speedup from this – we expect a 4x from threading, but there is some loss to overhead. This speedup helps more the higher $n$ grows. This is better performance across the board. Now calculating $f(9)$ isn’t quite as impossible.

Expression

Rather than delve deeper into optimization, let’s focus on trying to find a closed-form expression for this sequence. We do this first by visual inspection of the types of pairs generated.

$ ./puzzle 1 L
[(5,5)]

$ ./puzzle 2 L
[(50,50)]

$ ./puzzle 3 L
[(95,905),(185,815),(275,725),(365,635),(455,545),(500,500)]

$ ./puzzle 4 L
[(950,9050),(1850,8150),(2750,7250),(3650,6350),(4550,5450),(5000,5000)]

$ ./puzzle 5 L
[(995,99005),(1895,98105),(1985,98015),(2795,97205),(2975,97025),(3695,96305),
(3965,96035),(4595,95405),(4955,95045),(5495,94505),(5945,94055),(6395,93605),
(6935,93065),(7295,92705),(7925,92075),(8195,91805),(8915,91085),(9095,90905),
...
(45815,54185),(45905,54095),(46355,53645),(46535,53465),(47255,52745),
(47525,52475),(48155,51845),(48515,51485),(49055,50945),(49505,50495),
(50000,50000)]

Let’s introduce some nomenclature here to turn this from a programming puzzle into a math puzzle. For some $n$, we define $f(n)$ to be the number of possible pairs, and $S(n)$ to be the set of pairs itself. Thus, f(n) = length S(n).

Some observations:

  • For all even values $n$, we can see S(n) = [10*i for i in S(n-1)]; that is, every element of S(even n) is an element of S(odd n) multiplied by ten.
  • Thus, $f(n) = f(n-1)$ for even n.

Mapping to a simpler domain

We can see from the examples above that for sets generated by, say, $n=3$ and $n=4$, only the first two digits really fluctuate. We suspect the problem reduces to a combinatorics problem focused on strings of that length, which we call $m$. So, let’s map the domain of possible $n$ values to a simpler domain:

  m(n)
  ^
6 |           o o
5 |
4 |       o o
3 |
2 |   o o
1 +-+-+-+-+-+-+-+--> n
  1 2 3 4 5 6 7 8

We’ll switch to Python here for readability, and also because our algorithm will be imperative in a moment:

def M(n):
  return ((n-1)//2)*2

This alongside future memoizations will help us avoid having to calculate $f(n)$ when $f(n-1)$ is known.

The 9-pair

We first define the concept of a 9-pair. A 9-pair is a set of two positive numbers which add to 9. 9-pairs are: (0,9), (1,8), (2,7), (3,6), and (4,5).

Numbers and 9-pairs

Here is an example pair from $S(n=5)$: (48155,51845). We can reduce this to its tuple of interest: (4815, 5184). It seems (1,8) and (4,5) are the 9-pairs which compose the alphabet from which this tuple was generated.

Since 1 and 8 are a 9-pair, if 1 appears in the first number of the tuple, then we know for sure that:

  • 1 appears in the second item of the tuple,
  • 8 appears in the first item of the tuple,
  • 8 appears in the second item of the tuple.

Generalization: elements and element pairs

Let’s generalize this away from digits and number and into elements, sets, and permutations.

We will call (a,b) a tuple, a and b strings in that tuple, and the characters in the strings a and b characters.

For a given string a or b, any character in a or b uniquely defines its counterpart within an element-pair (i,j): if i in a, then j in b. Since the strings a and b are by definition anagrams, this means j in a as well. Thus a is a unique ordering of pairs of characters, selected from a known alphabet of character pairs.

Combinatorics on that domain

We can reduce this problem to finding the number of unique orderings of a list composed of selecting $x$ element-pairs from a list of $p$ element-pairs. We call this function $G(x,p)$. Because our real problem lies on the domain of digits, there will only eve be five element-pairs (the five 9-pairs), so $p=5$, always.

This is simple (if inefficient) in Python:

def G(x,p):
	result = 0
	# since p=5 always, this domain is always ['1','2','3','4','5']
	domain = (str(i) for i in range(p))
	itr = itertools.combinations_with_replacement(domain,x)
	for i0 in itr:
		# turns ['1','2'] into ['1','1~','2','2~'] for proper element-pairs
		alphabet = list(i0) + list( map((lambda x: str(x)+"~"), i0) )
		#the number of unique permutations in that alphabet of length 2x
		result += len(set(list(itertools.permutations(alphabet,2*x))))
	return result

To get our desired $f(n)$, we can just write:

def f(n):
	# the same as M(n) above
	m = ((n-1)//2)*2
	# the answer is recursive -- G(x,5) alone will only return pairs of length x.
	# we must also consider pairs of length y<x, right-padded by zeroes.
	def inner_f(n):
		return (G(n,5) + inner_f(n-1)) if (n > 0) else 0
	# divided by two because (185, 815), (815, 185) are the same pair.
	# plus one because (500, 500) won't be found by G5(x)
	return inner_f(m)/2 + 1

Unmemoized timing t0

We can run this with timing:

for n in range(1,12):
	start = time.time()
	localF = f(n)
	end = time.time()
	print "f({0})\t{1}\t{2}".format(n, localF, end-start)
n 1 2 3 4 5 6 7 8 9 10
f(n) 1 1 6 6 141 141 5591 5591 281566 281566
t0 5.9e-6s 1.3e-5s 4.2e-5s 2.2e-5s 1.5e-4s 1.4e-4s 4.3e-3s 4.0e-3s 0.94s 0.94s

Memoization t1

We can improve performance a lot through memoization. Functions like $G(n,5)$ get called repeatedly for the same value of $n$, for example.

G5memo = {}
def G5(x):
    if x not in G5memo:
        G5memo[x] = G(x,5)
    return G5memo[x]
n 1 2 3 4 5 6 7 8 9 10
t1 9.1e-6s 3.1e-6s 5.5e-5s 3.1e-6s 1.6e-4s 2.9e-6s 5.5e-3s 4.1e-6s 0.91s 1.6e-5s

The most obvious boost is that $f(n)$ where $n$ is even now takes almost no time to compute, since $f(n) = f(n-1)$ for even $n$.

Memoization t2

We can memoize further. The alphabet of characters fed into len(list(set(itr(alphabet)))) is usually something like ['a','A','a','A'] or ['b','B','b','B']. However, it seems obvious to us that the number of unique pairs from the first of these two alphabets will be the same as that of the second. It would be useful to create a non-unique hash of some kind, representing the pair distribution within an alphabet, and check whether len(list(set(itr(alphabet)))) for an equivalent alphabet has already been calculated.

# takes an alphabet like ['1', '1', '2', '3', '1~', '1~', '2~', '3~'] and returns
#                         \     /    |    |
#                          \   /     |    |
#                           (2)     (1)  (1) --> returns [1,1,2] as a list-hash
def makeListHash(alphabet):
    d = {}
    for i in alphabet[0:len(alphabet)/2]:
        try:
            d[i] += 1 #if i in d
        except:
            d[i] = 1
    return list(str(x) for x in sorted(d.values()))

# just a wrapper for makeListHash which returns a string ("1,2,2") instead.
def makeStringHash(alphabet):
    return ",".join(makeListHash(alphabet))

This lets us rewrite $G(x,p)$ as:

def G(x,p):
    result = 0
    alphabet = (str(i) for i in range(p))
    itr = itertools.combinations_with_replacement(alphabet,x)
    lMemo = {} #inline since memoization is only useful across a single value of x
    for i0 in itr:
        alphabet = list(i0) + list( map((lambda x: str(x)+"~"), i0) )
        key = makeStringHash(alphabet)
        if not key in lMemo:
            lMemo[key] = len(set(list(itertools.permutations(alphabet,2*x))))
        result += lMemo[key]
    return result

This memoization brings down computation time even further, and lets us calculate $f(11)=f(12)$ for the first time ever.

$n$ 1 2 3 4 5 6 7 8 9 10 11
$f(n)$ 1 1 6 6 141 141 5591 5591 281566 281566 16397596
t2 6.9e-6 3.1e-6 6.8e-5 1.3e-5 1.4e-4 3.1e-6 6.9e-4 5.0e-6 6.8e-2 7.2e-6 10.5

Conclusions

That’s all! I wasn’t able to further optimize the problem, and I’m not convinced a closed-form solution necessarily exists. That said, we found the number of anagram pairs which sum to $100,000,000,000$, and that’s pretty cool. I also learned a lot about parallelization in Haskell and using the built-in optimization flags in ghc, and a bit of memoization in Python.

Update (11/11/2016)

It’s been around a week since last I touched this puzzle. My buddy Josh bounced the problem to his boss Piotr, who came up with this phenomenal closed-form solution for $G(x,p=5)$. It’s brilliant and involves reducing the combinatorics problem, searching for and finding the elusive closed-form solution in the literature, and writing a small haskell script to implement it. In short:

The combinatorics section above describes a

sequence $a_n^N := \Sigma \left( \dfrac{n!}{p_1! p_2! … p_N!}\right)^2$, where the sum runs over sets of non-negative integers $p_1,..p_N$ summing to $n$.

This sequence can be found by a recurrence relation, and was performed in Sums of squares of binomial coefficients, with applications to Picard-Fuchs equations by H. A. Verrill in 2008.

Table 1 in the above paper describes recurrence relations for $a_n^N$, and we care about the case where $N=5$: the fourth equation is the object of our desire. It reads:

Piotr implemented this as so: (I’ve changed it the tiniest bit to better perform testing on it.)

import System.Environment
import Data.List
import Data.Char

next :: Integer -> Integer -> Integer -> Integer -> Integer
next n prev1 prev2 prev3 =
		((35*n*n*n*n - 70*n*n*n + 63*n*n - 28*n + 5) * prev1 -
		 ((n-1)*(n-1)*(259*(n-1)*(n-1) + 26)) * prev2 +
		 15*15*(n-1)*(n-1)*(n-2)*(n-2) * prev3)
		`div` (n*n*n*n)

vals :: [Integer]
vals = 1 : 5 : 45 : (zipWith4 next [3..] (drop 2 vals) (drop 1 vals) vals)

choose :: Integer -> Integer -> Integer
choose n 0 = 1
choose 0 k = 0
choose n k = choose (n-1) (k-1) * n `div` k

finalvals = zipWith (\v n -> v * (choose (n*2) n)) vals [0..]

main :: IO ()
	main = do
	[num] <- getArgs
  let n = read num :: Int
  print $ (finalvals !! n)

and called with runhaskell math.hs <num> +RTS -s (assuming you save it as math.hs.)

As expected, this has a phenomenal runtime.

$n$ 1 10 1e2 1e3 1e4 2e4 3e4 4e4 5e4
runtime 0.183s 0.183s 0.180s 0.198s 0.542s 1.232s 2.032s 3.203s 4.643s

I think that’s about as far as this problem goes. Thanks to Josh and Piotr for doing the actual difficult mathematics.