Jekyll2021-01-11T17:07:41+00:00https://www.kuniga.me/feed.xmlNP-IncompletenessKunigami's Technical BlogGuilherme KunigamiMax Area Under a Histogram2021-01-09T00:00:00+00:002021-01-09T00:00:00+00:00https://www.kuniga.me/blog/2021/01/09/max-area-under-histogram<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>This is a classic programming puzzle. Suppose we’re given an array of $n$ values reprenting the height $h_i \ge 0$ of bars in a histogram, for $i = 0, \cdots, n-1$. We want to find the largest rectangle that fits “inside” that histogram.</p>
<p>More formally, we’d like to find indices $l$ and $r$, $l \le r$ such that $(l - r + 1) \bar h$ is maximum, where $\bar h$ is the smallest $h_i$ for $i \in \curly{l, \cdots, r}$.</p>
<p>In this post we’ll describe an $O(n)$ algorithm that solves this problem.</p>
<!--more-->
<h2 id="simple-on2-solution">Simple $O(n^2)$ Solution</h2>
<p>The first observation we make is that any optimal rectangle must “touch” the top of one of the bars. If that wasn’t true, we could extend it a bit further and get a bigger rectangle.</p>
<p>This means we can consider each bar $i$ in turn and check the largest rectangle that “touches” the top of that bar, that is, has height $h_i$. We start with the rectangle as the bar itself, then we expand the width towards the left and right. How far can we go?</p>
<p>It’s easy to visualize we can keep expanding until we find a bar whose height is less than $h_i$. This gives us an $O(n^2)$ algorithm: for each $i$, find the closest $l < i$ whose height is less than $h_i$ and the closest $r > i$ whose height is less than $h_i$.</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2021-01-09-max-area-under-histogram/example_1.png" alt="a diagram depicting a quantum circuit" />
<figcaption>Figure 1: Finding the left and right boundaries of the highlighted bar</figcaption>
</figure>
<p>We can assume that the first and last elements of the height array are sentinels with height -1, so we don’t have to worry about corner cases since there are always such $l$ and $r$.</p>
<p>The maximum area will be $(r - l - 1)*h_i$. For illustration purposes we provide the Python code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">get_max_area_hist</span><span class="p">(</span><span class="n">h</span><span class="p">):</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">h</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">h</span> <span class="o">+</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># sentinels
</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">n</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span><span class="p">):</span>
<span class="n">l</span> <span class="o">=</span> <span class="n">i</span> <span class="o">-</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">h</span><span class="p">[</span><span class="n">l</span><span class="p">]</span> <span class="o">>=</span> <span class="n">h</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">l</span> <span class="o">-=</span> <span class="mi">1</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">i</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">h</span><span class="p">[</span><span class="n">r</span><span class="p">]</span> <span class="o">>=</span> <span class="n">h</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">r</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_a</span><span class="p">,</span> <span class="n">h</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">*</span><span class="p">(</span><span class="n">r</span> <span class="o">-</span> <span class="n">l</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span>
<span class="k">return</span> <span class="n">max_a</span></code></pre></figure>
<h2 id="on-solution-with-a-stack">$O(n)$ Solution with a Stack</h2>
<p>We can get rid of the inner loops by using a stack $S$. The stack will contain the indexes of the bars and starts out containing the sentinel element at 0 ($S = [0]$). At iteration $i$, we pop all elements from the stack that are greater than $h_i$ and then add $i$. Let’s represent the elements of the stack as $S = [a_0, a_1, \cdots, a_m]$, where $a_m$ is the top of the stack. Let’s explore a few properties:</p>
<p><strong>Property 1.</strong> The heights corresponding to the indices in the stack are sorted in non-decreasing order after each iteration. That is $h_{a_0} \le h_{a_1} \le \cdots, h_{a_m}$.</p>
<p><strong>Proof.</strong> We can show this by induction. It’s clearly true for a single element. Now suppose the property holds at the beginning of iteration $i$. Before we insert $i$ at the top, we’ll remove all the indices whose heights are bigger than $h_i$. Let the resulting stack be $S = [a_0, a_1, \cdots, a_{m’}]$. By hypothesis $h_{a_0} \le h_{a_1} \le \cdots, h_{a_{m’}}$ and by constuction $h_i \ge h_{a_{m’}}$, so the property holds after the insertion of $i$. <em>QED</em>.</p>
<p><strong>Property 2.</strong> For a given index $a_i$ in the stack ($i > 0$), $a_{i - 1}$ is the closest $l < a_{i}$ whose height is less than $h_{a_{i}}$.</p>
<p><em>Proof.</em> Let $j$ be the index stored at the top of the stack right before inserting $i$. We want to show $j = l$. We first note that by construction $h_j < h_i$ and $j < i$. Suppose $j \neq l$. Then $j < l$ by the definition of $l$. So if $l$ is not at the top of the stack, it got popped out since it was added, but it can only be popped at iteration $l’ > l$ whose height is smaller than $h_l$, which is a contradiction, since $l’$ would be closer to $i$ and $h_{l’} < h_i$.</p>
<p>This holds as long as both $i$ and $l$ remain on the stack since their order never change. <em>QED</em>.</p>
<p><strong>Property 3.</strong> If at iteration $i$ the index $j$ is popped from the stack, then $i$ is the closest $r > j$ whose height is less than $h_j$.</p>
<p><em>Proof.</em> We know that $h_j > h_i$ because it was popped and $i > j$ by construction. It remains to show that $i$ is the closest index to $j$. Suppose it’s not, that there is $i > i’ > j$ such that $h_j > h_{i’}$. Since by <em>Property 1</em> the heights in the stack are always in non-decreasing order, at iteration $i’$ it would have caused all elements on top of $j$ in $S$ to the popped and then $j$, but since $j$ is still in the stack, this cannot be so. <em>QED</em>.</p>
<p>Concluding, by <em>Property 3</em>, if $j$ is popped out in iteration $i$, then $r = i$. Moreover once $j$ is popped, the top stack happens to be $l$ by <em>Property 2</em>.</p>
<p>This leads to this algorithm:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">get_max_area_hist</span><span class="p">(</span><span class="n">h</span><span class="p">):</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">h</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="o">+</span> <span class="n">h</span> <span class="o">+</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span> <span class="c1"># sentinels
</span> <span class="n">stack</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">h</span><span class="p">)):</span>
<span class="k">while</span> <span class="n">h</span><span class="p">[</span><span class="n">stack</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]]</span> <span class="o">></span> <span class="n">h</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span>
<span class="n">j</span> <span class="o">=</span> <span class="n">stack</span><span class="p">.</span><span class="n">pop</span><span class="p">()</span>
<span class="n">l</span><span class="p">,</span> <span class="n">r</span> <span class="o">=</span> <span class="n">i</span><span class="p">,</span> <span class="n">stack</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_a</span><span class="p">,</span> <span class="n">h</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="o">*</span><span class="p">(</span><span class="n">l</span> <span class="o">-</span> <span class="n">r</span> <span class="o">-</span> <span class="mi">1</span><span class="p">))</span>
<span class="n">stack</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">i</span><span class="p">)</span>
<span class="k">return</span> <span class="n">max_a</span></code></pre></figure>
<p>We still have an inner loop but we can argue that the amortized cost is $O(n)$: every iteration of the inner <code class="language-plaintext highlighter-rouge">while</code> loop removes an element from the stack, and we only add elements to the stack $O(n)$ times, so we only execute the inner loop $O(n)$ times.</p>
<h2 id="largest-submatrix-of-a-binary-matrix">Largest Submatrix of a Binary Matrix</h2>
<p>Consider the following problem: we’re given a $n \times m$ binary matrix $B$ and we want to find the area of the largest rectangle that only contains 1s.</p>
<p>We will now show how to solve this problem in $O(nm)$. The idea is to find the largest rectangle that ends at row $i$, and then take the maximum accross all rows.</p>
<p>Suppose that the largest rectangle ending in row $i$ includes a given column $j$. The maximum height it can have is bounded by how many consecutive 1s there in previous rows for column $j$. If we call the length of such consecutive 1s $h_j$, we can now visualize these as heights of columns, so finding the largest rectangle ending in row $i$ can be reduced to finding the maximum area under a histogram, which we can do in $O(m)$.</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2021-01-09-max-area-under-histogram/example_2.png" alt="a diagram depicting a quantum circuit" />
<figcaption>Figure 2: 5 x 5 matrix. At the last row, he can visualize bars of histogram with heights: 1, 0, 2, 4, 1</figcaption>
</figure>
<p>How do we compute $h_j$ for row $i$? If we know the the “heights” of row $i-1$, say $h’$, we can compute it for $i$. Let $b_{ij}$ be the element in $B$ at row $i$ and column $j$. If $b_{ij} = 1$, then $h_j = h’_j + 1$. Otherwise, we break the chain of consecutive 1s, so $h_j = 0$.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">get_max_rectangle</span><span class="p">(</span><span class="n">b</span><span class="p">):</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">hist</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">m</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n</span><span class="p">):</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">if</span> <span class="n">matrix</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">j</span><span class="p">]</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="n">hist</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">hist</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">max_a</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="n">max_a</span><span class="p">,</span> <span class="n">get_max_area_hist</span><span class="p">(</span><span class="n">hist</span><span class="p">))</span></code></pre></figure>
<p>It’s easy to see that the algorithm above is $O(nm)$.</p>
<p>Note that if the problem asked for the largest <strong>square</strong>, the problem is easier. Let $s_{ij}$ be the length of any side of the largest square that ends at row $i$ and column $j$ and we know how to comput it for $i’ = i - 1$ or $j’ = j - 1$. Then $s_{ij} = \min(s_{ij’}, s_{i’j}, s_{i’j’}) + 1$.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I remember seeing the “max area under histogram” problem a long time ago but I didn’t remember the solution. The use of a stack is very clever but not straightforward to see why it works.</p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.hackerrank.com/challenges/largest-rectangle/editorial">1</a>] HackerRank - Largest Rectangle Editorial</li>
</ul>Guilherme KunigamiThis is a classic programming puzzle. Suppose we’re given an array of $n$ values reprenting the height $h_i \ge 0$ of bars in a histogram, for $i = 0, \cdots, n-1$. We want to find the largest rectangle that fits “inside” that histogram. More formally, we’d like to find indices $l$ and $r$, $l \le r$ such that $(l - r + 1) \bar h$ is maximum, where $\bar h$ is the smallest $h_i$ for $i \in \curly{l, \cdots, r}$. In this post we’ll describe an $O(n)$ algorithm that solves this problem.2020 in Review2021-01-01T00:00:00+00:002021-01-01T00:00:00+00:00https://www.kuniga.me/blog/2021/01/01/2020-in-review<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>This is a meta-post to review what happened in 2020.</p>
<!--more-->
<h2 id="posts-summary">Posts Summary</h2>
<p>This year I set out to learn about <strong>Quantum Computing</strong>. My aim was to understand <a href="https://www.kuniga.me/blog/2020/12/26/shors-prime-factoring.html">Shor’s Prime Factoring Algorithm</a> and learn whatever was needed for that. This led to the study of <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">The Deutsch-Jozsa Algorithm</a>, <a href="https://www.kuniga.me/blog/2020/11/21/quantum-fourier-transform.html">Quantum Fourier Transform</a>, <a href="https://www.kuniga.me/blog/2020/12/23/quantum-phase-estimation.html">Quantum Phase Estimation</a> and <a href="https://www.kuniga.me/blog/2020/12/11/factorization-from-order.html">Number Factorization from Order-Finding</a>.</p>
<p>I’m satisfied with the learning progress and glad to finally have a better understanding of Shor’s algorithm, even though I procrastinated until the second half of the year to start my studies. I liked the approach of having a specific goal in mind, “Understand Shor’s algorithm”, as opposed to the more vague “Learn Quantum Computing”, since it allows focusing and it’s clearer when I can stop.</p>
<p>I wrote about some topics relevant to work including <a href="https://www.kuniga.me/blog/2020/02/02/python-coroutines.html">Python Coroutines</a>, <a href="https://www.kuniga.me/blog/2020/03/07/sockets.html">Sockets</a>, <a href="https://www.kuniga.me/blog/2020/03/28/browser-performance.html">Browser Performance</a>, <a href="https://www.kuniga.me/blog/2020/01/04/observable.html">Observable</a> and <a href="https://www.kuniga.me/blog/2020/05/22/review-working-effectively-with-legacy-code.html">Review: Working Effectively With Legacy Code</a>.</p>
<p>I dedicated some time to learn about system development including <a href="https://www.kuniga.me/blog/2020/07/31/buddy-memory-allocation.html">Memory Allocation</a> and <a href="https://www.kuniga.me/blog/2020/04/24/cpu-cache.html">CPU Cache</a>.</p>
<p>I touched on machine learning by reading the paper <a href="https://www.kuniga.me/blog/2020/10/09/lara.html">Latent Aspect Rating Analysis on Review Text Data</a> and the optmization algorithm <a href="https://www.kuniga.me/blog/2020/09/04/lbfgs.html">L-BFGS</a>.</p>
<p>I had fun writing about two programming puzzles, <a href="https://www.kuniga.me/blog/2020/05/25/minimum-string-from-removing-doubles.html">Shortest String From Removing Doubles</a> and <a href="https://www.kuniga.me/blog/2020/11/06/puzzling-election.html">A Puzzling Election</a>.</p>
<p>I read a book on Information Theory which I didn’t end up writing about, but it inspired me to revisit <a href="https://www.kuniga.me/blog/2020/06/11/huffman-coding.html">Huffman Coding</a>.</p>
<h2 id="the-blog-in-2020">The Blog in 2020</h2>
<p>This year the blog went through major transformations. After about 10 years using Wordpress, I finally decided to <a href="https://www.kuniga.me/blog/2020/07/11/from-wordpress-to-jekyll.html">migrate to Github pages</a> for more control.</p>
<p>One of the features I miss the most is the well integrated analytics. I’m currently using Google analytics but it doesn’t have a reliable way to exclude my own visits, which is a lot especially while writing a post. With that caveat, according to the data, the <a href="https://www.kuniga.me/blog/2020/07/31/buddy-memory-allocation.html">Buddy Memory Allocation</a> post was the most popular with 146 unique visits. Overall the blog had a total of 1.6k visitors.</p>
<p>I kept the resolution to have at least one post a month on average, by writing 19 posts. The blog completed 8 years with 115 posts (some of which were ported and translated from my old blog in Portuguese).</p>
<h2 id="resolutions-for-2021">Resolutions for 2021</h2>
<p>I enjoyed learning about the basics of quantum computing, but I found it highly theoretical. I’m still interested in it from a purely intellectual point of view, especially in learning about Quantum information theory and the complexity class of quantum algorithms, but it will not be my focus.</p>
<p>For 2021 I’ll try to focus on less things. My only explicit goal for 2021 is to learn about machine learning especifically for speech recognition. I’ll try to learn the state of the art and the theory behind it, but also anything related to this problem from a practical perspective such as audio encoding, OS drivers for microphones, signal processing, etc.</p>
<h2 id="personal">Personal</h2>
<p>The end of the year is a good time to look back and remember all the things I’ve done besides work and the technical blog. Due to the coronavirus pandemic this year there wasn’t much opportunity for travelling, but on the other hand I ended up having a lot more time for catching up on reading.</p>
<h3 id="trips">Trips</h3>
<p>Despite travel restrictions, I was able to go on roadtrips around California which has beautiful scenery. I had a chance to go again to Yosemite, Pinnacles and Death Valley National Parks, besides doing a lot of hikes and some camping in local parks.</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/2020-nps.png" alt="a collage of photos from different national parks in California" />
<figcaption>
Top (All in Yosemite):
1. <a href="https://www.flickr.com/photos/kunigami/50800718838/" target="_blank">Nevada Fall</a>;
2. <a href="https://www.flickr.com/photos/kunigami/50800727303/" target="_blank">Cathedral Peak</a>;
3. <a href="https://www.flickr.com/photos/kunigami/50800727608/" target="_blank">Half-dome</a>.
Bottom:
4. <a href="https://www.flickr.com/photos/kunigami/50800727093/" target="_blank">Bear Gulch reservoir in Pinnacles</a>;
5. <a href="https://www.flickr.com/photos/kunigami/50801468616/" target="_blank">Death Valley from the Windrose trail</a>;
6. <a href="https://www.flickr.com/photos/kunigami/50787619163/" target="_blank">Red Rock Canyon State Park</a>
</figcaption>
</figure>
<h3 id="books">Books</h3>
<p>As I mentioned, the pandemic left a lot of more indoor time which led to more reading. Here are the books I finished reading in 2020.</p>
<p><strong>History</strong></p>
<table class="books-table">
<tbody>
<tr>
<td><b>Bury my Heart at Wounded Knee</b> by Dee Brown. The history of many native American tribes (Arapaho, Apache, Cheyenne, Kiowa, Navaho, Sioux) in the late 19th century and their fight against American settlers and military. It's a bit hard to read at times due to violent and injust acts of the latter. I wasn't familiar with this dark side of American history.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/bury_my_heart.jpg" alt="Bury my Heart at Wounded Knee Book Cover" /></td>
</tr>
<tr>
<td><b>The Last Mughal</b> by William Dalrymple. Recounts the history of India, centered around the last years of Zafar's reign, and preceding the British Raj. I couldn't help drawing parallels with <i>Bury my Heart at Wounded Knee</i>. I picked this as part of the trip to India in 2019 - most of the book is in Delhi, so it is a good read if you're visiting the region.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/last_mughal.jpg" alt="The Last Mughal Book Cover" /></td>
</tr>
<tr>
<td><b>Sapiens</b> by Yuval Noah Harari. I rarely re-read books but I recall liking this book so much a few years back that I decided to revisit. I didn't remember a lot of the contents and wasn't as amused, possibly due to high expectations and maybe having internalized some of the more surprising facts. I do liked the idea of dedicating some time to re-read books I really liked, so I'll try to make a point of re-reading a book every year.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/sapiens.jpg" alt="Sapiens Book Cover" /></td>
</tr>
<tr>
<td><b>The Great Influenza</b> by John M. Barry. Recounts the events around the US during the pandemic 1918. It focuses a lot on the life of the scientists that made key contributions during and after the time, and also the revolution of American medicine which started a few decades prior to the pandemic. It shouldn't be surprising my choice of this book during 2020 :)</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/the_great_influenza.jpg" alt="The Great Influenza Cover" /></td>
</tr>
<tr>
<td><b>The Quartet</b> by Joseph J. Ellis. I'm not the biggest fan of American history but knowing this year we'd be limited to be within the US and since I enjoy reading history from places I travel to, I decided to give it a try.
It focuses on what's called the second American revolution (the first being independence from Britain) led by four proeminent figures: Alexander Hamilton, George Washington, John Jay, and James Madison. It culminates with the writing of the constitution.
It was interesting to learn how much the struggle of powers between states and the federal government influenced the nature of the constitution.
</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/the_quartet.jpg" alt="The Quartet Cover" /></td>
</tr>
<tr>
<td><b>The Silk Roads</b> by Peter Frankopan. This book tells the history of world from the point of view of the region covered by the Silk roads, which include countries from the near and middle east and central asia. I don't recall learning so much history from a single book, and if I had to pick one, this would be my favorite book from 2020.
</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/silk_roads.jpg" alt="The Silk Roads Book cover" /></td>
</tr>
</tbody>
</table>
<p><strong>Science</strong></p>
<table class="books-table">
<tbody>
<tr>
<td><b>I'm a Strange Loop</b> by Douglas Hofstadter. I was impressed by Hofstadter's Gödel Escher and Bach but had trouble grasping a lot of the subjects. The author claims that <i>I'm Strange Loop</i> is a more focused and intuitive take into consciousness. It borrows a lot on his personal experiences which makes it kind of an auto-biography. Overall it's a fascinating philosophical discussion. My favorite bit was the thought experiment by Derek Parfit regarding the uniqueness of the "self", which is summarized in <a href="https://en.wikipedia.org/wiki/Reasons_and_Persons">here</a>. I'm looking forward to reading <i>Reasons and Persons</i>.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/i_am_a_strange_loop.jpg" alt="I'm a Strange Loop Book Cover" /></td>
</tr>
<tr>
<td><b>Working Effectively With Legacy Code</b> by Michael C. Feathers. I wrote a <a href="https://www.kuniga.me/blog/2020/05/22/review-working-effectively-with-legacy-code.html">post</a> about it.
</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/work_with_legacy_code.jpg" alt="Working Effectively With Legacy Code Book Cover" /></td>
</tr>
<tr>
<td><b>An Introduction to Information Theory</b> by John R. Pierce. I don't recall why I had this book on my shelf, but it had been there for a while so I decided to catch up on my unread books. It doesn't require prior advanced math knowledge but it's still a textbook. I like its multi-disciplinary approach, for example: bringing in thermodynamics to discuss and compare entropy in physics and in information theory; talking (briefly) about quantum information theory; considering information theory in arts and linguistics. It inspired me to write about <a href="https://www.kuniga.me/blog/2020/07/31/buddy-memory-allocation.html">Huffman encoding</a>.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/information_theory.jpg" alt="An Introduction to Information Theory Book Cover" /></td>
</tr>
<tr>
<td><b>I contain multitudes</b> by Ed Young. This book explores the world of microbes and makes the case that there are not inherently good or bad microbes, but there are those that happen to benefit us vs. not, and in some cases the same species even play both roles depending on the situation.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/i_contain_multitudes.jpg" alt="I contain multitudes Book Cover" /></td>
</tr>
<tr>
<td><b>Why we sleep</b> by Matthew Walker. I learned how important sleeping is for our health. Insufficient sleep is related to a plethora of diseases and conditions, including cancer, obesity, the immune system health, etc.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/why_we_sleep.jpg" alt="Why we sleep Book Cover" /></td>
</tr>
<tr>
<td><b>Beyond Weird</b> by Philip Ball. Quantum Mechanics for lay people which I found very accessible. It doesn't require prior knowledge of quantum mechanics but it does try to clarify where the popular notions of entanglement, superposition, quantum teleportation come from. My main takeway is that quantum mechanics is a mathematical theory (abstraction) that exists without necessarily having an explicit representation in reality, which is hard to be satisfied with given it does predict a lot of real-world observations.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/beyond_weird.jpg" alt="Beyond Weird Book Cover" /></td>
</tr>
<tr>
<td><b>Infinite Powers</b> by Steven Strogatz. It covers the history of Calculus including the seeds of the theory which started with mathematicians from the ancient era such as Archimedes, developing through Galileo, Kepler until the full-development by Leibniz and Newton. It is very informative and provides an intuitive and gentle introduction to calculus. It also describes important applications both in theory and practice (quantum mechanics, GPS, CTScan).</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/infinite_powers.jpg" alt="Infinite Powers Book Cover" /></td>
</tr>
</tbody>
</table>
<p><strong>Other non-fiction</strong></p>
<table class="books-table">
<tbody>
<tr>
<td><b>The Everything Store</b> by Brad Stone. As with the <a href="https://www.kuniga.me/blog/2020/01/04/2019-in-review.html">biography of Phil Knight</a> (Nike's founder), this biography of Jeff Bezzos is intertwined with that of his company. I learned some interesting facts for example, how much leverage Amazon has on acquiring smaller competitors (such as Zappos).</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/the_everything_store.jpg" alt="The Everything Store Book Cover" /></td>
</tr>
<tr>
<td><b>Everybody Lies</b> by Seth Stephens-Davidowitz. Seth is a data scientist who finds insights using publicly available sources. One of my main takeaways is that Google trends is a particularly rich source of data because people make searches in anonymity. This is in contrast to public surveys or social media where people tend to be "polically correct" and not fully honest.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/everybody_lies.jpg" alt="Everybody Book Cover" /></td>
</tr>
<tr>
<td><b>Don't make me think</b> by Steve Krug. This book provides several practical advices on making websites more user friendly. I felt I had already internalized a lot of the good practices suggested by having worked with web tools that inherited a lot of designs made by someone with good UX knowledge. It was useful to see them listed out explicitly though.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/dont_make_me_think.jpg" alt="Don't make me think Book Cover" /></td>
</tr>
<tr>
<td><b>Peopleware</b> by Tom DeMarco and Timothy Lister. Every list of recommended programming books seems to include this (among others that I like such as *Code Complete*), so I decided to give it a go. I am not and don't plan to manage people any time soon, but I wanted to understand what makes a good manager, since most people work with one. The book covers a set of topics primarily focused on the happiness and productivity of individuals. It's full of interesting anectodes and it's not prescriptive. I enjoyed it overall and might write a review at some point.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/peopleware.jpg" alt="Peopleware Book Cover" /></td>
</tr>
</tbody>
</table>
<p><strong>Fiction</strong></p>
<table class="books-table">
<tbody>
<tr>
<td><b>Invisible Cities</b> by Italo Calvino. I started this book a long time ago (2018?) but only finished this year. It consists of a set of short stories about fictious cities. It's hard to make sense on some of them but the imagery some of them evoke are very artistic.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/invisible_cities.jpg" alt="Invisible Cities Book Cover" /></td>
</tr>
<tr>
<td><b>The Overstory</b> by Richard Powers. Beautiful book and message. I like how a lot of the story happens around the Bay Area. I learned that in Stanford's <a href="https://trees.stanford.edu/treewalks/treemaps.htm" target="_blank">main quad</a> there are a variety of trees from all over the world. I thought the author went a bit overboard with esoteric words, and I had to look up the dictionary pretty often.</td>
<td><img src="https://www.kuniga.me/resources/blog/2021-01-01-2020-in-review/the_overstory.png" alt="The Overstory Book Cover" /></td>
</tr>
</tbody>
</table>Guilherme KunigamiThis is a meta-post to review what happened in 2020.Shor’s Prime Factoring Algorithm2020-12-26T00:00:00+00:002020-12-26T00:00:00+00:00https://www.kuniga.me/blog/2020/12/26/shors-prime-factoring<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<figure class="image_float_left">
<img src="https://www.kuniga.me/resources/blog/2020-12-26-shors-prime-factoring/peter-shor.png" alt="Peter Shor thumbnail" />
</figure>
<p>Peter Shor is an American professor at MIT. He received his B.S. in Mathematics at Caltech and earned his Ph.D. in Applied Mathematics from MIT advised by Tom Leighton. While at Bell Labs, Shor developed the Shor’s prime factorization quantum algorithm, which awarded him prizes including the Gödel Prize in 1999.</p>
<p>In this post we’ll combine the parts we studied before to understand Peter Shor’s prime factorization quantum algorithm, which can find a factor of a composite number exponentially faster than the best known algorithm using classic computation.</p>
<p>We’ll need basic familiarity with quantum computing, covered in a <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">previous post</a>. The bulk of the post is showing how to efficiently solve the order-finding problem since we learned from <a href="https://www.kuniga.me/blog/2020-12-11-factorization-from-order.html">Number Factorization from Order-Finding</a> that it is the bottleneck step in finding a prime factor of a composite number. The remaining is putting everything together and do some analysis of the performance of the algorithm as a whole.</p>
<!--more-->
<h2 id="quantum-order-finding">Quantum Order-finding</h2>
<p>Recall the definition of order-finding from [2]:</p>
<blockquote>
<p>Given integers $x$, $N$, the problem of <em>order-finding</em> consists in finding the smallest positive number $r$ such that $x^r \equiv 1 \Mod{N}$, where $r$ is called the <em>order of</em> $x \Mod{N}$.</p>
</blockquote>
<p>The basic idea is to choose a unitary matrix $U$ and show that its eigenvalue contains the order of $x \Mod{N}$.</p>
<h3 id="choosing-the-operator-u">Choosing the Operator $U$</h3>
<p>We choose $U$ such that $U \ket{u} = \ket{x u \Mod{N}}$, which can be shown to be a unitary matrix. Now suppose our eigenvector is</p>
\[\ket{u_s} = \frac{1}{\sqrt{r}} \sum_{k = 0}^{r-1} \exp({\frac{-2 \pi i s k}{r}}) \ket{x^k \Mod{N}}\]
<p>For a parameter $0 \le s \le r - 1$. If we apply the operator $U$:</p>
\[U \ket{u_s} = \ket{x} \frac{1}{\sqrt{r}} \sum_{k = 0}^{r-1} \exp({\frac{-2 \pi i s k}{r}}) \ket{x^k \Mod{N}}\]
\[= \frac{1}{\sqrt{r}} \sum_{k = 0}^{r-1} \exp({\frac{-2 \pi i s k}{r}}) \ket{x^{k+1} \Mod{N}}\]
<p>We can show that (see <em>Lemma 1</em> in the <em>Appendix</em>)</p>
\[U \ket{u_s} = \exp({\frac{2 \pi i s}{r}}) \ket{u_s}\]
<p>We conclude that $\exp({\frac{2 \pi i s}{r}})$ is the eigenvalue for $U$ and $\ket{u_s}$. We can then measure $\varphi \approx s/r$ via <a href="https://www.kuniga.me/blog/2020/12/23/quantum-phase-estimation.html">Quantum Phase Estimation</a>.</p>
<p>But how do we prepare the state $\ket{u_s}$ for some $s$?</p>
<h3 id="preparing-the-eigenvector">Preparing the eigenvector</h3>
<p>We don’t know how to prepare $\ket{u_s}$ for a specific $s$, but we can prepare a state which is a linear combination of $\ket{u_s}$, in particular:</p>
\[\frac{1}{\sqrt{r}} \sum_{s=0}^{r-1} \ket{u_s}\]
<p>Which can be show to be exactly $\ket{1}$ (see <em>Lemma 2</em> in the <em>Appendix</em>). That means if we use $\ket{1}$ as our initial eigenvector, we’ll measure $\varphi$ corresponding to one of the eigenvalues $\exp({\frac{2 \pi i s}{r}})$, but we don’t know which $s$ was used!</p>
<p>Let’s take a detour to revisit continued fractions and learn how to leverage them to recover $r$ and $s$.</p>
<h2 id="continued-fractions">Continued Fractions</h2>
<p>Continued fractions is a way to represent rational numbers such that they can be iteratively approximated. For example, consider the rational $\frac{31}{13}$.</p>
<p>We can represent it as $2 + \frac{5}{13}$, which is the same as $2 + \frac{1}{\frac{13}{5}}$. We can repeat this process for $13/5$ to get</p>
\[\frac{31}{13} = 2 + \frac{1}{2 + \frac{1}{\frac{3}{5}}}\]
<p>If we continue with this, we’ll end up with</p>
\[\frac{31}{13} = 2 + \frac{1}{2 + \frac{1}{1 + \frac{1}{1 + \frac{1}{2}}}}\]
<p>We can’t keep doing this with $\frac{1}{2}$ since the denominator of $\frac{1}{(\frac{2}{1})}$ leaves no remainder.</p>
<p>More formally, consider a rational number greater than 1, $p/q$. We rewrite it as $a + p/q$, such that $p = aq + b$, $b < q$. Since $b/q < 1$, $1/(q/b) > 1$, and we can repeat the procedure for $q/b$. Note that because $b < q$ this algorithm will eventually end. The algorithm should return the list of $a$’s generated in the process, which provide a unique representation for $p/q$.</p>
<p>This idea can be implemented in a short Python code:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">continued_fraction</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">q</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">while</span> <span class="n">q</span> <span class="o">></span> <span class="mi">0</span> <span class="ow">and</span> <span class="n">p</span> <span class="o">></span> <span class="n">q</span><span class="p">:</span>
<span class="n">a</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">p</span> <span class="o">//</span> <span class="n">q</span><span class="p">)</span>
<span class="n">p</span><span class="p">,</span> <span class="n">q</span> <span class="o">=</span> <span class="n">q</span><span class="p">,</span> <span class="n">p</span> <span class="o">%</span> <span class="n">q</span>
<span class="k">return</span> <span class="n">a</span></code></pre></figure>
<p>As an example, if we run it with $p = 31, q = 13$, it returns $a = [2, 2, 1, 1, 2]$.</p>
<h3 id="recovering-the-rational-number">Recovering the Rational Number</h3>
<p>Given a list of integers $a = [a_0, \cdots, a_n]$, it’s possible to obtain recover the numerator and denominator $p$ and $q$ that has the continued fraction corresponding to $a$.</p>
<p>If we define $p_0 = a_0$, $q_0 = 1$, $p_1 = 1 + a_0 a_1$ and $q_1 = a_1$, and then recursively</p>
\[\begin{aligned}
p_n &= a_n p_{n-1} + p_{n-2}\\
q_n &= a_n q_{n-1} + q_{n-2}
\end{aligned}\]
<p>It’s possible to show that $[a_0, \cdots, a_n] = p_n / q_n$ and furthermore, that $p_n$ and $q_n$ are co-primes. This provides an easy $O(n)$ algorithm to recover $p$ and $q$ given $[a_0, \cdots, a_n]$.</p>
<h3 id="finding-nearby-rational-numbers">Finding Nearby Rational Numbers</h3>
<p>Suppose we are given a rational number $x$ and we want to recover co-primes $p$ and $q$, such that</p>
\[\abs{\frac{p}{q} - x} \le \frac{1}{2q^2}\]
<p>It’s possible to show that $x$ has a continued fraction $a = [a_0, \cdots, a_n]$ and that $p / q = [a_0, \cdots, a_k]$, for $k \le n$.</p>
<p>This means that if we feed $x$’s continued fraction to the algorithm from the section above, we’ll invariably run into $p = p_k$ and $q = q_k$.</p>
<p>This flexibility is important in the context of our problem because the value we’ll measure, $\varphi$, is not exactly $s/r$ but an approximation.</p>
<h3 id="recovering-r-via-continued-fractions">Recovering $r$ via Continued Fractions</h3>
<p>We’ll now see how to recover $s$ and $r$ from $\varphi$. We know that both $s$ and $r$ are integers, so $s/r$ is a rational, and we can use continued fractions to extract them.</p>
<p>We first compute the continued fraction of $\varphi$. Then we try to recover its numerator and denominator, and it can be shown that</p>
\[\abs{\frac{s}{r} - \varphi} \le \frac{1}{2r^2}\]
<p>So by the discussion in the previous section we’ll invariably pass by $p_k / q_k$ such that $p_k / q_k = s / r$.</p>
<p>The problem is that we don’t know for which $k$ that’s the case, but we can determine if $r = q_k$ for each $k$ by checking whether $x^{q_k} \equiv 1 \Mod{N}$. If $r$ and $s$ are co-primes, then we’ll find it.</p>
<p>If not, let $r_0 = q_k$ for a given iteration $k$ and assume that $x^{r_0} \not \equiv 1 \Mod{N}$. We can show that $r_0$ is a factor of $r$ (see <em>Lemma 3</em> in <em>Appendix</em>). Let $x_0 \equiv x^{r_0} \Mod{N}$. The order of $x_0$ is $r_{r_0} = r/r_0$ since $x_0^{r/r_0} = x^{r}$. We’ll obtain $r_1$, which if happens to be the order of $x_0$ allows us to get $r$ via $r = r_1 r_0$. Otherwise we repeat for $x_0’ \equiv x^{r_1}$. Since $r_0$ is a proper factor of $r$, $r_0 \le r/2$, so on each iteraction we at least halve the order, which means we only need $O(\log r)$ iterations. If we reach the point where $r_{n} = 1$, it means that $q_k$ is not valid and hence $p_k / q_k \ne s / r$.</p>
<p>It’s also possible that the true value of $s$ is $0$, in which case we won’t be able to find $r$. This can happen with probability $p(s=0) \le r$, in which case we repeat the whole algorithm.</p>
<p><strong>Note.</strong> In [1], it suggests we can find $s’/r’ = s/r$ where $s’$ and $r’$ are co-prime from $\varphi$ using continued fractions alone, but I don’t understand how from studying the proof of <em>Theorem A 4.16</em>. In other words, we know $k$ for which $p_k / q_k = s / r$, which in turn will lead to a more efficient algorithm.</p>
<h2 id="shors-prime-factoring">Shor’s Prime Factoring</h2>
<p>We now have all the pieces to solve the prime factoring. We first use $U \ket{u} = \ket{x u \Mod{N}}$ and $\ket{u} = \ket{1}$. We’ll measure $s / r$ for $s \in \curly{0, \cdots, r-1}$ with high-probability. We can recover $r$, the order of $x \Mod{N}$, by using the method outlined in the previous section.</p>
<p>Finally, once we know $r$ we can obtain a prime factor with high-probability as described in <a href="https://www.kuniga.me/blog/2020-12-11-factorization-from-order.html">Number Factorization from Order-Finding</a>.</p>
<h2 id="performance">Performance</h2>
<h3 id="runtime-complexity">Runtime Complexity</h3>
<p>Let’s first consider the number of gates needed for performing the steps above. Let $L$ represent the number of bits of $x$ for which we want to compute the order modulo $N$.</p>
<p><strong>Measuring $\phi$.</strong> We assume $t$ is roughly the size of $L$. From the circuit depicted in <em>Figure 1</em> in <a href="https://www.kuniga.me/blog/2020/12/23/quantum-phase-estimation.html">Quantum Phase Estimation</a>, we need $O(L)$ Hadamard gates and $O(L)$ of the $U^{2^k}$ gates. We can use an <a href="https://en.wikipedia.org/wiki/Modular_exponentiation">efficient modular exponentiation algorithm</a> which can compute $a^b$ in $O(\log(b) \log^2(a))$ where $\log^2(a)$ is due to the multiplications, so $U^{2^k}$ can be implemented using $O(L^3)$ gates. Summarizing, we can measure $\phi \approx s/r$ using $O(L^4)$ gates.</p>
<p><strong>Recovering $r$ via Continued Fractions.</strong> It’s possible to show that we can compute the continued fraction of $\phi$ in $O(L^3)$, and that the output has size $O(L)$. We then need $O(L)$ iterations to recover its numerator and denominator, but at each step $k$ we need to know whether the numerator $q_k$ is $r$ by computing $x^r \Mod{N}$, which can be done in $O(L^3)$ using fast modular exponentiation.</p>
<p>However, if we didn’t find $r$, we need to repeat the process up to $O(\log r) = O(L)$ times, each time re-computing $x^r \Mod{N}$, for a total of $O(L^4)$ operations each time we need to test a candidate $q_k$. This amounts up to $O(L^5)$.</p>
<p><strong>Number Factorization from Order-Finding.</strong> Finally, as we discussed in [3], the complexity of obtaining a prime factor is $O(L^3)$ excluding the order finding step.</p>
<p>Recovering $r$ from $\phi$ dominates the overall complexity, leading to a $O(L^5)$ probabilistic quantum algorithm that can find a prime factor of a composite number.</p>
<p>For comparison, if we use a linear algorithm to find the order as discussed in [3], we would end up with a $O(2^L)$ one.</p>
<h3 id="precision">Precision</h3>
<p>We have a few sources of uncertainty in the algorithm, which we recap now:</p>
<ul>
<li>The possibility of not measuring the real $s / r$, which is less than 60% (see <em>Measuring $\phi$</em> in [2]) and can further reduced by using more gates, or simply repeating the phase estimation algorithm since each run is independent.</li>
<li>The possibility that $s = 0$ (see <em>Recovering $r$ via Continued Fractions</em> above), which has low probability ($1 / r$) and can be further reduced by repeating the phase estimation algorithm.</li>
<li>The possibility that the randomly chosen $x$ in the prime factoring algorithm (see <em>Prime Factoring Algorithm</em> in [3]) yields a “bad” $r$, which is less than 25% and can be further reduced by repeatedly choosing a new random $x$.</li>
</ul>
<h2 id="conclusion">Conclusion</h2>
<p>This is the final post that led to the Shor’s prime factoring. Regardless of the practical applicability of this method, I found it fascinating how much theory it relies on. I learned a lot about quantum computing in the process and while I might not have grasped every single detail, I think I have a good overall idea on how everything comes together.</p>
<p>I found that:</p>
\[\frac{1}{\sqrt{r}} \sum_{s=0}^{r-1} \ket{u_s} = \ket{1}\]
<p>Is mind-blowing. It’s as if a bunch of eigenvectors are “hiding” inside $\ket{1}$ and only “come out” when measured.</p>
<h2 id="related-posts">Related Posts</h2>
<ul>
<li><a href="https://www.kuniga.me/blog/2019/04/12/consistent-hashing.html">Consistent Hashing</a> - this is a trivia rather than real relatedeness, but I found out that <a href="https://en.wikipedia.org/wiki/F._Thomson_Leighton">Tom Leighton</a> was the advisor of both <a href="https://en.wikipedia.org/wiki/Daniel_Lewin">Daniel Lewin</a> (founder of Akamai, featured in that post) and Peter Shor (featured in this post).</li>
</ul>
<h2 id="appendix">Appendix</h2>
<p><strong>Lemma 1.</strong> $U \ket{u_s} = \exp({\frac{2 \pi i s}{r}}) \ket{u_s}$</p>
<p><em>Proof.</em> To simplify the notation, assume $\alpha = \frac{-2 \pi i s}{r}$ and $y^k = x^k \Mod{N}$. We can write $\ket{u_s}$ as:</p>
\[\ket{u_s} = \frac{1}{\sqrt{r}} (e^{\alpha 0} \ket{y^0} + e^{\alpha 1} \ket{y^1} + \cdots + e^{\alpha (r-1)} \ket{y^{r-1}})\]
<p>We can write $U \ket{u_s}$ as:</p>
\[U \ket{u_s} = \frac{1}{\sqrt{r}} (e^{\alpha 0} \ket{y^1} + e^{\alpha 1} \ket{y^2} + \cdots + e^{\alpha (r-1)} \ket{y^{r}})\]
<p>We note that $\ket{y^{r}} = \ket{y^{0}}$ since by definition $x^r \equiv x^0 \equiv 1 \Mod{N}$, so we can rearrange:</p>
\[U \ket{u_s} = \frac{1}{\sqrt{r}} (e^{\alpha (r-1)} \ket{y^{0}} + e^{\alpha 0} \ket{y^1} + \cdots + e^{\alpha (r-2)} \ket{y^{r - 1}})\]
<p>This looks almost like $\ket{u_s}$ except the exponents are shifted by 1. We can fix this by pulling a factor of $e^{\alpha}$:</p>
\[U \ket{u_s} = \frac{1}{e^{\alpha} \sqrt{r}} (e^{\alpha r} \ket{y^{0}} + e^{\alpha 1} \ket{y^1} + \cdots + e^{\alpha (r-1)} \ket{y^{r - 1}})\]
<p>We have $e^{\alpha r} = \exp(\frac{2 \pi i s r}{r}) = \exp(2 \pi i s)$ and since $s$ is integer, by Euler’s formula we conclude that $e^{\alpha r} = 1 = e^{\alpha 0}$, so we can do</p>
\[U \ket{u_s} = \frac{1}{e^{\alpha} \sqrt{r}} (e^{\alpha 0} \ket{y^{0}} + e^{\alpha 1} \ket{y^1} + \cdots + e^{\alpha (r-1)} \ket{y^{r - 1}})\]
<p>Now it’s easy to see that $U \ket{u_s} = \frac{1}{e^{\alpha}} \ket{u_s} = e^{-\alpha} \ket{u_s} = \exp({\frac{2 \pi i s}{r}}) \ket{u_s}$. <em>QED</em></p>
<p><strong>Lemma 2.</strong></p>
\[\frac{1}{\sqrt{r}} \sum_{s=0}^{r-1} \ket{u_s} = \ket{1}\]
<p><em>Proof:</em></p>
<p>Let</p>
\[S = \frac{1}{\sqrt{r}} \sum_{s=0}^{r-1} \ket{u_s}\]
<p>Replace the definition of $\ket{u_s}$:</p>
\[S = \frac{1}{\sqrt{r}} \sum_{s=0}^{r-1} (\frac{1}{\sqrt{r}} \sum_{k = 0}^{r-1} e^{-2 \pi i s k /r} \ket{x^k \Mod{N}})\]
<p>Moving scalars around and changing the order of the sums yields:</p>
\[\frac{1}{r} \sum_{k = 0}^{r-1} \ket{x^k \Mod{N}} (\sum_{s=0}^{r-1} e^{-2 \pi i s k /r} )\]
<p>For $k = 0$ the terms of the inner sum are equal to 1, so it adds up to $r$. Otherwise, the inner sum is a geometric sum on $s$, which has a closed form (see <em>Interlude: Classic Inverse Fourier Transform</em> [2] for more details):</p>
\[\sum_{s=0}^{r-1} e^{-2 \pi i s k /r} = \frac{1 - e^{-2 \pi i s k}}{1 - e^{-2 \pi i s k / r}}\]
<p>Since $k$ is a positive integer, $e^{-2 \pi i s k} = 1$. Since $k < r$, $k / r < 1$ and $e^{-2 \pi i s k / r} \neq 1$, which means the inner sum is 0 for $k > 0$.</p>
<p>Thus,</p>
\[S = \frac{1}{r} r \ket{x^0 \Mod{N}} = \ket{1}\]
<p><em>QED</em></p>
<p><strong>Lemma 3.</strong> Let $r, s, r’, s’$ be positive integers. If $s’ / r’ = s / r$ and $s’$ and $r’$ are co-primes, then $r’$ divides $r$. Furthermore, if $s$ and $r$ are not co-prime then $r’ < r$.</p>
<p><em>Proof</em> We have $s’ r = s r’$, and because prime factorization is unique, both sides have the same prime factors. All prime factors contributed by $r’$ must be matched by $r$ on the other side since $r’$ and $s’$ do not share any prime factors. This means that $r$ can be divided by $r’$.</p>
<p>If $r$ and $s$ are not co-prime then $r \neq r’$ because otherwise $s’ = s$ but $r’$ and $s’$ are co-prime. Also, since $r$ can be divided by $r’$, $r \ge r’$, so it must be $r > r’$. <em>QED</em></p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.amazon.com/Quantum-Computation-Information-10th-Anniversary/dp/1107002176">1</a>] Quantum Computation and Quantum Information - Nielsen, M. and Chuang, I.</li>
<li>[<a href="https://www.kuniga.me/blog/2020/12/23/quantum-phase-estimation.html">2</a>] Quantum Phase Estimation</li>
<li>[<a href="(https://www.kuniga.me/blog/2020-12-11-factorization-from-order.html)">3</a>] Number Factorization from Order-Finding</li>
</ul>Guilherme KunigamiPeter Shor is an American professor at MIT. He received his B.S. in Mathematics at Caltech and earned his Ph.D. in Applied Mathematics from MIT advised by Tom Leighton. While at Bell Labs, Shor developed the Shor’s prime factorization quantum algorithm, which awarded him prizes including the Gödel Prize in 1999. In this post we’ll combine the parts we studied before to understand Peter Shor’s prime factorization quantum algorithm, which can find a factor of a composite number exponentially faster than the best known algorithm using classic computation. We’ll need basic familiarity with quantum computing, covered in a previous post. The bulk of the post is showing how to efficiently solve the order-finding problem since we learned from Number Factorization from Order-Finding that it is the bottleneck step in finding a prime factor of a composite number. The remaining is putting everything together and do some analysis of the performance of the algorithm as a whole.Quantum Phase Estimation2020-12-23T00:00:00+00:002020-12-23T00:00:00+00:00https://www.kuniga.me/blog/2020/12/23/quantum-phase-estimation<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>Given a unitary matrix $U$ with eigenvector $\ket{u}$, we want to estimate $\varphi$ where $e^{2 \pi i \varphi}$ is the eigenvalue of $U$.</p>
<p>This serves as framework for solving a varierity of problems including order finding, which as we have shown in a <a href="https://www.kuniga.me/blog/2020/12/11/factorization-from-order.html">recent post</a>, can be used to efficiently factorize a number.</p>
<p>We assume basic familiarity with quantum computing, covered in a <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">previous post</a>, plus we’ll use <a href="https://www.kuniga.me/blog/2020/11/21/quantum-fourier-transform.html">quantum Fourier transform</a> (QFT) in one of the steps.</p>
<!--more-->
<h2 id="quantum-circuit">Quantum Circuit</h2>
<p>Let’s first consider smaller parts of the circuit before showing the whole picture.</p>
<h3 id="controlled-gate-revisited">Controlled gate revisited</h3>
<p>We described the CNOT gate in [2], and then that any $n$-qubit gate $U$ can be transformed into a $(n+1)$-qubit [3] (the $CR_k$ gate). In both cases the control qubit is assumed to be in the computational basis (more specifically $\ket{0}$ or $\ket{1}$).</p>
<p>Here we consider the case where the control bit is in an arbitrary state, e.g. $\alpha \ket{0} + \beta \ket{1}$.</p>
<p>Suppose we transformed a $n$-qubit gate $U$ into a $(n+1)$-qubit controlled gate. What happens when we apply it to a $(n+1)$-qubit $\ket{c} \ket{y}$ where $y$ is a $n$-qubit and $c$ is the control?</p>
<p>If $c = \ket{0}$, the output is $\ket{y} \ket{0}$ whereas if $c = \ket{1}$ the output is $U \ket{y} \ket{0}$. We can use the linearity principle and obtain</p>
\[(1) \quad \alpha \ket{0} \ket{y} + \beta \ket{1} U \ket{y}\]
<h3 id="the-u-gate">The $U$ gate</h3>
<p>Because $\ket{u}$ is the eigenvector of $U$ and $e^{2 \pi i \varphi}$ its eigenvalue, then by definition $U \ket{u} = e^{2 \pi i \varphi} \ket{u}$, so if we apply a gate with a corresponding unitart matrix $U$ to an input $\ket{u}$ we obtain $e^{2 \pi i \varphi} \ket{u}$.</p>
<p>It’s possible to show that $U^k$ for a positive integer $k$ is also a unitary matrix and thus has a corresponding gate. If we apply it to $U$’s eigenvector $\ket{u}$ we get $e^{2 \pi i k \varphi} \ket{u}$.</p>
<h3 id="a-simple-circuit">A simple circuit</h3>
<p>The circuit below is used as a building block for the larger one.</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-12-23-quantum-phase-estimation/quantum-phase-estimation-single.png" alt="a diagram depicting a quantum circuit" />
<figcaption>Figure 1: Quantum phase estimation circuit for the k-th qubit</figcaption>
</figure>
<p>Let’s follow what’s happening. We start with:</p>
\[\ket{\psi_1} = \ket{0} \ket{u}\]
<p>In the first step we apply the Hadamard to obtain</p>
\[\ket{\psi_2} = (\frac{\ket{0} + \ket{1}}{\sqrt{2}}) \ket{u}\]
<p>Now the first qubit is used as control for the $U^{2^k}$ gate, which similarly to (1) is</p>
\[\frac{\ket{0} \ket{u} + \ket{1} U^{2^k} \ket{u}}{\sqrt{2}}\]
<p>Using the fact that $U^{2^k} \ket{u} = e^{2 \pi i 2^k \varphi} \ket{u}$ we have</p>
\[\frac{\ket{0} \ket{u} + \ket{1} e^{2 \pi i 2^k \varphi} \ket{u}}{\sqrt{2}}\]
<p>Since $e^{2 \pi i 2^k \varphi}$ is scalar we can do some re-arranging:</p>
\[\ket{\psi_3} = \frac{\ket{0} + e^{2 \pi i 2^k \varphi} \ket{1}}{\sqrt{2}} \ket{u}\]
<p>In this view, the first qubit $\ket{0}$ became $\frac{\ket{0} + e^{2 \pi i 2^k \varphi} \ket{1}}{\sqrt{2}}$ while the $n$-qubit $\ket{u}$ remained unchanged, which is a bit counter intuitive especially given the first qubit was the control one.</p>
<h3 id="the-whole-picture">The whole picture</h3>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-12-23-quantum-phase-estimation/quantum-phase-estimation-640w.png" alt="a diagram depicting a quantum circuit" />
<figcaption>Figure 2: Quantum phase estimation circuit</figcaption>
</figure>
<p>We can see that this circuit combines $t$ of the small circuit from the previous session. The output of $U^{2^i}$ is fed into $U^{2^{i+1}}$ but as we saw above we can assume it doesn’t change the input so the final state would still be $\ket{u}$.</p>
<p>For each of the control qubits, there will be a corresponding $U^{2^k}$, so it will end as $\frac{\ket{0} + e^{2 \pi i 2^k \varphi} \ket{1}}{\sqrt{2}}$ as we saw above.</p>
<p>If we look at the whole state after applying this larger circuit we end up with</p>
\[\frac{
(\ket{0} + e^{2 \pi i 2^{t-1} \varphi} \ket{1})
(\ket{0} + e^{2 \pi i 2^{t-2} \varphi} \ket{1}) \cdots
(\ket{0} + e^{2 \pi i 2^0 \varphi} \ket{1}) \ket{1})
}{2^{t/2}} \ket{u}\]
<p>If we introduce $\phi = \varphi 2^t$ and looking at the first $t$ qubits:</p>
\[(2) \quad
\frac{
(\ket{0} + e^{2 \pi i 2^{-1} \phi} \ket{1})
(\ket{0} + e^{2 \pi i 2^{-2} \phi} \ket{1}) \cdots
(\ket{0} + e^{2 \pi i 2^{-t} \phi} \ket{1}) \ket{1})
}{2^{t/2}}\]
<p>We’ll now understand the motivation for this circuit.</p>
<h2 id="inverse-fourier-transform">Inverse Fourier Transform</h2>
<p>If we look back at the construction of equation (8) in the <a href="https://www.kuniga.me/blog/2020/11/21/quantum-fourier-transform.html">quantum Fourier transform</a> (QFT), we’ll be able to recognize that (2) is the result of applying the quantum Fourier transform to a state in the computational basis $\ket{\phi}$!</p>
<p>The previous observation assumes that $\phi$ is a $t$ bit integer $\phi = \phi_t 2^0 + \phi_{t-1} 2^1 + … \phi_1 2^{t - 1}$. In reality $\varphi$ is a real number. Since we’re obtaining $\phi = \varphi 2^t$, if we increase $t$, we improve accuracy at the expense of performance since the number of gates is proportional to $t$.</p>
<p>For example, if the true value of $\varphi$’s binary representation was $0.0100101111101$, then with $t = 4$, the value we’d obtain is $\phi = 0100$, for $t = 8$, we’d get $\phi = 01001011$, which allows for a better approximation of $\varphi$.</p>
<p>Because the QFT can be implemented as a quantum circuit, it has a corresponding unitary matrix, which has an inverse. That is to say there’s a quantum circuit which we can apply to (2), that is to the first $t$ qubits of the output, to obtain $\phi$.</p>
<h3 id="interlude-classic-inverse-fourier-transform">Interlude: Classic Inverse Fourier Transform</h3>
<p>Recall from [3] that the output of the Fourier transform (FT) over $x \in \mathbb{C}^N$ is $y \in \mathbb{C}^N$ with:</p>
\[(3) \quad y_k = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N - 1} x_j e^{2 \pi i j k / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>The inverse Fourier transform over $x \in \mathbb{C}^N$ is $y \in \mathbb{C}^N$ with:</p>
\[(4) \quad y_k = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N - 1} x_j e^{-2 \pi i j k / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>Which is almost the same as the normal FT but with the negative sign on the exponent of $e$.</p>
<p>Now let’s apply the inverse Fourier transform over the output of a Fourier transform.</p>
<p>We’ll replace the $x_j$ in (4) with $y_k$ from (3):</p>
\[y_k = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N - 1} \big(\frac{1}{\sqrt{N}} \sum_{l = 0}^{N - 1} x_l e^{2 \pi i l j / N} \big) e^{-2 \pi i j k / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>We can re-arrange some terms to obtain:</p>
\[y_k = \frac{1}{N} \sum_{j = 0}^{N - 1} \sum_{l = 0}^{N - 1} x_l e^{2 \pi i j (l - k) / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>We can swap the order of the sums to get:</p>
\[y_k = \frac{1}{N} \sum_{l = 0}^{N - 1} x_l \sum_{j = 0}^{N - 1} e^{2 \pi i j (l - k) / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>For $k = l$, $e^{2 \pi i j (l - k) / N} = 1$, and the inner sum is $N$. For a $l \neq k$, the term $e^{2 \pi i j (l - k) / N} = (e^{2 \pi i (l - k) / N})^j$. Taking $\alpha = e^{2 \pi i (l - k) / N}$, we have that</p>
\[S = \sum_{j = 0}^{N - 1} e^{2 \pi i j (l - k) / N} = \sum_{j = 0}^{N - 1} \alpha^{j}\]
<p>This is a geometric sum, so we can use the trick of computing $S \alpha$ and subtracking from $S$:</p>
\[S = \frac{1 - \alpha^N}{1 - \alpha}\]
<p>Replacing $\alpha$ back:</p>
\[S = \frac{1 - e^{2 \pi i (l - k)}}{1 - e^{2 \pi i (l - k) / N}}\]
<p>Since both $l$ and $k$ are integers, $(e^{2 \pi i})^{(l - k)} = 1^{l - k} = 1$. Moreover $l - k < N$ and thus $2 \pi i (l - k) / N < 2 \pi i$, which implies $e^{2 \pi i (l - k) / N} \neq 1$, so $S = 0$.</p>
<p>This implies that the only $l$ for which the inner sum is non-zero is $l = k$, so $y_k = x_k$. This shows that applying the Fourier transform followed by the inverse Fourier transform yields the original result.</p>
<h2 id="measuring-phi">Measuring $\phi$</h2>
<p>We can write (2) in this form:</p>
\[\frac{1}{2^{t/2}} \sum_{k=0}^{2^t - 1} e^{2 \pi i \phi k / 2^{t}} \ket{k}\]
<p>Note how this is the reverse of what we did in [3] (See <em>Algebraic Preparation</em>). To simply the notation, assume that $N = 2^t$:</p>
\[(5) \quad \frac{1}{\sqrt{N}} \sum_{k=0}^{N - 1} e^{2 \pi i \phi k / N} \ket{k}\]
<p>Note this is a state where the $k$-th element has amplitude.</p>
\[\frac{1}{\sqrt{N}} e^{2 \pi i \phi k / N}\]
<p>The inverse Fourier transform is given by:</p>
\[\ket{y} = \sum_{k = 0}^{N - 1} y_k \ket{k}\]
<p>where $y_k$ is defined as:</p>
\[y_k = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N- 1} x_j e^{-2 \pi i j k / N} \qquad \forall k = 0, \cdots, N - 1\]
<p>Thus we can apply the inverse FT to (5), where $x_j$ will be the amplitude of the $j$-th component of (5):</p>
\[\sum_{k = 0}^{N - 1} \frac{1}{\sqrt{N}} \sum_{j = 0}^{N - 1} \big(\frac{1}{\sqrt{N}} e^{2 \pi i \phi j / N} \big) e^{-2 \pi i j k / N} \ket{k}\]
<p>Which we can simplify to</p>
\[\frac{1}{N} \sum_{k = 0}^{N - 1} \sum_{j = 0}^{N - 1} e^{2 \pi i j (\phi - k) / N} \ket{k}\]
<p>The amplitude for a given $\ket{k}$ is given by:</p>
\[(6) \quad \frac{1}{N} \sum_{j = 0}^{N-1} e^{2 \pi i j (\phi - k) / N} = \frac{1}{N} \sum_{j = 0}^{N-1} (e^{2 \pi i (\phi - k) / N})^j\]
<p>Consider the largest base state $b$ (an integer from 0 to $N-1$) smaller than $\phi$ (which can be a real value). Now suppose we measure a value $m$ from the state (5). Note that if $\phi$ is an integer less than $2^t$, there’s $b = \phi$ and from <em>Interlude</em> above there’s only one amplitude that equals to 1, so we would measure $\phi$ with 100% probability.</p>
<p>Suppose it’s not. Let $\delta = \phi - b$. Then the probability of measuring $b$, $p(m = b)$, is given by the square of the magnitude of (6). From [4], we can note, as in <em>Interlude</em>, that (6) is a geometric sum:</p>
\[\frac{1}{N} \frac{1 - e^{2 \pi i \delta}}{1 - e^{2 \pi i \delta / N}}\]
<p>Thus:</p>
\[p(m = b) = \frac{1}{N^2} \frac{\abs{1 - e^{2 \pi i \delta}}^2}{\abs{1 - e^{2 \pi i \delta / N}}^2}\]
<p>We have that $\abs{1 - e^{2ix}}^2 = 4 \abs{\sin x}^2$ (see <em>Appendix</em>), so</p>
\[p(m = b) = \frac{1}{N^2} \frac{\abs{\sin (\pi \delta)}^2}{\abs{\sin (\pi \delta / N)}^2}\]
<p>Since $\delta < 1$ and assuming $t > 0$, $\delta / N \le 1/2$, recalling $N = 2^t$, and using $\sin x \le x$ for $x \le \pi/2$ (see <em>Appendix</em>),</p>
\[p(m = b) \ge \frac{1}{N^2} \frac{\abs{\sin (\pi \delta)}^2}{(\pi \delta / N)^2} = \frac{\abs{\sin (\pi \delta)}^2}{(\pi \delta)^2}\]
<p>Finally, using that $2 x \le \sin(\pi x)$ for $x \le 1/2$ (see [7]), we have</p>
\[p(m = b) \ge \frac{\abs{2 \delta}^2}{(\pi \delta / N)^2} = \frac{4}{\pi^2} \approx 40\%\]
<p>The interesting thing is that this does not depend on the number of qubits used for $b$. We can increase the probability by trading off accuracy if for example $b \pm 1$ would still be a good approximation to $\phi$. More generally, we define some error $\xi > 1$ and now want to know the probability that $\abs{m - b} \le \xi$. In [1] the authors prove that:</p>
\[p(\abs{m - b} \le \xi) \ge \frac{1}{2(\xi - 1)}\]
<h2 id="entangled-eigenvector">Entangled Eigenvector</h2>
<p>So far we assumed the eigenvector $\ket{u}$ is in some computational base state. In practice it could be in an entangle state $\alpha_1 \ket{u_1} + \cdots + \alpha_m \ket{u_m}$ with corresponding eigenvalues $\varphi_1, \cdots, \varphi_m$. This would add another factor of uncertainty because a given $\varphi_i$ would have probability of $\abs{\alpha_i}^2$ of being measured.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we learned how to efficiently find the eigenvalue given a unitary matrix and its eigenvector using a quantum circuit. The algorithm is both approximate (but so is any classical computation dealing with real numbers) and probabilistic, but we can improve both by using more qubits at the expense of number of quantum gates and hence complexity.</p>
<h2 id="related-posts">Related Posts</h2>
<ul>
<li><a href="https://www.kuniga.me/blog/2014/11/24/the-pagerank-algorithm.html">The PageRank algorithm</a> also has to do with computing an eigenvalue and eigenvector. It made me wonder if a quantum PageRank algorithm would make sense. Paparo and Martin-Delgado wrote a paper, <a href="https://arxiv.org/abs/1112.2079">Google in a Quantum Network</a>, which I haven’t read, but from skimming the conclusion it seems to be promising based on initial studies for small networks.</li>
</ul>
<h2 id="appendix">Appendix</h2>
<p><strong>Lemma 1.</strong> $\abs{1 - e^{2ix}}^2 = 4 \abs{\sin x}^2$</p>
<p><em>Proof.</em> By Euler’s formula $e^{2ix} = \cos 2x + i \sin 2x$. We can use common trigonometry identities such as $\sin 2x = 2 \sin x \cos x$, $\cos 2x = 1 - 2\sin^2 x$, to say</p>
\[1 - e^{2ix} = 1 - (1 - 2\sin^2 x + i 2 \sin x \cos x) = 2\sin^2 x - i 2 \sin x \cos x)\]
<p>Given a complex number $a + ib$, $\abs{a + ib}^2 = a^2 + b^2$, so</p>
<p>\(\abs{1 - e^{2ix}}^2 = 4 \sin^4 x + 4 \sin^2 x \cos^2 x = 4 \sin^2 x(\sin^2 x + \cos^2 x) = 4\sin^2 x = 4 \abs{\sin x}^2\). <em>QED</em></p>
<p><strong>Lemma 2.</strong> $\sin x \le x$ for $x \ge 0$</p>
<p><em>Proof.</em> We start with $x = 0$, for which $\sin x = x$. Now we look at the rate of change of both functions: $\frac{d (\sin x)}{dx} = \cos x$ and $\frac{dx}{dx} = 1$. Since $\cos (x) \le 1$, the rate of change of $f(x) = \sin x$ is never greater than that of $f(x) = x$. Both functions are equal at $x = 0$, so for larger values of $x$, $\sin x$ will never become greater than $x$.</p>
<p><strong>Lemma 2.</strong> $\sin x \le x$ for $x \le \pi/2$</p>
<p><em>Pseudo-proof.</em> We start with $x = 0$, for which $\sin x = x$. Now consider $x > 0$. The rate of change of both functions are $\frac{d (\sin x)}{dx} = \cos x$ and $\frac{dx}{dx} = 1$, respectively. Since $\cos (x) \le 1$, the rate of change of $f(x) = \sin x$ is never greater than that of $f(x) = x$. Both functions are equal at $x = 0$, so for larger values of $x$, $\sin x$ will never become greater than $x$. We can use a similar argument for $x < 0$.</p>
<p>Unfortunately this “proof” is not sound because the derivative of $\sin x$ relies on $\lim_{x \rightarrow 0} \frac{sin x}{x} = 1$ which assumes $\sin x < x$, which is a circular argument [5]. Freeman [6] proposes a much simpler geometric proof.</p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.amazon.com/Quantum-Computation-Information-10th-Anniversary/dp/1107002176">1</a>] Quantum Computation and Quantum Information - Nielsen, M. and Chuang, I.</li>
<li><a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">[2]</a> NP-Incompleteness: The Deutsch-Jozsa Algorithm</li>
<li><a href="https://www.kuniga.me/blog/2020/11/21/quantum-fourier-transform.html">[3]</a> NP-Incompleteness:Quantum Fourier Transform</li>
<li>[<a href="https://en.wikipedia.org/wiki/Quantum_phase_estimation_algorithm#Phase_approximation_representation">4</a>] Wikipedia: Quantum phase estimation algorithm</li>
<li>[<a href="https://math.stackexchange.com/questions/125298/how-to-strictly-prove-sin-xx-for-0x-frac-pi2">5</a>] how to strictly prove sin𝑥<𝑥 for $0 < x < \pi 2$</li>
<li>[<a href="http://mathrefresher.blogspot.com/2006/08/sin-x-x-tan-x-for-x-in-02.html">6</a>] Math Refresher: $\sin x < x < \tan x$ for $x \in (0, \pi/2)$</li>
<li><a href="https://math.stackexchange.com/questions/596634/mean-value-theorem-frac2-pi-frac-sin-xx1">[7]</a> Math StackExchange - Mean Value Theorem: $2 \pi < \frac{\sin x}{x} < 1$</li>
</ul>Guilherme KunigamiGiven a unitary matrix $U$ with eigenvector $\ket{u}$, we want to estimate $\varphi$ where $e^{2 \pi i \varphi}$ is the eigenvalue of $U$. This serves as framework for solving a varierity of problems including order finding, which as we have shown in a recent post, can be used to efficiently factorize a number. We assume basic familiarity with quantum computing, covered in a previous post, plus we’ll use quantum Fourier transform (QFT) in one of the steps.Number Factorization from Order-Finding2020-12-11T00:00:00+00:002020-12-11T00:00:00+00:00https://www.kuniga.me/blog/2020/12/11/factorization-from-order<p>Given integers $x$, $N$, the problem of <em>order-finding</em> consists in finding the smallest positive number $r$ such that $x^r \equiv 1 \Mod{N}$, where $r$ is called the <em>order of</em> $x \Mod{N}$.</p>
<p>In this post we’ll show that if we know how to solve the order of $x \Mod{N}$, we can use it to get a probabilistic algorithm for finding a non-trivial factor of a number $N$.</p>
<p>The motivation is that this is a crucial step in Shor’s quantum factorization, but only relies on classic number theory.</p>
<!--more-->
<h2 id="definitions">Definitions</h2>
<p>In this section we define a bunch of terminology, most of which the reader might already be familiar with. Feel free to skip ahead and refer to this when seeing them later.</p>
<p><strong>Prime factorization.</strong> Given a positive number $N$, the prime factorization of $N$ is a set of distinct prime factors $p_1, p_2, \cdots, p_m$ and positive exponents $\alpha_1, \alpha_2, \cdots, \alpha_m$ such that $N = p_1^{\alpha_1} p_2^{\alpha_2} \cdots p_m^{\alpha_m}$. For convenience we assume $p_1 < p_2 < \cdots < p_m$. It’s possible to show that any integer larger than 1 can be uniquely represented by its prime factorization. Example: $600$ can be uniquely represented by $2^3 3^1 5^2$.</p>
<p><strong>Divisibility.</strong> We say a positive integer $x$ divides $y$, denoted as $x \mid y$ if there’s a positive integer $k$ such that $y = kx$. Otherwise, we denote it as $x \nmid y$ and there’s a positive integer $c < x$ such that $y = kx + c$.</p>
<p><strong>Greatest Common Divisor.</strong> The greatest common divisor of two integers $x$ and $y$ is the largest integer that divides both $x$ and $y$ and is denoted by $\gcd(x, y)$. It can be computed in $O(\min(\log(x), \log(y)))$.</p>
<p><strong>Co-primality.</strong> Given two positive integers $x$ and $y$, we say $x$ and $y$ are co-prime if they don’t share any prime factors, or $\gcd(x, y) = 1$. For example, 9 and 10 are co-prime, but 10 and 12 are not since they share the prime factor 2.</p>
<p><strong>Set of co-primes.</strong> Given an integer $N$, we define $Z_N = \curly{1, \cdots, N}$ and $Z_N^{*}$ as the elements in $Z_N$ that are co-prime with $N$. For example, if $N = 10$, $Z^{*}_N = \curly{1, 3, 7, 9}$.</p>
<p><strong>Euler $\varphi$ function.</strong> is defined as the number of co-primes of $N$ less than $N$ and denoted as $\varphi(N)$. Note that $\abs{Z_N^{*}} = \varphi(N)$.</p>
<h2 id="theory">Theory</h2>
<p>We’ll now state a few Theorems from which we’ll build the prime factoring algorithm. Their proofs are described in the <em>Appendix</em>.</p>
<p><strong>Theorem 1.</strong> Given co-primes $x$, $N$ and $r$ the order of $x \Mod{N}$, then $r \le N$.</p>
<p><strong>Theorem 2.</strong> Let $N$ be a non-prime number and $1 \le x \le N$ a non-trivial solution to $x^2 \equiv 1 \Mod{N}$ (by non-trivial we mean $x \not \equiv \pm 1 \Mod{N}$), then at least one of $\gcd (x - 1, N)$ or $\gcd (x + 1, N)$ is a non-trivial factor of $N$.</p>
<p><strong>Theorem 3.</strong> Let $N$ be a odd non-prime positive integer $N$ with prime factors $N = p_1^{\alpha_1} p_2^{\alpha_2} \cdots p_m^{\alpha_m}$. Let $x$ be an element chosen at random from $Z_N^{*}$ and $r$ the order of $x \Mod{N}$. Then</p>
\[p(r \mbox{ is even and } x^{r/2} \not \equiv -1 \Mod{N}) \ge 1 - \frac{1}{2^m}\]
<h2 id="prime-factoring-algorithm">Prime Factoring Algorithm</h2>
<p><em>Theorem 3</em> seems highly specific but combined with <em>Theorem 2</em>, it allows us to find a factor of $N$. To see how, suppose $r$ is even and $x^{r/2} \not \equiv -1 \Mod{N}$, which can happen with probability at least $1 - \frac{1}{2^m}$. Let $y = x^{r/2}$, so $y \not \equiv -1 \Mod{N}$. We also have $y \not \equiv 1 \Mod{N}$, since otherwise $r/2$ would be the order of $x \Mod{N}$. This means that by Theorem B, $\gcd (y - 1, N)$ or $\gcd (y + 1, N)$ is a non-trivial factor of $N$.</p>
<p>We can now define the algorithm to obtain a non-trivial factor of $N$. Here’s a simple Python implementation:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">math</span> <span class="kn">import</span> <span class="n">gcd</span>
<span class="kn">import</span> <span class="nn">random</span>
<span class="k">def</span> <span class="nf">get_factor</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="k">if</span> <span class="n">N</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="mi">2</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">random</span><span class="p">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="c1"># inclusive
</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">gcd</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="k">if</span> <span class="n">f</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">f</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">order</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="k">if</span> <span class="n">r</span> <span class="o">%</span> <span class="mi">2</span> <span class="o">!=</span> <span class="mi">0</span> <span class="ow">or</span> <span class="n">mod_exp</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">r</span><span class="o">//</span><span class="mi">2</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span> <span class="o">==</span> <span class="n">N</span> <span class="o">-</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="bp">None</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">x</span> <span class="o">**</span> <span class="p">(</span><span class="n">r</span> <span class="o">//</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">gcd</span><span class="p">(</span><span class="n">y</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="k">if</span> <span class="n">f</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">f</span>
<span class="k">return</span> <span class="n">gcd</span><span class="p">(</span><span class="n">y</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span></code></pre></figure>
<p><code class="language-plaintext highlighter-rouge">mod_exp(b, e, N)</code> is computes $b^n \Mod{N}$. <code class="language-plaintext highlighter-rouge">order(x, N)</code> returns $r$ such that $x^r \equiv 1 \Mod{N}$.</p>
<h3 id="refining">Refining</h3>
<p>If $N$’s prime factorization is $N = p_1^{\alpha_1}$ for $\alpha_1 > 1$, then $m = 1$ and the probability lower bound is only $0.5$.</p>
<p>We can detect when that’s the case with the following algorithm: for each exponent starting from $e = 2$, we find the largest value $a$ such that $a^b <= N$ via binary search. We stop looking for exponents when $2^{b} > N$ or when our binary search returns 1.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">bin_search</span><span class="p">(</span><span class="n">f</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span> <span class="o"><<</span> <span class="mi">1</span><span class="p">)</span> <span class="o"><=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">x</span> <span class="o"><<=</span> <span class="mi">1</span>
<span class="n">p2</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">while</span> <span class="n">p2</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">p2</span> <span class="o">>>=</span> <span class="mi">1</span>
<span class="k">if</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">p2</span><span class="p">)</span> <span class="o"><=</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">x</span> <span class="o">+=</span> <span class="n">p2</span>
<span class="k">return</span> <span class="n">x</span>
<span class="k">def</span> <span class="nf">get_single_base</span><span class="p">(</span><span class="n">N</span><span class="p">):</span>
<span class="n">e</span> <span class="o">=</span> <span class="mi">2</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">f</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">b</span><span class="p">:</span> <span class="n">b</span><span class="o">**</span><span class="n">e</span> <span class="o">-</span> <span class="n">N</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">bin_search</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
<span class="k">if</span> <span class="n">b</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">f</span><span class="p">(</span><span class="n">b</span><span class="p">)</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">return</span> <span class="n">b</span>
<span class="n">e</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="bp">None</span></code></pre></figure>
<p>In the code above <code class="language-plaintext highlighter-rouge">bin_search()</code> finds $a$ by constructing the bits from the most to least significant bits, so its complexity is $O(log N)$. Python implements the power function using repeated squares, so <code class="language-plaintext highlighter-rouge">f()</code> is also $O(log N)$. Finally we’ll stop when $2^{b} > N$, so $b \le log(N)$. This leads to a total complexity of $O(log^3 N)$.</p>
<p>We know how to determine $a$ when $N = a^b$, for any positive integers $a > 1$ and $b > 1$. In that case $a$ is a non-trivial factor. Otherwise we know $N$’s prime factorization has at least 2 distinct prime factors, so $m > 1$ and the probability lower bound is now $0.75$.</p>
<h2 id="complexity">Complexity</h2>
<p>Let’s analyze the complexity of <code class="language-plaintext highlighter-rouge">get_factor()</code>. The <code class="language-plaintext highlighter-rouge">gcd(a, b)</code> can be implemented as $O(\log min(a, b))$. If we check for <code class="language-plaintext highlighter-rouge">get_single_base()</code> discussed above, it will add an $O(log^3 N)$ component.</p>
<p>However, the dominant complexity of the function is <code class="language-plaintext highlighter-rouge">order(x, N)</code>. We know from Theorem 1 that $r \le N$. A brute approach consists in looking for all the possibilities, which leads to an $O(N)$ algorithm (or rather $O(N \log N)$ if we were to account for the arithmetic operations):</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">order</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span>
<span class="n">m</span> <span class="o">=</span> <span class="n">x</span>
<span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">N</span> <span class="o">+</span> <span class="mi">1</span><span class="p">):</span>
<span class="k">if</span> <span class="n">m</span> <span class="o">%</span> <span class="n">N</span> <span class="o">==</span> <span class="mi">1</span><span class="p">:</span>
<span class="k">return</span> <span class="n">r</span>
<span class="n">m</span> <span class="o">*=</span> <span class="n">x</span></code></pre></figure>
<p>Being able to solve <code class="language-plaintext highlighter-rouge">order(x, N)</code> efficiently is the secret ingredient behind Shor’s factorization but we need to resort to quantum computing. We’ll not discuss it here, but we know have the background and motivation from perspective of classic number theory.</p>
<h2 id="experiments">Experiments</h2>
<p>Setting the running time aside, let’s not forget the algorithm we described is probabilistic. The refined version of <code class="language-plaintext highlighter-rouge">get_factor()</code> provides a lower bound of $0.75$, but how accurate is it in practice?</p>
<p>If we run for the first 5,000 non-prime, odd numbers, we get ~75% accuracy on average. If we exclude number of the form $N = a^b$, we get 77% accuracy, only slightly better.</p>
<p>One thing we can do is to repeat the algorithm $k$ times or until it finds a factor. If the probability of one run is $p$, and assuming each run is <a href="https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables">iid</a> then the resulting probability should be $1 - (1 - p)^k$.</p>
<p>If we run for $k=5$ for example, the accuracy is 99.6%. It’s thus possible to get pretty good accuracies with a small number of repetitions.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I really like number theory and it was fun to study the reduction from the prime factoring to the order finding, so much so that I decided to post about this before actually writing on how to solve the order finding problem using quantum computation.</p>
<p>I’m wondering what is the best classic algorithm for solving order finding, including probabilistic ones. We have probabilistic algorithms for detecting primes that are very efficient.</p>
<p>We recall studying some of the modular arithmetic and its properties in college, probably in the context of criptography classes.</p>
<h2 id="appendix">Appendix</h2>
<p><strong>Theorem 1.</strong> Given co-prime integers $x$, $N$, the order $r$ of $x \Mod{N}$ is $r \le N$.</p>
<p><em>Proof.</em> We know that $x^i \Mod{N} \in \curly{1, \cdots N - 1}$ for some positive integer $i$. It follows from the pigeonhole principle that there must exist $j \le N + 1$ such that $x^i \equiv x^j \Mod{N}$, for $i < j$. To see why, note we have $N$ possible outcomes for $x^i \Mod{N}$ so if we consider the first $N + 1$ values of $i$ there ought to be a repeated value.</p>
<p>Since $j > i$, there is some $r > 1$ such that $j = i + r$, thus</p>
\[x^j = x^r x^i\]
<p>and</p>
\[x^i \equiv x^r x^i \Mod{N}\]
<p>Since \(x^i \Mod{N} > 0\), this implies $x^r \equiv 1 \Mod{N}$. Since $j \le N + 1$ and $i \ge 1$, $r \le N$. <em>QED</em></p>
<p><strong>Theorem 2.</strong> Let $N$ be a non-prime number and $1 \le x \le N$ a non-trivial solution to $x^2 \equiv 1 \Mod{N}$ (by non-trivial we mean $x \not \equiv \pm 1 \Mod{N}$), then at least one of $\gcd (x - 1, N)$ or $\gcd (x + 1, N)$ is a non-trivial factor of $N$.</p>
<p><em>Proof.</em> Assuming $x^2 \equiv 1 \Mod{N}$, then $N \mid x^2 - 1 = (x + 1)(x - 1)$, so $N$ must have a common factor with at least one of $(x + 1)$ or $(x - 1)$.</p>
<p>Since $x \not \equiv \pm 1 \Mod{N}$, then $x \neq 1$ and $x \neq N-1$. Which implies $0 < x - 1$ and $x + 1 < N$, hence the common factor is not $N$. Then at least one of $\gcd (x - 1, N)$ or $\gcd (x + 1, N)$ is a non-trivial factor of $N$. <em>QED</em></p>
<p>The proof of theorem 3 is much more involved, so let’s introduce some helpers.</p>
<p><strong>Chinese Remainder Theorem</strong> Let $b_1, \cdots, b_n$ be a set of integers pairwise co-prime and integers $a_1, \cdots, a_n$, where $0 \le a_i < b_i$. Let $N = b_1 \cdots b_m$.</p>
<p>Then the is exactly one $0 \le x < N$ that safisfies $x \equiv a_i \Mod{b_i}$ for every $i$.</p>
<p><em>Proof.</em> Not included here. Refer to [1], Theorem A4.16 (p629).</p>
<p><strong>Lemma 3.1</strong> Let $N$ be a positive integer $N$ with prime factors $N = p_1^{\alpha_1} p_2^{\alpha_2} \cdots p_m^{\alpha_m}$. Let $x$ be an element chosen at random from $Z_N^{*}$. This is equivalent to picking $x_1, x_2, \cdots, x_m$ at random from $Z_{p_1^{\alpha_1}}^{*}, Z_{p_2^{\alpha_2}}^{*}, \cdots, Z_{p_m^{\alpha_m}}^{*}$, respectively.</p>
<p><em>Proof.</em> We just need to show that there’s a one-to-one mapping between $x$ and ($x_1, x_2, \cdots x_m$).</p>
<p>$\rightarrow$ If we define $x_i$ as the remainder of $x$ divided by $p_i^{\alpha_i}$ for every $i$, then this is a unique map from $x$ to ($x_1, x_2, \cdots x_m$), but we need to prove that if $x \in Z_N^{*}$ then $x_i \in Z_{p_i^{\alpha_i}}^{*}$.</p>
<p>Since $x$ is co-prime with $N$, then $x$ is co-prime with $p_i^{\alpha_i}$ (since it has a subset of factors of $N$). We claim that $x_i$ is also co-prime with $p_i^{\alpha_i}$. Otherwise, since $x = k p_i^{\alpha_i} + x_i$ for some integer $k$, if $x_i$ is not co-prime with $p_i^{\alpha_i}$, then they share at least one factor $p_i$, so $x_i = p_i \alpha$, thus $x = p_i (k p_i^{\alpha_i - 1} + \alpha)$, which implies $x$ is not co-prime with $p_i^{\alpha_i}$, a contradiction.</p>
<p>$\leftarrow$ Assume now we have $x_i \in Z_{p_i^{\alpha_i}}^{*}$. Since $p_1^{\alpha_1}$, $p_2^{\alpha_2}$ and $p_m^{\alpha_m}$ are pairwise co-prime, we can use the <em>Chinese Remainder Theorem</em>
to show there’s exactly one solution $0 \le x < N$ to $x \equiv x_i \Mod{p_i^{\alpha_i}}$ for every $i$.</p>
<p>To show $x \in Z_N^{*}$ it remains to show $x$ and $N$ are co-prime. Suppose it’s not. Then it shares a factor $p_j$ with $N$ for some $j$, but then since it holds that $x \equiv x_j \Mod{p_j^{\alpha_j}}$, then $x = p_j \alpha = k p_j^{\alpha_j} + x_j$, which means $x_j = p_j(\alpha - k p_j^{\alpha_j - 1})$, so $x_j$ is not co-prime with $p_j^{\alpha_j}$, contradicting the hypothesis that $x_j \in Z_{p_j^{\alpha_j}}^{*}$. <em>QED</em></p>
<p><strong>Lemma 3.2</strong> Let $a$ and $N$ be co-primes. Then $a^{\varphi(N)} \equiv 1 \Mod{N}$</p>
<p><em>Proof.</em> Not included here. Refer to [1], Theorem A4.9 (p631).</p>
<p><strong>Lemma 3.3</strong> Let $r$ be the order of $x \Mod{N}$ for co-primes $x$ and $N$. Let $r’$ be such that $x^{r’} \equiv 1 \Mod{N}$. Then $r$ divides $r’$.</p>
<p><em>Proof.</em> If $r = r’$, this is trivially true, so consider the case where $r’ > r$ (by definition $r$ cannot be bigger than $r’$). Let’s now assume $r \nmid r’$, so there’s $k > 0$ (since $r < r’$) and $0 < \alpha < r$ such that $r’ = kr + \alpha$.</p>
<p>Then $x^{r’} \equiv x^{kr} x^{\alpha} \equiv 1 \Mod{N}$. We know $x^{r} \equiv 1 \Mod{N}$ and so is $(x^{r})^k \equiv x^{rk} \equiv 1 \Mod{N}$. But this means $x^{\alpha} \equiv 1 \Mod{N}$ with $\alpha < r$, which is a contradiction that $r$ is minimal. <em>QED</em></p>
<p><strong>Lemma 3.4</strong> Let $r$ be the order of $x \Mod{N}$ for co-primes $x$ and $N$. Then $r$ divides $\varphi(N)$</p>
<p><em>Proof.</em> This follows from <em>Lemma 3.2</em>, which states $x^{\varphi(N)} \equiv 1 \Mod{N}$, which allows us to use <em>Lemma 3.4</em>, with $r’ = \varphi(N)$ to conclude that $r$ divides $\varphi(N)$. <em>QED</em></p>
<p><strong>Definition 3</strong> From now until Theorem 3, we’ll assume that $x \in Z_N^{*}$, $x_i \in Z_{p_i^{\alpha_i}}^{*}$ and that $x \equiv x_i \Mod{p_i^{\alpha_i}}$. Furthermore, we’ll assume $r$ is the order of $x \Mod{N}$, and $r_i$ the order of $x_i \Mod{p_i^{\alpha_i}}$.</p>
<p><strong>Lemma 3.5</strong> $r_i \mid r$ for every $i$</p>
<p><em>Proof.</em> We have $x \equiv x_i \Mod{p_i^{\alpha_i}}$, which holds if we raise both to a power $k$. In particular $x^{r} \equiv x_i^{r} \equiv 1 \Mod{p_i^{\alpha_i}}$. We can use <em>Lemma 3.3</em> for $x_i$, $r_i$ and $r$, to show $r_i$ divides $r$.</p>
<p><strong>Lemma 3.6</strong> Let $d_i$ be the largest exponent such that $2^{d_i}$ divides $r_i$. If $r$ is odd or $x^{r/2} \equiv -1 \mod{N}$ then $d_i$ is the same for any $i$.</p>
<p><em>Proof.</em> Let’s consider the case where $r$ is odd. This implies $r_i$ must be too, and the largest power of two that divides it is $2^0 = 1$, hence $d_i = 0$ for all $i$s.</p>
<p>Let’s consider the case where $r$ is even and $x^{r/2} \equiv -1 \mod{N}$.</p>
<p>This means $x^{r/2} + 1 = k N$ for some integer $k$. Since $p_i^{\alpha_i}$ is a factor of $N$ for any $i$, $x^{r/2} + 1 = k’ p_i^{\alpha_i}$, where $k’ = k (N/p_i^{\alpha_i})$ is an integer. Thus $x^{r/2} \equiv -1 \mod{p_i^{\alpha_i}}$. Similar to a previous argument, since $x \equiv x_i \Mod{p_i^{\alpha_i}}$, we have $x^{r/2} \equiv x_i^{r/2} \equiv - 1 \Mod{p_i^{\alpha_i}}$.</p>
<p>Now suppose that $r_i$ divides $r/2$, so $r/2 = k r_i$, so $x_i^{r/2} \equiv x_i^{r_i k} \equiv - 1 \Mod{p_i^{\alpha_i}}$, but $x_i^{r_i} \equiv (x_i^{r_i})^k \equiv x_i^{r_i k} \equiv 1 \Mod{p_i^{\alpha_i}}$ which is a contradiction, so it must be $r_i \nmid r/2$.</p>
<p>Let $d$ be the largest exponent such that $2^{d}$ divides $r$. From <em>Lemma 3.5</em> we have $r_i \nmid r$, so $r_i \le r$, thus $2^{d_i} \le 2^{d}$. If $2^{d_i} < 2^{d}$ then $2^{d_i} \le 2^{d - 1}$ but since $2^{d - 1}$ divides $r / 2$, it must be $2^{d_i} = 2^{d}$.</p>
<p>We just proved, for all $i$, that $d_i = 0$ if $r$ is odd and $d_i = d$ if $x^{r/2} \equiv -1 \mod{N}$. <em>QED</em></p>
<p><strong>Cyclic Group Theorem</strong> A group $Z_N^{*}$ is called <em>cyclic</em> if there’s $g \in Z_N^{*}$ such that for any element $x \in Z_N^{*}$, $x \equiv g^k \Mod{N}$ for some $k \ge 0$. If $N = p^\alpha$ for some odd prime $p$ and positive integer $\alpha$, then $Z_{p^\alpha}^{*}$ is cyclic.</p>
<p><em>Proof.</em> Not included here. This is also not included in [1].</p>
<p><strong>Lemma 3.7.</strong> Suppose $g$ is a generator for $Z_{p^\alpha}^{*}$ and $r$ the order of $g$ $\Mod{p^\alpha}$. Then $r = \abs{Z_{p^\alpha}^{*}} = \varphi(p^\alpha)$.</p>
<p><em>Proof.</em> We first prove that every $x \in Z_{p^\alpha}^{*}$ can be expressed as $x \equiv g^{i} \Mod{p^\alpha}$ for $0 \le i \le r - 1$. From the <em>Cyclic Group Theorem</em>, there is $k \ge 0$ such $x \equiv g^k \Mod{p^\alpha}$. Let $k’$ be the smallest such $k$. If $k’ > r$, then there is $0 < \delta < k’$ such that $k’ = r + \delta$, and $g^{k’} \equiv g^{r} g^{\delta} \Mod{p^\alpha}$, which implies $g^{k’} \equiv g^{\delta} \Mod{p^\alpha}$ which contradicts the fact $k’$ is minimal, thus $k’ \le r$. We also know that $k’ \neq r$ because $g^0 \equiv g^r \equiv 1 \Mod{p^\alpha}$.</p>
<p>What we conclude here is that there are $\abs{Z_{p^\alpha}^{*}} = \varphi(p^\alpha)$ distinct elements in $Z_{p^\alpha}^{*}$ and they all can be expressed with exponents $0 \le k \le r - 1$, which gives a lower bound $r \ge \varphi(p^\alpha)$.</p>
<p>Now consider the set $S$ of $x \equiv g^{i} \Mod{p^\alpha}$ for $0 \le i \le r - 1$. Let $i$ and $j$ represent the exponents of two elements in $S$. We claim that if $g^{i} \equiv g^{j} \Mod{p^\alpha}$ then $i = j$. Suppose not, that there’s $i < j$ such that $g^{i} \equiv g^{j} \Mod{p^\alpha}$. Then $j = i + \delta$, for $0 < \delta < r$, and since $g^i \equiv g^i g^{\delta} \Mod{N}$, which means $g^\delta \equiv 1 \Mod{p^\alpha}$ which contradicts the definition of $r$. This implies that every element of $g^{i} \Mod{p^\alpha}$ for $0 \le i \le r - 1$ is unique, so the size of $S$ is exactly $r$.</p>
<p>We also note that every element of $S$ is in $Z_{p^\alpha}^{*}$, so $S$ is a subset of it and thus $r = \abs{S} \le \abs{Z_{p^\alpha}^{*}} = \varphi(p^\alpha)$, which is an upper bound for $r$.</p>
<p>Combining the lower bound and upper bound of $r$, we conclude it has to be exactly $ \varphi(p^\alpha)$. <em>QED</em></p>
<p><strong>Lemma 3.8</strong> For a prime $p$ and integer $\alpha$, $\varphi(p^\alpha) = p^{\alpha - 1}(p - 1)$.</p>
<p><em>Proof.</em> We start by noting that $\varphi(p) = p - 1$ since no number smaller than $p$ has a common prime factor with $p$. For $p^\alpha$, the only numbers smaller than it that share a prime factor with it must be multiples of $p$, that is, $p k$ for $k = 1, \cdots, p^{\alpha - 1} - 1$. So the number of co-primes of $p^\alpha$ is $p^\alpha - 1$ minus $p^{\alpha - 1} - 1$, so $\varphi(p^\alpha) = p^{\alpha - 1}(p - 1)$. <em>QED</em></p>
<p><strong>Lemma 3.9</strong> Let $p$ be an odd prime and $2^d$ the largest power of 2 dividing $\varphi(p^\alpha)$. Let $r’$ be the order of a randomly chosen element $x$ from $Z_{p^\alpha}^{*}$. Then the probability that $2^d$ divides $r$ is 1/2.</p>
<p><em>Proof.</em> From Lemma 3.8 we have $\varphi(p^\alpha) = p^{\alpha - 1}(p - 1)$. Since $p$ is odd, $p-1$ and $\varphi(p^\alpha)$ are even and thus $d \ge 1$.</p>
<p>From the <em>Cyclic Group Theorem</em>, there is $g \in Z_{p^\alpha}^{*}$ such that a randomly chosen element $x$ satisfies $x \equiv g^{k} \Mod{p^{\alpha}}$. Let’s consider 2 cases:</p>
<p>Case 1: $k$ is odd. We have that $x^r \equiv g^{kr} \equiv 1 \Mod{p^{\alpha - 1}}$. Let $r_g$ be the order of $g^{k} \Mod{p^\alpha}$. From <em>Lemma 3.7</em>, $r_g = \varphi(p^\alpha)$ and then from <em>Lemma 3.3</em> we conclude that $r_g \mid kr$ and thus $\varphi(p^\alpha) \mid kr$. Since $k$ is odd, $r$ and $\varphi(p^\alpha)$ have the same number of 2 factors, hence $2^d$ divides $r$.</p>
<p>Case 2: $k$ is even. From <em>Lemma 3.2</em> $g^{\varphi(p^\alpha)} \equiv 1 \Mod{p^\alpha}$, and since $k/2$ is integer, $g^{\varphi(p^\alpha) k/2} \equiv 1 \Mod{p^\alpha}$, so $x^{\varphi(p^\alpha)/2} \equiv 1 \Mod{p^\alpha}$, and by <em>Lemma 3.3</em> $r \mid \varphi(p^\alpha) / 2$. It must be that $2^d \nmid r$ otherwise $2^d \mid \varphi(p^\alpha) / 2$ and $2^{d+1} \mid \varphi(p^\alpha)$ contradicting the fact that $d$ is maximum.</p>
<p>Summarizing $k$ is odd if and only if $2^d \mid r$. It remains to show that $k$ is odd with 1/2 probability for a random $x$ from $Z_{p^\alpha}^{*}$. We can refer to the proof of <em>Lemma 3.7.</em> that states every $x \in Z_{p^\alpha}^{*}$ can be expressed as $x \equiv g^{i} \Mod{p^\alpha}$ for $0 \le i \le r - 1 = \varphi(p^\alpha) - 1$. Since $\varphi(p^\alpha)$ is even, $\varphi(p^\alpha) - 1$ is odd and if we divide the set of numbers $\curly{0, \cdots, \varphi(p^\alpha) - 1}$ into odds and evens we get two sets of the same size.</p>
<p><strong>Theorem 3.</strong> Let $N$ be an odd non-prime positive integer $N$ with prime factors $N = p_1^{\alpha_1} p_2^{\alpha_2} \cdots p_m^{\alpha_m}$. Let $x$ be an element chosen at random from $Z_N^{*}$ and $r$ the order of $x \Mod{N}$. Then</p>
\[p(r \mbox{ is even and } x^{r/2} \not \equiv -1 \Mod{N}) \ge 1 - \frac{1}{2^m}\]
<p><em>Proof.</em> We’ll prove the equivalent statement:</p>
\[p(r \mbox{ is odd or } x^{r/2} \equiv -1 \Mod{N}) \le \frac{1}{2^m}\]
<p>Let $x \in Z_N^{*}$, $x_i \in Z_{p_i^{\alpha_i}}^{*}$ such that $x \equiv x_i \Mod{p_i^{\alpha_i}}$. By <em>Lemma 3.1</em> we can assume we’re picking $x_i$ instead of $x$.</p>
<p>Let $r_i$ be the order of $x_i \Mod{p_i^{\alpha_i}}$ as in <em>Definition 3</em>. Let $d_i$ be the largest exponent such that $2^{d_i}$ divides $r_i$. By <em>Lemma 3.6</em> if $r$ is odd or $x^{r/2} \equiv -1 \mod{N}$ then $d_i$ is the same for any $i$.</p>
<p>[1] claims it’s enough to use <em>Lemma 3.9</em> to prove it, but it’s not clear to me why. My hunch is to show that if $r \mbox{ is odd or } x^{r/2} \equiv -1 \Mod{N}$ then all of $x_1, x_2, \cdots, x_m$ will be either divisible by $2^d$ (as defined in <em>Lemma 3.9</em>) or not. Since only $\frac{1}{2^m}$ of all the possible values of $x_1, x_2, \cdots, x_m$ can satisfy that, then the condition $r \mbox{ is odd or } x^{r/2} \equiv -1 \Mod{N}$ cannot happen with more than that probability.</p>
<p>I’ll leave this as is for now. If I figure out I can update the post.</p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.amazon.com/Quantum-Computation-Information-10th-Anniversary/dp/1107002176">1</a>] Quantum Computation and Quantum Information - Nielsen, M. and Chuang, I.</li>
</ul>Guilherme KunigamiGiven integers $x$, $N$, the problem of order-finding consists in finding the smallest positive number $r$ such that $x^r \equiv 1 \Mod{N}$, where $r$ is called the order of $x \Mod{N}$. In this post we’ll show that if we know how to solve the order of $x \Mod{N}$, we can use it to get a probabilistic algorithm for finding a non-trivial factor of a number $N$. The motivation is that this is a crucial step in Shor’s quantum factorization, but only relies on classic number theory.Quantum Fourier Transform2020-11-21T00:00:00+00:002020-11-21T00:00:00+00:00https://www.kuniga.me/blog/2020/11/21/quantum-fourier-transform<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>In this post we’ll learn how to compute the Fourier transform using quantum circuits, which underpins algorithms like Shor’s efficient prime factorization. We assume basic familiarity with quantum computing, covered in a <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">previous post</a>.</p>
<!--more-->
<h2 id="fourier-transform">Fourier Transform</h2>
<p>Given a vector $x \in \mathbb{C}^N$ the Fourier transform is a function that maps it to another vector $y \in \mathbb{C}^N$ as follows:</p>
\[(1) \quad y_k = \frac{1}{\sqrt{N}} \sum_{j = 0}^{N - 1} x_j e^{2 \pi i j k / N} \qquad \forall k = 0, \cdots, N - 1\]
<h2 id="quantum-fourier-transform">Quantum Fourier Transform</h2>
<p>We can define an analogous function for quantum states. Consider a state</p>
\[\ket{x} = \sum_{k = 0}^{N - 1} x_k \ket{k}\]
<p>where $k$ are all states in a computational base and $x$ is a $n$ qubit state, thus $N = 2^n$. Then the quantum Fourier transform is a function that maps it to another state $y$:</p>
\[(2) \quad \ket{y} = \sum_{k = 0}^{2^n - 1} y_k \ket{k}\]
<p>where $y_k$ is defined the same way as (1):</p>
\[y_k = \frac{1}{2^{n/2}} \sum_{j = 0}^{2^n - 1} x_j e^{2 \pi i j k / 2^n} \qquad \forall k = 0, \cdots, N - 1\]
<p>If $x$ is a computational base state, say $\ket{j}$ (so $x_j = 1$ and 0 otherwise), then $y_k$ can be written as:</p>
\[y_k = \frac{1}{2^{n/2}} e^{2 \pi i j k / 2^n} \qquad \forall k = 0, \cdots, 2^n - 1\]
<p>and (2) as:</p>
\[(3) \quad \ket{y} = \frac{1}{2^{n/2}} \sum_{k = 0}^{2^n - 1} e^{- 2 \pi i j k / 2^n} \ket{k}\]
<p>We’ll now study how to construct a quantum circuit that implements (3).</p>
<h2 id="algebraic-preparation">Algebraic Preparation</h2>
<p>In this section we’ll re-write (3) in a form that makes it easier to accomplish via a quantum circuit.</p>
<h3 id="step-1---factor-k-into-bits">Step 1 - Factor $k$ into bits</h3>
<p>First we factor $k$ into bits, that is $(k_1, k_2, \cdots, k_n)$. We can write $k$ as a polinomial of its bits:</p>
\[k = k_n 2^0 + k_{n-1} 2^1 + ... k_1 2^{n - 1}\]
<p>and extract a factor of $2^n$:</p>
\[k = 2^n (k_n \frac{2^0}{2^n} + k_{n-1} \frac{2^1}{2^n} + \cdots k_1 \frac{2^{n - 1}}{2^n}) = 2^n (k_n 2^{-n} + k_{n-1} 2^{-(n - 1)} \cdots k_1 2^{-1})\]
<p>which we can write as a sum:</p>
\[k = 2^n \sum_{l=1}^n k_l 2^{-l}\]
<p>$\ket{k}$ can be written as an explicit multi-qubit state $\ket{k} = \ket{k_1 \cdots k_n}$ and a sum of all values of $\ket{k}$ from 0 to $2^n - 1$ can be written as:</p>
\[\sum_{k=0}^{2^n - 1} \ket{k} = \sum_{k_1 = 0}^1 \cdots \sum_{k_n = 0}^1 \ket{k_1 \cdots k_n}\]
<p>To better grasp the above, it was helpful to me to think of how I would generate all combinations of a $n$-vector with a recursive function of depth $n$ and a for-loop (which corresponds to the sum) at each level.</p>
<p>Finally using these equations, we can rewrite (3) as:</p>
\[(4) \quad \frac{1}{2^{n/2}} \sum_{k_1 = 0}^1 \cdots \sum_{k_n = 0}^1 e^{2 \pi i j (\sum_{l=1}^n k_l 2^{-l}) } \ket{k_1 \cdots k_n}\]
<h3 id="step-2---make-sum-in-exponent-a-product">Step 2 - Make sum in exponent a product</h3>
<p>We can move the sum in the exponent and make it a product,</p>
\[e^{2 \pi i j (\sum_{l=1}^n k_l 2^{-l})} = \prod_{l = 1}^n e^{2 \pi i j k_l 2^{-l}}\]
<p>We learned about the tensor operator $\otimes$ <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">previously</a>, which can also be used to represent multiple qubits as a product-like expression:</p>
\[\ket{k} = \ket{k_1 \cdots k_n} = \bigotimes_{l=1}^n \ket{k_l}\]
<p>Using the equations above, we can “pair” each of the $l$ factors, so</p>
\[e^{2 \pi i j (\sum_{l=1}^n k_l 2^{-l})} \ket{k_1 \cdots k_n} = \bigotimes_{l=1}^n e^{2 \pi i j k_l 2^{-l}} \ket{k_l}\]
<p>Which allows re-writing (4) as</p>
\[(5) \quad \frac{1}{2^{n/2}} \sum_{k_1 = 0}^1 \cdots \sum_{k_n = 0}^1 \bigotimes_{l=1}^n e^{2 \pi i j k_l 2^{-l}} \ket{k_l}\]
<h3 id="step-3---swap-sum-and-product">Step 3 - Swap sum and product</h3>
<p>Step 2 made it so each factor inside the product is only dependent on $k_l$. We can associate each of the sums $\sum_{k_1 = 0}^1$ with each of the factors so that (5) becomes:</p>
\[(6) \quad \frac{1}{2^{n/2}} \bigotimes_{l=1}^n \sum_{k_l = 0}^1 e^{2 \pi i j k_l 2^{-l}} \ket{k_l}\]
<p>In more explict steps, we can isolate the $n$-th factor of (5):</p>
\[\quad \frac{1}{2^{n/2}} \sum_{k_1 = 0}^1 \cdots \sum_{k_n = 0}^1 (\bigotimes_{l=1}^{n-1} e^{2 \pi i j k_l 2^{-l}} \ket{k_n}) (e^{2 \pi i j k_n 2^{-n}} \ket{k_l})\]
<p>Since the $n$-th sum only affects $k_n$, we can order terms as:</p>
\[\quad \frac{1}{2^{n/2}} \sum_{k_1 = 0}^1 \cdots \sum_{k_{n-1} = 0}^1\ (\bigotimes_{l=1}^{n-1} e^{2 \pi i j k_l 2^{-l}} \ket{k_l}) (\sum_{k_n = 0}^1 e^{2 \pi i j k_n 2^{-n}} \ket{k_n})\]
<p>If we keep repeating this step for $l = 0, \cdots, n-1$ we’ll obtain (6).</p>
<h3 id="step-4---enumerate">Step 4 - Enumerate</h3>
<p>This is the easiest step. We just replace the sum $k_l$ with its two values: 0 and 1. For 0, $e^{2 \pi i j k_l 2^{-l}}$ is 1, and for 1 it’s $e^{2 \pi i j 2^{-l}}$:</p>
\[(7) \quad \frac{1}{2^{n/2}} \bigotimes_{l=1}^n (\ket{0} + e^{2 \pi i j 2^{-l}} \ket{1})\]
<h3 id="step-5---eulers-identity-and-binary-fractions">Step 5 - Euler’s identity and binary fractions</h3>
<p>Let’s start with a quick detour. Recall that Euler’s formula states:</p>
\[e^{ix} = \cos x + i \sin x\]
<p>So we can see $x$ as angular displacements in a circle and every $2 \pi$ is a full revolution.</p>
<p>Say we have an angle $2 \pi \alpha$ for some real $\alpha$. Let $k$ be the integer part of $\alpha$ and $\beta$ its decimal. Then our angle is $2 \pi k + 2 \pi \beta$. The first term represents full revolutions, so we can throw away the integer part of $\alpha$ when computing $\cos 2 \pi \alpha$ and $\sin 2 \pi \alpha$, and thus $e^{i 2 \pi \alpha}$. For example, $e^{6.7 * 2 \pi i} = e^{0.7 * 2 \pi i}$.</p>
<p>End of detour but we’ll need this fact next.</p>
<p>We can factorize our input $j$ into bits too:</p>
\[j = j_n 2^0 + j_{n-1} 2^1 + ... j_1 2^{n - 1}\]
<p>In the equation $e^{2 \pi i j 2^{-l}}$, we can say $\alpha = j 2^{-l}$. As we saw above, we can discard the integer portion of $j 2^{-l}$. Let’s see some examples. For $l = 1$ we have:</p>
\[j / 2 = j_n 2^{-1} + j_{n-1} + ... j_1 2^{n - 2}\]
<p>Only $j_n 2^{-1}$ is less than 1, so it’s the only term we need to include. Thus:</p>
\[e^{2 \pi i j 2^{-1}} = e^{2 \pi i j_n 2^{-1}}\]
<p>For $l = 2$,</p>
\[j / 2^2 = j_n 2^{-2} + j_{n-1} /2 + ... j_1 2^{n - 3}\]
<p>The first two terms contribute to the fraction, so</p>
\[e^{2 \pi i j 2^{-2}} = e^{2 \pi i (j_n 2^{-2} + j_{n-1} 2^{-1})}\]
<p>We can use the binary fraction notation,</p>
\[0.b_1,b_2,\cdots,b_m = \sum_{i=1}^{m} b_i 2^{-i}\]
<p>So for $l = 1$, we want $j_n 2^{-1} = 0.j_n$. For $l = 2$, we want $j_n 2^{-2} + j_{n-1} 2^{-1} = 0.j_{n-1}j_n$. For a general $l$,</p>
\[j_n 2^{-l} + j_{n-1} 2^{-(l - 1)} \cdots + j_{n-(l-1)} 2^{-1} = 0.j_{n-(l-1)} \cdots j_n\]
<p>This is the last piece we need to rewrite (7) as</p>
\[(8) \quad
\frac{
(\ket{0} + e^{2 \pi i (0.j_n)} \ket{1})
(\ket{0} + e^{2 \pi i (0.j_{n-1}j_n)} \ket{1}) \cdots
(\ket{0} + e^{2 \pi i (0.j_1j_2 \cdots j_n)} \ket{1})
}{2^{n/2}}\]
<p>It’s worth noting that the implicit operator between factors is $\otimes$, which is not commutative. That means order matters above, which we’ll have to take into account when building the circuit.</p>
<h2 id="building-blocks-quantum-gates">Building Blocks: Quantum Gates</h2>
<p>Let’s study the base components needed for constructing the circuit in the next section.</p>
<h3 id="hadamard-gate-revisited">Hadamard gate revisited</h3>
<p>Recall that the <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">Hadamard gate</a> has the following unitary matrix:</p>
\[H = \frac{1}{\sqrt{2}}\begin{bmatrix}
1 & 1 \\
1 & -1 \\
\end{bmatrix}\]
<p>When applied to a qubit $j_i$ in a computational base state it yields</p>
\[\begin{cases}
\frac{\ket{0} + \ket{1}}{\sqrt{2}} & \text{if } j_i = 0 \\
\frac{\ket{0} - \ket{1}}{\sqrt{2}} & \text{if } j_i = 1
\end{cases}\]
<p>We can write this in a form that is more suitable for (8):</p>
\[\frac{\ket{0} + e^{2 \pi i 0.j_i} \ket{1}}{\sqrt{2}}\]
<p>If $j_i = 0$, then $e^{2 \pi i 0.j_i} = e^0 = 1$, if $j_i = 1$, then $e^{2 \pi i 0.5} = e^{\pi i} = -1$ (Euler’s identity).</p>
<h3 id="the-cr_k-gate">The $CR_k$ gate</h3>
<p>We define the $R_k$ gate as:</p>
\[R_k = \frac{1}{\sqrt{2}}\begin{bmatrix}
1 & 0 \\
0 & e^{2 \pi i / 2^{k}} \\
\end{bmatrix}\]
<p>when applied to a state $\psi = \alpha \ket{0} + \beta \ket{1}$ it yields</p>
\[R_k(\psi) = \alpha \ket{0} + \beta e^{2 \pi i / 2^{k}} \ket{1}\]
<p>Now, it’s possible to show that any $n$-qubit gate $U$ can be transformed into a $(n+1)$-qubit controlled gate, where the extra bit is called <em>control</em>, but we’ll not prove it here. If it’s zero, the other qubits from input state are unchanged; if it’s one, the gate $U$ is applied. The simplest example is the NOT gate converted to the CNOT gate.</p>
<p>We can define $CR_k(\psi, j_i)$ as the $R_k$ gate controlled by the qubit $j_i$ as:</p>
\[(9) \quad CR_k(\psi, j_i) = \alpha \ket{0} + \beta e^{2 \pi i j_i / 2^{k}} \ket{1}\]
<p>If $j_i = 0$, it returns $\alpha \ket{0} + \beta \ket{1}$, the original state, and if $j_i = 1$ it returns $R_k(\psi)$.</p>
<h3 id="swapping-qubits">Swapping qubits</h3>
<p>Let $x_0$ and $y_0$ be 2 qubits in the computational base. We can swap their values using 3 CNOT gates. Let $CNOT(x, y)$ be the NOT gate applied to $y$ (target) and controlled by $x$. Assuming the qubits are in the computational base $\curly{0, 1}$, $CNOT(x, y)$ is equivalent to $x \oplus y$ (XOR).</p>
<p>We start off with $(x_0, y_0)$. Then we apply CNOT to the second bit, so $(x_1, y_1) = (x_0, x_0 \oplus y_0)$, then we apply CNOT to the first bit, $(x_2, y_2) = (y_1 \oplus x_1, y_1) = (x_0 \oplus y_0 \oplus x_0, x_0 \oplus y_0) = (y_0 , x_0 \oplus y_0)$, the last step comes from the fact that $\oplus$ is associative and commutative, so $x_0 \oplus y_0 \oplus x_0 = y_0 \oplus (x_0 \oplus x_0) = y_0 \oplus 0 = y_0$.</p>
<p>Finally we apply CNOT to the second bit again, so $(x_3, y_3) = (x_2, x_2 \oplus y_2) = (y_0, y_0 \oplus (x_0 \oplus y_0)) = (y_0, x_0)$. Summarizing, $(x_3, y_3) = (y_0, x_0)$ which is the first state with the qubits swapped.</p>
<h2 id="construction-quantum-circuit">Construction: Quantum Circuit</h2>
<p>Let’s construct the first factor of (8). Since it only depends on $j_n$, we can simply apply the Hadamard gate to it:</p>
\[H(j_n) = \frac{\ket{0} + e^{2 \pi i (0.j_n)} \ket{1}}{\sqrt{2}}\]
<p>For the second factor, the exponent depends on both $j_n$ and $j_{n-1}$. Applying the Hadamard on $j_{n-1}$ yields</p>
\[H(j_n) = \frac{\ket{0} + e^{2 \pi i (0.j_{n-1})} \ket{1}}{\sqrt{2}}\]
<p>We can now use $CR_k$, in particular $CR_2(\psi, j_n)$ to “inject” $j_n$ as the second fraction bit. To see how, let’s apply (9) with $k=2$ we currently have $\alpha = 1/\sqrt{2}$ and $\beta = e^{2 \pi i (0.j_{n-1})} / \sqrt{2}$. If we appply $CR_2(\psi, j_n)$, then</p>
\[\beta e^{2 \pi i j_n / 2^{2}} =
\frac{e^{2 \pi i (0.j_{n-1})} e^{2 \pi i j_n / 2^{2}} }{ \sqrt{2} } =
\frac{e^{2 \pi i (0.j_{n-1}) + 2 \pi i j_n / 2^{2}} }{ \sqrt{2} } =
\frac{e^{2 \pi i (0.j_{n-1}j_n)} }{ \sqrt{2} }\]
<p>Thus:</p>
\[CR_2(H(j_{n-1}), j_{n}) = \frac{\ket{0} + e^{2 \pi i (0.j_{n-1}j_n)} \ket{1}}{\sqrt{2}}\]
<p>For the third factor we repeat the steps above for $j_{n-1}$ and $j_{n-2}$, and use $CR_3(\psi, j_n)$ to obtain</p>
\[CR_3(CR_2(H(j_{n-2}), j_{n-1}), j_n) = \frac{\ket{0} + e^{2 \pi i (0.j_{n-2}j_{n-1}j_n)} \ket{1}}{\sqrt{2}}\]
<p>We can use this procedure to generate all factors. The circuit below depicts these steps:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-11-21-quantum-fourier-transform/quantum-circuit.png" alt="a diagram depicting a quantum circuit" />
<figcaption>Figure 1: The Quantum Fourier Transform Circuit (<a href="https://www.kuniga.me/resources/blog/2020-11-21-quantum-fourier-transform/quantum-circuit.png" target="blank">full resolution</a>)</figcaption>
</figure>
<p>Note we’re omitting the $1/\sqrt{2}$ factor for simplicity.</p>
<h3 id="reversing-bits">Reversing bits</h3>
<p>The circuit we provided above generates the output in the reverse order, since for example $(\ket{0} + e^{2 \pi i (0.j_n)} \ket{1})$ is constructured from the last bit $j_n$ but it’s the first factor in (8). We can reverse the order of the bits by swapping $N/2$ bits using the 3 CNOT gates we described in <em>Swapping qubits</em>. However, we only showed how to swap bits in the computational base state $\curly{0, 1}$, which the output is not in. The input $\ket{k}$ is, so we can reverse the bits beforehand.</p>
<h2 id="complexity">Complexity</h2>
<p>We saw that the quantum circuit from the previous section computes the quantum fourier transform. We first need to reverse the bits which can be done in $O(n)$ where $n$ is the number of qubits. Then we have $O(n)$ Hadamard gates and $O(n^2)$ $CR_k$ gates, for a total time complexity of $O(n^2)$ (since each gate can compute in O(1)).</p>
<p>Constrast this with the classical Fourier transform, which is computed in $O(N \log N)$ using the Fast Fourier Transform, FFT. Since $n$ is the number of bits, $N = 2^n$ and the complexity is $O(2^n n)$, which means the quantum circuit is exponentially for efficient.</p>
<p>Note however that similar to the constructs we studied in <a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">The Deutsch-Jozsa Algorithm</a> algorithm, this procedure constructs a quantum state that encodes the computed results of the fourier transform but acessing that information is another story. We’ll stop here in this post but will discuss applications in future ones, including prime factorization.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This post are basically my study notes from Chapter 5.1 from <em>Quantum Computation and Quantum Information</em> [1], but with some more details filled in. It was particularly challenging to understand the <em>Step 5</em> since it implicitly relies on some Euler’s identity properties. It could be this was covered in a previous chapter and I missed it since I’m not reading the book in linear fashion.</p>
<p>The book uses notations I’m not used to and I decide to avoid using them for simplicity. The major example is to represent $y = f(x)$ as $x \rightarrow f(x)$ and omitting $y$ (the book does that for the quantum fourier transform equation).</p>
<p>I assume it wasn’t a clear straight path of discovery from (3) to (8) but I can’t help being astonished by the ingenuity of the logical leaps in there.</p>
<h2 id="related-posts">Related Posts</h2>
<ul>
<li><a href="https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm.html">The Deutsch-Jozsa Algorithm</a> besides being the pre-requisite for this post, I noticed similarities between the quantum Fourier transform computation and the Deutsch-Jozsa Algorithm, in which we leverage the ability of quantum circuits to “entangle” information from the input efficiently. In the Deutsch-Jozsa Algorithm, we entangled all the computations of some $f: \curly{\ket{0}, \ket{1}}^{\otimes n} \rightarrow \curly{\ket{0}, \ket{1}}$ while here we entangle all the bits of the input into the each bit of the output, like the last factor of (8) $(\ket{0} + e^{2 \pi i (0.j_1j_2 \cdots j_n} \ket{1})$.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.amazon.com/Quantum-Computation-Information-10th-Anniversary/dp/1107002176">1</a>] Quantum Computation and Quantum Information - Nielsen, M. and Chuang, I.</li>
</ul>Guilherme KunigamiIn this post we’ll learn how to compute the Fourier transform using quantum circuits, which underpins algorithms like Shor’s efficient prime factorization. We assume basic familiarity with quantum computing, covered in a previous post.A Puzzling Election2020-11-06T00:00:00+00:002020-11-06T00:00:00+00:00https://www.kuniga.me/blog/2020/11/06/puzzling-election<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>In 2016, Donald Trump won the US presidential election. He won with 304 electoral votes over Hillary Clinton with 227 votes, even though Clinton had almost 3 million more popular votes than Trump.</p>
<p>This is due to how the US decides to count votes. In its system each state is given a number of votes equal to the number of representatives (proportional to the state population) plus senators (2 per state). There are a total of 435 representatives and 100 senators and the District of Columbia (DC) gets 3 votes (proportional to its population, but no more votes than the least populated state, Wyoming), for a total of 538 electoral votes. A candidate wins the election if they get more than half of the electoral votes, that is, at least 270.</p>
<p>All but two states (Maine and Nebraska) have an all-or-nothing system, which means that whoever wins the majority of the popular votes in that state, gets all the electoral votes. For example, California has 55 electoral votes where Clinton got 8,753,788 (61.73%) of votes and hence the 55 electoral votes and Trump none. Had she received 100% the votes, she would have received the exact same electoral votes.</p>
<p>This all-or-nothing system is the source of the counter-intuitive result in which a candidate with the majority of the popular vote might not win the election. This led me to wonder about the extreme case: What would be the most popular votes a candidate can get without winning?</p>
<p>In this post we explore this problem.</p>
<!--more-->
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-11-06-puzzling-election/us-map-election-2016.png" alt="the US map showing the states which democrats (blue) and republicans (red) won in the 2016 US Election." />
<figcaption>Figure 1: States which democrats (blue) and republicans (red) won in the 2016 US Election.</figcaption>
</figure>
<h2 id="problem-formulation">Problem Formulation</h2>
<p>To simplify the problem a bit, let’s assume all states including Maine and Nebraska use an all-nothing system. Let’s also assume the number of popular votes per state is the same as the 2016 election, but we can choose who it went for.</p>
<p>We’ll further assume that there are only two candidates, A and B, and a vote has to go to either of them. What is the minimum number of popular votes A needs to win?</p>
<h2 id="ballpark-estimate">Ballpark estimate</h2>
<p>If we assume there are sets of states that divide the electoral votes into roughly two equal parts, then candidate A needs to win ~50% of the electoral votes.</p>
<p>Also, since the number of electoral votes is roughly proportional to the population and assuming the voting participation rate is roughly the same across the states, candidate A needs to win ~50% of the votes of a set of states that add up to ~50% of the electoral votes. This is ~25% of the total votes for A and ~75% for B, so we’d expect the exact results we’ll compute to be around this ratio.</p>
<h2 id="the-inverse-knapsack-problem">The Inverse Knapsack Problem</h2>
<p>Let $S^*$ the set of states A won in an optimal solution. If a state $i$ is in $S^*$, A needs exactly the minimum majority of votes, so if there were $v_i$ votes, we need $\lfloor v_i/2 \rfloor + 1$ votes. If a state is not in $S^*$, there’s no reason to spend any votes there, so it should be 0.</p>
<p>Given these observtions, we can model this problem as picking a set of items corresponding to the states and DC, with weights equal to their electoral votes and costs equal to the minimum majority of popular votes.</p>
<p>Picking a state corresponds to candidate A winning that state. We want to pick a set of items that minimizes the total cost (number of popular votes) but has total weight greater or equal 270 (the minimum majority of the electoral votes).</p>
<p>More generally, let $S$ be a set of items, wher item $i \in S$ has cost $c_i$ and weight $w_i$. The problem can be formulated as an integer linear programming if we introduce $x_i \in {0, 1}$ where $x_i = 1$ corresponds to picking that item.</p>
\[\min \sum_{i \in S} x_i c_i\]
<p>Subject to</p>
\[\sum_{i \in S} x_i w_i \ge W\]
<p>We’ll call this the <em>Inverse Knapsack Problem</em>. To recall, the <em>Knapsack Problem</em> costist of maximizing the value of a set of items with totsl weight not exceeding $W$, which can be modelled as integer linear programming:</p>
\[\max \sum_{i \in S} x_i c_i\]
<p>Subject to</p>
\[\sum_{i \in S} x_i w_i \le W\]
<h2 id="the-inverse-knapsack-problem-is-np-complete">The Inverse Knapsack Problem is NP-Complete</h2>
<p>The decision version of the knapsack problem can be defined as:</p>
<blockquote>
<p>($D_1$) Is there any solution with total value of at least $C_1$ weighing no more than $W_1$?</p>
</blockquote>
<p>In general this is an NP-complete problem. The decision version of the inverse knapsack problem can be defined as:</p>
<blockquote>
<p>($D_2$) Is there any solution with total cost of at most $C_2$ weighing no less than $W_2$?</p>
</blockquote>
<p>If we can reduce the knapsack problem to the inverse knapsack problem, we prove that the later is also an NP-complete problem.</p>
<p>Consider an instance of $D_1$ with $C_1$ and $W_1$.</p>
<p>Let $X$ denote a subset of $S$ and $c(X)$ and $w(X)$ the total value and weight of the items in $X$, respectively. Let $C_T$ be the sum of values of all items, i.e. $C_T = c(S)$. Let $W_T$ be the sum of weights of all items, i.e. $W_T = w(S)$.</p>
<p>We create an instace of $D_2$ with $C_2 = C_T - C_1$ and $W_2 = W_T - W_1$.</p>
<p>Suppose $D_1$ is true. Then there’s a set of items $X^*$ such that $c(X^*) \ge C_1$ and $w(X^*) \le W_1$. Let $Y^*$ be the set of items not in $X^*$ (i.e. $S \setminus X^*$).</p>
<p>Their weight is $w(Y^*) = W_T - w(X^*)$, and since $w(X^*) \le W_1$, $w(Y^*) = W_T - w(X^*) \ge W_T - W_1 = W_2$. Similarly we can see that $c(Y^*) \le C_2$, which means $Y^*$ is a solution to $D_2$.</p>
<p>This implies tht if $D_1$ has a solution, then $D_2$ has one too. We can use a symmetric argument to show that if $D_2$ has a solution, then $D_1$ has too. This constructive proof that shows we can reduce the decision version of the knapsack problem into the inverse knapsack problem.</p>
<p>While this problem is NP-complete, for many real-world instances it can be solved exactly via dynamic programming as we’ll see next.</p>
<h2 id="dynamic-programming">Dynamic Programming</h2>
<p>The reduction from the decision version of the knapsack problem into the inverse knapsack problem can also be used to reduce the optimization version of the inverse knapsack problem (let’s call it $O_2$) into the knapsack problem ($O_1$).</p>
<p>Let $Y^*$ be the optimal solution for $O_2$ with $W_2$. We create an instance of $O_1$ with $W_1 = W_T - W_2$, and $X^* = S \setminus Y^*$. Since $Y^*$ is a feasible solution, $w(Y^*) \ge W_2$ and we can show that $w(X^*) \le W_1$, so $X^*$ is a feasible solution to $O_1$.</p>
<p>We now claim $X^*$ is the optimal solution for $O_1$. Suppose it’s not. Then there is $\hat{X}$ such that $c(\hat{X}) > c(X^*)$ and $\hat{Y} = S \setminus \hat{X}$. Since $c(X^*) = C_T - c(Y^*)$ and $c(\hat{X}) = C_T - c(\hat{Y})$ we get $C_T - c(\hat{Y}) > C_T - c(Y^*)$, which means $c(\hat{Y}) < c(Y^*)$, which is a contradiction.</p>
<p>Hence we can solve the original problem by creating $O_1$ with $W = 538 - 270 = 268$, solve the knapsack problem and reverse our picks. We can solve the knapsack problem in $O(nW)$ using dynamic programming, where $n$ would be the number of states + DC, 51.</p>
<h2 id="implementation">Implementation</h2>
<p>He’s a Python implementation that uses a $O(nW)$ matrix <code class="language-plaintext highlighter-rouge">ks</code>, where <code class="language-plaintext highlighter-rouge">ks[i][w]</code> represents the best possible knapsack value using only the first <code class="language-plaintext highlighter-rouge">i-1</code> items and with size <code class="language-plaintext highlighter-rouge">w</code>.</p>
<p>We can use <code class="language-plaintext highlighter-rouge">ks</code> to retrieve the elements used in the optimal solution.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">solve_knapsack</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">W</span><span class="p">):</span>
<span class="c1"># Empty set
</span> <span class="n">k</span> <span class="o">=</span> <span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="p">(</span><span class="n">W</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">k</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">ks</span> <span class="o">=</span> <span class="p">[</span><span class="n">k</span><span class="p">]</span>
<span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items</span><span class="p">:</span>
<span class="n">next_k</span> <span class="o">=</span> <span class="n">k</span><span class="p">.</span><span class="n">copy</span><span class="p">()</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">k</span><span class="p">)):</span>
<span class="n">next_w</span> <span class="o">=</span> <span class="n">w</span> <span class="o">+</span> <span class="n">item</span><span class="p">[</span><span class="s">'w'</span><span class="p">]</span>
<span class="k">if</span> <span class="n">next_w</span> <span class="o">></span> <span class="n">W</span><span class="p">:</span>
<span class="k">break</span>
<span class="k">if</span> <span class="n">k</span><span class="p">[</span><span class="n">w</span><span class="p">]</span> <span class="o"><</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">continue</span>
<span class="n">next_c</span> <span class="o">=</span> <span class="n">k</span><span class="p">[</span><span class="n">w</span><span class="p">]</span> <span class="o">+</span> <span class="n">item</span><span class="p">[</span><span class="s">'c'</span><span class="p">]</span>
<span class="k">if</span> <span class="n">k</span><span class="p">[</span><span class="n">next_w</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span> <span class="ow">or</span> <span class="n">k</span><span class="p">[</span><span class="n">next_w</span><span class="p">]</span> <span class="o"><</span> <span class="n">next_c</span><span class="p">:</span>
<span class="n">next_k</span><span class="p">[</span><span class="n">next_w</span><span class="p">]</span> <span class="o">=</span> <span class="n">next_c</span>
<span class="n">k</span> <span class="o">=</span> <span class="n">next_k</span>
<span class="n">ks</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">k</span><span class="p">)</span>
<span class="c1"># Find the best size
</span> <span class="n">max_w</span> <span class="o">=</span> <span class="n">W</span>
<span class="k">while</span> <span class="n">ks</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">][</span><span class="n">max_w</span><span class="p">]</span> <span class="o">==</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="n">max_w</span> <span class="o">-=</span> <span class="mi">1</span>
<span class="c1"># Backtrack to find which items were used
</span> <span class="n">picked</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">curr_w</span> <span class="o">=</span> <span class="n">max_w</span>
<span class="k">for</span> <span class="n">idx</span> <span class="ow">in</span> <span class="nb">reversed</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">ks</span><span class="p">))):</span>
<span class="k">if</span> <span class="n">ks</span><span class="p">[</span><span class="n">idx</span><span class="p">][</span><span class="n">curr_w</span><span class="p">]</span> <span class="o">></span> <span class="n">ks</span><span class="p">[</span><span class="n">idx</span> <span class="o">-</span> <span class="mi">1</span><span class="p">][</span><span class="n">curr_w</span><span class="p">]:</span>
<span class="n">picked</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">idx</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">curr_w</span> <span class="o">-=</span> <span class="n">items</span><span class="p">[</span><span class="n">idx</span> <span class="o">-</span> <span class="mi">1</span><span class="p">][</span><span class="s">'w'</span><span class="p">]</span>
<span class="k">return</span> <span class="n">picked</span></code></pre></figure>
<p>We can reduce the inverse knapsack problem to the knapsack problem by the procedure we described above ($W_1 = W_T - W_2$) and then get the complement of the items:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">solve_inverse_knapsack</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">W</span><span class="p">):</span>
<span class="n">total_weight</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">item</span><span class="p">[</span><span class="s">'w'</span><span class="p">]</span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="n">items</span><span class="p">])</span>
<span class="n">solution</span> <span class="o">=</span> <span class="n">solve_knapsack</span><span class="p">(</span><span class="n">items</span><span class="p">,</span> <span class="n">total_weight</span> <span class="o">-</span> <span class="n">W</span><span class="p">)</span>
<span class="n">solution_lookup</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">solution</span><span class="p">)</span>
<span class="c1"># the complement of items in solution
</span> <span class="k">return</span> <span class="p">[</span><span class="n">i</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">items</span><span class="p">))</span> <span class="k">if</span> <span class="n">i</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">solution_lookup</span><span class="p">]</span></code></pre></figure>
<p>The full source is on <a href="https://github.com/kunigami/kunigami.github.io/blob/master/blog/code/2020-11-06-puzzling-election/solve.py">Github</a>.</p>
<h2 id="results">Results</h2>
<p>We obtained the following optimal number of popular votes:</p>
<div class="center_children">
<table>
<thead>
<tr>
<th>A (winner)</th>
<th>B</th>
<th>Total</th>
</tr>
</thead>
<tbody>
<tr>
<td>29,152,906</td>
<td>10,7516,331</td>
<td>136,669,237</td>
</tr>
<tr>
<td>21.3%</td>
<td>78.6%</td>
<td>100%</td>
</tr> </tbody>
</table>
</div>
<p>We also generate the map with the states where A gets $\lfloor v/2 \rfloor + 1$ votes in gold and 0 votes in green:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-11-06-puzzling-election/us-map-solution.png" alt="the US map highlighting the states in which candidate A has to get the minimum majority of the votes" />
<figcaption>Figure 2: States in which candidate A has to get the minimum majority of the votes are colored gold (<a target="_blank" href="https://observablehq.com/@kunigami/selected-states-map">Observable</a>)</figcaption>
</figure>
<h2 id="analysis">Analysis</h2>
<p>It’s theoretically possible for a candidate to win the US election (with the caveat of the Maine-Nebraska simplification) by wining only 21.3% of the popular votes. This ratio is not too far from our ballpark estimate of 25%.</p>
<p>While populous states yield a lot of electoral votes, it also requires a lot of popular votes to be won, so it doesn’t matter too much in finding the optimal solution. A better heuristic are picking states with low voter turnout (like Texas, ~50%), since it requires a smaller subset of the population to win, but it still yields electoral votes that are proportional to the full population.</p>
<p>If we didn’t fix the number of votes per state, the problem would be less interesting, because A would just need one vote to win a state (and B would get 0), whereas where A lost, we’d assume there was 100% turnout and B got all the votes. Candidate A would need just 12 popular votes to win, by picking the top states by electoral votes!</p>
<h2 id="conclusion">Conclusion</h2>
<p>It’s election time and was interesting to be able to model a real world example as a combinatorial optimization problem. I didn’t recall the inverse knapsack problem, though it’s likely I’ve seen it in some form before.</p>
<p>I had forgotten how to retrieve the items of the knapsack using dynamic programming, and was only able to come up with a $O(nW)$-memory solution. Is it possible to do it using $O(W)$ memory?</p>
<p>This post was a good way for me to learn how the voting system works in the US. I’ve recently read <em>The Quartet: Orchestrating the Second American Revolution</em>, and learned how there needed to be compromises to make the Constituion pass, which left a lot of power to states. This can be seen in the fact that the number of senators is proportional to the number of states (not population) and also the electoral votes, in which 48 states treat them as a unit.</p>
<h2 id="related-posts">Related Posts</h2>
<ul>
<li><a href="https://www.kuniga.me/blog/2016/11/05/us-as-an-hexagonal-map.html">US as an hexagonal map
</a> describes an alternative way that represents US states in a uniform way, since the geographical representation puts too much emphasis in large states like Alaska. A similar issue exists with electoral votes where states like New York feels underrepresented even though it was a large number of votes. There are alternative representation as well, like <a href="https://blog.revolutionanalytics.com/2016/10/tilegrams-in-r.html">here</a>.</li>
<li><a href="https://www.kuniga.me/blog/2020/05/25/minimum-string-from-removing-doubles.html">Shortest String From Removing Doubles</a> is another puzzle in which we use backtracking to recover the solution from an auxiliary array. I don’t know if the solution can be classified as dynamic programming because it does’t build open smaller instances of the problem explicitly.</li>
</ul>
<h2 id="references">References</h2>
<p>[<a href="https://en.wikipedia.org/wiki/2016_United_States_presidential_election">1</a>] Wikipedia - 2016 United States presidential election</p>Guilherme KunigamiIn 2016, Donald Trump won the US presidential election. He won with 304 electoral votes over Hillary Clinton with 227 votes, even though Clinton had almost 3 million more popular votes than Trump. This is due to how the US decides to count votes. In its system each state is given a number of votes equal to the number of representatives (proportional to the state population) plus senators (2 per state). There are a total of 435 representatives and 100 senators and the District of Columbia (DC) gets 3 votes (proportional to its population, but no more votes than the least populated state, Wyoming), for a total of 538 electoral votes. A candidate wins the election if they get more than half of the electoral votes, that is, at least 270. All but two states (Maine and Nebraska) have an all-or-nothing system, which means that whoever wins the majority of the popular votes in that state, gets all the electoral votes. For example, California has 55 electoral votes where Clinton got 8,753,788 (61.73%) of votes and hence the 55 electoral votes and Trump none. Had she received 100% the votes, she would have received the exact same electoral votes. This all-or-nothing system is the source of the counter-intuitive result in which a candidate with the majority of the popular vote might not win the election. This led me to wonder about the extreme case: What would be the most popular votes a candidate can get without winning? In this post we explore this problem.The Deutsch-Jozsa Algorithm2020-10-11T00:00:00+00:002020-10-11T00:00:00+00:00https://www.kuniga.me/blog/2020/10/11/deutsch-jozsa-algorithm<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p><a href="https://en.wikipedia.org/wiki/David_Deutsch">David Elieser Deutsch</a> is a British scientist at the University of Oxford, being a pioneer of the field of quantum computation by formulating a description for a quantum Turing machine.</p>
<p><a href="https://en.wikipedia.org/wiki/Richard_Jozsa">Richard Jozsa</a> is an Australian mathematician at the University of Cambridge and is a co-inventor of quantum teleportation.</p>
<p>Together they proposed the <a href="https://en.wikipedia.org/wiki/Deutsch%E2%80%93Jozsa_algorithm">Deutsch-Jozsa Algorithm</a> which, although not useful in practice, it provides an example where a quantum algorithm can outperform a classic one.</p>
<p>In this post we’ll describe the algorithm and the basic theory of quantum computation behind it.</p>
<!--more-->
<h2 id="quantum-mechanics-abstracted">Quantum Mechanics Abstracted</h2>
<p>In this post we’ll work with abstractions on top of Quantum mechanics concepts, namely <em>qubits</em> and <em>quantum gates</em>. A lot of properties we’ll leverage such as superposition and teleportation arise from the theory of Quantum mechanics but we’ll not delve into their explanation, but rather take them as facts to keep things simpler.</p>
<p>We’ll also try to avoid making real-world interpretations of the theoretical results, since it’s known to be counter-intutive, sometimes paradoxical and overall not agreed upon [1].</p>
<h2 id="the-qubit">The Qubit</h2>
<p>We’ll start by defining the quatum analog to a classical bit, which is named a <em>qubit</em> or simply <em>qbit</em>. A common statement about a qubit is that it can be both 0 and 1 at the same time, but as we mentioned in the previous section, we’ll avoid making interpreations such as these and will focus on its mathematical intepretation of a qubit instead.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Bra%E2%80%93ket_notation">Dirac notation</a> defines the pair $\langle \cdot \mid$ and $\mid \cdot \rangle$, called respectively <em>bra</em> and <em>ket</em> (possibly a word play from bracket). A qubit is often represented by the ket symbol: $\ket{\psi}$.</p>
<h2 id="the-state-of-a-qubit">The State of a Qubit</h2>
<p>We can think of a qubit as a pair of complex numbers subject to some constraint. More formally, the set of values of a qubit is a vector space represented by the complex linear combination of an <a href="https://mathworld.wolfram.com/OrthonormalBasis.html">orthonormal basis</a>. The orthonormal base used is often $\ket{0}, \ket{1}$, which are also called <em>computational basis states</em>.</p>
<p>In other words, any qubit can be represented as:</p>
\[\ket{\psi} = \alpha \ket{0} + \beta \ket{1}\]
<p>where $\alpha$ and $\beta$ are complex numbers and called the amplitude, and $\abs{\alpha}^2 + \abs{\beta}^2 = 1$. It’s worth recalling that the magnitude of a complex number number $c = a + bi$ is $\abs{c} = \sqrt{a^2 + b^2}$, so both $\abs{\alpha}$ and $\abs{\beta}$ are non-negative numbers.</p>
<p>We could have opted to use matrix notation and represent our qubit as:</p>
\[\begin{bmatrix}
\psi_{1} \\
\psi_{2} \\
\end{bmatrix} = \alpha \begin{bmatrix}
1 \\
0 \\
\end{bmatrix} + \beta \begin{bmatrix}
0 \\
1 \\
\end{bmatrix}\]
<h3 id="multiple-qubits">Multiple Qubits</h3>
<p>A state with 2-qubits can be written as:</p>
\[\ket{\psi} = \alpha_{00} \ket{00} + \alpha_{01} \ket{01} + \alpha_{10} \ket{10} + \alpha_{11} \ket{11}\]
<p>Note that the size of the base is $2^n$ if $n$ is the number of qubits.</p>
<p>We might also see this notation that factors common terms, so for example the above can be rewritten as:</p>
\[\ket{\psi} = \ket{0} (\alpha_{00} \ket{0} + \alpha_{01} \ket{1}) + \ket{1} (\alpha_{10} \ket{0} + \alpha_{11} \ket{1})\]
<p>We can also use multiple variables to represent a multi-qubit, so for example a 2-qubit can be denoted by $\ket{x, y}$ where $\ket{x}$ and $\ket{y}$ are single qubit variables.</p>
<p>Finally, we can represent repeated qubits using the operator $\otimes$, called <em>tensor</em>. For example, the 4-qubit state $\ket{0000}$ can be represented as $\ket{0}^{\otimes 4}$.</p>
<h3 id="measuring-the-state-of-a-qubit">Measuring the state of a Qubit</h3>
<p>If we measure a single qubit state like</p>
\[\ket{\psi} = \alpha \ket{0} + \beta \ket{1}\]
<p>the measurement will return $\ket{0}$ with probability $\abs{\alpha}^2$ and $\ket{1}$ with probability $\abs{\beta}^2$ (recall that $\abs{\alpha}^2 + \abs{\beta}^2 = 1$, so this is a valid probability distribution). One important thing to note is that this process is irreversible, once the measurement is made, the qubit will assume the measured state.</p>
<p>We can also measure partial qubits of a multi-qubit state. Suppose we have a 2-qubit state:</p>
\[\ket{\psi} = \alpha_{00} \ket{00} + \alpha_{01} \ket{01} + \alpha_{10} \ket{10} + \alpha_{11} \ket{11}\]
<p>And we measure the first qubit. It will return $\ket{0}$ with probability $\abs{\alpha_{00}}^2 + \abs{\alpha_{01}}^2$ and $\ket{1}$ with probability $\abs{\alpha_{01}}^2 + \abs{\alpha_{11}}^2$. Then the first qubit will assume the measured value. Say we measured $\ket{0}$, then the new state is</p>
\[\ket{\psi} = \frac{\alpha_{00} \ket{00} + \alpha_{01} \ket{01}}{\sqrt{\abs{\alpha_{00}}^2 + \abs{\alpha_{01}}^2}}\]
<p>Where the denominator is a normalizing factor so the amplitudes form a valid probability distribution.</p>
<h2 id="transforming-a-qubit-quantum-gates">Transforming a Qubit: Quantum Gates</h2>
<p>In the same way classical gates can be used to transform a bit or bits, we have the analogous quantum gates. The most basic classical gate is the NOT gate which transforms 0 into 1 and vice-versa. The analogous quantum gate flips $\alpha$ and $\beta$ of a state, which is a more general form of a NOT gate, since it also turns $\ket{0}$ into $\ket{1}$ (when $\alpha = 1$, $\beta = 0$) and $\ket{1}$ into $\ket{0}$ (when $\alpha = 0$, $\beta = 1$).</p>
<p>In matricial form, we want to find a transformation from column vector $[\alpha \, \beta]^T$ into $[\beta \, \alpha]^T$. We can do so using a 2 x 2 matrix:</p>
\[\begin{bmatrix}
\beta \\
\alpha \\
\end{bmatrix} = \begin{bmatrix}
0 & 1 \\
1 & 0 \\
\end{bmatrix} + \begin{bmatrix}
\alpha \\
\beta \\
\end{bmatrix}\]
<p>More generally any quantum gate on $n$-qubits can be represented by a $2^n \otimes 2^n$ matrix called a <em>unitary matrix</em>. A unitary matrix $U$ is such that $U^\dagger U = I$. Where $U^\dagger$ is the <em>adjoint</em> of $U$, which is the result of transposing $U$ and taking conjugate (i.e. negating the imaginary part of the complex number) of the elements.</p>
<h3 id="the-hadamard-gate">The Hadamard Gate</h3>
<p>The Hadamard gate appears in many constructs and can be defined by:</p>
\[H = \frac{1}{\sqrt{2}}\begin{bmatrix}
1 & 1 \\
1 & -1 \\
\end{bmatrix}\]
<p>Thus if applied on</p>
\[\alpha \ket{0} + \beta \ket{1}\]
<p>It yields</p>
\[\alpha \frac{\ket{0} + \ket{1}}{\sqrt{2}} + \beta \frac{\ket{0} - \ket{1}}{\sqrt{2}}\]
<p>The Hadamard gate can be drawn like a classic gate:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-10-11-deutsch-jozsa-algorithm/hadamard.png" alt="a diagram depicting the Hadamard gate" />
<figcaption>Figure 1: The Hadamard gate</figcaption>
</figure>
<h3 id="the-cnot-gate">The CNOT Gate</h3>
<p>The CNOT gate, also known as <em>controlled-NOT</em>, takes two qubits, called <em>control</em> and <em>target</em>. It can be represented by a 4 x 4 matrix:</p>
\[U_{CN} = \begin{bmatrix}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
0 & 0 & 0 & 1 \\
0 & 0 & 1 & 0 \\
\end{bmatrix}\]
<p>So for a state given by:</p>
\[\ket{\psi} = \alpha_{00} \ket{00} + \alpha_{01} \ket{01} + \alpha_{10} \ket{10} + \alpha_{11} \ket{11}\]
<p>We’ll end up with</p>
\[\ket{\psi} = \alpha_{00} \ket{00} + \alpha_{01} \ket{01} + \alpha_{11} \ket{10} + \alpha_{10} \ket{11}\]
<p>Where the 3-rd and 4-th terms got swapped. Let’s consider some special cases.</p>
<p>If the first qubit is $\ket{0}$ (that is, $\alpha_{10} = \alpha_{11} = 0$) then the initial state is</p>
\[\ket{\psi} = \alpha_{00} \ket{00} + \alpha_{01} \ket{01}\]
<p>and applying the gate preserves the state. If the first qubit is $\ket{1}$ (that is, $\alpha_{00} = \alpha_{01} = 0$), then the initial state is</p>
\[\ket{\psi} = \alpha_{11} \ket{10} + \alpha_{10} \ket{11}\]
<p>and the resulting state is as if the NOT gate had been applied:</p>
\[\ket{\psi}' = \alpha_{10} \ket{10} + \alpha_{11} \ket{11}\]
<p>In other words, the first qubit controls whether the second qubit will be NOTed, hence the name <em>control</em>. In a classical world, this could be achived by the XOR operation between the first and second bits, denoted by the symbol $\oplus$, so the same notation is used to represent the result of the second qubit.</p>
<p>Summarizing, if we’re given qubits $\ket{x, y}$, this gate returns $\ket{x, x \oplus y}$:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-10-11-deutsch-jozsa-algorithm/cnot.png" alt="a diagram depicting the CNOT gate" />
<figcaption>Figure 2: The CNOT gate</figcaption>
</figure>
<h3 id="quantum-circuits">Quantum Circuits</h3>
<p>A <em>quantum circuit</em> is simply a composition of one or more quantum gates, analogous to a classical circuit.</p>
<h2 id="quantum-parallelism">Quantum Parallelism</h2>
<p>Suppose we have a gate, which we’ll call $U_f$, that transforms a 2-qubit state $\ket{x,y}$ into $\ket{x, y \oplus f(x)}$ where $f(x)$ is any function that transforms a qubit into a $\ket{0}$ or $\ket{1}$ (i.e. a computational basis state) and $\oplus$ the XOR operator as defined in the CNOT gate. We’ll treat it as a blackbox but it can be shown to be a valid quantum gate (i.e. it has a corresponding unitary matrix transformation).</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-10-11-deutsch-jozsa-algorithm/parallel.png" alt="a diagram depicting the U_f circuit" />
<figcaption>Figure 3: Quantum circuit to create a superposition of $f(0)$ and $f(1)$</figcaption>
</figure>
<p>Say $\ket{x} = \frac{\ket{0} + \ket{1}}{\sqrt{2}}$ and $\ket{y} = \ket{0}$, where the state $\ket{x}$ can be obtained by applying the Hadamard gate over $\ket{0}$. Then the resulting state will be</p>
\[\frac{\ket{0, f(0)} + \ket{1, f(1)}}{\sqrt{2}}\]
<p>This is interesting because it contains the evaluation of $f(x)$ for both values $\ket{0}$ and $\ket{1}$.</p>
<p>We can generalize $\ket{x}$ to have $n$ qubits and apply the Hadamard gate to each of them, which can be denoted as $H^{\otimes n}$. For $n = 2$ if we apply $H^{\otimes 2}$ to $\ket{00}$ we get:</p>
\[\bigg( \frac{\ket{0} + \ket{1}}{\sqrt{2}} \bigg) \bigg( \frac{\ket{0} + \ket{1}}{\sqrt{2}} \bigg) = \frac{\ket{00} + \ket{01} + \ket{10} + \ket{11}}{2}\]
<p>In general it’s possible to show that if we apply $H^{\otimes n}$ to $\ket{0}^{\otimes n}$ we get:</p>
\[H^{\otimes n}(\ket{0}^{\otimes n}) = \frac{1}{\sqrt{2^n}} \sum_x \ket{x}\]
<p>Where $x$ is all binary numbers with $n$ bits.</p>
<p>Going back to our original gate $U_f$, if we set $\ket{x} = H^{\otimes n}(\ket{0}^{\otimes n})$, we’ll get the state</p>
\[\frac{1}{\sqrt{2^n}} \sum_x \ket{x} \ket{f(x)}\]
<p>The insight is that we now have a state encoding $2^n$ values of $f(x)$ and we achieved that using only $O(n)$ gates. Unfortunately as we saw in <em>Measuring the state of a Qubit</em>, there’s no way to extract all these values from a quantum state, and once we perform a measurement only one of the values of $x$ will be returned.</p>
<p>We’ll see next how to “entagle” these values such that any measurement will result in a value resulting from computing $f(x)$ for all values of $x$.</p>
<h2 id="the-deutsch-algorithm">The Deutsch Algorithm</h2>
<p>The Deutsch Algorithm is a more complex circuit using the $U_f$ gate from the previous session, which involves applying the Hadamard gate to both input qubits and then to the first qubit of the output:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-10-11-deutsch-jozsa-algorithm/deutsch.png" alt="a diagram depicting the Deutsch Algorithm" />
<figcaption>Figure 4: Quantum circuit representing the Deutsch Algorithm</figcaption>
</figure>
<p>Let’s follow the state at each step of the circuit:</p>
\[\ket{\psi_0} = \ket{01}\]
<p>Applying the Hadamard gate to each of the qubits:</p>
\[\ket{\psi_1} = \bigg[ \frac{\ket{0} + \ket{1}}{\sqrt{2}} \bigg] \bigg[ \frac{\ket{0} - \ket{1}}{\sqrt{2}} \bigg]\]
<p>Let’s assume $\ket{x} = \frac{\ket{0} + \ket{1}}{\sqrt{2}}$ and $\ket{y} = \frac{\ket{0} - \ket{1}}{\sqrt{2}}$.</p>
<p>Now, what is the result of applying $U_f$ over $\ket{\psi_1}$? First let’s assume the first qubit is either $\ket{0}$ or $\ket{1}$ (i.e. a computational basis). Now suppose $f(x) = \ket{0}$. Then $f(x) \oplus y = y$ and $U_f$ will yield the same state $\ket{x} \ket{y}$. If $f(x) = \ket{1}$, then $f(x) \oplus y$ is $y$ with its terms flipped, which in this particular case is $-y$, so in general we have:</p>
\[U_f(\ket{x} \ket{y}) = (-1)^{f(x)} \ket{x} \ket{y}\]
<p>However, $\ket{x}$ is not in a computational base state, but we can use the linearity principle when applying a function over a quantum state, that is:</p>
\[f(\ket{x}) = f(\alpha \ket{0} + \beta \ket{1}) = \alpha f(\ket{0}) + \beta f(\ket{1})\]
<p>Since $\ket{x} = \frac{\ket{0} + \ket{1}}{\sqrt{2}}$ the output of $U_f(\ket{x} \ket{y})$ is</p>
\[U_f(\ket{x} \ket{y}) = \frac{(-1)^{f(\ket{0})} \ket{0} + (-1)^{f(\ket{1})} \ket{1}}{\sqrt{2}} \ket{y}\]
<p>We can group the results in two cases: one where $f(\ket{0}) = f(\ket{1})$, in which case $(-1)^{f(\ket{0})}$ and $(-1)^{f(\ket{1})}$ have the same sign, say $z = \pm 1$:</p>
\[U_f(\ket{x} \ket{y}) = \frac{z \ket{0} + z \ket{1}}{\sqrt{2}} \ket{y} = \pm \frac{\ket{0} + \ket{1}}{\sqrt{2}} \ket{y} = \pm \ket{x} \ket{y}\]
<p>another wwhere $f(\ket{0}) \neq f(\ket{1})$ when $(-1)^{f(\ket{0})}$ and $(-1)^{f(\ket{1})}$ have opposite signs, so say $(-1)^{f(\ket{0})} = z$ and $(-1)^{f(\ket{0})} = -z$:</p>
\[U_f(\ket{x} \ket{y}) = \frac{z \ket{0} - z \ket{1}}{\sqrt{2}} \ket{y} = \pm \frac{\ket{0} - \ket{1}}{\sqrt{2}} \ket{y} = \pm \ket{\bar x} \ket{y}\]
<p>Where we define $\ket{\bar{x}} = \frac{\ket{0} - \ket{1}}{\sqrt{2}}$, that is, it’s $\ket{x}$ with the sign of $\ket{1}$ negated.</p>
<p>Summarizing,</p>
\[\ket{\psi_2} = \begin{cases}
\pm \ket{x}\ket{y} & \text{if } f(0) = f(1) \\
\pm \ket{\bar x}\ket{y}, & \text{if } f(0) \neq f(1)
\end{cases}\]
<p>To compute $\ket{\psi_3}$ we just need to apply the Hadamard gate on the firt qubit. We can show that $\ket{H(x)} = \ket{0}$ and $\ket{H(\bar x)} = \ket{1}$, so:</p>
\[\ket{\psi_3} = \begin{cases}
\pm \ket{0}\ket{y} & \text{if } f(0) = f(1) \\
\pm \ket{1}\ket{y}, & \text{if } f(0) \neq f(1)
\end{cases}\]
<p>This can be further compacted by noting $\ket{f(0) \oplus f(1)} = \ket{0}$ if $f(0) = f(1)$ and $\ket{f(0) \oplus f(1)} = \ket{1}$ otherwise:</p>
\[\ket{\psi_3} = \ket{f(0) \oplus f(1)}\ket{y}\]
<p>This also makes it clearer that if we measure the first qubit, regardless of the result, it has to have computed both $f(0)$ and $f(1)$. We don’t have access to the individual values of the function evaluations, but we can access the result of a computation that evaluated $2$ functions in one operation. This gain will be more obvious next where we generalize this to $n$ qubits.</p>
<h2 id="the-deutsch-jozsa-algorithm">The Deutsch-Jozsa Algorithm</h2>
<p>The Deutsch-Jozsa Algorithm is a generalization of the Deutsch for $n$-qubits. The circuit is almost exactly the same:</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-10-11-deutsch-jozsa-algorithm/deutsch_jozsa.png" alt="a diagram depicting the Deutsch-Jozsa Algorithm" />
<figcaption>Figure 5: Quantum circuit representing the Deutsch-Josza Algorithm</figcaption>
</figure>
<p>The difference is that instead of one qubit for the state on top, we generalize to $n$-qubits. Let’s analyze the state at each step of the circuit:</p>
\[\ket{\psi_0} = \ket{0}^{\oplus n} \ket{1}\]
<p>Applying a Hadamard gate to $\ket{0}^{\oplus n}$ yields $\frac{1}{\sqrt{2^n}} \sum_x \ket{x}$ as we saw in <em>Quantum Parallelism</em>, so $\ket{\psi_1}$ is:</p>
<p>$\ket{\psi_1} = \frac{1}{\sqrt{2^n}} \sum_x \ket{x} \ket{y}$</p>
<p>Where $\ket{y} = \frac{\ket{0} - \ket{1}}{\sqrt{2}}$ as in the <em>Deutsch Algorithm</em>. Let’s call $\ket{X} = \frac{1}{\sqrt{2^n}} \sum_x \ket{x}$.</p>
<p>As we saw in the previous session, if we assume $x$ in some computation basis state, then</p>
\[U_f(\ket{x} \ket{y}) = (-1)^{f(x)} \ket{x} \ket{y}\]
<p>This remains true for any number of qubits because both $f(\ket{x})$ and $\ket{y}$ are still one qubit. Leveraging the linearity of terms in $\ket{X}$ we have</p>
\[\ket{\psi_2} = U_f(\ket{X} \ket{y}) = \frac{1}{\sqrt{2^n}} \sum_x U_f(\ket{x} \ket{y}) = \frac{1}{\sqrt{2^n}} \sum_x (-1)^{f(x)} \ket{x} \ket{y}\]
<p>Let’s define</p>
\[\ket{X_2} = \frac{1}{\sqrt{2^n}} \sum_x (-1)^{f(x)} \ket{x}\]
<p>We now need to apply the Hadamard gate to $\ket{X_2}$. We know how to compute the Hadamard gate for $\ket{0}^{\oplus n}$, but let’s see how to do this for an arbitrary computation basis state. To get an intuition, we can try to find a pattern for 1-qubit:</p>
\[H(\ket{0}) = \frac{\ket{0} + \ket{1}}{\sqrt{2}}\]
\[H(\ket{1}) = \frac{\ket{0} - \ket{1}}{\sqrt{2}}\]
<p>They look very similar except for the signs on $\ket{1}$, if we could parametrize that sign based on the input and which term from the output we’re at, this could be succintly represented as a summation. Turns out it’s possible: let $z$ be a computation basis state from the output (that is, $\ket{0}$ and $\ket{1}$). If we define the term $(-1)^{xz} \ket{z}$, we can show that</p>
\[H(\ket{x}) = \sum_{z} \frac{(-1)^{xz} \ket{z}}{\sqrt{2}}\]
<p>It’s possible to generalize this to $n$-qubits:</p>
\[H^{\otimes n}(\ket{x}) = \sum_{z} \frac{(-1)^{x \cdot z} \ket{z}}{\sqrt{2^n}}\]
<p>Where $x \cdot z$ is the inner product modulo 2. Again, this assumes $\ket{x}$ is in a computation basis state. Our $\ket{X_2}$ is not, but its terms are, so we can simply use the linearity principle:</p>
\[H^{\otimes n}(\ket{X_2}) = \frac{1}{\sqrt{2^n}} \sum_x (-1)^{f(x)} H^{\otimes n} (\ket{x}) = \frac{1}{2^n} \sum_x (-1)^{f(x)} \sum_z (-1)^{x \cdot z} \ket{z}\]
<p>We can exchange the summation over $x$ with that of $z$ for a cleaner form:</p>
\[H^{\otimes n}(\ket{X_2}) = \frac{\sum_z \sum_x (-1)^{x \cdot z + f(x)} \ket{z}}{2^n}\]
<p>Using this, we can finally compute the final state of the circuit:</p>
\[\ket{\psi_3} = \frac{\sum_z \sum_x (-1)^{x \cdot z + f(x)} \ket{z}}{2^n} \ket{y}\]
<p>If we measure the state of the first $n$-qubits, we’ll obtain the term for a given $z$, say $\frac{\sum_x (-1)^{x \cdot z + f(x)} \ket{z}}{\sqrt{2^n}}$ and that contains a computation involving all the $2^n$ computational base states, and we did so with only $O(n)$ operations!</p>
<p>What can we do with this? We’ll next present a contrived problem which can be solved using this result.</p>
<h2 id="the-deutschs-problem">The Deutsch’s Problem</h2>
<p>The Deutsch can be described as follows: let $f(x)$ be a function that takes a $n$-bit number and return true or false. It can be either a <em>constant function</em>, one that returns true (or false, but not both) for all its inputs, or a <em>balanced function</em>, which returns true for exactly half of it’s input.</p>
<p>To be super clear, an example of a constant function is:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">constant</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">true</span></code></pre></figure>
<p>An example of a balanced function is:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">balanced</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">%</span> <span class="mi">2</span></code></pre></figure>
<p>It’s easy to tell which type of functions the examples above are, but in general, we’d need to evaluate a function for at least the majority of all the possible inputs, that is $2^n/2 + 1$, which would make this classication process an exponential one using a classical computer.</p>
<p>For a quantum computer, we can assume we used $U_f$ and have the state:</p>
\[\ket{\psi_3} = \frac{\sum_z \sum_x (-1)^{x \cdot z + f(x)} \ket{z}}{2^n} \ket{y}\]
<p>Let $\ket{X_3}$ be the non-$y$ part of this state:</p>
\[\ket{X_3} = \frac{\sum_z \sum_x (-1)^{x \cdot z + f(x)} \ket{z}}{2^n}\]
<p>Suppose we performed a measurement on the qubit and got the state $z = \ket{0}^{\otimes n}$. The correspond amplitude of this state is</p>
\[\alpha_{0^{\otimes n}} = \frac{\sum_x (-1)^{x \cdot z + f(x)}}{2^n} = \frac{\sum_x (-1)^{f(x)}}{2^n}\]
<p>The last step comes from $x \cdot \vec{0} = 0$, and the probability of getting that state is $\abs{\alpha_{0^{\otimes n}}}^2$.</p>
<p>Now suppose $f(x)$ is constant and always returns $k$. Then</p>
\[\alpha_{0^{\otimes n}} = (-1)^k \frac{\sum_x 1}{2^n} = (-1)^k \frac{2^n}{2^n} = \pm 1\]
<p>This means that if $f(x)$ is constant, then $\abs{\alpha_{0^{\otimes n}}}^2 = 1$ and since the sum of the square of amplitudes must be 1, this means the other amplitues are 0, and we’ll obtain state $z = \ket{0}^{\otimes n}$ with 100% probability.</p>
<p>Now suppose $f(x)$ is balanced. Then half of the terms in $\sum_x (-1)^{f(x)}$ will be positive ($f(x) = 0$), and half negative ($f(x) = 1$), so the $\alpha_{0^{\otimes n}} = 0$ and the probability of obtaining $z = \ket{0}^{\otimes n}$ is 0.</p>
<p>This gives a simple proxy to determine whether $f(x)$ is constant or balanced. If we measure $\ket{X_3}$ and get $\alpha_{0^{\otimes n}}$, the function is constant, otherwise it’s balanced, and we determined this in $O(n)$ operations as opposed to thr $O(2^n)$ of a classical computer.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we covered the Deutsch-Jozsa Algorithm, which on one hand provides an example in which a quantum computation outperforms a classical one, but on the other hand it’s simple enough that it requires only the basics of quantum computing to be understood, which allowed for a self-contained post.</p>
<p>I started learning quantum computing via Michael Nielsen’s Youtube videos [3]. Though the series was never finished, he also wrote a book with Issac Chuang, <em>Quantum Computation and Quantum Information</em>, which seems very thorough and well praised, so I’m going to use that for my education.</p>
<p>Here are some difficulties I encountered so far. One is getting used to the notation: I think I get the idea of the <em>ket</em> operator, but it’s still not natural to remember when to wrap variables inside it. I also found it important to be aware of when the state of a qubit has to be in a computation basis or it can be in a more general state. The book seems to be switching between these two types of state implicitly at times and is hard to follow at first.</p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.goodreads.com/book/show/35535406-beyond-weird">1</a>] Beyond Weird - Phillip Ball</li>
<li>[<a href="https://www.amazon.com/Quantum-Computation-Information-10th-Anniversary/dp/1107002176">2</a>] Quantum Computation and Quantum Information - Nielsen, M. and Chuang, I.</li>
<li>[<a href="http://michaelnielsen.org/blog/quantum-computing-for-the-determined/">3</a>] Quantum computing for the determined - Nielsen, M.</li>
</ul>Guilherme KunigamiDavid Elieser Deutsch is a British scientist at the University of Oxford, being a pioneer of the field of quantum computation by formulating a description for a quantum Turing machine. Richard Jozsa is an Australian mathematician at the University of Cambridge and is a co-inventor of quantum teleportation. Together they proposed the Deutsch-Jozsa Algorithm which, although not useful in practice, it provides an example where a quantum algorithm can outperform a classic one. In this post we’ll describe the algorithm and the basic theory of quantum computation behind it.Paper Reading - Latent Aspect Rating Analysis on Review Text Data2020-10-09T00:00:00+00:002020-10-09T00:00:00+00:00https://www.kuniga.me/blog/2020/10/09/lara<p>In [1] Wang, Lu and Zhai present a supervised machine learning model to solve a problem they name <em>Latent Aspect Rating Analysis</em>. To explain the problem, let’s start with some context. Sites like Amazon, TripAdvisor or Audible have a review system where the user provides a rating and a review text.</p>
<p>The overall rating might be too vague, so it’s desirable to break down this rating per what is called <em>aspect</em>. The list of aspects depend on the use case: for logding, it could be room, cleaningless or location; for Audible it has performance and story. The model tries to infer these implicit (latent) ratings from only the overall rating and the review text.</p>
<p>The authors define a model that learns from training data and show promising results with test data. In this post we’ll focus on the theory presented in [1].</p>
<!--more-->
<h2 id="overview-of-the-solution">Overview of the solution</h2>
<p>To solve the <em>Latent Aspect Rating Analysis</em> problem, first the authors build a correlation matrix between words and aspects by scanning all the texts from the reviews in the training dataset. This matrix is then used in tandem with both the overall and aspect ratings in the training dataset to construct a model which is then used to infer the aspect ratings for the test dataset. The inferred values are then compared to the actual values to assess the quality of the model.</p>
<h2 id="definitions">Definitions</h2>
<h3 id="aspect-and-associated-keywords">Aspect and associated keywords</h3>
<p>We’ll have a fixed set of $k$ aspects (e.g. for hotels it could be location, cleaningness, price). We then let $A_i$ for $i \in [1, …, k]$ be the set of words associated with an aspect $i$. Note that these sets are not mutually exclusive, so for example “view” could be related to aspect “room” or “location”.</p>
<h3 id="word-aspect-correlation">Word-aspect correlation</h3>
<p>For each document $d \in D$, we have $W_d$ which is a $k \times n$ where $W_{dij}$ is the frequency of word $w_j$ for aspect $i$ normalized by the total number of the words in the text in that aspect, that is:</p>
<p>\(\sum_{j=1}^n W_{ijd} = 1\) for all documents $d$ and aspects $i$.</p>
<h3 id="aspect-ratings-and-weights">Aspect ratings and weights</h3>
<p>The <em>aspect rating</em> for a document $d$, denoted by $s_d$, is a $k$-dimensional vector where $s_{di} \in [r_{min}, r_{max}]$ is the rating for corresponding to aspect $i$. For example, a user might rate a hotel’s location as 4 (out of 5), but its cleaningness as 1.</p>
<p>The <em>aspect weight</em> for a document $d$, denoted by $\alpha_d$ is also a $k$-dimensional vector where $\alpha_{di} \in [0, 1]$ and $\sum_{i=1}^k \alpha_{di} = 1$ and represents the weight the review $d$ puts on aspect $i$. For example, a user might put more importance of location than price.</p>
<p>These are the two quantities the model will try to predict once it’s properly trainted. Note that for a document $d$:</p>
\[(1) \quad r_d = \sum_{i=1}^{k} \alpha_{di} s_{di}\]
<h2 id="the-model">The Model</h2>
<h3 id="word-aspect-correlation-1">Word-aspect correlation</h3>
<p>We’ll gloss over the methodology for building the matrix $W_d$ and present it only at a high level. The idea is to seed each aspect with initial keywords. Then use a clustering algorithm to include more words into each of the aspects based on the reviews texts. Refer to [1] for more details.</p>
<h3 id="aspect-ratings">Aspect ratings</h3>
<p>The model defines a parameter $\beta$ representing the overall rating associated with a (word-aspect) pair. This can be thought of as the average rating a user would to aspect $i$ solely based on word $j$, so $\beta_{ij} \in [r_{min}, r_{max}]$.</p>
<p>Since \(\sum_{j=1}^n W_{ijd} = 1\), the model assumes that for document $d$ and aspect $i$, $s_{di}$ is a linear combination of $\beta_i$ and $W_{di}$:</p>
\[(2) \quad s_{di} = \sum_{j=1}^n \beta_{ij} W_{dij}\]
<h3 id="overal-ratings">Overal ratings</h3>
<p>In theory $r_d$ can be obtained by (1), but to encode uncertainty and allow variance, the model assumes $r_d$ is drawn from a Gaussian distribution, with mean (1) and some variance $\delta$ which will be another parameter in the model.</p>
\[r_{d} \sim N(\sum_{i=1}^{k} \alpha_{di} s_{di}, \delta^2)\]
<p>Exapanding $s_{di}$ using (2):</p>
\[(3) \quad r_{d} \sim N(\sum_{i=1}^{k} \alpha_{di} \sum_{j=1}^{n} \beta_{ij} W_{dij}, \delta^2)\]
<h3 id="correlated-aspect-weights">Correlated aspect weights</h3>
<p>The authors argue that aspect weights are not completely independent (e.g. a preference for cleaninliness is correlated with a preference for room, maybe less so for price).</p>
<p>To model this, we assume $\alpha_{d}$ is drawn from a multi-variate Gaussian distribution:</p>
\[\alpha_{d} \sim N(\mu, \Sigma)\]
<p>where $\mu$ is a $k$ dimensional vector and $\Sigma$ a $k \times k$ variance matrix encoding the correlation between pairs of aspects.</p>
<h3 id="objective-function">Objective function</h3>
<p>Let’s bundle the parameters needed to compute $r_d$ into $\Theta = (\mu, \Sigma, delta^2, \beta)$. We then define the likelihood of observing $D$ (i.e. the overall and aspect ratings) as</p>
\[p(D \mid \Theta)\]
<p>we need to find $\hat \Theta = \mbox{argmax}_\Theta \, p(D \mid \Theta)$.</p>
<h2 id="solving-the-model">Solving the model</h2>
<p>We’ll use an interactive algorithm. We start with an initial estimate $\Theta_0$ for $\Theta$.</p>
<p>Because $\alpha_d$ is also considered a random variable, we can estimate it from $\Theta_t$. We then compute $\Theta_{t+1}$ assuming $\alpha_d$ is constant. We keep iterating until $\Theta_t$ converges.</p>
<h3 id="estimating-aspect-weights">Estimating aspect weights</h3>
<p>Given a rating $r_{d}$, how can we estimate $\alpha_{d}$, that is $p(\alpha_d \mid r_d)$? In particular what is the most probable value of $\alpha_{d}$? We can use the <em>Maximum a Posteriori Estimate</em> (MAP) [2] method to find:</p>
\[\hat \alpha_{d} = \mbox{argmax}_{\alpha_d} p(\alpha_d \mid r_d)\]
<p>Which we show in <em>Appendix A</em>,</p>
\[(4) \quad \hat \alpha_{d} = \mbox{argmax}_{\alpha_d} \bigg[ - \frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2} -\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu) \bigg]\]
<p>We can use a non-linear optimization such as <a href="https://www.kuniga.me/blog/_posts/2020-09-04-lbfgs.md">L-BFGS</a> to solve this, using the gradient of (4):</p>
\[- \frac{(r_d - \alpha_d^T s_d) s_d}{\delta^2} - \Sigma^{-1} (\alpha_d - \mu)\]
<h3 id="estimating-parameters-mu-and-sigma">Estimating parameters $\mu$ and $\Sigma$</h3>
<p>To estimate the $\Theta$ we’ll use <em>Maximum Likelihood Estimate</em> (MLE). Note that since there’s no constraint limiting each of $\mu$, $\Sigma$, $\beta$ and $\delta$, we can estimate them individually.</p>
<p>First we estimate $\mu$ and $\Sigma$ to maximize the likelihood of observing $\alpha$, $p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)$.</p>
<p>We show in <em>Appendix A</em> that if</p>
\[\mu_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)\]
<p>then</p>
\[\mu_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} \alpha_d\]
<p>It can be shown [4] that if</p>
\[\Sigma_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma)\]
<p>then</p>
\[\Sigma_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} (\alpha_d - \mu_{MLE})(\alpha_d - \mu_{MLE})^{T}\]
<p>Summarizing, from a iteractive algorithm perspective, we’re given $\alpha$ computed previously and we can estimate the next iteration $\mu_{t+1}$ and $\Sigma_{t+1}$:</p>
\[\mu_{t+1} = \frac{1}{\mid D \mid} \sum_{d = 1}^{\mid D \mid} \alpha_d\]
\[\Sigma_{t+1} = \frac{1}{\mid D \mid} \sum_{d \in D} (\alpha_d - \mu_{t+1})(\alpha_d - \mu_{t+1})^{T}\]
<h3 id="estimating-parameters-beta-and-delta2">Estimating parameters $\beta$ and $\delta^2$</h3>
<p>Now we estimate $\beta$ and $\delta^2$ to maximize the likelihood of observing $r$, $p(r_1, \cdots, r_D \mid \beta, \delta^2)$.</p>
<p>We show in <em>Appendix A</em> that if:</p>
\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} p(r_1, \cdots, r_D \mid \beta, \delta^2)\]
<p>then</p>
\[\delta_{MLE}^2 = \frac{r_d - \alpha_d^T s_d}{\mid D \mid}\]
<p>For $\beta$ we need to expand $\alpha_d^T s_d$ since it’s a function of $\beta$, that is $\alpha_d^T s_d = \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}$. This give us:</p>
\[\beta_{MLE} = \mbox{argmax}_{\beta} \sum_{d \in D} \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}}{\delta})^2\]
<p>Discarding the terms independent of $\beta$:</p>
\[\beta_{MLE} = \mbox{argmax}_{\beta} - \sum_{d \in D} (r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di})^2\]
<p>The authors claim the closed form of this optimization requires inverting a $n \times n$, which is costly. The proposal is to also used a non-linear optimization such as L-BFGS like we did for $\hat \alpha_{d}$. The $j$-th component of the gradient of the log-likelihood function above is:</p>
\[\frac{\partial \mathcal{L}({\beta})}{\partial \beta_j} = \sum_{d \in D} \big( (r_d - \sum_{i = 1}^{k} \alpha_{di} \beta_i^T W_{di}) \alpha_{dj} W_{dj} \big)\]
<p>Summarizing, from a iteractive algorithm perspective, we’re given $\alpha$ computed previously and we can estimate the next iteration $\delta^2_{t+1}$ and $\beta{t+1}$:</p>
\[\delta^2_{t+1} = \frac{r_d - \alpha_d^T s_d}{\mid D \mid}\]
<p>and</p>
\[\beta_{t+1} = \mbox{LBFGS}(\beta_{t}, r, \alpha, s, W)\]
<h2 id="conclusion">Conclusion</h2>
<p>In this post we studied the theory of the paper <em>Latent Aspect Rating Analysis on Review Text Data</em>. I also planned to work on a Python implementation but it will have to wait. It’s the first machine learning paper I can remember reading and it was interesting to see how much calculus is involved. I had past experience with linear (discrete and continuous) optimization, which involves mostly basic linear algebra.</p>
<p>In studying this post I had to watch many basic machine leaning math, via the <a href="https://www.youtube.com/channel/UCcAtD_VYwcYwVbTdvArsm7w">mathematicakmonk</a> channel. I also realized how rusty I am in regards to basic derivatives operations, so I’ll need a refresher.</p>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://www.cs.virginia.edu/~hw5x/paper/rp166f-wang.pdf">1</a>] Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach</li>
<li>[<a href="https://www.youtube.com/watch?v=kkhdIriddSI&list=PLD0F06AA0D2E8FFBA">2</a>] (ML 6.1) Maximum a posteriori (MAP) estimation - Youtube</li>
<li>[<a href="https://www.youtube.com/watch?v=aHwsEXCk4HA&list=PLD0F06AA0D2E8FFBA">3</a>] (ML 4.1) Maximum Likelihood Estimation (MLE) - Youtube</li>
<li>[<a href="https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/other-readings/chapter13.pdf">4</a>] Chapter 13 - The Multivariate Gaussian - Michael I. Jordan</li>
</ul>
<h2 id="appendix-a">Appendix A</h2>
<h3 id="finding-hat-alpha_d">Finding $\hat \alpha_{d}$</h3>
<p>Here we derive $\hat \alpha_{d}$ as a function of $\Theta$.</p>
\[\hat \alpha_{d} = \mbox{argmax}_{\alpha_d} p(\alpha_d \mid r_d)\]
<p>Using Bayes’ Rule, we have</p>
\[p(\alpha_d \mid r_d) = \frac{p(r_d \mid \alpha_d) p(\alpha_d)}{p(r_d)}\]
<p>Since $\log$ is a monotomic function and $p(r_d)$ does’t depend on $\alpha_{d}$, finding $\alpha_{d}$ that maximizes the above is equal to maximizing</p>
\[\hat \alpha_{r_d} = \mbox{argmax}_{\alpha_d} \log p(r_d \mid \alpha_d) + \log p(\alpha_d)\]
<p>Let’s consider $p(r_d \mid \alpha_d)$ first. We know it’s the Gaussian with mean $\alpha_d^T s_d$ and variance $\delta^2$ (3), so by the definition of a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Gaussian</a>,</p>
\[p(r_d \mid \alpha_d) = \frac{1}{\delta \sqrt{2 \pi}} \exp(-\frac{1}{2} (\frac{r_d - \alpha_d^T s_d}{\delta})^2)\]
<p>Applying log we get,</p>
\[(5) \quad \log p(r_d \mid \alpha_d) = \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - \alpha_d^T s_d}{\delta})^2\]
<p>The first term is independent of alpha, so we’re left with</p>
\[\frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2}\]
<p>Let’s now look at $p(\alpha_d)$. It’a multi-variate Gaussian, so by definition:</p>
\[(G) \quad p(\alpha_d) = \frac{\exp(-\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu))}{\sqrt{(2 \pi)^{k} \mid \Sigma \mid}}\]
<p>Where $k$ is the dimension of $\alpha_d$, $\mu$ and $\Sigma$ ($k \times k$) and $\mid \Sigma \mid$ is the determinant of $\Sigma$. If we apply $\log$ and discard terms unrelated to $\alpha_d$ we get</p>
\[-\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu)\]
<p>Putting everything together, we want to find</p>
\[(4) \quad \hat \alpha_{d} = \mbox{argmax}_{\alpha_d} \bigg[ - \frac{(r_d - \alpha_d^T s_d)^2}{2 \delta^2} -\frac{1}{2} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu) \bigg]\]
<h3 id="finding-mu_mle">Finding $\mu_{MLE}$</h3>
<p>We have that</p>
<p>\(\mu_{MLE} = \mbox{argmax}_{\mu} p(\alpha_1, \cdots, \alpha_D \mid \mu, \Sigma) = \prod_{d \in D} p(\alpha_d \mid \mu, \Sigma)\).</p>
<p>Recalling $\alpha_d$ is sampled from a multi-variate Gaussian. Sing $\log$ is a monotonically increasing function, we can take the $\log$ on the right side of the equation:</p>
\[\quad \mu_{MLE} = \mbox{argmax}_{\mu} -\sum_{d \in D} (\alpha_d - \mu)^T \Sigma^{-1} (\alpha_d - \mu)\]
<p>We can take the derivative of with respect to $\mu$ to obtain the gradient:</p>
\[\sum_{d \in D} (\alpha_d - \mu) \Sigma^{-1}\]
<p>and set the gradient to 0 for which $\mu$ is optimal, and multiply both sides by $\Sigma$ to obtain:</p>
\[0 = \sum_{d \in D} (\alpha_d - \mu)\]
<p>which implies</p>
\[\sum_{d \in D} \mu = D \mu = \sum_{d \in D} \alpha_d\]
<p>so</p>
\[\mu_{MLE} = \frac{1}{\mid D \mid} \sum_{d \in D} \alpha_d\]
<h3 id="finding-delta_mle">Finding $\delta_{MLE}$</h3>
<p>We have that</p>
\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} p(r_1, \cdots, r_D \mid \beta, \delta^2) = \prod_{d \in D} p(r_d \mid \beta, \delta^2)\]
<p>Since $\log$ is a monotomic function, we have,</p>
\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} \sum_{d \in D} \log p(r_d \mid \beta, \delta^2)\]
<p>Recalling $r_d$ is sampled from a univariate Gaussian (3) with some a mean that is a function of $\beta$ (say $f(\beta)$) and variance $\delta^2$ (3), so by the definition of a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Gaussian</a>,</p>
\[p(r_d \mid \beta, \delta^2) = \frac{1}{\delta \sqrt{2 \pi}} \exp(-\frac{1}{2} (\frac{r_d - f(\beta)}{\delta})^2)\]
<p>Applying log,</p>
\[\log p(r_d \mid \beta, \delta^2) = \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2} (\frac{r_d - f(\beta)}{\delta})^2\]
<p>Putting it back in the original sum:</p>
\[\delta_{MLE}^2 = \mbox{argmax}_{\delta^2} \mid D \mid \log(\frac{1}{\delta \sqrt{2 \pi}}) - \frac{1}{2\delta^2} \sum_{d \in D} (r_d - f(\beta))^2\]
<p>If we set $y = x^2$, $a = \mid D \mid$, $b = \sqrt(2\pi)$ and $c = \sum_{d \in D} (r_d - f(\beta))^2$ we can put it in a form that is easier to derive:</p>
\[a \log(\frac{1}{\sqrt(y) b}) - \frac{c}{2y}\]
<p>If we derive on $y$,</p>
\[- \frac{a}{2y} + \frac{c}{2y^2}\]
<p>Setting to 0, we can solve for $y$ to get:</p>
\[y = \frac{c}{a}\]
<p>Replacing the terms back:</p>
\[\delta_{MLE}^2 = \frac{\sum_{d \in D} (r_d - f(\beta))^2}{\mid D \mid}\]Guilherme KunigamiIn [1] Wang, Lu and Zhai present a supervised machine learning model to solve a problem they name Latent Aspect Rating Analysis. To explain the problem, let’s start with some context. Sites like Amazon, TripAdvisor or Audible have a review system where the user provides a rating and a review text. The overall rating might be too vague, so it’s desirable to break down this rating per what is called aspect. The list of aspects depend on the use case: for logding, it could be room, cleaningless or location; for Audible it has performance and story. The model tries to infer these implicit (latent) ratings from only the overall rating and the review text. The authors define a model that learns from training data and show promising results with test data. In this post we’ll focus on the theory presented in [1].L-BFGS2020-09-04T00:00:00+00:002020-09-04T00:00:00+00:00https://www.kuniga.me/blog/2020/09/04/lbfgs<!-- This needs to be define as included html because variables are not inherited by Jekyll pages -->
<p>BFGS stands for Broyden–Fletcher–Goldfarb–Shanno algorithm [1] and it’s a non-linear numerical optimization method. L-BFGS means Low Memory BFGS and is a variant of the original algorithm that uses less memory [2]. The problem it’s aiming to solve is to mimize a given function \(f: R^d \rightarrow R\) (this is applicable to a maximization problem - we just need to solve for \(-f\)).</p>
<p>In this post we’ll cover 5 topics: Taylor Expansion, Newton’s method, QuasiNewton methods, BFGS and L-BFGS. We then look back to see how all these fit together in the big picture.</p>
<!--more-->
<h2 id="taylor-expansion">Taylor Expansion</h2>
<p>The Taylor expansion [3] is a way to approximate a function for a given value if we know how to compute it and its derivatives <em>around</em> that point.</p>
<p>The function might be hard to compute at a given point but if we know how to compute derivatives <em>around</em> that point we can get a good approximation.</p>
<p>The geometric intuition in 2D is that our function is a curve \(y = f(x)\) and we want to compute \(f()\) for a given \(x'\). If we know how to compute \(f'()\) for a given \(a\) that is sufficiently close to \(x'\), then \(f'(a)\) represents the coeficient of a line, the tangent of \(f()\) at \(a\), so \(f(x') \sim f(a) + f'(a) (x' - a)\).</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-09-04-lbfgs/taylor-expansion-2d.png" alt="2d Taylor expansion for the sine function" />
<figcaption>Figure 1: Visualizing the Taylor in 2D - the colored curves are expensions of different orders for the sine function, the red being the first derivative.</figcaption>
</figure>
<p>Depending on the distance of \(x'\) and \(a\) and the “curvature” of \(f()\), the approximation might be too off. The idea is to add higher degree polinomials so that the polinomial resembles \(f()\)’s shape better. Taylor expansion does exactly this though I lack the geometric intuition for these extra terms. Given \(f: R \rightarrow R\), its Taylor expansion is:</p>
\[f(x) \sim f(a) + \frac{f'(a)}{1!} (x - a) + \frac{f^{''}(a)}{2!} (x - a)^2 + \frac{f^{'''}(a)}{3!} (x - a)^3 ...\]
<p>or in a compact form:</p>
\[f(x) \sim \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x - a)^k\]
<p>Where \(f^{(k)}\) is the \(k\)-th derivative of \(f()\), and assuming \(f^{(0)} = f\).</p>
<h3 id="example-sinx">Example: sin(x)</h3>
<p>One interesting application of the Taylor Expansion is computing sin(x). We have that \(\frac{d(sin(x))}{dx} = cos(x)\) and \(\frac{d(cos(x))}{dx} = -sin(x)\), so according to Taylor:</p>
\[sin(x) \sim sin(a) + \frac{cos(a)}{1!} (x - a) + \frac{-sin(a)}{2!} (x - a)^2 + \frac{-cos(a)}{3!} (x - a)^3 ...\]
<p>If we choose $a=0$, then we can use the knowledge that \(sin(0)=0\) and \(cos(0)=1\) and get</p>
\[sin(x) \sim x - \frac{x^3}{3!} + \frac{x^5}{5!} ...\]
<p>Again, this is assuming \(x\) is around \(a=0\). This might not a good approximation for \(sin(1)\) for example.</p>
<h3 id="multi-variate">Multi-variate</h3>
<p>The Taylor expansion we defined assumes $x$ is a scalar. If we go back to our multi-dimensional function, \(f: R^d \rightarrow R\), we can define the expansion as:</p>
\[f(x_1,..., x_d) \sim f(a_1, ..., a_d) + \sum_{j=1}^{d} \frac{\partial f(a_1, ..., a_d)}{\partial x_j} (x_j - a_j) +\]
\[\quad\quad \frac{1}{2!} \sum_{j=1}^{d} \sum_{k=1}^{d} \frac{\partial^2 f(a_1, ..., a_d)}{\partial x_j \partial x_k} (x_j - a_j)(x_k - a_k) + ...\]
<p>As we can see, this notation is very verbose, so let’s use some common terminology/symbols from vectors. 1) Vector notation: \(\vec a = (a_1, ..., a_d)\). 2) The <strong>gradient</strong> defined as \(\nabla f(\vec x) = (\frac{\partial f(\vec x)}{\partial x_1}, ..., \frac{\partial f(\vec x)}{\partial x_d})\), so the sum</p>
\[\sum_{j=1}^{d} \frac{\partial f(a_1, ..., a_d)}{\partial x_j} (x_j - a_j)\]
<p>can be re-written as:</p>
\[(\vec x - \vec a)^T \nabla f(\vec x)\]
<p>Recalling that \(\vec u^T \vec v\) is the inner product of \(\vec u\) and \(\vec v\).</p>
<p>3) The <strong>Hessian matrix</strong> defined as:</p>
\[\nabla^2 f(\vec x) = \begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \dots & \frac{\partial^2 f}{\partial x_1 \partial x_d} \\
\vdots & \ddots & \vdots \\
\frac{\partial^2 f}{\partial x_d \partial x_1} & \dots & \frac{\partial^2 f}{\partial x_d^2}
\end{bmatrix}\]
<p>that is \(\nabla^2 f(\vec x)_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\) and we can show that this double summation:</p>
\[\sum_{j=1}^{d} \sum_{k=1}^{d} \frac{\partial^2 f(a_1, ..., a_d)}{\partial x_j \partial x_k} (x_j - a_j)(x_k - a_k)\]
<p>can be rewritten as:</p>
\[(\vec x - \vec a)^T \cdot \nabla^2 f(\vec x) \cdot (\vec x - \vec a)\]
<p>We included the \(\cdot\) just so it’s obvious where the multiplications are. To get a sense on why this identity is correct, we can look at \(d=2\):</p>
\[\frac{\partial^2 f(a_1, a_1)}{\partial^2 x_1} (x_1 - a_1)(x_1 - a_1) + \frac{\partial^2 f(a_1, a_2)}{\partial x_1 \partial x_2} (x_1 - a_1)(x_2 - a_2) +\]
\[\quad \frac{\partial^2 f(a_2, a_1)}{\partial x_2 \partial x_1} (x_2 - a_2)(x_1 - a_1) + \frac{\partial^2 f(a_2, a_2)}{\partial^2 x_2} (x_2 - a_2)(x_2 - a_2)\]
<p>equals to</p>
\[\begin{bmatrix}
x_1 - a_1 \\
x_2 - a_2
\end{bmatrix}
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 x_d} \\
\frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2}
\end{bmatrix}
\begin{bmatrix}
x_1 - a_1 & x_2 - a_2
\end{bmatrix}\]
<p>For a Taylor expansion of order two, which we’ll see next is of special interest to us, can be described more succinctly as:</p>
\[(1) \quad f(\vec x) \sim f(\vec a) + (\vec x - \vec a)^T \nabla f(\vec x) + \frac{1}{2} (\vec x - \vec a)^T \nabla^2 f(\vec x) (\vec x - \vec a)\]
<h2 id="newtons-method">Newton’s Method</h2>
<p>Newton’s Method [4, 5] can be used to optimize a function. At its core, it relies on the fact that we can find the minimum of a quadratic equation through it’s derivative. Let \(f: R \rightarrow R\) be a quadratic function, \(f(x) = ax^2 + bx + c\), with \(a > 0\) (otherwise the minimum value would be unbounded). In 2D the function \(f(x)\) looks like a parabola when plotted against x.</p>
<p>Drawing intuition from this 2D case above, the derivative can be seen as the rate of change for the curve at a given point x. It starts out as negative and becomes positive when the curve bounces back upwards, but because the curve is smooth, it has an inflection point where the rate of change is 0. This is also where the lowest value $f(x)$ takes and hence is our optimal value.</p>
<figure class="center_children">
<img src="https://www.kuniga.me/resources/blog/2020-09-04-lbfgs/quadratic-minimum.png" alt="2d Taylor expansion for the sine function" />
<figcaption>Figure 2: A quadratic curve in 2D with the point where the derivative is 0 (x=0) (<a href="https://observablehq.com/@kunigami/quadratic-function">interactive</a>)</figcaption>
</figure>
<p>Mathematically, we can get the derivative of our second order polynomial as:</p>
\[f'(x) = 2ax + b\]
<p>and find the \(x^*\) for which it’s 0:</p>
\[2ax^* + b = 0\]
\[(2) \quad x^* = -b/2a\]
<p>Now say we want to minimize an arbitrary function (which is not necessarily quadratic), \(f: R \rightarrow R\), which we’ll call \(f^*\).</p>
<p>Suppose we pick a value of \(x_0\) which we think is a good solution to \(f^*\). We might be able to improve this further by moving around \(x_0\), by some \(\Delta x\). Say our neighbor point is \(x = x_0 + \Delta x\).</p>
<p>We can now go back to our Taylor expansion of order 2 by replacing a with \(x_0\):</p>
\[f(x) \sim f(x_0) + \frac{f'(x_0)}{1!} (x - x_0) + \frac{f^{''}(x_0)}{2!} (x - x_0)^2\]
<p>if we express this in terms of only \(x_0\) and \(\Delta x\) we get:</p>
\[(3) \quad f(x) = f(x_0 + \Delta x) \sim f(x_0) + \frac{f'(x_0)}{1!} \Delta x + \frac{f^{''}(x_0)}{2!} (\Delta x)^2\]
<p>To mimimize $f(x)$ we can minimize the right hand side of that equation, which is a quadratic function on \(\Delta x\)! This means we can use (2) to find such \(\Delta x\).</p>
<p>In our expression above, \(a = \frac{f^{''}(x_0)}{2!}\) and \(b = \frac{f'(x_0)}{1!}\), so this yields:</p>
\[\Delta x = -\frac{f'(x_0)}{f^{''}(x_0)}\]
<p>We can find the best neighbor \(x\) via:</p>
\[x = x_0 -\frac{f'(x_0)}{f^{''}(x_0)}\]
<p>This expression can be used as recurrence to find smaller values of \(f(x)\) from an initial \(x_0\):</p>
\[x_n = x_{n-1} -\frac{f'(x_{n-1})}{f^{''}(x_{n-1})}\]
<p>We can generalize it to higher dimensions, based on the general form of the Taylor expansion (1):</p>
\[\vec x_n = \vec x_{n-1} - (\nabla^2 f(\vec x_{n-1}))^{-1} \nabla f(\vec x_{n-1})\]
<p>Recalling from matrix arithmetic, the way to “divide” a given \(d\)-dimension vector \(v\) by a \(d \times d\) matrix \(M\), is to find the “quotient” \(q\) such that \(v = qM\). We can multiply both sides by \(M^-1\) and get \(vM^{-1} = q(MM^{-1}) = qI = q\), which is what we’re doing with \(\nabla f(\vec x_{n-1})\) and \(\nabla^2 f(\vec x_{n-1})\) above.</p>
<p>The main problem of this method is the cost of computing the inverse of the Hessian matrix on every step. The simplest way to do it (Gauss–Jordan elimination) is \(O(n^3)\), while the best known algorithm is \(O(n^{2.373})\).</p>
<p>We’ll now cover a class of methods aiming to address this bottleneck.</p>
<h2 id="quasinewton-methods">QuasiNewton Methods</h2>
<p>The QuasiNewton Methods [6] are a class of methods that avoids computing the inverse of the Hessian by using an approximation matrix, which we’ll denote \(B_n\), whose inverse can be more efficiently computed from the inverse of a previous iteration (\(B_{n-1}\)).</p>
<p>One of the common properties methods in this class have is that they satisfy the <em>secant condition</em>. Let’s see what this means. Recall that the second-order Taylor expansion for \(x \in R\) (3), but now renaming \(x\) to \(x_{n+1}\) and \(x_0\) to \(x_n\), and \(\Delta x\) renamed to \(y_n\) to make it clear that \(\Delta x\) is different on each iteration:</p>
\[f(x_{n+1}) = f(x_n) + \frac{f'(x_n)}{1!} y_n + \frac{f^{''}(x_n)}{2!} y_n^2\]
<p>Which is a quadratic function on $y_n$. If we derive on $y_n$ we get the direction of the function:</p>
\[\frac{df(x_{n+1})}{d y_n} = f'(x_n) + f^{''}(x_n) y_n\]
<p>The multi-dimensional version is:</p>
\[(4) \quad \nabla f(\vec x_{n+1}) = \nabla f(\vec x_n) + \nabla^2 f(\vec x_n) \vec y_n\]
<p>The secant condition states that the approximation matrix must satisfy (4), that is:</p>
\[\nabla f(\vec x_{n+1}) = \nabla f(\vec x_n) + B_n \vec y_n\]
<p>If we define \(\vec s_n = \nabla f(\vec x_{n+1}) - \nabla f(\vec x_n)\), the equation can be more succintly expressed as:</p>
\[(5) B_n \vec y_n = \vec s_n\]
<p>Also a further constraint can be added that \(B\) is symmetric [6]. Given these constraints, the different methods define $B$ as the optimal solution to this objective function:</p>
\[B_{n+1} = argmin_{B} \, \| B - B_k \|\]
<p>According to Wikipedia:</p>
<blockquote>
<p>The various quasi-Newton methods differ in their choice of the solution to the secant equation.</p>
</blockquote>
<p>Where $| . |$ is a matrix norm, and \(\| B - B_k \|\) can be thought out as distance between \(B\) and \(B_k\). Thus the optimization problem we’re trying to solve is to find a matrix that satisfy the secant condition and is as close (where “distance” is defined by the norm) as possible to the current approximation \(B_k\).</p>
<p>Once we have an approximate of the Hessian, we can use to compute the direction in the Newton method via (5):</p>
\[B_n \vec y_n = \vec s_n\]
<p>Multiplying by \(B_n^{-1}\):</p>
\[\vec y_n = B_n^{-1} \vec s_n\]
<p>Since \(\vec x_{n+1} - \vec x_n = \vec y_n\):</p>
\[\vec x_{n+1} = \vec x_n + B_n^{-1} \vec s_n\]
<p>Meaning that once we have \(B_n^{-1}\) we only need to compute \(\nabla f(\vec x_{n + 1})\) to obtain the next $\vec x_{n+1}$.</p>
<h2 id="the-bfgs-method">The BFGS Method</h2>
<p>The BFGS method uses what is called the weighted Frobenius norm [7, 8] and defined as:</p>
\[\| M \|_W = \| W^{1/2} M W^{1/2} \|_F\]
<p>For some positive definite matrix that satisfies \(W \vec s_n = \vec y_n\) (note how this is the reverse of (5)). And the Frobenius norm of a matrix $M$, $| M |_F$, is the square root of the square of all its elements.</p>
<p>The solution to this minimization problem is</p>
\[(6) \quad B^{-1}_{n+1} = (I - \rho_n s_n y_n^T) B^{-1}_n (I - \rho_n y_n s_n^T) + \rho_n s_n s_n^T\]
<p>where \(\rho_n = (y_n^T s_n)^{-1}\), and \(\vec y_n = B_n^{-1} \vec s_n\). We’ll have \(B^{-1}_n\) from the previous iteration and we can compute \(\vec s_n\) from \(\nabla f(\vec x_{n}) - \nabla f(\vec x_{n - 1})\).</p>
<h2 id="the-l-bfgs-method">The L-BFGS Method</h2>
<p>We just saw we can use QuasiNewton methods to avoid the cost of inverting it, but we still need to store \(B_n\) in memory, which has size \(O(d^2)\) ans might be an issue if \(d\) is very large.</p>
<p>The Low Memory BFGS avoids the problem by not working with \(B\) directly. It computes \(B_n^{-1} \vec s_n\) (which uses only \(O(d)\) memory) without ever holding \(B_n\) in memory. It’s an interactive procedure to compute \(B_n^{-1}\) from \(B_0^{-1}\). Nocedal [9] defines the following interactive algorithm (I don’t understand how this was derived from (6) but he mentions it was originally proposed by [10], which unfortunately I don’t have access to):</p>
<p>For a given $n \in {0, \dotsc, N}$, where $N$ is the number of iterations:</p>
<ol>
<li>$q_n \leftarrow s_n$</li>
<li>$\mbox{for} \, i = n - 1, \dotsc, 0:$</li>
<li>$\quad\alpha_i \leftarrow \rho_i s_i^T q_{i+1}$</li>
<li>$\quad q_i \leftarrow q_{i + 1} - \alpha_i y_i$</li>
<li>$r_0 \leftarrow B_0^{-1} q_0$</li>
<li>$\mbox{for} \, i = 0, \dotsc, n - 1:$</li>
<li>$\quad \beta_i \leftarrow \rho_i y_i^T r_i$</li>
<li>$\quad r_{i+1} \leftarrow r_{i} + s_j (\alpha_i - \beta_i)$</li>
<li>$\mbox{return} \, r_n$</li>
</ol>
<p>Note that $\alpha$ is the only variable shared among the two-loops, so it has to be stored. In addition, we need to keep $s_i$ and $y_i$ from all the previous $n$ iterations, so their size amounts to \(O(dN)\), where $N$ can be very large depending on convergence conditions. The proposal of L-BFGS is to define $m \lt \lt N$ and only work with the last $m$ entries of $\vec s$ and $\vec y$. If $n \le m$, we use the above procedure. Otherwise, we use this modification [9]:</p>
<ol>
<li>$q_n \leftarrow s_n$</li>
<li>$\mbox{for} \, i = m - 1, \dotsc, 0:$</li>
<li>$\quad j = i + n - m$</li>
<li>$\quad \alpha_i \leftarrow \rho_j s_j^T q_{i+1}$</li>
<li>$\quad q_i \leftarrow q_{i + 1} - \alpha_i y_j$</li>
<li>$r_0 \leftarrow B_0^{-1} q_0$</li>
<li>$\mbox{for} \, i = 0, \dotsc, m - 1:$</li>
<li>$\quad \beta_i \leftarrow \rho_j y_j^T r_i$</li>
<li>$\quad r_{i+1} \leftarrow r_{i} + s_j (\alpha_i - \beta_i)$</li>
<li>$\mbox{return} \, r_m$</li>
</ol>
<p>Now $\alpha$ has only $m$ entries so its size is \(O(dm)\). Furthermore, we can see from the range of $j$ above that we only need the last $m$ entries of $s$ and $y$, which also amounts to \(O(dm)\) required space.</p>
<p>Finally, isn’t \(B_0^{-1}\) also a \(d \times d\) matrix? That’s true, but we can choose \(B_0^{-1}\) so that it is sparse. In [2] it’s suggested \(B_0^{-1} = I\), which means we only need the diagonal, which can be efficiently stored in memory using only $O(d)$ space.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we followed the steps from Newton’s Method to the L-BFGS which helped understanding the motivation behind these methods. This post was largely written after Aria Haghighi’s <a href="https://aria42.com/blog/2014/12/understanding-lbfgs">post</a>.</p>
<p>I studied Taylor Series and Newton’s Method in a while back in college, but with a refresher I think I got a better intuition about them. I don’t recall learning about QuasiNewton methods or L-BFGS, and writing a post definitely helped with understanding them better.</p>
<h2 id="related-posts">Related Posts</h2>
<ul>
<li><a href="https://www.kuniga.me/blog/2012/03/11/lagrangean-relaxation-practice.html">Lagrangean Relaxation - Practice</a> - uses the Gradient Descent method, which can be seen as a special case of the Newton’s method where we only work with the first order Taylor expansion.</li>
</ul>
<h2 id="references">References</h2>
<ul>
<li>[<a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm">1</a>] Wikipedia - Broyden–Fletcher–Goldfarb–Shanno algorithm</li>
<li>[<a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">2</a>] Wikipedia - Limited-memory BFGS</li>
<li>[<a href="https://en.wikipedia.org/wiki/Taylor_series">3</a>] Wikipedia -Taylor series</li>
<li>[<a href="https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization">4</a>] Wikipedia - Newton’s method in optimization</li>
<li>[<a href="https://suzyahyah.github.io/calculus/optimization/2018/04/06/Taylor-Series-Newtons-Method.html">5</a>] suzyahyah - Taylor Series approximation, newton’s method and optimization</li>
<li>[<a href="https://en.wikipedia.org/wiki/Quasi-Newton_method">6</a>] Wikipedia - Quasi-Newton method</li>
<li>[<a href="https://aria42.com/blog/2014/12/understanding-lbfgs">7</a>] Aria42 - Numerical Optimization: Understanding L-BFGS</li>
<li>[<a href="https://math.stackexchange.com/questions/2271887/how-to-solve-the-matrix-minimization-for-bfgs-update-in-quasi-newton-optimizatio">8</a>] Mathematics, Stack Exchange - How to solve the matrix minimization for BFGS update in Quasi-Newton optimization</li>
<li>[<a href="https://courses.engr.illinois.edu/ece544na/fa2014/nocedal80.pdf">9</a>] Nocedal J. - Updating Quasi-Newton Matrices
With Limited Storage</li>
<li>[<a href="https://doi.org/10.1002/nme.1620141104">10</a>] Matthies H., Strang G. - The solution of nonlinear finite element equations</li>
</ul>Guilherme KunigamiBFGS stands for Broyden–Fletcher–Goldfarb–Shanno algorithm [1] and it’s a non-linear numerical optimization method. L-BFGS means Low Memory BFGS and is a variant of the original algorithm that uses less memory [2]. The problem it’s aiming to solve is to mimize a given function \(f: R^d \rightarrow R\) (this is applicable to a maximization problem - we just need to solve for \(-f\)). In this post we’ll cover 5 topics: Taylor Expansion, Newton’s method, QuasiNewton methods, BFGS and L-BFGS. We then look back to see how all these fit together in the big picture.