Projects, Data, and Mathematical Musings

Friday, August 7, 2015

Measuring the Jet Stream using Flight Times

If you've ever flown between the East coast and West coast, you know the feeling: going west takes soooo much longer! The difference in flight times is about an hour, which can be painful if it's tacked onto the end of a long trip.

Let's run some basic numbers. Suppose we between SFO and BOS on a 737, which has a cruising speed of $v_{\rm cruise} \sim 500$ mph. It's $d \sim 2700$ miles between the two cities, so with no wind the flight time would be $t_{\rm flight} = $ 5 hours and 24 minutes. If it takes $\Delta t = 1$ hour longer for the eastward trip, then the average jet stream velocity, $\bar{v}$, is given by the relation

$\dfrac{d}{v_{\rm cruise} - \bar{v}} - \dfrac{d}{v_{\rm cruise} + \bar{v}} = \Delta t$

$ \quad \Rightarrow \; \bar{v} = \dfrac{d}{\Delta t} \Biggl( \sqrt{1 + \dfrac{\Delta t^2 v_{\rm cruise}^2}{d^2} } - 1 \Biggr) \approx \dfrac{v_{\rm cruise} \Delta t}{2d} \, v_{\rm cruise} = \dfrac{\frac12 \Delta t}{t_{\rm flight}} v_{\rm cruise},$

so that in this case $\bar{v} \approx 46 \text{ mph}$. This is the average jet stream velocity over the flight path (as experienced by the flight), but what if we want to get finer grained than this? Can we use this basic idea to build a map of the jet stream across the US?

It turns out, the answer is yes! And we can do so via a nice linear algebra problem, mixed in with some spherical geometry. Here's how it goes.

The Data

I started to think about this project when I ran across a nice, big publicly available data set on flights within the American Statistical Association, from a 2009 data expo, here. There are 120 million flight records spread over nearly 21 years, and the data is pretty clean. We'll use 5 years of the data here, from 2004-2008, which has more than 32 million flights (plus about 4 million we drop when we clean the data). Each flight record contains the following information:

date
departure and arrival times, both scheduled and actual
carrier code, flight number, plane's tail number
origin and destination, distance between airports
flight times: elapsed time (actual and scheduled), taxi time in/out, time in the air
delays: arrival and departure delays, delays in categories (carrier, weather, NAS, security, late aircraft)
flags for whether the flight was cancelled (and why), or diverted

Which is a lot of info! They even provide accompanying data sets on airports, carriers, and planes. The data comes in .csv files by year, so it's easy to drop into a MySQL database.

Relevant to our question, what does the data let us do? We can compute the average velocity for each flight, using the air time and distance traveled. The airport dataset also has handy latitude and longitude coordinates, so we can figure out the idealized path the plane travels (on a great circle).

What doesn't this data set contain? The most important missing item is the actual flight path. If this deviates far from a great circle, the velocity of the flight may be very different from what is reported (and the path the flight travels will be different). This is a feature we'll have to deal with, and maybe with another data set, such as one from a flight tracking site, we could build a more precise model. But this data set is a great example of what you can do with less-than-complete information, so let's build a model!

Here's a map of every airport with more than 100,000 enplanements per year (using recent passenger data), with a few example flight paths: San Francisco to Boston, Seattle to Miami, and Dallas-Fort Worth to Chicago.

There are 203 airports on the map, and the smaller ones are needed to give reasonable coverage of the High Plains and eastern Northwest. Here are all flight paths between pairs of airports with more than 100 flights in each direction in the 5-year time range of the data. Of the 20503 possible airport pairs (pairs drawn from the 203 airports), there are 2334 that meet this criterion.

The Model

First, let's discuss flights between a pair of cities $i$ and $j$; we'll label this route as path $p$. Since we're assuming all flights on $p$ follow the same great circle, we can take the average speed of flights from $i$ to $j$, $\bar{v}^{(ij)}_{\rm flight}$, and the average speed of flights from $j$ to $i$, $\bar{v}^{(ji)}_{\rm flight}$. The jet stream parallel to the flight path (averaged over the flight path) is given by

$\bar{v}^{(p)} = \frac12 ( \bar{v}^{(ij)}_{\rm flight} - \bar{v}^{(ji)}_{\rm flight}).$

We can write this distance-averaged speed in terms of the jet stream velocity $\vec{v}$ as a line integral over the path $C_p$ of the flight:

$\displaystyle \bar{v}^{(p)} = \frac{1}{d_p} \int_{C_p} d\vec{s} \cdot \vec{v} (s),$

where $d_p$ is the distance between the cities. We can discretize the integral by dividing the US up into patches over which we approximate the jet stream as having a constant direction; then we have

$\displaystyle \bar{v}^{(p)} = \frac{1}{d_p} \sum_k \vec{r}_k^{(p)} \cdot \vec{v}_k,$

where the sum is over patches and $\vec{r}_k^{(p)}$ is the distance vector of the path in patch $k$.

Here's a map of (26) boxes in the US we'll use; we'll have a speed measurement in each box. In the eastern US, which has denser flight patterns, we can use smaller boxes and get good enough coverage.

In terms of latitude and longitude directions (which are just $\hat{\phi}$ and $\hat{\theta}$ in spherical coordinates),

$\dfrac{1}{d_p} \vec{r}_k^{(p)} \cdot \vec{v}_k = \dfrac{1}{d_p} \bigl( r_{k,\theta}^{(p)} v_{k,\theta} + r_{k,\phi}^{(p)} v_{k,\phi} \bigr).$

For any flight path, we can measure $\vec{r}_k^{(p)}$ and $d_p$, meaning that we can access the unknowns $v_{k,\theta}$ and $v_{k,\phi}$.

So now we have the dependence on the jet stream in patches of sky for one flight path -- great! How do we turn this into a system where we can solve for the jet stream velocity? We use a whole bunch of flight paths! Using our data set, we can take a large collection of cities, and look at flights between pairs. Let's be concrete and label them $1, \ldots, N_p$, and label the number of patches $1, \ldots, N_b$.

For each path, we get an equation like the above, where we can express the average jet stream speed along that path (which we can measure from the average velocities) in terms of the components of jet stream velocities in each patch. This forms a linear system, which is overcomplete as you might expect because we can easily have many more pairs of cities (flight paths) than patches of sky.

The linear system can be written

$\mathbf{M} \, V = \bar{V},$

where

$\mathbf{M} = \left( \begin{matrix} r_{1,\theta}^{(1)} / d_1 & r_{1,\phi}^{(1)} / d_1 & r_{2,\theta}^{(1)} / d_1 & r_{2,\phi}^{(1)} / d_1 & \ldots \\ r_{1,\theta}^{(2)} / d_2 & r_{1,\phi}^{(2)} / d_2 & r_{2,\theta}^{(2)} / d_2 & r_{2,\phi}^{(2)} / d_2 & \ldots \\ \vdots & \vdots & \vdots & \vdots & \end{matrix} \right),$

$V = \left( \begin{matrix} v_{1,\theta} \\ v_{1,\phi}\\ v_{2,\theta} \\ v_{2,\phi} \\ \vdots \end{matrix} \right) , \quad \bar{V} = \left( \begin{matrix} \bar{v}^{(1)} \\ \bar{v}^{(2)} \\ \vdots \end{matrix} \right) .$

The matrix $\mathbf{M}$ is $N_p \times 2N_b$, with $N_p$ the larger dimension (so that the flight paths provide good coverage to the different regions of the map). This means that the system is overconstrained. A simple approach, then, is to use least squares minimization, which is the minimization condition

$\displaystyle \min_{V} \, \lvert \lvert \mathbf{M} V - \bar{V} \rvert \rvert,$

and has a solution

$V = \bigl( \mathbf{M}^{\rm T} \mathbf{M} \bigr)^{-1} \mathbf{M}^{\rm T} \bar{V}.$

This is great! We have a way to measure the jet stream from the data set we have, since we can construct $\mathbf{M}$ and $\bar{V}$ using the flight information and the geometry of the cities we'll use in the data set.

Before we start looking at the data, let's briefly talk about the assumptions inherent in the model. In calculating $\vec{\bar{v}}_{\rm jet}^{(ij)}$ by the simple difference, we have assumed that flights in both directions have the same fraction of types of planes -- certain models of planes are faster or slower than others, and faster planes go more often from $i$ to $j$ than $j$ to $i$, this will throw off the measurement of the average jet stream velocity. This is a good assumption given that on many routes planes hop back and forth between cities.

Another assumption we've made, given what is in the data set, is that each flight follows a great circle. While this is the shortest path, planes will sometimes deviate from a great circle to avoid bad weather or optimize the wind assist from the jet stream. This is particularly true of flights over the Atlantic, where east and west bound flights often fly at very different latitudes (in the US, this seems to happen much less often). However, time is money, and so deviations from the shortest path are usually minimal. Scanning through flight paths online, most seem to travel a path whose distance is within a few percent of the shortest. From the structure of the model, if we had flight path information for individual flights, we could input this into the model and gain more fine-grained sensitivity to the jet stream.

Now let's look at the data and do the measurement!

Measuring the Jet Stream

Let's start with a couple of examples for the average jet stream velocity between pairs of cities. Here's a histogram of all flights between San Francisco and Boston; you can see there's a nice distribution for each flight direction (SFO to BOS and BOS to SFO) with a clear separation between the two directions. The average velocities are different by about 81 mph, meaning the average jet stream velocity speed speeds up the flight from SFO to BOS by ~40 mph and slows down the flight from BOS to SFO by ~40 mph.

Here's another case, a short (mostly north-south) flight between Dallas-Fort Worth and Houston:

There's basically no effect from the jet stream here, which indicates the average jet stream is perpendicular to the flight path. Looking at these hisotgrams, you can see the reason why we want to require a certain number of flights for each path: we want to have sufficient statistics to accurately resolve the average jet stream velocity.

OK, so we can now put everything together! We take the 2334 flight paths and compute $\bar{V}$, and using the path directions in each box we construct $\mathbf{M}$. Using the least squares minimization we can then solve for $V$ and plot the average jet stream velocities:

Looks great! You can see the jet stream is west-to-east, with a magnitude in the 20 - 40 mph range (see the legend on the lower right). The velocities are continuously flowing between boxes, a good sign of consistency.

Now that we've had a chance to look at this map, we can talk about what this measurement really is. The jet stream is a constantly changing flow, and the high-velocity parts of it can be rather narrow (e.g., a hundred miles wide) and confined to certain altitudes. Our data is time averaged over a long period, so we are seeing the average wind speed. We'll talk a bit later about instantaneous measurements of the jet stream using real-time information.

Since we are solving an overconstrained system, we can use the map to compute the average jet stream velocity for each flight path and compare it to the average velocity we extracted directly from the flight data. This is a measure of the internal consistency of the data.

Here is the difference in velocities for each flight path:

Here we're looking at a symmetric difference (so that we include the path from A to B and B to A).
The differences are peaked at zero, and the spread is about +/- 15 mph. This can be a significant fraction of the data, but given that this is a signed measurement (i.e. the speed has direction, whether is is along the path or against it), we should be pretty happy with this level of agreement! It is built from millions of flights averaged over thousands of routes, boiled down to 26 velocity measurements (52 numbers!). Here's another way to look at the data: the raw average wind speed for each path.

The green is the input data to our model, and the blue are the wind speed for each path that we determine using the model. The model data have a tighter grouping than the measured data around 20-25 mph, but it is reasonable that the model cannot capture the largest speeds accurately. In fact, we can see that the model does better for longer flights which traverse many boxes:

This indicates that short flights have the largest discrepancy between measured speeds and the model. That is fine, since we should expect the averaging of the model to better match the longer flights.

Improved Models

Let's recap. We have constructed a simple model to measure the average wind speed based on the flight speeds for a large data set of 32 million flights over 2300 routes, to measure wind speeds over 26 spatial boxes covering the US. This is a nice example of tidying a data set: 32 million flights reduced to 2300 speeds, reduced via the model to 26 numbers that give a nice picture of the wind speeds over the US.

So let's consider some questions we should always ask: how can we do better, and what else can we do?

In the same vein of using flight data to map out wind speeds, there are lots of variations we could pursue:

Global flight data would be very interesting, partly because of the geometry challenges of having good coverage in some regions. There are lots of flights, but also lots of the parts of the world that planes just don't fly very often (the southern oceans, anyone?). Having more data with flights outside the US would help with coverage at the geographic boundaries, which is hard to capture with continental flights.
On the other end of the spectrum, we could focus on a narrower geographic region with heavy flight traffic. We could do a more robust model validation, look at more fine-grained spatial features, identify outlier flights, and introduce corrections to the model. This is something we can do with our current data, since it just takes a narrower focus.
Slicing the data in time buckets would let us build an time-evolving model, and should capture seasonal or transient effects like storms. The eastern US or continental Europe would be great targets for something like this.

We can also ask what other forms of data are available. The data set we've used here is nice because it allowed us to exploit large statistical samples to create the model; more instantaneous forms of the data may simply provide direct measurements of the wind speed on a flight-by-flight basis. For example,

Live (or nearly live) data is available for most flights that include velocity and flight path information. If we know the cruising speed of the particular plane for a given flight, we can easily determine the wind speed along the flight path for each flight. Given that for most flights there are other planes in the air nearby at any given time (just based on the sheer volume of flights), there can be enough coverage to map out the (basically instantaneous) two dimensional velocity map.

The FlightAware and FlightStats API both seem easily capable of producing this information, although they are pay services (after free evaluation periods). A real-time map of this information is a really exciting possibility!

Wednesday, July 15, 2015

Symmetry Factors of Graphs

Particle physicists like to represent things with diagrams. The popular example is Feynman diagrams, and there are rules to draw and evaluate diagrams for different scattering processes. Diagrams also arise in the large scale structure of cosmology, where matter fluctuations can be described using perturbation theory (see this Wikipedia article).

Diagrams for cosmological perturbation theory (specifically large scale structure): they are basically functions of directed graphs. That is:

The value of the diagram is the product of weights of edges and vertices.
Each edge in the diagram is associated with a momentum flow, which enters through the vertices.
The weight of and edge is a scalar function of the momentum flow in the edge.
The weight of a vertex is a scalar function of the momentum flow in the edges connecting to the vertex.

These diagrams have two types of symmetry factors:

The internal symmetry factor $-$ the number of topologically-invariant permutations of the edges of the graph.

The external symmetry factor $-$ if the vertices are labeled, the number of permutations of vertices that produce topologically distinct graphs.

The Internal Symmetry Factor

The internal symmetry factor simply counts the number of ways we can connect the edges from each vertex together to form a given topology. Here's an example: suppose we have a graph we'll call $D_{321}$, with the topology $\{ v_1 \textbf{ - } v_1 ,\, v_1 \textbf{ - } v_2 ,\, v_2 \textbf{ - } v_3 \},$ where vertices are denoted $v_{1 - 3}.$ There are 6 configurations of edge connections that give the same topology:

So the internal symmetry factor is $S(D_{321}) = 6.$ What is the symmetry factor for a general graph $D$ with $k$ vertices? Let's derive a formula.

There are two relevant quantities for a general graph:

$n_i:$ the number of edges connecting to vertex $i$.

$n_{ij}:$ the number of edges connecting vertex $i$ to vertex $j$.

Note $\sum_j n_{ij} = n_i$. First, let's take a simpler case first: $n_{ii} = 0$ for all $i$. For each vertex, let's define the multinomial coefficient

$M_i \equiv \dfrac{n_i!}{\prod_j n_{ij}!} \,,$

which counts how many ways we can divide up the $n_i$ edges connecting to vertex $i$ into groups that go to each of the vertices. This tells us the symmetry factor contribution for each vertex, and now we need to multiply by the number of ways $C_{ij}$ to connect the set of edges running between each pair of vertices. But this is easy:

$C_{ij} = n_{ij}!$

So the symmetry factor for graphs with $n_{ii} = 0$ is:

$\displaystyle S(D[n_{ii} = 0]) = \Bigl(\prod_{i = 1}^k M_i \Bigr) \Bigl( \prod_{i = 1}^k \prod_{j = 1}^{i-1} C_{ij} \Bigr) \,.$

Now let's handle the case where there can be "tadpole" loops at a single vertex. Clearly the multinomial factor $M_i$ is still the same, since the vertices don't know where the edges are going. But now we need to include a term for $C_{ii}$, the number of way to connect $n_{ii}$ edges from a vertex to itself.

The edges from a vertex to itself are connected in pairs, so $C_{ii}$ is the number of ways to group $n_{ii}$ labeled objects into pairs. You may recognize this counting problem as giving a double factorial (it is equivalent to a perfect matching of a complete graph of $n_{ii}$ vertices, or countings of various types of binary trees). More explicitly, taking the first few cases of $n_{ii} = $2, 4, 6 ($n_{ii}$ must be even), $C_{ii} = $1, 3, 15, and it's not hard to see

$C_{ii} = (n_{ii} - 1)!!$

One can go through a more explicit combinatorial counting and see that

$C_{ii} = \frac{n_{ii}!}{(2!)^{n_{ii} / 2} (\frac12 n_{ii})!} = (n_{ii} - 1)!!$

Therefore the internal symmetry factor for any graph is

$\boxed{\displaystyle S(D) = \Bigl(\prod_{i = 1}^k M_i \Bigr) \Bigl( \prod_{i = 1}^k \prod_{j = 1}^{i} C_{ij} \Bigr) \,,}$

where

$\boxed{ M_i = \dfrac{n_i!}{\prod_j n_{ij}!} \,, \\ C_{ij} = \begin{cases} n_{ij}! & i \neq j \\ (n_{ii} - 1)!! & i = j \,. \end{cases} }$

If we apply this formula to the figure above, we have $M_1 = 3 ,\, M_2 = 2 ,\, M_3 = 1,\,$ and $C_{ij} = 1$ for all $i,\, j$, so that $S = 6$ as we the figure explicitly computed.

The External Symmetry Factor

The external symmetry factor counts how we may permute the vertex labels on the graph to produce a topologically distinct graph (think of the vertex labels as colors). For example, for a graph $D_{332}$, with the topology $\{ v_1 \textbf{ - } v_2 ,\, v_1 \textbf{ - } v_2 ,\, v_1 \textbf{ - } v_3 ,\, v_2 \textbf{ - } v_3 \},$ there are 3 distinct topologies, each having two equivalent vertex labelings:

So the external symmetry factor $E(D_{332}) = 3$. This factor can be computed by using the adjacency matrix $A$, which encodes the $n_{ij}:$

$(A)_{ij} = n_{ij} \,.$

The adjacency matrix for $D_{332}$ is distinct for the different rows of the figure:

$\left( \begin{matrix} 0 & 2 & 1 \\ 2 & 0 & 1 \\ 1 & 1 & 0 \end{matrix} \right) \,, \;\; \left( \begin{matrix} 0 & 1 & 2 \\ 1 & 0 & 1 \\ 2 & 1 & 0 \end{matrix} \right) \,, \;\; \left( \begin{matrix} 0 & 1 & 1 \\ 1 & 0 & 2 \\ 1 & 2 & 0 \end{matrix} \right) \,.$

Therefore, the external symmetry factor is:

$\boxed{E(D) = \text{the number of unique adjacency matrices} \\ \qquad \qquad \text{under permutation of the vertex labels.} \quad }$

There doesn't seem to be a simple formula for this, as it depends on the symmetry group of the graph, but one can straightforwardly calculate $E(D)$ using the adjacency matrix in code.

Great! Now we know how to compute various symmetry factors of graphs, and they use fun features of combinatorics.

Thursday, June 11, 2015

Interpolations

Interpolation is a great technique to convert discrete data into continuous data. It's useful if you want to sample the domain at points other than the original data, or make continuous plots, or do operations convenient on a function (like integration or differentiation).

It's useful to distinguish interpolation from regression -- interpolation provides a functional form passing through the original data, and uses a standard functional form (e.g., a cubic polynomial) that is ideally a sufficient approximation to the entire domain, while regression fits a model to the data. In regression, the model is often specialized to the case at hand and usually doesn't pass through the original data points. If you have a candidate model for the data, you should use regression and not interpolation. Or, if your data is very noisy, interpolation is usually not a great technique (as we'll see later).

Here's a case where I use interpolation in my own work. Let's say I have a smooth function $f(x)$ which I want to evaluate over some compact interval. Suppose $f(x) \equiv \int_a^b dy \, g(x, y)$, where $g(x, y)$ is a very complex function and the integral can only be done numerically. If I want to avoid doing this integral every time I evaluate $f$, I can tabulate the function over a large number of points (once and for all) and interpolate them to get any $f(x)$. I can evaluate the numerical error from interpolation, and ensure that this error falls below the accuracy I need.

I'll talk about two types of interpolation of 1-D functions: the standard cubic spline and the less-known Steffen interpolation. Both use cubic polynomials to interpolate between data points, where the polynomial coefficients are determined by boundary conditions at the data points. The Steffen interpolation [M. Steffen, Astron. Astrophys. 239, 443-450 (1990)] has the nice feature that it is monotonic between data points, which avoids wild excursions of the interpolation that can happen with other methods. Many other types of interpolation exist, and frameworks such as gsl or scipy have rather broad implementations of interpolation routines (with 2D interpolations in scipy or available as an extension in gsl). I use 2D cubic (or bicubic) interpolation a lot in my own work.

Suppose we have a set of points $\{(x_0 , y_0) ,\, \ldots ,\, (x_n, y_n) \}$ to interpolate. We define a set of cubic polynomials $\{ g_0 (x), \, \ldots ,\, g_{n-1} (x) \}$, where

$g_k (x) = c_k^{(0)} + c_k^{(1)} (x - x_k) + c_k^{(2)} (x - x_k)^2 + c_k^{(3)} (x - x_k)^3 \;$ covers the domain $[x_k, x_{k+1}]$.

There are $4n$ coefficients to determine. These can be expressed in terms of the values ($y_k$) and derivatives ($y_k'$) at each of the data points via the linear system

$g_k (x_k) = c_k^{(0)} = y_k ,$
$g_k' (x_k) = c_k^{(1)} = y_k' ,$
$g_k (x_{k+1}) = c_k^{(0)} + c_k^{(1)} d_k + c_k^{(2)} d_k^2 + c_k^{(3)} d_k^3 = y_{k+1} ,$
$g_k' (x_{k+1}) = c_k^{(1)} + 2 c_k^{(2)} d_k + 3 c_k^{(3)} d_k^2 = y_{k+1}' ,$

where $d_k \equiv x_{k+1} - x_k$. We also define $m_k \equiv (y_{k+1} - y_k) / d_k$. Solving yields

$c_k^{(0)} = y_k ,$
$c_k^{(1)} = y_k' ,$
$c_k^{(2)} = \frac{1}{d_k} (3m_k - 2y_k' - y_{k+1}') ,$
$c_k^{(3)} = \frac{1}{d_k^2}(-2m_k + y_k' + y_{k+1}') .$

Enforcing that the interpolation passes through the data points (and hence requiring continuity) has removed $2n$ variables, and enforcing continuity of the first derivative has further removed $n-1$ variables. The $n+1$ remaining variables are the $n+1$ derivatives $y_k'$ of the interpolation at each data point.

At this point, different interpolation schemes make different choices to fix the remaining variables. Let's talk about these in turn, starting with the usual cubic spline.

cubic spline:

The standard cubic spline makes an obvious choice: enforcing the interpolation is $C_2$, plus boundary conditions at the endpoint:

continuity of the second derivative ($g_k'' (x_{k+1}) = g_{k+1}'' (x_{k+1})$). [$n - 1$ conditions]
vanishing second derivative at the boundary, ($g_0''(x_0) = 0$, $g_{n-1}''(x_n) = 0$) [2 conditions]

The endpoint boundary conditions could be chosen differently, and it's useful to consider the case where we ask that the first derivatives vanish instead of the second:

vanishing first derivative at the boundary, ($g_0'(x_0) = 0$, $g_{n-1}'(x_n) = 0$) [2 conditions]

The second derivatives are

$g_k'' (x_{k+1}) = 2 c_k^{(2)} + 6 c_k^{(3)} d_k ,$
$g_{k+1}'' (x_{k+1}) = 2 c_{k+1}^{(2)} ,$

so that continuity implies

$g_k'' (x_{k+1}) = g_{k+1}'' (x_{k+1}) \; \Rightarrow \; d_{k+1} y_k' + 2(d_k + d_{k+1}) y_{k+1}' + d_k y_{k+2}' = 3(d_k m_{k+1} + d_{k+1} m_k) ,$

while a vanishing second derivative at the endpoint gives the conditions

$g_0'' (x_0) = 0 \; \Rightarrow \; 2 y_0' + y_1' = 3 m_0 ,$

$g_{n-1}'' (x_n) = 0 \; \Rightarrow \; 2 y_n' + y_{n-1}' = 3 m_{n-1} .$

This leads to a linear system which can be represented as

$\left(\begin{matrix} 2d_0 & d_0 & 0 & 0 & 0 & \ldots & 0 \\ d_1 & 2(d_1 + d_0) & d_0 & 0 & 0 & \ldots & 0 \\ 0 & d_2 & 2(d_2 + d_1) & d_1 & 0 & \ldots & 0 \\ & & & \ddots & & & \\ 0 & \ldots & 0 & d_{n-2} & 2(d_{n-2} + d_{n-3}) & d_{n-3} & 0 \\ 0 & \ldots & 0 & 0 & d_{n-1} & 2(d_{n-1} + d_{n-2}) & d_{n-2} \\ 0 & \ldots & 0 & 0 & 0 & d_{n-1} & 2 d_{n-1} \end{matrix} \right) \\ \qquad\qquad \times \left( \begin{matrix} y_0' \\ y_1' \\ y_2' \\ \vdots \\ y_{n-2}' \\ y_{n-1}' \\ y_n' \end{matrix} \right) = \left( \begin{matrix} 3 d_0 m_0 \\ 3(d_0 m_1 + d_1 m_0) \\ 3(d_1 m_2 + d_2 m_1) \\ \vdots \\ 3(d_{n-3} m_{n-2} + d_{n-2} m_{n-3}) \\ 3(d_{n-2} m_{n-1} + d_{n-1} m_{n-2}) \\ 3(d_{n-1} m_{n-1}) \end{matrix} \right) .$

The matrix is tridiagonal, and by inverting it we obtain the derivatives at each point and can define the cubic spline. The form becomes far simpler if the $x$ points are equidistant, so that the $d_k$ scale out:

$\left(\begin{matrix} 2 & 1 & 0 & 0 & 0 & \ldots & 0 \\ 1 & 4 & 1 & 0 & 0 & \ldots & 0 \\ 0 & 1 & 4 & 1 & 0 & \ldots & 0 \\ & & & \ddots & & & \\ 0 & \ldots & 0 & 1 & 4 & 1 & 0 \\ 0 & \ldots & 0 & 0 & 1 & 4 & 1 \\ 0 & \ldots & 0 & 0 & 0 & 1 & 2 \end{matrix} \right) \left( \begin{matrix} y_0' \\ y_1' \\ y_2' \\ \vdots \\ y_{n-2}' \\ y_{n-1}' \\ y_n' \end{matrix} \right) = \left( \begin{matrix} 3 m_0 \\ 3(m_1 + m_0) \\ 3(m_2 + m_1) \\ \vdots \\ 3(m_{n-2} + m_{n-3}) \\ 3(m_{n-1} + m_{n-2}) \\ 3(m_{n-1}) \end{matrix} \right) .$

If we instead chose vanishing first derivatives and the endpoints, we would just have $y_0' = y_n' = 0$, and our linear system would become (for equidistant $x_k$)

$\left(\begin{matrix} 4 & 1 & 0 & 0 & 0 & \ldots & 0 \\ 1 & 4 & 1 & 0 & 0 & \ldots & 0 \\ 0 & 1 & 4 & 1 & 0 & \ldots & 0 \\ & & & \ddots & & & \\ 0 & \ldots & 0 & 1 & 4 & 1 & 0 \\ 0 & \ldots & 0 & 0 & 1 & 4 & 1 \\ 0 & \ldots & 0 & 0 & 0 & 1 & 4 \end{matrix} \right) \left( \begin{matrix} y_1' \\ y_2' \\ y_3' \\ \vdots \\ y_{n-3}' \\ y_{n-2}' \\ y_{n-1}' \end{matrix} \right) = \left( \begin{matrix} 3(y_2 - y_0) \\ 3(y_3 - y_1) \\ 3(y_4 - y_2) \\ \vdots \\ 3(y_{n-2} - y_{n-4}) \\ 3(y_{n-1} - y_{n-3}) \\ 3(y_n - y_{n-2}) \end{matrix} \right) ,$

which is also tridiagonal. At this point, we only need to invert the matrix to solve for the derivatives and calculate the function. It may be surprising, but the derivatives will generally depend on the values of all of the data points -- so your interpolation can be adversely affected by including garbage data at the ends of the domain. Before moving on to the Steffen interpolation I'll talk about inverting the tridiagonal matrices here, which is interesting on its own.

Inverting tridiagonal matrices:

Guess what? Inverting tridiagonal matrices is a lot of fun! There's a pretty slick form for the inverse, that runs into some interesting sequences (I believe the proper reference is [R. Usmani, Comput. Math. Appl. 212/213, 413-414 (1994)]).

For a tridiagonal matrix of the form

$T = \left( \begin{matrix} a_1 & b_1 & & & \\ c_1 & a_2 & b_2 & & \\ & c_2 & \ddots & \ddots & \\ & & \ddots & \ddots & b_{n-1} \\ & & & c_{n-1} & a_n \end{matrix} \right) ,$

the inverse $T^{-1}$ has entries

$\bigl( T^{-1} \bigr)_{ij} = \begin{cases} (-1)^{i+j} b_i \cdots b_{j-1} \theta_{i-1} \phi_{j+1} / \theta_n & i \le j \,, \\ (-1)^{i+j} c_j \cdots c_{i-1} \theta_{j-1} \phi_{i+1} / \theta_n & i > j \end{cases}$

where $\theta_k$ and $\phi_k$ satisfy the recursion relations

$\theta_k = a_k \theta_{k-1} - b_{k-1} c_{k-1} \theta_{k-2} ,\, k = 2, \ldots, n$ with $\theta_0 = 1 ,\, \theta_1 = a_1 ,$
$\phi_k = a_k \phi_{k+1} - b_k c_k \phi_{k+2} ,\, k = n-1, \ldots, 1$ with $\phi_{n+1} = 1,\, \phi_n = a_n .$

Note that if (as in our case), $b_k = c_k = 1$, we have a much simpler form:

$\bigl( T^{-1} \bigr)_{ij} = \begin{cases} (-1)^{i+j} \theta_{i-1} \phi_{j+1} / \theta_n & i \le j \,, \\ (-1)^{i+j} \theta_{j-1} \phi_{i+1} / \theta_n & i > j \end{cases}$

with recursion relations

$\theta_k = a_k \theta_{k-1} - \theta_{k-2} ,\, k = 2, \ldots, n$ with $\theta_0 = 1 ,\, \theta_1 = a_1 ,$
$\phi_k = a_k \phi_{k+1} - \phi_{k+2} ,\, k = n-1, \ldots, 1$ with $\phi_{n+1} = 1,\, \phi_n = a_n .$

This further simplifies if the $a_k$ are constant; in this case the recurrence relation becomes a Lucas sequence $U(a_k, 1)$ (as in the case of the modified cubic spline, where $a_k = 4$). Let's take another side track and prove a nice identity for the Lucas sequence $U(P, 1)$ that seems to be not well known.

The Lucas sequence $U(P, 1)$:

We're discussing a special case of the Lucas sequence $U(P, Q)$, setting $Q = 1$. Let's use the notation $U_n$ as the $n^{\rm th}$ term of $U(P, 1)$. The closed form expression for $U_n$ is given by

$U_n = \frac{1}{\sqrt D} (a^n - b^n),$ where $D = P^2 - 4,$ $a = \frac12(P + \sqrt{D}),$ and $b = \frac12(P - \sqrt{D}).$

We will show $U_n^2 = U_{n+k} U_{n-k} + U_k^2$ for any natural numbers $n$ and $k$. We will make use of the defining recursion relation for $U_n:$

$U_n = P U_{n-1} - U_{n-2},$ where $U_1 = 1$ and $U_2 = P.$

First, we need to prove another relation, $U_{n+k} = U_n U_{k+1} - U_{n-1} U_k,$ which we do by induction on $k$. For $k = 1$, we recover the defining recursion relation above. Assuming the case for $k$, for $k+1$ we have

$U_{n+k+1} = P U_{n+k} - U_{n+k-1} = P [ U_n U_{k+1} - U_{n-1} U_k ] - U_n U_k + U_{n-1} U_{k-1}$
$ \qquad\qquad = U_n [ P U_{k+1} - U_k ] - U_{n-1} [ P U_k - U_{k-1} ] = U_n U_{k+2} - U_{n-1} U_{k+1}$, which completes the induction.

Now to show the main relation, we take the difference $U_{n+k}^2 - U_{n+2k} U_n$ and apply the relation we have already show to separate the $n$ and $k$ dependence:

$U_{n+k}^2 - U_{n+2k} U_n = (U_n U_{k+1} - U_{n-1} U_k)^2 - (U_n U_{2k+1} - U_{n-1} U_{2k}) U_n$
$\qquad = (U_n^2 + U_{n-1}^2) U_k^2 - U_n U_{n-1} U_k (U_{k+1} + U_{k-1})$

Using $U_{k+1} + U_{k-1} = P U_k,$ (which uses $Q = 1$ for the general Lucas sequence relation) we can put the above into the form

$ = U_k^2 ( U_n^2 - U_{n-1} U_{n+1} )$,

so that if $U_n^2 - U_{n-1} U_{n+1} = 1$, we have proven the relation we set out to (note that we have just shifted the indices by $k$). This can be done using $U_n$ in terms of $a$ and $b$, noting two things: $a b = \frac14 (P^2 - D) = 1,$ and $a - b = \sqrt{D}.$ Then

$U_n^2 - U_{n-1} U_{n+1} = \frac{1}{D} \Bigl( (a^n - b^n)^2 - (a^{n-1} - b^{n-1})(a^{n+1} - b^{n+1}) \Bigr)$
$ \qquad = \frac{1}{D} (ab)^{n-1} (a-b)^2 = 1$,

and therefore we have $U_n^2 = U_{n+k} U_{n-k} + U_k^2$! For the Fibonacci sequence, a very similar relation holds and is called Catalan's identity; this extension holds for the general Lucas sequence $U(P, 1)$, although it's not widely known.

OK, that was an interesting diversion. Now let's talk about the Steffen interpolation, which doesn't have any matrices to invert!

Steffen interpolation:

The Steffen interpolation was motivated by the fact that many interpolation routines can have undesirable behavior between data points. One way to think about this (for the case of the cubic spline) is that it "uses up" its degrees of freedom by requiring the interpolation is $C_2$, and so there are no more degrees of freedom left to ensure that the function is at all well-behaved between the data points -- we are just relying on the smoothness of the cubic polynomials to not do anything too crazy.

The Steffen interpolation solves this problem by using the remaining degrees of freedom to ensure that for each segment (cubic function between adjacent data points), the interpolation is monotonic. It sets the derivatives to values depending on the data in the neighborhood:

$y_k' = \begin{cases} 0 & \text{if } m_k m_{k-1} \le 0, \\ 2a \min(\lvert m_{k-1} \rvert , \lvert m_k \rvert) & \text{if } \lvert p_k \rvert > 2 \min( \lvert m_{k-1} \rvert, \lvert m_k \rvert) \\ p_k & \text{otherwise} \end{cases}$,

where

$a = \text{sign}(m_k) = \text{sign}(m_{k-1})$,
$p_k = \dfrac{m_{k-1} d_k + m_k d_{k-1}}{d_{k-1} + d_k}$.

For $a$, note that the signs must be equal when the middle condition is met, as the first condition is not satisfied.

The first case for $y_k'$ is obvious -- enforcing monotonicity of the interpolation between data points requires that any local extremum has to be at an original data point, so that if the derivative is undergoing a sign change between $x_{k-1}$ and $x_{k+1}$ (signaled by $m_{k-1}$ and $m_k$ having different signs) it has to be zero at $x_k$. The fact that local extrema can only exist at the original data points can be seen as a feature/bug of the Steffen interpolation, depending on your point of view. Personally, I like it because I often deal with applications where I want the interpolating function to be well-behaved over the whole domain more than I want it to be $C_2$ (although that does have its advantages in applications where the derivative of the function is important, as it will have kinks where the second derivative fails to be continuous).

The other cases defining $y_k'$ set a maximum absolute value for the slope. In fact, there is an interesting relation to the standard cubic spline. Recall the relation defining the derivatives $y_k'$,

$d_{k+1} y_k' + 2(d_k + d_{k+1}) y_{k+1}' + d_k y_{k+2}' = 3(d_k m_{k+1} + d_{k+1} m_k).$

If we set $y_k' = y_{k+1}' = y_{k+2}' = p_{k+1}$, then we recover the defining relation for $p_{k+1}$. The Steffen interpolation has a maximum between this value and twice the (pointwise-difference) slope of adjacent points.

Tuesday, June 9, 2015

About this blog

This blog is a place for me to collect some of the explorations into different topics that I run across in my work, side projects, or just everyday life.

When I'm working on a project I'll often run across a mathematical technique that I want to understand better. A great time for learning something new! Or sometimes I'll use this blog to describe a project, or some key part, that I'm working on.

If you've found your way here for sometime specific, or just to poke around, I hope you find it useful!