Joseph Van Name's Shortform

Joseph Van Name

Joseph Van Name's Shortform

1 min read29th Oct 202311 comments

This is a special post for quick takes by Joseph Van Name. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

Getting Started

FAQ

Library

Joseph Van Name's Shortform

11 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:07 PM

[-]Joseph Van Name5mo100

Every entry in a matrix counts for the -spectral radius similarity. Suppose that $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are real $n \times n$ -matrices. Set $A^{\otimes 2} = A \otimes A$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ to be the number

$\frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ (A_{1}^{\otimes 2} + \dots + A_{r}^{\otimes 2})^{1 / 2} ρ (B_{1}^{\otimes 2} + \dots + B_{r}^{\otimes 2})^{1 / 2}}$ . Then the $L_{2}$ -spectral radius similarity is always a real number in the interval $[0, 1]$ , so one can think of the $L_{2}$ -spectral radius similarity as a generalization of the value $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ where $u, v$ are real or complex vectors. It turns out experimentally that if $A_{1}, \dots, A_{r}$ are random real matrices, and each $B_{j}$ is obtained from $A_{j}$ by replacing each entry in $B_{j}$ with $0$ with probability $1 - α$ , then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $\sqrt{α}$ . If $u = (A_{1}, \dots, A_{r}), v = (B_{1}, \dots, B_{r})$ , then observe that $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥} \approx \sqrt{α}$ as well.

Suppose now that $A_{1}, \dots, A_{r}$ are random real $n \times n$ matrices and $C_{1}, \dots, C_{r}$ are the $m \times m$ submatrices of $A_{1}, \dots, A_{r}$ respectively obtained by only looking at the first $m$ rows and columns of $A_{1}, \dots, A_{r}$ . Then the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $C_{1}, \dots, C_{r}$ will be about $\sqrt{m / n}$ . We can therefore conclude that in some sense $C_{1}, \dots, C_{r}$ is a simplified version of $A_{1}, \dots, A_{r}$ that more efficiently captures the behavior of $A_{1}, \dots, A_{r}$ than $B_{1}, \dots, B_{r}$ does.

If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are independent random matrices with standard Gaussian entries, then the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ will be about $1 / \sqrt{r}$ with small variance. If $u, v$ are random Gaussian vectors of length $r$ , then $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will on average be about $c / \sqrt{r}$ for some constant $c$ , but $\frac{| ⟨ u, v ⟩ |}{∥ u ∥ \cdot ∥ v ∥}$ will have a high variance.

These are some simple observations that I have made about the spectral radius during my research for evaluating cryptographic functions for cryptocurrency technologies.

[-]Algon5mo20

Your notation is confusing me. If r is the size of the list of matrices, then how can you have a probability of 1-r for r>=2? Maybe you mean 1-1/r and sqrt{1/r} instead of 1-r and sqrt{r} respectively?

[-]Joseph Van Name5mo30

Thanks for pointing that out. I have corrected the typo. I simply used the symbol for two different quantities, but now the probability is denoted by the symbol $α$ .

[-]Joseph Van Name4mo30

We can use the spectral radius similarity to measure more complicated similarities between data sets.

Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -real matrices and $B_{1}, \dots, B_{r}$ are $n \times n$ -real matrices. Let $ρ (A)$ denote the spectral radius of $A$ and let $A \otimes B$ denote the tensor product of $A$ with $B$ . Define the $L_{2}$ -spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ , Define the $L_{2}$ -spectral radius similarity between $A_{1}, \dots, A_{r}$ and $B_{1}, \dots, B_{r}$ as

$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2} = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ .

We observe that if $C$ is invertible and $λ$ is a constant, then

$∥ (A_{1}, \dots, A_{r}) ≃ (λ C B_{1} C^{- 1}, \dots, λ C B_{r} C^{- 1}) ∥_{2} = 1.$

Therefore, the $L_{2}$ -spectral radius is able to detect and measure symmetry that is normally hidden.

Example: Suppose that $u_{1}, \dots, u_{r}; v_{1}, \dots, v_{r}$ are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation $T$ with $T (u_{j}) = v_{j}$ for all $j$ (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors $x_{1}, \dots, x_{r}; y_{1}, \dots, y_{r}$ with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of low close we are to obtaining such an affine transformation $T$ . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that $M, N$ are both the first few layers in a (typically different) neural network. Suppose that $a_{1}, \dots, a_{r}$ is a set of data points. Then if $u_{j} = M (a_{j})$ and $v_{j} = M (a_{j})$ , then $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ is a measure of the similarity between $M$ and $N$ .

I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of $S \subseteq {0, 1}^{32} \times {0, 1}^{32}$ of ordered pairs and I trained the neural networks $M, N$ to minimize the expected losses $E (∥ N (a) - b ∥^{2} : (a, b) \in S), E (∥ M (a) - b ∥^{2} : (a, b) \in S)$ . In my experiment, each $a_{j}$ was a random vector of length 32 whose entries were 0's and 1's. In my experiment, the similarity $∥ (x_{1} x_{1}^{*}, \dots, x_{r} x_{r}^{*}) ≃ (y_{1} y_{1}^{*}, \dots, y_{r} y_{r}^{*}) ∥_{2}$ was worse than if $x_{1}, \dots, x_{r}, y_{1}, \dots, y_{r}$ were just random vectors.

This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the $L_{2}$ -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).

[-]Joseph Van Name5mo20

So in my research into machine learning algorithms that I can use to evaluate small block ciphers for cryptocurrency technologies, I have just stumbled upon a dimensionality reduction for tensors in tensor products of inner product spaces that according to my computer experiments exists, is unique, and which reduces a real tensor to another real tensor even when the underlying field is the field of complex numbers. I would not be too surprised if someone else came up with this tensor dimensionality reduction before since it has a rather simple description and it is in a sense a canonical tensor dimensionality reduction when we consider tensors as homogeneous non-commutative polynomials. But even if this tensor dimensionality reduction is not new, this dimensionality reduction algorithm belongs to a broader class of new algorithms that I have been studying recently such as LSRDRs.

Suppose that is either the field of real numbers or the field of complex numbers. Let $V_{1}, \dots, V_{n}$ be finite dimensional inner product spaces over $K$ with dimensions $d_{1}, \dots, d_{n}$ respectively. Suppose that $V_{i}$ has basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . Given $v \in V_{1} \otimes \dots \otimes V_{n}$ , we would sometimes want to approximate the tensor $v$ with a tensor that has less parameters. Suppose that $(m_{0}, \dots, m_{n})$ is a sequence of natural numbers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is a $m_{i - 1} \times m_{i}$ matrix over the field $K$ for $1 \leq i \leq n$ and $1 \leq j \leq d_{i}$ . From the system of matrices $(X_{i, j})_{i, j}$ , we obtain a tensor $T ((X_{i, j})_{i, j}) = \sum_{i_{1}, \dots, i_{n}} e_{i_{1}} \otimes \dots \otimes e_{i_{n}} \cdot X_{1, i_{1}} \dots X_{n, i_{n}}$ . If the system of matrices $(X_{i, j})_{i, j}$ locally minimizes the distance $∥ v - T ((X_{i, j})_{i, j}) ∥$ , then the tensor $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $v$ which we shall denote by $u$ .

Intuition: One can associate the tensor product $V_{1} \otimes \dots \otimes V_{n}$ with the set of all degree $n$ homogeneous non-commutative polynomials that consist of linear combinations of the monomials of the form $x_{1, i_{1}} \dots x_{n, i_{n}}$ . Given, our matrices $X_{i, j}$ , we can define a linear functional $ϕ : V_{1} \otimes \dots \otimes V_{n} \to K$ by setting $ϕ (p) = p ((X_{i, j})_{i, j})$ . But by the Reisz representation theorem, the linear functional $ϕ$ is dual to some tensor in $V_{1} \otimes \dots \otimes V_{n}$ . More specifically, $ϕ$ is dual to $T ((X_{i, j})_{i, j})$ . The tensors of the form $T ((X_{i, j})_{i, j})$ are therefore the

Advantages:

In my computer experiments, the reduced dimension tensor $u$ is often (but not always) unique in the sense that if we calculate the tensor $u$ twice, then we will get the same tensor. At least, the distribution of reduced dimension tensors $u$ will have low Renyi entropy. I personally consider the partial uniqueness of the reduced dimension tensor to be advantageous over total uniqueness since this partial uniqueness signals whether one should use this tensor dimensionality reduction in the first place. If the reduced tensor is far from being unique, then one should not use this tensor dimensionality reduction algorithm. If the reduced tensor is unique or at least has low Renyi entropy, then this dimensionality reduction works well for the tensor $v$ .
This dimensionality reduction does not depend on the choice of orthonormal basis $e_{i, 1}, \dots, e_{i, d_{i}}$ . If we chose a different basis for each $V_{i}$ , then the resulting tensor $u$ of reduced dimensionality will remain the same (the proof is given below).
If $K$ is the field of complex numbers, but all the entries in the tensor $v$ happen to be real numbers, then all the entries in the tensor $u$ will also be real numbers.
This dimensionality reduction algorithm is intuitive when tensors are considered as homogeneous non-commutative polynomials.

Disadvantages:

This dimensionality reduction depends on a canonical cyclic ordering the inner product spaces $V_{1}, \dots, V_{n}$ .
Other notions of dimensionality reduction for tensors such as the CP tensor dimensionality reduction and the Tucker decompositions are more well-established, and they are obviously attempted generalizations of the singular value decomposition to higher dimensions, so they may be more intuitive to some.
The tensors of reduced dimensionality $T ((X_{i, j})_{i, j})$ have a more complicated description than the tensors in the CP tensor dimensionality reduction.

Proposition: The set of tensors of the form $\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$ does not depend on the choice of bases $(e_{i, 1}, \dots, e_{i, d_{i}})_{i}$ .

Proof: For each $i$ , let $f_{i, 1}, \dots, f_{i, d_{i}}$ be an alternative basis for $V_{i}$ . Then suppose that $e_{i, j} = \sum_{k} u_{i, j, k} f_{i, k}$ for each $i, j$ . Then

$\sum_{i_{1}, \dots, i_{n}} e_{1, i_{1}} \otimes \dots \otimes e_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$

$= \sum_{i_{1}, \dots, i_{n}} \sum_{k_{1}} u_{1, i_{1}, k_{1}} f_{1, i_{1}} \otimes \dots \otimes \sum_{k_{n}} u_{n, i_{n}, k_{n}} f_{n, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}}$

$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} \sum_{i_{1}, \dots, i_{n}} u_{1, i_{1}, k_{1}} \dots u_{n, i_{n}, k_{n}} X_{1, i_{1}} \dots X_{n, i, n}$

$= \sum_{k_{1}, \dots, k_{n}} f_{1, k_{1}} \otimes \dots \otimes f_{n, k_{n}} (\sum_{i_{1}} u_{1, i_{1}, k_{1}} X_{1, i_{1}}) \dots (\sum_{i_{n}} u_{n, i_{n}, k_{n}} X_{i_{n}})$ . Q.E.D.

A failed generalization: An astute reader may have observed that if we drop the requirement that $m_{n} = 1$ , then we get a linear functional defined by letting

$ϕ (p) = Tr (p ((X_{i, j})_{i, j}))$ . This is indeed a linear functional, and we can try to approximate $v$ using a the dual to $ϕ$ , but this approach does not work as well.

[-]Joseph Van Name2mo10

There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.

Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.

Theorem: Suppose that is a real,complex, or quaternionic $n \times n$ -matrix that minimizes the quantity $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2}$ . Then $U$ is unitary.

Proof: The real case is a special case of a complex case, and by representing each $n \times n$ -quaternionic matrix as a complex $2 n \times 2 n$ -matrix, we may assume that $U$ is a complex matrix.

By the Schur decomposition, we know that $U = V T V^{*}$ where $V$ is a unitary matrix and $T$ is upper triangular. But we know that $∥ U ∥_{2} = ∥ T ∥_{2}$ . Furthermore, $U^{- 1} = V T^{- 1} V^{*}$ , so $∥ U^{- 1} ∥_{2} = ∥ T^{- 1} ∥_{2}$ . Let $D$ denote the diagonal matrix whose diagonal entries are the same as $T$ . Then $∥ T ∥_{2} \geq ∥ D ∥_{2}$ and $∥ T^{- 1} ∥_{2} \geq ∥ D^{- 1} ∥_{2}$ . Furthermore, $∥ T ∥_{2} = ∥ D ∥_{2}$ iff T is diagonal and $∥ T^{- 1} ∥_{2} = ∥ D^{- 1} ∥_{2}$ iff $D$ is diagonal. Therefore, since $∥ U ∥_{2} + ∥ U^{- 1} ∥_{2} = ∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ and $∥ T ∥_{2} + ∥ T^{- 1} ∥_{2}$ is minimized, we can conclude that $T = D$ , so $T$ is a diagonal matrix. Suppose that $T$ has diagonal entries $(z_{1}, \dots, z_{n})$ . By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that $\frac{1}{2} \cdot (∥ (z_{1}, \dots, z_{n}) ∥_{2} + ∥ (z_{1}^{- 1}, \dots, z_{n}^{- 1}) ∥_{2}) \geq ∥ (| z_{1} |, \dots, | z_{n} |) ∥_{2} \cdot ∥ (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ∥_{2}$

$\geq ⟨ (| z_{1} |, \dots, | z_{n} |), (| z_{1}^{- 1} |, \dots, | z_{n}^{- 1}) | ⟩ = \sqrt{n} .$

Here, the equalities hold if and only if $| z_{j} | = 1$ for all $j$ , but this implies that $U$ is unitary. Q.E.D.

[-]Joseph Van Name4mo10

The -spectral radius similarity is not transitive. Suppose that $A_{1}, \dots, A_{r}$ are $m \times m$ -matrices and $B_{1}, \dots, B_{r}$ are real $n \times n$ -matrices. Then define $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (A_{1} \otimes A_{1} + \dots + A_{r} \otimes A_{r})^{1 / 2}$ . Then the generalized Cauchy-Schwarz inequality is satisfied:

$ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r}) \leq ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})$ .

We therefore define the $L_{2, d}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(B_{1}, \dots, B_{r})$ as $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥ = \frac{ρ (A_{1} \otimes B_{1} + \dots + A_{r} \otimes B_{r})}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (B_{1}, \dots, B_{r})}$ . One should think of the $L_{2}$ -spectral radius similarity as a generalization of the cosine similarity $\frac{⟨ u, v ⟩}{∥ u ∥ \cdot ∥ v ∥}$ between vectors $u, v$ . I have been using the $L_{2}$ -spectral radius similarity to develop AI systems that seem to be very interpretable. The $L_{2}$ -spectral radius similarity is not transitive.

$∥ (A_{1}, \dots, A_{r}) ≃ (A_{1} \oplus B_{1}, \dots, A_{r} \oplus B_{r}) ∥ = 1$ and

$∥ (B_{1}, \dots, B_{r}) ≃ (A_{1} \oplus B_{1}, \dots, A_{r} \oplus B_{r}) ∥ = 1$ , but $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥$ can take any value in the interval $[0, 1]$ .

We should therefore think of the $L_{2, d}$ -spectral radius similarity as a sort of least upper bound of $[0, 1]$ -valued equivalence relations than a $[0, 1]$ -valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.

Notation: $ρ (A) = {lim}_{n \to \infty} ∥ A^{n} ∥^{1 / n}$ is the spectral radius. The spectral radius $A$ is the largest magnitude of an eigenvalue of the matrix $A$ . Here the norm does not matter because we are taking the limit. $A \oplus B$ is the direct sum of matrices while $A \otimes B$ denotes the Kronecker product of matrices.

[-]Joseph Van Name5mo10

Let's compute some inner products and gradients.

Set up: Let denote either the field of real or the field of complex numbers. Suppose that $d_{1}, \dots, d_{r}$ are positive integers. Let $m_{0}, \dots, m_{n}$ be a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then from the matrices $X_{i, j}$ , we can define a $d_{1} \times \dots \times d_{r}$ -tensor $T ((X_{i, j})_{i, j}) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}}$ . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the $ℓ_{2}$ -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors $T ((X_{i, j})_{i, j})$ , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices $(X_{i, j})_{i, j}$ .

Notation: If $A_{1}, \dots, A_{r}, B_{1}, \dots, B_{r}$ are matrices, then let $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r})$ denote the superoperator defined by letting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$ . Let $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ .

Inner product: Here is the computation of the inner product of our tensors.

$⟨ T ((A_{i, j})_{i, j}), T ((B_{i, j})_{i, j}) ⟩$

$= ⟨ (A_{1, i_{1}} \dots A_{n, i_{n}})_{i_{1}, \dots, i_{n}}, (B_{1, i_{1}} \dots B_{n, i_{n}})_{i_{1}, \dots, i_{n}} ⟩$

$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} (B_{1, i_{1}} \dots B_{n, i_{n}})^{*}$

$= \sum_{i_{1}, \dots, i_{n}} A_{1, i_{1}} \dots A_{n, i_{n}} B_{n, i_{n}}^{*} \dots B_{1, i_{1}}^{*}$

$= Γ (A_{1, 1}, \dots, A_{1, d_{1}}; B_{1, 1}, \dots, B_{1, d_{1}}) \dots Γ (A_{n, 1}, \dots, A_{n, d_{n}}; B_{n, 1}, \dots, B_{n, d_{n}})$ .

In particular, $∥ T ((A_{i, j})_{i, j}) ∥^{2} = Φ (A_{1, 1}, \dots, A_{1, d_{1}}) \dots Φ (A_{n, 1}, \dots, A_{n, d_{n}})$ .

Gradient: Observe that $\nabla_{X} Tr (A X) = A^{T}$ . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.

$\nabla_{X_{α, β}} ⟨ T ((X_{i, j})_{i, j}), T ((A_{i, j})_{i, j}) ⟩$

$= \nabla_{X_{α, β}} \sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*}$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{1, i_{1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots A_{1, i_{1}}^{*})$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{1}, \dots, i_{n}} X_{α, i_{α}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, i_{α}}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$

$= \nabla_{X_{α, β}} Tr (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α, β} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})$

$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}} A_{n, i_{n}}^{*} \dots$

$A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$

$= (\sum_{i_{α + 1}, \dots, i_{n}, i_{1}, \dots, i_{α - 1}} X_{α + 1, i_{α + 1}} \dots X_{n, i_{n}}$

$A_{n, i_{n}}^{*} \dots A_{α + 1, i_{α + 1}}^{*} A_{α, β}^{*} A_{α - 1, i_{α - 1}}^{*} \dots A_{1, i_{1}}^{*} X_{1, i_{1}} \dots X_{α - 1, i_{α - 1}})^{T}$

$= [(Γ (X_{α + 1, 1}, \dots, X_{α + 1, d_{α + 1}}; A_{α + 1, 1}, \dots, A_{α + 1, d_{α + 1}}) \dots$

$Γ (X_{n, 1}, \dots, X_{n, d_{n}}; A_{n, 1}, \dots, A_{n, d_{n}}) (1))$

$A_{α, β}^{*}$

$((Γ (A_{α - 1, 1}^{*}, \dots, A_{α - 1, d_{α - 1}}^{*}; X_{α - 1, 1}^{*}, \dots, X_{α - 1, d_{α - 1}}^{*}) \dots$

$Γ (A_{1, 1}^{*}, \dots, A_{1, d_{1}}^{*}; X_{1, 1}^{*}, \dots, X_{1, d_{1}}^{*}) (1))]^{T}$ .

[-]Joseph Van Name5mo10

So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.

Suppose that is either the field of real numbers or the field of complex numbers. Suppose that $d_{1}, \dots, d_{n}$ are positive integers and $(m_{0}, \dots, m_{n})$ is a sequence of positive integers with $m_{0} = m_{n} = 1$ . Suppose that $X_{i, j}$ is an $m_{i - 1} \times m_{i}$ -matrix whenever $1 \leq j \leq d_{i}$ . Then define a tensor $T ((X_{i, j})) = (X_{1, i_{1}} \dots X_{n, i_{n}})_{i_{1}, \dots, i_{n}} \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ .

If $v \in K^{d_{1}} \otimes \dots \otimes K^{d_{n}}$ , and $(X_{i, j})_{i, j}$ is a system of matrices that minimizes the value $∥ v - T ((X_{i, j})) ∥$ , then $T ((X_{i, j})_{i, j})$ is a dimensionality reduction of $(X_{i, j})_{i, j}$ , and we shall denote let $u$ denote the tensor of reduced dimension $T ((X_{i, j})_{i, j})$ . We shall call $u$ a matrix table to tensor dimensionality reduction of type $(m_{0}, \dots, m_{n})$ .

Observation 1: (Sparsity) If $v$ is sparse in the sense that most entries in the tensor $v$ are zero, then the tensor $u$ will tend to have plenty of zero entries, but as expected, $u$ will be less sparse than $v$ .

Observation 2: (Repeated entries) If $v$ is sparse and $v = (x_{i_{1}, \dots, i_{n}})_{i_{1}, \dots, i_{n}}$ and the set ${x_{i_{1}, \dots, i_{n}} : i_{1}, \dots, i_{n}}$ has small cardinality, then the tensor $u$ will contain plenty of repeated non-zero entries.

Observation 3: (Tensor decomposition) Let $v$ be a tensor. Then we can often find a matrix table to tensor dimensionality reduction $u$ of type $(m_{0}, \dots, m_{n})$ so that $v - u$ is its own matrix table to tensor dimensionality reduction.

Observation 4: (Rational reduction) Suppose that $v$ is sparse and the entries in $v$ are all integers. Then the value $∥ u - v ∥^{2}$ is often a positive integer in both the case when $u$ has only integer entries and in the case when $u$ has non-integer entries.

Observation 5: (Multiple lines) Let $m$ be a fixed positive even number. Suppose that $v$ is sparse and the entries in $v$ are all of the form $r \cdot e^{2 π i n / m}$ for some integer $n$ and $r \geq 0$ . Then the entries in $u$ are often exclusively of the form $r \cdot e^{2 π i n / m}$ as well.

Observation 6: (Rational reductions) I have observed a sparse tensor $v$ all of whose entries are integers along with matrix table to tensor dimensionality reductions $u_{1}, u_{2}$ of $v$ where $∥ v - u_{1} ∥ = 3, ∥ v - u_{1} ∥ = 2, ∥ u_{2} - u_{1} ∥ = 5$ .

This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.

From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.

[-]Joseph Van Name6mo10

Suppose that are natural numbers. Let $1 < p < \infty$ . Let $z_{i, j}$ be a complex number whenever $1 \leq i \leq q, 1 \leq j \leq r$ . Let $L : M_{d} (C)^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $L (X_{1}, \dots, X_{r})$ $= (\sum_{i = 1}^{q} log (ρ (\sum_{j = 1}^{r} z_{i, j} X_{j})) / q) - log (∥ \sum_{j = 1}^{r} X_{j} X_{j}^{*} ∥_{p}) / 2$ . Here, $ρ (X)$ denotes the spectral radius of a matrix $X$ while $∥ X ∥_{p}$ denotes the Schatten $p$ -norm of $X$ .

Now suppose that $(A_{1}, \dots, A_{r}) \in M_{d} (C)^{r} ∖ {0}$ is a tuple that maximizes $L (A_{1}, \dots, A_{r})$ . Let $M : C^{r} ∖ {0} \to [- \infty, \infty)$ be the fitness function defined by letting $M (w_{1}, \dots, w_{r}) = log (ρ (w_{1} A_{1} + \dots + w_{r} A_{r})) - log (∥ (w_{1}, \dots, w_{r}) ∥_{2})$ . Then suppose that $(v_{1}, \dots, v_{r}) \in C^{r} ∖ {0}$ is a tuple that maximizes $M (v_{1}, \dots, v_{r})$ . Then we will likely be able to find an $ℓ \in {1, \dots, q}$ and a non-zero complex number $α$ where $(v_{1}, \dots, v_{r}) = α \cdot (x_{ℓ, 1}, \dots, x_{ℓ, r})$ .

In this case, $(z_{i, j})_{i, j}$ represents the training data while the matrices $A_{1}, \dots, A_{r}$ is our learned machine learning model. In this case, we are able to recover some original data values from the learned machine learning model $A_{1}, \dots, A_{r}$ without any distortion to the data values.

I have just made this observation, so I am still exploring the implications of this observation. But this is an example of how mathematical spectral machine learning algorithms can behave, and more mathematical machine learning models are more likely to be interpretable and they are more likely to have a robust mathematical/empirical theory behind them.

[-]Joseph Van Name6mo30

I think that all that happened here was the matrices just ended up being diagonal matrices. This means that this is probably an uninteresting observation in this case, but I need to do more tests before commenting any further.

Moderation Log