NiLang.jl
A differential eDSL that can run faster than light and go back to the past.
view repo
Traditional machine instruction level reverse mode automatic differentiation (AD) faces the problem of having a space overhead that linear to time in order to trace back the computational state, which is also the source of bad time performance. In reversible programming, a program can be executed bi-directionally, which means we do not need extra design to trace back the computational state. This paper answers the question that how practical it is to implement a machine instruction level reverse mode AD in a reversible programming language. By implementing sparse matrix operations and some machine learning applications in our reversible eDSL NiLang, and benchmark the performance with state-of-the-art AD frameworks, our answer is a clear positive. NiLang is an open source r-Turing complete reversible eDSL in Julia. It empowers users the flexibility to tradeoff time, space, and energy rather than caching data into a global tape. Manageable memory allocation makes it a good tool to differentiate GPU kernels too. In this paper, we will also discuss the challenges that we face towards energy efficient, rounding error free reversible computing, mainly from the instruction and hardware perspective.
READ FULL TEXT VIEW PDF
This paper considers the source-to-source automatic differentiation (AD)...
read it
High-level reversible programming languages are few and far between and ...
read it
This paper presents an efficient reversible algorithm for linear regress...
read it
Reversible computing can reduce the energy dissipation of computation, w...
read it
The deep learning community has devised a diverse set of methods to make...
read it
Classical reverse-mode automatic differentiation (AD) imposes only a sma...
read it
Heretofore, automatic checkpointing at procedure-call boundaries, to red...
read it
A differential eDSL that can run faster than light and go back to the past.
Computing the gradients of a numeric model plays a crucial role in scientific computing. Consider a computing process
where , , is the depth of computing. The Jacobian of this program is a matrix , where and
are single elements from inputs and outputs. Computing the Jacobian or part of the Jacobian automatically is what we called automatic differentiation (AD). It can be classified into three classes, the forward mode AD, the backward mode AD and the mixed mode AD
Hascoet and Pascual (2013). The forward mode AD computes the Jacobian matrix elements related to a single input using the chain rule
with the column index, while a backward mode AD computes Jacobian matrix elements related to a single output using the chain rule in the reverse direction withthe row index. In variational applications where the loss function always outputs a scalar, the backward mode AD is preferred. However, implementing backward mode AD is harder than implementing its forward mode counterpart, because it requires propagating the gradients in the inverse direction of computing the loss. The backpropagation of gradients requires
an approach to trace back the computational process,
caching variables required for computing gradients.
Most popular AD packages in the market implements the computational graph to solve above issues at the tensor level. In Pytorch
Paszke et al. (2017) and Flux Innes et al. (2018), every variable has a tracker field. When applying a predefined primitive function on a variable, the variable’s tracker field keeps track of this function as well as data needed in backpropagation. TensorFlow
Abadi et al. (2015) also implements the computational graph, but it builds a static computational graph as a description of the program before actual computation happens. These frameworks sometimes fail to meet the diverse needs in research, for example, in physics research,People need to differentiate over sparse matrix operations that are important for Hamiltonian engineering Hao Xie and Wang
, like solving dominant eigenvalues and eigenvectors
Golub and Van Loan (2012),People need to backpropagate singular value decomposition (SVD) function and QR decomposition in tensor network algorithms to study the phase transition problem
Golub and Van Loan (2012); Liao et al. (2019); Seeger et al. (2017); Wan and Zhang (2019); Hubig (2019); Wan and Zhang (2019),People need to differentiate over a quantum simulation where each quantum gate is an inplace function that changes the quantum register directly Luo et al. (2019).
To solve these issues better, we need a hardware instruction level AD. Source code transformation based AD packages like Tapenade Hascoet and Pascual (2013) and Zygote Innes (2018); Innes et al. (2019) are closer to this goal. They read the source code from a user and generate a new code that computes the gradients for users. However, these packages have their own limitations too. In many practical applications, an elementary level differentiable program that might do billions of computations will cache intermediate results to a global storage. Frequent caching of data slows down the program significantly, and the memory usage will become a bottleneck as well. With these AD tools, it is still nearly impossible to automatically generate the backward rules for BLAS functions and sparse matrix operations with a performance comparable to the state-of-the-art.
We propose to implement hardware instruction level AD on a reversible (domain-specific) programming language Perumalla (2013); Frank (2017). So that the intermediate states of a program can be traced backward with no extra effort. The overhead of reverse mode AD becomes the overhead of reversing a program, where the later has the advantage of efficient and controllable memory management. There have been many prototypes of reversible languages like Janus Lutz (1986), R (not the popular one) Frank (1997), Erlang Lanese et al. (2018) and object-oriented ROOPL Haulund (2017). In the past, the primary motivation of studying reversible programming is to support reversible computing devices Frank and Knight Jr (1999) like adiabatic complementary metal–oxide–semiconductor (CMOS) Koller and Athas (1992), molecular mechanical computing system Merkle et al. (2018) and superconducting system Likharev (1977); Semenov et al. (2003), where a reversible computing device is more energy-efficient from the perspective of information and entropy, or by the Landauer’s principle Landauer (1961). After decades of efforts, reversible computing devices are very close to providing productivity now. As an example, adiabatic CMOS can be a better choice in a spacecraft Hänninen et al. (2014); DeBenedictis et al. (2017), where energy is more valuable than device itself. Reversible programming is interesting to software engineers too, because it is a powerful tool to schedule asynchronious events Jefferson (1985) and debug a program bidirectionally Boothe (2000). However, the field of reversible computing faces the issue of having not enough funding in recent decade Frank (2017). As a result, not many people studying AD know the marvelous designs in reversible computing. People have not connected it with automatic differentiation seriously, even though they have many similarities. This paper aims to break the information barrier between the machine learning community and the reversible programming community in our work and provide yet another strong motivation to develop reversible programming.
In this paper, we first introduce the language design of a reversible programming language and introduce our reversible eDSL NiLang in Sec. 2. In Sec. 3, we explain the implementation of automatic differentiation in this eDSL. In Sec. 4, we show several examples. In Sec. 5, we benchmark the performance of NiLang with other AD packages and explain why reversible programming AD is fast. In Sec. 6, we discuss several important issues, the time-space tradeoff, reversible instructions and hardware, and finally, an outlook to some open problems to be solved. In the appendix, we show the grammar of NiLang and other technical details.
In a modern programming language, functions are pushed to a global stack for scheduling. The memory layout of a function consists of input arguments, a function frame with information like the return address and saved memory segments, local variables, and working stack. After the call, the function clears run-time information, only stores the return value. In reversible programming, this kind of design is no longer the best practice. One can not discard input variables and local variables easily after a function call, since discarding information may ruin reversibility. For this reason, reversible functions are very different from irreversible ones from multiple perspectives.
A distinct feature of reversible memory management is, the content of a variable must be known when it is deallocated. We denote the allocation of a zero emptied memory as x 0, and the corresponding deallocation as x 0. A variable can be allocated and deallocated in a local scope, which is called an ancilla. It can also be pushed to a stack and used later with a pop statement. This stack is similar to a traditional stack, except it zero-clears the variable after pushing and presupposes that the variable being zero-cleared before popping.
Knowing the contents in the memory when deallocating is not easy. Hence Charles H. Bennett introduced the famous compute-copy-uncompute paradigm Bennett (1973) for reversible programming. To explain how it works, we introduce the memory oriented computational graph, as shown in Fig. 1. Notations are highly inspired by the quantum circuit representation. A vertical line is a variable and a horizontal line is a function. When a variable is used by a function, depending on whether its value is changed or not, we put a box or a dot at the line cross. It is different from the computational graph for being a hypergraph rather than a simple graph, because a variable can be used by multiple functions now. In panel (a). The subprogram in dashed box X is executed on space represents the computing stage. In the copying stage, the content in is read out to a pre-emptied memory through inplace add +=. Since this copy operation does not change contents of , we can use the uncomputing operation X to undo all the changes to these registers. Now we computing the result without modifying the contents in . If any of them is in a known state, it can be deallocated immediately. In panel (b), we can use the subprogram defined in (a) maked as Y to generate without modifying the contents of variables . It is easy to see that although this uncompute-copy-uncompute design pattern can restore memories to known state, it has computational overhead. Both and are executed twice in the program (b), which is not necessary. We can cancel a pair of and (the gray boxes). By doing this, we are not allowed to deallocate the memory during computing . This is the famous time-space tradeoff that playing the central role in reversible programming. The tradeoff strategy will be discussed in detail in Sec. 6.1.
One can define reversible if, for and while statements in a slightly different way comparing with its irreversible counterpart. The reversible if statement is shown in Fig. 2 (a). Its condition statement contains two parts, a precondition and a postcondition. The precondition decides which branch to enter in the forward execution, while the postcondition decides which branch to enter in the backward execution. After executing the specific branch, the program checks the consistency between precondition and postcondition to make sure they are consistent. The reversible while statement in Fig. 2 (b) also has two condition fields. Before executing the condition expressions, the program preassumes the postcondition is false. After each iteration, the program asserts the postcondition to be true. In the reverse pass, we exchange the precondition and postcondition. The reversible for statement is similar to irreversible ones except that after executing the loop, the program checks the values of these variables to make sure they are not changed. In the reverse pass, we exchange start and stop and inverse the sign of step.
Every arithmetic instruction has a unique inverse that can undo the changes.
For logical operations, y = f(args...) is self reversible.
For integer and floating point arithmetic operations, we treat
y += f(args...) and y -= f(args...) as reversible to each other. Here f can be an arbitrary pure function such as identity, *, / and^. Let’s forget the floating point rounding errors for the moment and discuss in detail in Sec.
6.2.For logartihmic number and tropical number algebra Speyer and Sturmfels (2009), y *= f(args...) and y /= f(args...) as reversible to each other. Notice the zero element () in the Tropical algebra is not considered here.
Besides the above two types of operations, SWAP operation that exchanges the contents in two memory spaces is also widely used in reversible computing systems.
We develop an embedded domain-specific language (eDSL) NiLang in Julia language Bezanson et al. (2012, 2017)
that implements reversible programming. One can write reversible control flows, instructions, and memory managements inside this macro. Julia is a popular language for scientific programming. We choose Julia as the host language for multiple purposes. The most important consideration is speed that crucial for a machine instruction level AD. Its clever design of type inference and just in time compiling provides a C like speed. Also, it has a rich ecosystem for meta-programming. The package for pattern matching
MLStyle allow us to define an eDSL conveniently. Last but not least, its multiple-dispatch provides the polymorphism that will be used in our AD engine. The main feature of NiLang is contained in a single macro @i that compiles a reversible function. The allowed statements in this eDSL are shown in Appendix A. We can use macroexpand to show the compiling a reversible function to the native Julia function.Here, the version of NiLang is v0.4.0. Macro @i generates two functions that reversible to each other f and f. f is an callable of type Inv{typeof(f)}, where the type parameter typeof(f) stands for the type of the function f. In the body of f, NiLangCore.wrap_tuple is used to unify output data types to tuples. The outputs of SWAP are assigned back to its input variables. At the end of this function, this macro attaches a return statement that returns all input variables.
The compilation of a reversible function to native Julia functions is consisted of three stages: preprocessing, reversing and translation. Fig. 3 shows the compilation of the complex valued log function body, which is originally defined as follows.
In the preprocessing stage, the compiler pre-processes human inputs to reversible NiLang IR. The preprocessor removes redundant grammars and expands shortcuts. In the left most code box in Fig. 3, one uses @routine <stmt> statement to record a statement, and @routine to insert the corresponding inverse statement for uncomputing. The computing-uncomputing macros @routine and @routine is expanded in this stage. Here, one can input “” and “” by typing “leftarrow[TAB KEY]” and “rightarrow[TAB KEY]” respectively in a Julia editor or REPL.
In the reversing stage, based on this symmetric and reversible IR, the compiler generates reversed statements according to table Table 1.
In the translation stage, the compiler translates this reversible IR as well as its inverse to native Julia code. It adds @assignback before each function call, inserts codes for reversibility check, and handle control flows. We can expand the @assignback macro to see the compiled expression.
Here, the function chfield returns a complex number with an updated re field. This updated value is then assigned back to y!. In other words, this macro simulates “inplace” operations on immutable types. Except immutable fields, mappings and indexing can also be modified. We call an expression that directly modifiable in NiLang a dataview, it can be a variable itself, a field or an element of a dataview, or a bijective mapping of a dataview. The following are some examples of dataviews
real(x)
x.re
x[3]
x'
real.(x)
(x, y, z)
tget(x, 2) # tuple get index
-x[2].re'
As a final step, the compiler attaches a return statement that returns all updated input arguments at the end of a function definition. Now, the function is ready to execute on the host language. One can also define a reversible constructor and destructor, we put this part in Appendix C.
The computation of gradients in NiLang contains two parts, computing and uncomputing. In the computing stage, the program marches forward and computes outputs. In the uncomputing stage, we attach each scalar and array element with an extra gradient field and feed them into the inverse function. To composite data type with a gradient field is called GVar. As shown in Fig. 4, when an instruction is uncalled, we first uncompute the value field of GVars to and , using the input information, we then update the gradient fields according to the formula in the right panel. The binding utilizes the multiple dispatch in Julia, where a function can be dynamically dispatched based on the run time type of more than one of its arguments. Here, we dispatch a inverse instruction with input type GVar to the (:-=)(exp) instruction.
Here, the first four lines is the @routine statement that computes and store the value into an ancilla. The 5th line updates the value dataview of out!. The 6th line updates the gradient fields of x and y by applying the adjoint rule of (:+=)(exp). Finally, @routine uncomputes anc1 so that it can be returned to the “memory pool”. One does not need to define the similar function on (:+=)(exp) because macro @i will generate it automatically. Notice that taking inverse and computing gradients commute McInerney (2015).
Combining the uncomputing program in NiLang with dual-numbers is a simple yet efficient way to obtain Hessians. The dual number is the scalar type for computing gradients in the forward mode AD, it wraps the original scalar with a extra gradient field. The gradient field of a dual number is updated automatically as the computation marches forward. By wrapping the elementary type with Dual defined in package ForwardDiff Revels et al. (2016) and throwing it into the gradient program defined in NiLang, one obtains one row/column of the Hessian matrix straightforward. We will show a benchmark in Sec. 5.3.
To differentiate complex numbers, we re-implemented complex instructions reversibly. For example, with the reversible function defined in in Listing. 1, we can differentiated complex valued log with no extra effort.
CUDA programming is playing a significant role in high-performance computing. In Julia, one can write GPU compatible functions in native Julia language with KernelAbstractions Besard et al. (2017). Since NiLang does not push variables into stack automatically for users, it is safe to write differentiable GPU kernels with NiLang. We will show this feature in the benchmarks of bundle adjustment (BA) in Sec. 5.4. Here, one should notice that the shared read in forward pass will become shared write in the backward pass, which may result in incorrect gradients. We will review this issue in Sec. 6.5.
In this section, we introduce several examples.
sparse matrix dot product,
first kind bessel function and memory oriented computational graph,
solving the graph embedding problem.
All codes for this section and the next benchmark section are available in the paper repository.
Differentiating sparse matrices is useful in many applications, however, it can not benefit directly from generic backward rules for the dense matrix because the generic rules do not keep the sparse structure. In the following, we will show how to convert a irreversible Frobenius dot product code to a reversible one to differentiate it. Here, the Frobenius dot product is defined as trace(A'B). In SparseArrays code base, it is implemented as follows.
It is easy to rewrite it in a reversible style with NiLang without sacrificing much performance.
Here, all assignments are replaced with to indicate that the values of these variables must be returned at the end of this function scope. We put a “” symbol in the postcondition field of if statements to indicate this postcondition is a dummy one that takes the same value as the precondition, i.e. the condition is not changed inside the loop body. If the precondition is changed by the loop body, one can use a branch_keepervector to cache branch decisions. The value of branch_keeper can be restored through uncomputing (the “” statement above). Finally, after checking the correctness of the program, one can turn off the reversibility checks by using the macro @invcheckoff macro to achieve better performance. We provide the benchmark of this function in Sec. 5.1, where the reversible sparse matrix multiplication is also benchmarked.
A Bessel function of the first kind of order can be computed via Taylor expansion
(1) |
where is the Gamma function. One can compute the accumulated item iteratively as . The irreversible implementation is
This computational process could be diagrammatically represented as a computational graph as shown in Fig. 5 (a). The computational graph is a directed acyclic graph (DAG), where a node is a function and an edge is a data. An edge connects two nodes, one generates this data, and one consumes it. A computational graph is more likely a mathematical expression. It can not describe inplace functions and control flows conveniently because it does not have the notation for memory and loops.
Before showing the reversible implementation, we introduce how to obtain the product of a sequence of numbers reversibly. Consecutive multiplication requires an increasing size of tape to cache an intermediate state , since one can not deallocate the previous state directly since *= and /= are not considered as reversible here. To mitigate the space overhead, the standard approach in reversible computing is the pebble game model Perumalla (2013) (or the checkpointing technique in machine learning), where the cache size can be reduce in the cost of increasing the time complexity, however, constant cache size is not achievable in this scheme. Hence, we introduce the following reversible approximate multiplier.
Here, the third argument anc! is a dirty ancilla, which should be a value . Line 2 computes the result and accumulates it to the dirty ancilla, so that we have an approximate output in anc!. Line 3 removes the content in out! approximately using the information stored in anc!. Line 4 swaps the contents in out! and anc!. Finally, we have an approximate output and a dirtier ancilla. The reason why this trick works here lies in the fact that *= and /= are mathematically reversible (except the zero point) to each other. One can approximately uncomputing the contents in the register at the cost of rounding errors. This rounding error introduced in such a way only affects the function output, and does not sacrifice the reversibility. With this approximate multiplier, we implement the reversible as follows.
The above algorithm uses a constant number of ancillas, while the time overhead is also a constant factor. Ancilla anc4 plays the role of dirty ancilla in multiplication, and it is uncomputed rigorously in the uncomputing stage marked by @routine. This reversible program can be diagrammatically represented as a memory oriented computational graph as shown in Fig. 5 (b). This diagram can be used to analyze variables uncomputing. In this example, routine “B” uses hz_2, and k as control parameters, and changes the contents in anc2, anc3 and anc5. The following imul and (:+=)(identity) copies the result to output without changing these variables. Hence we can apply the inverse routine B to safely restore contents in anc2, anc3 and anc5, and this examplifies the compute-copy-uncompute paradigm.
One can obtain gradients of this function by calling Grad(ibesselj). Here, Grad(ibesselj) returns a callable instance of type Grad{typeof(ibesselj)}. The first parameters Val(1) specifies the position of loss in argument list. The Hessian can be obtained by feeding dual-numbers into this gradient function.
Here, the gradient field of hxx is defined as , which is a Dual number. It has a field partials that store the Hessian .
Graph embedding can be used to find a proper representation for an order parameter Takahashi and Sandvik (2020) in condensed matter physics. LABEL:Takahashi2020 considers a problem of finding the minimum Euclidean space dimension that a Petersen graph can embed into, that the distances between pairs of connected vertices are , and the distance between pairs of disconnected vertices are , where . The Petersen graph is shown in Fig. 6. Let us denote the set of connected and disconnected vertex pairs as and , respectively. This problem can be variationally solved with the following loss.
(2) |
The first line is a summation of distance variances in two sets of vertex pairs, where
is the variance of samples in . The second line is used to guarantee , where means taking the average of samples in . Its reversible implementation could be found in our benchmark repository.We repeat the training for dimension from to . In each training, we fix two of the vertices and optimize the positions of the rest. Otherwise, the program will find the trivial solution with overlapped vertices. For , the loss is always much higher than , while for
, we can get a loss close to machine precision with high probability. From the
solution, it is easy to see . An Adam optimizer with a learning rate Kingma and Ba requires steps training. The trust region Newton’s method converges much faster, which requires computations of Hessians to reach convergence. Although training time is comparable, the converged precision of the later is much better.In the following benchmarks the CPU device is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, and the GPU device is Nvidia Titan V. For NiLang benchmarks, we have turned off the reversibility check off to achieve a better performance.
We benchmarked the call, uncall and backward propagation time used for sparse matrix dot product and matrix multiplication. Here, we estimate the time for back propagating gradients rather than including both forward and backward, since
mul! does not output a scalar as loss.The time used for computing backward pass is approximately 1.5-3 times the Julia’s native forward pass. This is because the instruction length of differentiating basic arithmetic instructions is longer than pure computing by a factor of 2 or more.
We differentiate the first type Bessel function in Sec. 4.2 and show the benchmarks in Table 3. In the table, Julia is the CPU time used for running the irreversible forward program. It is the baseline for benchmarking. NiLang (call/uncall) is the time of reversible call or uncall. Both of them are times slower than its irreversible counterpart. Since Bessel function has only one input argument, forward mode AD tools are faster than reverse mode AD, both source-to-source framework ForwardDiff and operator overloading framework Tapenade have the a comparable computing time with the pure function call.
NiLang.AD is the reverse mode AD submodule in NiLang, and it takes 11 times the native Julia program, and is also 2 times slower than Tapenade. However, the key point is, there is no extra memory allocation like stack operations in the whole computation. The controllable memory allocation of NiLang makes it compatible with CUDA program. In other backward mode AD like Zygote, ReverseDiff and Tapenade, the memory allocation in heap is nonzero due to the checkpointing.
Since one can combine ForwardDiff and NiLang to obtain Hessians, it is interesting to see how much performance we can get in differentiating the graph embedding program in Sec. 4.3.
In Table 4, we show the the performance of different implementations by varying the dimension . The number of parameters is . As the baseline, (a) shows the time for computing the function call. We have reversible and irreversible implementations, where the reversible program is slower than the irreversible native Julia program by a factor of due to the uncomputing overhead. The reversible program shows the advantage of obtaining gradients when the dimension . The larger the number of inputs, the more advantage it shows due to the overhead proportional to input size in forward mode AD. The same reason applies to computing Hessians, where the combo of NiLang and ForwardDiff gives the best performance for .
We reproduced the benchmarks for Gaussian mixture model (GMM) and bundle adjustment (BA) in
LABEL:Srajer2018 by re-writing the programs in a reversible style. We show the results in Table 5 and Table 6. In our new benchmarks, we also rewrite the ForwardDiff program for a fair benchmark, this explains the difference between our results and the original benchmark. The Tapenade data is obtained by executing the docker file provided by the original benchmark, which provides a baseline for comparison.In the GMM benchmark, NiLang’s objective function has overhead comparing with irreversible programs in most cases. Except the uncomputing overhead, it is also because our naive reversible matrix-vector multiplication is much slower than the highly optimized BLAS function, where the matrix-vector multiplication is the bottleneck of the computation. The forward mode AD suffers from too large input dimension in the large number of parameters regime. Although ForwardDiff batches the gradient fields, the overhead proportional to input size still dominates. The source to source AD framework Tapenade is faster than NiLang in all scales of input parameters, but the ratio between computing the gradients and the objective function are close.
In the BA benchmark, reverse mode AD shows slight advantage over ForwardDiff. The bottleneck of computing this large sparse Jacobian is computing the Jacobian of a elementary function with 15 input arguments and 2 output arguments, where input space is larger than the output space. In this instance, our reversible implementation is even faster than the source code transformation based AD framework Tapenade. With KernelAbstractions, we run our zero allocation reversible program on GPU, which provides a >200x speed up.
In this paper, we show how to realize a reversible programming eDSL and implement an instruction level backward mode AD on top of it. It gives the user more flexibility to tradeoff memory and computing time comparing with traditional checkpointing. The Julia implementation NiLang gives the state-of-the-art performance and memory efficiency in obtaining first and second-order gradients in applications, including first type Bessel function, sparse matrix manipulations, solving graph embedding problem, Gaussian mixture model and bundle adjustment.
In the following, we discuss some practical issues about reversible programming, and several future directions to go.
In history, there have been many discussions about time-space tradeoff on a reversible Turing machine (RTM). In the most straightforward g-segment tradeoff scheme
Bennett (1989); Levine and Sherman (1990), an RTM model has either a space overhead that is proportional to computing time or a computational overhead that sometimes can be exponential to the program size comparing with an irreversible counterpart. This result stops many people from taking reversible computing seriously as a high-performance computing scheme. In the following, we try to explain why the overhead of reversible computing is not as terrible as people thought.First of all, the overhead of reversing a program is upper bounded by the checkpointing Chen et al. (2016) strategy used in many traditional machine learning package that memorizes inputs of primitives because checkpointing can be trivially implemented in reversible programming. Perumalla (2013) Reversible programming provides some alternatives to reduce the overhead. For example, accumulation is reversible, so that many BLAS functions can be implemented reversiblly without extra memory. Meanwhile, the memory allocation in some iterative algorithms can often be reduced with the “arithmetic uncomputing” trick without sacrificing reversibility, as shown in the ibesselj example in Sec. 4.2. Clever compiling based on memory oriented computational graphs (Fig. 1 and Fig. 5 (b)) can also be used to help user tradeoff between time and space. The overhead of a reversible program mainly comes from the uncomputing of ancillas. It is possible to automatically uncompute ancillas by analyzing variable dependency instead of asking users to write @routine and @routine pairs. In a hierarchical design, uncomputing can appear in every memory deallocation (or symbol table reduction). To quantify the overhead of uncomputing, we introduce the term uncomputing level as bellow.
The log-ratio between the number of instructions of a reversible program and its irreversible counterpart.
From the lowest instruction level, whenever we reduce the symbol table (or space), the computational cost doubles. The computational overhead grows exponentially as the uncomputating level increases, which can be seen from some of the benchmarks in the main text. In sparse matrix multiplication and dot product, we don’t introduce uncomputing in the most time consuming part, so it is . The space overhead is 2*m to keep the branch decisions, which is even much smaller than the memory used to store row indices. in Gaussian mixture model, the most time consuming matrix-vector multiplication is doubled, so it is . The extra memory usage is approximately 0.5% of the original program. In the first kind Bessel function and bundle adjustment program, the most time consuming parts are (nestedly) uncomputed twice, hence their uncomputing level is . Such aggressive uncomputing makes zero memory allocation possible.
So far, our eDSL is compiled to Julia language. It relies on Julia’s multiple dispatch to differentiate a program, which requires users to write generic programs. A more liable AD should be a hardware or micro instruction level feature. In the future, we can expect NiLang being compiled to reversible instructions Vieri (1999) and executed on a reversible device. A reversible devices can play a role of differentiation engine as shown in the hetero-structural design in Fig. 7. It defines a reversible instruction set and has a switch that controls whether the instruction calls a normal instruction or an instruction that also updates gradients. When a program calls a reversible differentiable subroutine, the reversible co-processor first marches forward, compute the loss and copy the result to the main memory. Then the co-processor execute the program backward and uncall instructions, initialize and updating gradient fields at the same time. After reaching the starting point of the program, the gradients are transferred to the global memory. Running AD program on a reversible device can save energy. Theoretically, the reversible routines do not necessarily cost energy, the only energy bottleneck is copying gradient and outputs to the main memory.
A Quantum device Nielsen and Chuang (2002)
is a special reversible hardware that features quantum entanglement. The instruction set of classical reversible programming is a subset of quantum instruction set. However, building a universal quantum computer is difficult. Unlike a classical state, a quantum state can not be cloned. Meanwhile, it loses information by interacting with the environment. Classical reversible computing does not enjoy the quantum advantage, nor the quantum disadvantages of non-cloning and decoherence, but it is a model that we can try directly with our classical computer. It is technically smooth to have a reversible computing device to bridge the gap between classical devices and universal quantum computing devices. By introducing entanglement little by little, we can accelerate some elementary components in reversible computing. For example, quantum Fourier transformation provides an alternative to the reversible adders and multipliers by introducing the Hadamard and CPHASE quantum gates
Ruiz-Perez and Garcia-Escartin (2017). From the programming languages’s perspective, most quantum programming language preassumes the existence of a classical coprocessor to control quantum devices Svore et al. (2018). It is also interesting to know what is a native quantum control flow like, and does quantum entanglement provide speed up to automatic differentiation? We believe the reversible compiling technologies will open a door to study quantum compiling.In this subsection, we introduce an easily overlooked problem in our reversible AD framework. An ancilla can sometimes carry a nonzero gradient when it is deallocated. As a result, the gradient program can be irreversible in the local scope. In NiLang, we drop the gradient field of ancillas instead of raising an error. In the following, we justify our decision by proving the following theorem.
Deallocating an ancilla with constant value field and nonzero gradient field does not introduce incorrect gradients.
Consider a reversible function , where and are the input and output values of an ancilla. Since both , are constants that are independent of input , we have
(3) |
Discarding gradients should not have any effect on the value fields of outputs. The key is to show does appear in the grad fields of the output. It can be seen from the back-propagation rule
(4) |
where the second term with vanishes naturally. We emphasis here, the value part of discarded ancilla must be a constant. ∎
One should be careful about shared read in reversible programming AD, because the shared read can introduce shared write in the adjoint program. Let’s begin with the following expression.
Most people will agree that this statement is not reversible and should not be allowed because it changes input variables. We call it the simultaneous read-and-write issue. However, the following expression with two same inputs is a bit subtle.
It is reversible, but should not be allowed in an AD program because of the shared write issue. It can be seen directly from the expanded expression.
In an AD program, the gradient field of x will be updated. The later assignment to will overwrite the former one and introduce an incorrect gradient. One can get free of this issue by avoiding using same variable in a single instruction
or equivalently,
Share variables in an instruction can be easily identified and avoided. However, it becomes tricky when one runs the program in a parallel way. For example, in CUDA programming, every thread may write to the same gradient field of a shared scalar. How to solve the shared write in CUDA programming is still an open problem, which limits the power of reversible programming AD on GPU.
We can use NiLang to solve some existing issues related to AD. Reversible programming can make use of reversibility to save memory. Reversibility has been used in reducing the memory allocations in machine learning models such as recurrent neural networks
MacKay et al. (2018), Hyperparameter learning
Maclaurin et al. (2015) and residual neural networks Behrmann et al. (2018). We can use it to generate AD rules for existing machine learning packages like ReverseDiff, Zygote Innes et al. (2019), Knet Yuret (2016), and Flux Innes et al. (2018). Many backward rules for sparse arrays and linear algebra operations have not been defined yet in these packages. We can also use the flexible time-space tradeoff in reversible programming to overcome the memory wall problem in some applications. A successful, related example is the memory-efficient domain-specific AD engine in quantum simulator Yao Luo et al. (2019). This domain-specific AD engine is written in a reversible style and solved the memory bottleneck in variational quantum simulations. It also gives so far the best performance in differentiating quantum circuit parameters. Similarly, we can write memory-efficient normalizing flow Kobyzev et al. (2019)in a reversible style. Normalizing flow is a successful class of generative models in both computer vision
Kingma and Dhariwal (2018) and quantum physics Dinh et al. (2016); Li and Wang (2018), where its building block bijector is reversible. We can use a similar idea to differentiate reversible integrators Hut et al. (1995); Laikov (2018). With reversible integrators, it should be possible to rewrite the control system in robotics Giftthaler et al. (2017) in a reversible style, where scalar is a first-class citizen rather than tensor. Writing a reversible control program should boost training performance. Reversibility is also a valuable resource for training.To solve the above problems better, reversible programming should be improved from multiple perspectives. First, we need a better compiler for compiling reversible programs. To be specific, a compiler that admits mutability of data, and handle shared read and write better. Then, we need a reversible number system and instruction set to avoid rounding errors and support reversible control flows better. There are proposals of reversible floating point adders and multipliers, however these designs require allocating garbage bits in each operation Nachtigal et al. (2010, 2011); Nguyen and Meter (2013); Häner et al. (2018). In NiLang, one can simulate rigorous reversible arithmetic with the fixed-point number package FixedPointNumbers. However, a more efficient reversible design requires instruction-level support. Some other numbers systems are reversible under *= and /= rather than += and -=, including LogarithmicNumbers Taylor et al. (1988) and TropicalNumbers. They are powerful tools to solve domain specific problems, for example, we have an upcoming work about differentiating over tropical numbers to solve the ground state configurations of a spinglass system efficiently. We also need comefrom like instruction as a partner of goto to specify the postconditions in our instruction set. Finally, although we introduced that the adiabatic CMOS as a better choice as the computing device in a spacecraft DeBenedictis et al. (2017). There are some challenges in the hardware side too, one can find a proper summary of these challenges in LABEL:Frank2005b.
Solutions to these issues requires the participation of people from multiple fields.
Jin-Guo Liu thank Lei Wang for motivating the project with possible applications to reversible integrator, normalizing flow, and neural ODE. Johann-Tobias Schäg for deepening the discussion about reversible programming with his mathematicians head. Marisa Kiresame and Xiu-Zhe Luo for discussion on the implementation details of source-to-source automatic differentiation, Shuo-Hui Li for helpful discussion on differential geometry, Tong Liu and An-Qi Chen for helpful discussion on quantum adders and multipliers, Ying-Bo Ma for correcting typos by submitting pull requests, Chris Rackauckas for helpful discussion on reversible integrator, Mike Innes for reviewing the comments about Zygote, Jun Takahashi for discussion about the graph embedding problem, Simon Byrne and Chen Zhao for helpful discussion on floating-point and logarithmic numbers. The authors are supported by the National Natural Science Foundation of China under Grant No. 11774398, the Strategic Priority Research Program of Chinese Academy of Sciences Grant No. XDB28000000.
To define a reversible function one can use “@i” plus a standard function definition like bellow
where the definition of “<stmts>” are shown in the grammar page bellow. The following is a list of terminologies used in the definition of grammar
, symbols
, numbers
, empty statement
, native Julia expression
, zero or one repetitions.
Here, all should be pure. Otherwise, the reversibility is not guaranteed. Dataview is a view of data. It can be a bijective mapping of an object, an item of an array, or a field of an object.
Stmts
Stmt
Stmts Stmt
StmtBlockStmt
IfStmt
WhileStmt
ForStmt
InstrStmt
RevStmt
AncillaStmt
TypecastStmt
@routine Stmt
@safe
CallStmt
BlockStmtbegin Stmts end
RevCond( , )
IfStmtif RevCond Stmts [ else Stmts] end
WhileStmtwhile RevCond Stmts end
Range : [ : ]
ForStmtfor ident = Range Stmts end
KwArgident =
KwArgs[ KwArgs ,] KwArg
CallStmt ( [ DataViews] [ ; KwArgs] )
Constantnum true false
InstrBinOp+= -= =
InstrTrailer[ .] ( [ DataViews] )
InstrStmtDataView InstrBinOp ident [ InstrTrailer]
RevStmt Stmt
AncillaStmtident
ident
TypecastStmt( => ) ( ident )
@routine@routine ident Stmt
@safe@safe
DataViews
DataView
DataViews , DataView
DataViews , DataView …
DataViewDataView [ ]
DataView . ident
( DataView )
DataView
- DataView
Constant
ident
A table instructions used in the main text
So far, the language design is not too different from a traditional reversible language. To port Julia’s type system better, we introduce dataviews. The type used in the reversible context is just a standard Julia type with an additional requirement of having reversible constructors. The inverse of a constructor is called a “destructor”, which unpacks data and deallocates derived fields. A reversible constructor is implemented by reinterpreting the new function in Julia. Let us consider the following statement.
The above statement is similar to allocating an ancilla, except that it deallocates g directly at the same time. Doing this is proper because new is special that its output keeps all information of its arguments. All input variables that do not appear in the output can be discarded safely. Its inverse is
It unpacks structure x and assigns fields to corresponding variables in the argument list. The following example shows a non-complete definition of the reversible type GVar.
GVar has two fields that correspond to the value and gradient of a variable. Here, we put @i macro before both struct and function statements. The ones before functions generate forward and backward functions, while the one before struct moves GVar functions to the outside of the type definition. Otherwise, the inverse function will be ignored by Julia compiler.
Since an operation changes data inplace in NiLang, a field of an immutable instance should also be “ modifiable”. Let us first consider the following example.
In Julia language, the assign statement above will throw a syntax error because the function call “-” can not be assigned, and GVar is an immutable type. In NiLang, we use the macro @assignback to modify an immutable data directly. It translates the above statement to
The first line PlusEq(*)(-arr[3].g, x, y) computes the output as a tuple of length . At lines 2-3, chfield(x, Val{:g}, val) modifies the g field of x and chfield(x, -, res[1]) returns -res[1]. Here, modifying a field requires the default constructor of a type not overwritten. The assignments in lines 4 and 5 are straightforward. We call a bijection of a field of an object a “dataview” of this object, and it is directly modifiable in NiLang. The definition of dataview can be found in Appendix A.
The computation of gradients in NiLang contains two parts, computing and uncomputing. In the computing stage, the program marches forward and computes outputs. In the uncomputing stage, we attach each scalar and array element with an extra gradient field and feed them into the inverse function. To composite data type with a gradient field is called GVar. As shown in Fig. 4, when an instruction is uncalled, we first uncompute the value field of GVars to and , using the input information, we then update the gradient fields according to the formula in the right panel. The binding utilizes the multiple dispatch in Julia, where a function can be dynamically dispatched based on the run time type of more than one of its arguments. Here, we dispatch a inverse instruction with input type GVar to the (:-=)(exp) instruction.
Here, the first four lines is the @routine statement that computes and store the value into an ancilla. The 5th line updates the value dataview of out!. The 6th line updates the gradient fields of x and y by applying the adjoint rule of (:+=)(exp). Finally, @routine uncomputes anc1 so that it can be returned to the “memory pool”. One does not need to define the similar function on (:+=)(exp) because macro @i will generate it automatically. Notice that taking inverse and computing gradients commute McInerney (2015).
Combining the uncomputing program in NiLang with dual-numbers is a simple yet efficient way to obtain Hessians. The dual number is the scalar type for computing gradients in the forward mode AD, it wraps the original scalar with a extra gradient field. The gradient field of a dual number is updated automatically as the computation marches forward. By wrapping the elementary type with Dual defined in package ForwardDiff Revels et al. (2016) and throwing it into the gradient program defined in NiLang, one obtains one row/column of the Hessian matrix straightforward. We will show a benchmark in Sec. 5.3.
To differentiate complex numbers, we re-implemented complex instructions reversibly. For example, with the reversible function defined in in Listing. 1, we can differentiated complex valued log with no extra effort.
CUDA programming is playing a significant role in high-performance computing. In Julia, one can write GPU compatible functions in native Julia language with KernelAbstractions Besard et al. (2017). Since NiLang does not push variables into stack automatically for users, it is safe to write differentiable GPU kernels with NiLang. We will show this feature in the benchmarks of bundle adjustment (BA) in Sec. 5.4. Here, one should notice that the shared read in forward pass will become shared write in the backward pass, which may result in incorrect gradients. We will review this issue in Sec. 6.5.
In this section, we introduce several examples.
sparse matrix dot product,
first kind bessel function and memory oriented computational graph,
solving the graph embedding problem.
All codes for this section and the next benchmark section are available in the paper repository.
Differentiating sparse matrices is useful in many applications, however, it can not benefit directly from generic backward rules for the dense matrix because the generic rules do not keep the sparse structure. In the following, we will show how to convert a irreversible Frobenius dot product code to a reversible one to differentiate it. Here, the Frobenius dot product is defined as trace(A'B). In SparseArrays code base, it is implemented as follows.
It is easy to rewrite it in a reversible style with NiLang without sacrificing much performance.
Here, all assignments are replaced with to indicate that the values of these variables must be returned at the end of this function scope. We put a “” symbol in the postcondition field of if statements to indicate this postcondition is a dummy one that takes the same value as the precondition, i.e. the condition is not changed inside the loop body. If the precondition is changed by the loop body, one can use a branch_keepervector to cache branch decisions. The value of branch_keeper can be restored through uncomputing (the “” statement above). Finally, after checking the correctness of the program, one can turn off the reversibility checks by using the macro @invcheckoff macro to achieve better performance. We provide the benchmark of this function in Sec. 5.1, where the reversible sparse matrix multiplication is also benchmarked.
A Bessel function of the first kind of order can be computed via Taylor expansion
(1) |
where is the Gamma function. One can compute the accumulated item iteratively as . The irreversible implementation is
This computational process could be diagrammatically represented as a computational graph as shown in Fig. 5 (a). The computational graph is a directed acyclic graph (DAG), where a node is a function and an edge is a data. An edge connects two nodes, one generates this data, and one consumes it. A computational graph is more likely a mathematical expression. It can not describe inplace functions and control flows conveniently because it does not have the notation for memory and loops.
Before showing the reversible implementation, we introduce how to obtain the product of a sequence of numbers reversibly. Consecutive multiplication requires an increasing size of tape to cache an intermediate state , since one can not deallocate the previous state directly since *= and /= are not considered as reversible here. To mitigate the space overhead, the standard approach in reversible computing is the pebble game model Perumalla (2013) (or the checkpointing technique in machine learning), where the cache size can be reduce in the cost of increasing the time complexity, however, constant cache size is not achievable in this scheme. Hence, we introduce the following reversible approximate multiplier.
Here, the third argument anc! is a dirty ancilla, which should be a value . Line 2 computes the result and accumulates it to the dirty ancilla, so that we have an approximate output in anc!. Line 3 removes the content in out! approximately using the information stored in anc!. Line 4 swaps the contents in out! and anc!. Finally, we have an approximate output and a dirtier ancilla. The reason why this trick works here lies in the fact that *= and /= are mathematically reversible (except the zero point) to each other. One can approximately uncomputing the contents in the register at the cost of rounding errors. This rounding error introduced in such a way only affects the function output, and does not sacrifice the reversibility. With this approximate multiplier, we implement the reversible as follows.
The above algorithm uses a constant number of ancillas, while the time overhead is also a constant factor. Ancilla anc4 plays the role of dirty ancilla in multiplication, and it is uncomputed rigorously in the uncomputing stage marked by @routine. This reversible program can be diagrammatically represented as a memory oriented computational graph as shown in Fig. 5 (b). This diagram can be used to analyze variables uncomputing. In this example, routine “B” uses hz_2, and k as control parameters, and changes the contents in anc2, anc3 and anc5. The following imul and (:+=)(identity) copies the result to output without changing these variables. Hence we can apply the inverse routine B to safely restore contents in anc2, anc3 and anc5, and this examplifies the compute-copy-uncompute paradigm.
One can obtain gradients of this function by calling Grad(ibesselj). Here, Grad(ibesselj) returns a callable instance of type Grad{typeof(ibesselj)}. The first parameters Val(1) specifies the position of loss in argument list. The Hessian can be obtained by feeding dual-numbers into this gradient function.
Here, the gradient field of hxx is defined as , which is a Dual number. It has a field partials that store the Hessian .
Graph embedding can be used to find a proper representation for an order parameter Takahashi and Sandvik (2020) in condensed matter physics. LABEL:Takahashi2020 considers a problem of finding the minimum Euclidean space dimension that a Petersen graph can embed into, that the distances between pairs of connected vertices are , and the distance between pairs of disconnected vertices are , where . The Petersen graph is shown in Fig. 6. Let us denote the set of connected and disconnected vertex pairs as and , respectively. This problem can be variationally solved with the following loss.
(2) |
The first line is a summation of distance variances in two sets of vertex pairs, where
is the variance of samples in . The second line is used to guarantee , where means taking the average of samples in . Its reversible implementation could be found in our benchmark repository.We repeat the training for dimension from to . In each training, we fix two of the vertices and optimize the positions of the rest. Otherwise, the program will find the trivial solution with overlapped vertices. For , the loss is always much higher than , while for
, we can get a loss close to machine precision with high probability. From the
solution, it is easy to see . An Adam optimizer with a learning rate Kingma and Ba requires steps training. The trust region Newton’s method converges much faster, which requires computations of Hessians to reach convergence. Although training time is comparable, the converged precision of the later is much better.In the following benchmarks the CPU device is Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, and the GPU device is Nvidia Titan V. For NiLang benchmarks, we have turned off the reversibility check off to achieve a better performance.
We benchmarked the call, uncall and backward propagation time used for sparse matrix dot product and matrix multiplication. Here, we estimate the time for back propagating gradients rather than including both forward and backward, since
mul! does not output a scalar as loss.The time used for computing backward pass is approximately 1.5-3 times the Julia’s native forward pass. This is because the instruction length of differentiating basic arithmetic instructions is longer than pure computing by a factor of 2 or more.
We differentiate the first type Bessel function in Sec. 4.2 and show the benchmarks in Table 3. In the table, Julia is the CPU time used for running the irreversible forward program. It is the baseline for benchmarking. NiLang (call/uncall) is the time of reversible call or uncall. Both of them are times slower than its irreversible counterpart. Since Bessel function has only one input argument, forward mode AD tools are faster than reverse mode AD, both source-to-source framework ForwardDiff and operator overloading framework Tapenade have the a comparable computing time with the pure function call.
NiLang.AD is the reverse mode AD submodule in NiLang, and it takes 11 times the native Julia program, and is also 2 times slower than Tapenade. However, the key point is, there is no extra memory allocation like stack operations in the whole computation. The controllable memory allocation of NiLang makes it compatible with CUDA program. In other backward mode AD like Zygote, ReverseDiff and Tapenade, the memory allocation in heap is nonzero due to the checkpointing.
Since one can combine ForwardDiff and NiLang to obtain Hessians, it is interesting to see how much performance we can get in differentiating the graph embedding program in Sec. 4.3.
In Table 4, we show the the performance of different implementations by varying the dimension . The number of parameters is . As the baseline, (a) shows the time for computing the function call. We have reversible and irreversible implementations, where the reversible program is slower than the irreversible native Julia program by a factor of due to the uncomputing overhead. The reversible program shows the advantage of obtaining gradients when the dimension . The larger the number of inputs, the more advantage it shows due to the overhead proportional to input size in forward mode AD. The same reason applies to computing Hessians, where the combo of NiLang and ForwardDiff gives the best performance for .
We reproduced the benchmarks for Gaussian mixture model (GMM) and bundle adjustment (BA) in
LABEL:Srajer2018 by re-writing the programs in a reversible style. We show the results in Table 5 and Table 6. In our new benchmarks, we also rewrite the ForwardDiff program for a fair benchmark, this explains the difference between our results and the original benchmark. The Tapenade data is obtained by executing the docker file provided by the original benchmark, which provides a baseline for comparison.In the GMM benchmark, NiLang’s objective function has overhead comparing with irreversible programs in most cases. Except the uncomputing overhead, it is also because our naive reversible matrix-vector multiplication is much slower than the highly optimized BLAS function, where the matrix-vector multiplication is the bottleneck of the computation. The forward mode AD suffers from too large input dimension in the large number of parameters regime. Although ForwardDiff batches the gradient fields, the overhead proportional to input size still dominates. The source to source AD framework Tapenade is faster than NiLang in all scales of input parameters, but the ratio between computing the gradients and the objective function are close.
In the BA benchmark, reverse mode AD shows slight advantage over ForwardDiff. The bottleneck of computing this large sparse Jacobian is computing the Jacobian of a elementary function with 15 input arguments and 2 output arguments, where input space is larger than the output space. In this instance, our reversible implementation is even faster than the source code transformation based AD framework Tapenade. With KernelAbstractions, we run our zero allocation reversible program on GPU, which provides a >200x speed up.
In this paper, we show how to realize a reversible programming eDSL and implement an instruction level backward mode AD on top of it. It gives the user more flexibility to tradeoff memory and computing time comparing with traditional checkpointing. The Julia implementation NiLang gives the state-of-the-art performance and memory efficiency in obtaining first and second-order gradients in applications, including first type Bessel function, sparse matrix manipulations, solving graph embedding problem, Gaussian mixture model and bundle adjustment.
In the following, we discuss some practical issues about reversible programming, and several future directions to go.
In history, there have been many discussions about time-space tradeoff on a reversible Turing machine (RTM). In the most straightforward g-segment tradeoff scheme
Bennett (1989); Levine and Sherman (1990), an RTM model has either a space overhead that is proportional to computing time or a computational overhead that sometimes can be exponential to the program size comparing with an irreversible counterpart. This result stops many people from taking reversible computing seriously as a high-performance computing scheme. In the following, we try to explain why the overhead of reversible computing is not as terrible as people thought.First of all, the overhead of reversing a program is upper bounded by the checkpointing Chen et al. (2016) strategy used in many traditional machine learning package that memorizes inputs of primitives because checkpointing can be trivially implemented in reversible programming. Perumalla (2013) Reversible programming provides some alternatives to reduce the overhead. For example, accumulation is reversible, so that many BLAS functions can be implemented reversiblly without extra memory. Meanwhile, the memory allocation in some iterative algorithms can often be reduced with the “arithmetic uncomputing” trick without sacrificing reversibility, as shown in the ibesselj example in Sec. 4.2. Clever compiling based on memory oriented computational graphs (Fig. 1 and Fig. 5 (b)) can also be used to help user tradeoff between time and space. The overhead of a reversible program mainly comes from the uncomputing of ancillas. It is possible to automatically uncompute ancillas by analyzing variable dependency instead of asking users to write @routine and @routine pairs. In a hierarchical design, uncomputing can appear in every memory deallocation (or symbol table reduction). To quantify the overhead of uncomputing, we introduce the term uncomputing level as bellow.
The log-ratio between the number of instructions of a reversible program and its irreversible counterpart.
From the lowest instruction level, whenever we reduce the symbol table (or space), the computational cost doubles. The computational overhead grows exponentially as the uncomputating level increases, which can be seen from some of the benchmarks in the main text. In sparse matrix multiplication and dot product, we don’t introduce uncomputing in the most time consuming part, so it is . The space overhead is 2*m to keep the branch decisions, which is even much smaller than the memory used to store row indices. in Gaussian mixture model, the most time consuming matrix-vector multiplication is doubled, so it is . The extra memory usage is approximately 0.5% of the original program. In the first kind Bessel function and bundle adjustment program, the most time consuming parts are (nestedly) uncomputed twice, hence their uncomputing level is . Such aggressive uncomputing makes zero memory allocation possible.
So far, our eDSL is compiled to Julia language. It relies on Julia’s multiple dispatch to differentiate a program, which requires users to write generic programs. A more liable AD should be a hardware or micro instruction level feature. In the future, we can expect NiLang being compiled to reversible instructions Vieri (1999) and executed on a reversible device. A reversible devices can play a role of differentiation engine as shown in the hetero-structural design in Fig. 7. It defines a reversible instruction set and has a switch that controls whether the instruction calls a normal instruction or an instruction that also updates gradients. When a program calls a reversible differentiable subroutine, the reversible co-processor first marches forward, compute the loss and copy the result to the main memory. Then the co-processor execute the program backward and uncall instructions, initialize and updating gradient fields at the same time. After reaching the starting point of the program, the gradients are transferred to the global memory. Running AD program on a reversible device can save energy. Theoretically, the reversible routines do not necessarily cost energy, the only energy bottleneck is copying gradient and outputs to the main memory.
A Quantum device Nielsen and Chuang (2002)
is a special reversible hardware that features quantum entanglement. The instruction set of classical reversible programming is a subset of quantum instruction set. However, building a universal quantum computer is difficult. Unlike a classical state, a quantum state can not be cloned. Meanwhile, it loses information by interacting with the environment. Classical reversible computing does not enjoy the quantum advantage, nor the quantum disadvantages of non-cloning and decoherence, but it is a model that we can try directly with our classical computer. It is technically smooth to have a reversible computing device to bridge the gap between classical devices and universal quantum computing devices. By introducing entanglement little by little, we can accelerate some elementary components in reversible computing. For example, quantum Fourier transformation provides an alternative to the reversibl
Comments
There are no comments yet.