The first program to make significant headway at this, called BACON, was developed in the late 1970s by Patrick Langley. A column of periods and distances for different planets would be taken by BACON. If it found a constant value, it would stop combining the data in different ways. The constant implied that it had identified two proportional quantities. When it found an equation, it stopped.

Despite being rediscovered, BACON remained something of a curiosity in an era of limited computing power. Researchers still had to analyze most data sets by hand or use software that could find the best fit for a simple data set when given a specific class of equation. The idea of an algorithm being able to find the correct model for describing a data set was not thought of until 2009.

Their main goal was to build a machine that could boil down a lot of data into a small amount of variables. The weather may be important. The number of dentists per square mile may be important.

Finding an efficient way to guess new equations over and over is one persistent hurdle to wrangling numerous variables. Researchers say you need the flexibility to try out dead ends. The ability to hit as many data points as possible might get worse before it gets better if the algorithm can jump from a line to a parabola. In 1992 the computer scientist John Koza came up with a way to test the equations against the data. Over many trials, initially useless features either evolve or wither away.

The technique was taken to the next level by Lipson and Schmidt, who built head-to-head competition into Eureqa. On one side, they created equations. They randomized which data points to test the equations on, with the points being the ones that challenged the equations.

The Eureqa is capable of crunching data sets with more than a dozen variables. It could recover advanced equations like those describing the motion of one pendulum hanging from another.

An infographic showing how symbolic regression algorithms mutate and crossbreed equations and compare the resulting equations to a set of data points.

Other researchers were finding ways to train deep neural networks. By the year 2011, these were becoming wildly successful at learning to tell dogs from cats. A trained neural network consists of millions of numerically valuedneurons, which don't say anything about which features they've learned to recognize. Eureqa could communicate its findings in mathematical operations of physical variables.

She thought it was impossible when Sales-Pardo played with Eureqa for the first time. She and Guimer began to use Eureqa to build models for their own research on networks, but they were both impressed with its power and frustrated with its consistency. The equation was too complicated for the algorithm to land on it. Eureqa would return a completely different formula if the researchers slightly changed their data. Sales-Pardo and Guimer set out to create a new machine scientist.

A Degree of Compression

The problem with genetic algorithms was that they relied too much on their creators' tastes. To balance simplicity with accuracy, developers need to instruct the algorithm. An equation can always hit more points with additional terms. outlying points are noisy and best ignored. One might define simplicity as the length of the equation and accuracy as how close the curve gets to each point in the data set, but those are just two definitions.

Sales-Pardo and Guimer drew on their expertise in physics and statistics to come up with a new framework for the evolutionary process. They downloaded the equations from Wikipedia. They looked at the equations to see what types were most common. They were able to ensure that the initial guesses would be straightforward, so that it would be more likely to try out a plus sign. The random sampling method that was used to generate the variations of the equations was proven to explore every corner of the mathematical landscape.

The candidate equations were evaluated in terms of how well they could fit in a data set. You need to know the position of every dot in a random amount of points. If 1,000 dots fall along a straight line, they can be compressed into two numbers. The degree of compression gave the couple a way to compare candidate equations.

They and their colleagues described their method for figuring out what causes cell division in Science Advances in 2020.

Oceans of Data

Since then, the researchers have employed the machine scientist to improve on the state-of-the-art equation for predicting a country's energy consumption, while another group has used it to help model percolation through a network. Developers think that these kinds of software will play a bigger role in biological research than previously thought.

Machine scientists are helping physicists understand systems. Physicists typically use one set of equations for atoms and a completely different set for billiard balls, but this piecemeal approach doesn't work for researchers in a discipline like climate science, where small-scale currents around Manhattan feed into the Atlantic Ocean.

Laure Zanna is a researcher at New York University. She is often caught between two extremes in her work modeling ocean turbulence, because supercomputers can only simulation either city-size eddies or intercontinental currents at once. Her job is to help the computers generate a global picture that includes the effects of smaller whirlpools. She used deep neural networks to extract the effect of high-resolution simulations and update coarser simulations accordingly.

Nathan Kutz is an applied mathematician at the University of Washington. Sparse regression is an approach similar to symbolic regression. It starts with a library of a thousand functions like x 2, x/(x ) and sin. The library is searched for a combination of terms that give the most accurate predictions, and then deleted the least useful terms until it is down to just a few terms. The final equation must be built from library terms since the lightning-fast procedure can handle more data than symbolic regression.

Zanna applied a modified version of Kutz's sparse regression algorithm to ocean models to get a feel for how it worked. When she fed in high-resolution movies and asked the algorithm to look for accurate sketches, it returned a succinct equation about how fluids stretch and shear. When she fed this into her model of large-scale fluid flow, she saw the flow change as a function of energy.

Zanna said that the equation that was produced was really a representation of some of the key properties of ocean currents.

Smarter Together

Other groups are giving machine scientists a boost by incorporating their strengths with those of deep neural networks.

A graduate student at the University of Princeton has developed an open-source symbolic regression algorithm. It sets up different populations of equations on digital islands and lets the equations that fit the data periodically migrate and compete with the residents of other islands. Cranmer worked with computer scientists at DeepMind and NYU and astrophysicists at the Flatiron Institute to come up with a hybrid scheme where they first train a neural network to accomplish a task, then ask PySR to find an equation describing what certain parts of the neural network have learned to do.

The group applied the procedure to a dark matter simulation and created a formula for the density at the center of a dark matter cloud based on the properties of neighboring clouds. The equation fit the data better than the human-designed one.

Hod Lipson smiles at the camera while holding a football-size robot resembling a spider.

In February, they fed their system 30 years worth of real positions of the solar system. The law of gravitation and the mass of the planets and moons were skipped by the algorithm. The way clouds of dark matter sculpt the galaxies at their centers is one of the features of particle collisions that other groups have recently discovered using PySR.

The Massachusetts Institute of Technology created a machine scientist called,,,,,,,,,,,,,,,,,,,,,,,,

Kutz believes machine scientists are bringing the field to a point where researchers can point a camera at an event and get back an equation for what is happening. Humans still need to give current algorithms a laundry list of potentially relevant variables.

That is what Lipson has been working on. In a December preprint, he and his colleagues described a procedure in which they trained a deep neural network to take in a few frames of a video and predict the next few frames. The team reduced the number of variables the neural network was allowed to use.

The formula was able to figure out how many variables were needed to model both simple and complex systems, like the flickering of a campfire.

They are like the flaminess of the flame, according to Lipson.

The Edge of (Machine) Science

Deep neural networks are not going to be replaced by machine scientists. No one expects to find an equation for dog and cat.

When it comes to planets, fluids and dividing cells, concise equations are hard to understand. Eugene Wigner called it a wonderful gift that we don't understand or deserve in his 1960 essay, The Unreasonable Effectiveness of Mathematics in the Natural Sciences.

Cranmer and colleagues think that elementary operations are overachievers because they represent basic geometric actions in space, making them a natural language for describing reality. An object is moved down a number line. A flat area is turned into a 3D volume by multiplication. They think that betting on simplicity makes sense.

Success can be achieved by the universe's underlying simplicity.

Guimer and Sales-Pardo built a rigorous mathematical formula because Eureqa would sometimes find wildly different equations for similar inputs. They found that even their machine scientist sometimes returned multiple equally good models for the same data set.

The reason is baked into the data itself. They looked at various data sets and found that they were categorized into two categories: clean and noisy. The machine scientist could always find the equation that generated the data. It never could above a certain threshold. In other words, noisy data could match any number of equations. The researchers know that no other scientist can succeed where their algorithm fails because they have proved that it always finds the best equation.

Guimer said that that is a fundamental limitation.

The Flatiron Institute is funded by the Simons Foundation, which also funds this editorially independent publication.