This article introduces surrogate modelling, creating data-driven models for machine learning and optimization applications, and gives examples of their applications. Furthermore, it presents how to construct surrogate models, lists common surrogate modelling methods, introduces a Python library consisting of implementation of these methods. The outline of this article is:

- Why Do We Need Surrogate Models ?
- How Can We Utilize Surrogate Models in Machine Learning and Optimization ?
- How Do Surrogate Models Work ?
- How to Construct a Surrogate Models ?
- Surrogate Modelling Methods
- Which Programming Language for Surrogate Models ?
- Conclusion and Future Article

**Why Do We Need Surrogate Models ?**

Having a sufficient amount of data is critical for most optimization and machine learning methods to understand the entire system. However, collecting data from real systems may require extensively repeated experiments with different parameters and this generally takes a very long time, for instance, a few days to a few years. Another method is to simulate the experiment and use the simulated data in optimization or machine learning problems. Depending on the complexity of the models, simulations can also take a very long time. For example, electromagnetic simulation of a sophisticated antenna or waveguide may take a few hours to a few days. Assuming that optimization of this structure requires at least 10000 simulations, the required simulation would be 10000 hours to 10000 days which is not feasible and computationally expensive. Consequently, collecting data through extensively repeated measurements or simulations may be ultra-expensive or time-consuming and may not be possible to acquire sufficient data.

**How Can We Utilize Surrogate Models in Machine Learning and Optimization **?

One of the solutions for producing sufficient data is to mimic the complicated real system model using a surrogate model. Surrogate models are also known as metamodels or emulators, which are data-driven models of real system models. Surrogate models are constructed based on a smaller amount of data compared to the data amount required for the simulations. Let us consider our first example where our optimization problem (or machine learning problem) required at least 10000 simulations to achieve a reasonable solution. However, we may construct a meta-model only using 500 simulations which will require 1/20 of the total required time. Subsequently, we can construct out the surrogate model using the data acquired through 500 simulations. After constructing the surrogate model, it will take only a few second to produce a new data set, thus, we can produce our data set containing 10000 simulation data in a few minutes and perform an optimization process through this data. As we can see this optimization problem is sped up by 20 times. For instance, instead of 40 weeks, we can optimize the same problem only in 2 weeks, which will save a great amount of time and money for our business or research.

**How Do Surrogate Models Work ?**

The main idea behind the surrogate models is that every linear or non-linear system can be modelled as a black box having input and output parameters and if we have a sufficient amount of data we may accurately model this black box to acquire the relation between input parameters and output data. Although we may not have a sufficient amount of data in most cases, we can still produce a surrogate model which represents the real model with a reasonable error. Furthermore, there are various techniques to produce surrogate models, and one of them may represent our real system better than others. Consequently, it is not straightforward to produce surrogate models and this process requires a good understanding of the real system as well as an understanding of surrogate model types.

**How to Construct a Surrogate Models **?

A surrogate model is a statistical model of a system and it is vital to construct an accurate surrogate model. To construct a surrogate model, the following steps are generally followed:

- Sampling data from simulations or experiments
- Construct the surrogate model and optimize its parameters
- Validate the surrogate model and improve it by selecting a more suitable surrogate model type and/or adding more data

**Surrogate Modelling Methods**

There are various** statistical or machine learning-based techniques** to create surrogate models. For example:

- Polynomial Response Surfaces (RSM)
- Kriging (Gaussian Process Regression)
- Gradient-enhanced Kriging (GEK)
- Radial Basis Function (RBF)
- Support Vector Machines (SVM)
- Artificial Neural Networks (ANN)
- Random Forests

The accuracy or efficiency of each method depends on the real system under consideration to be modelled, therefore, in most cases, it is not possible to say which model is more suitable for a specific application without seeing the data. Furthermore, constructing the surrogate model via a few different methods and then comparing them would reveal the most appropriate model for the system under consideration.

**Which Programming Language for Surrogate Models** ?

Surrogate models can be constructed in any programming language. However, Python or MATLAB are generally used for surrogate models due to their extensive support in scientific research and machine learning communities.

For example, Surrogate Modelling Toolbox(SMT) developed by the researchers from University of Michigan. This is an open-source Python package including libraries of surrogate modelling methods (kriging, RBF, GEKPLS, etc.), sampling methods, and example benchmarking problems. SMT is also well-documented and well-tested to be used in research and industrial applications. Links for this library:

https://smt.readthedocs.io/en/latest/

**Conclusion and Future Article**

In this article, we have introduced the surrogate models and explain their benefits in optimization and machine learning applications. Furthermore, example surrogate models and a Python library consisting of most surrogate model methods are presented. In the next article, we will evaluate the accuracy of various surrogate model methods on simulated data obtained through electromagnetic simulations and compare them.