New algorithm to ensure virtual simulation data security based on deep learning using applied innovation design

The virtual simulation laboratory saves the material resources and equipment investment required for simulating a real experiment environment. This paves the learners to experiment and explore in a virtual environment, reducing resource waste and cost. In addition, the virtual simulation laboratory can also realize the sharing of resources, and academic institutions can share the platform and content of the virtual laboratory to improve the efficiency of resource utilization. But the virtual simulation experiment data can be is easily hacked from the network, hence making it challenging task to study virtual simulation data security. In this paper, we research the virtual simulation data security based on deep learning through applied innovation design and proposed a new algorithm. The minimum violation sequence set in the virtual simulation data set is identified and the suppression mode of the minimum violation sequence is judged. The score table is constructed for the instances in the sequence, and the corresponding instances are selected and suppressed according to the score value. The cross-attention module of Transformer Structure is proposed to aggregate the global and local feature information between left and right graphs and obtain the long-distance dependence relationship between left and right graphs along the polar direction, which can more effectively fuse the global feature information of left and right graphs. The results show that the proposed algorithm can not only ensure the safety of trajectory data but also improve the availability of data.


INTRODUCTION
With the rapid development of information technology and the coming of the era of big data, data privacy becomes particularly important.The privacy protection technology of data release has been widely concerned by academia and industry.Differential Privacy (Yin et al., 2021;Adnan et al., 2022;Zhao et al., 2022) is a new privacy protection technology emerging in recent years, which solves the defects of the traditional privacy protection model.Its basic idea is to use stochastic algorithm to disturb the query operation results of the original data set, so as to achieve the privacy protection effect.The advantage of Differential Privacy Protection technology is that it no longer describes the background knowledge of the attacker quantitatively, but directly gives the assumption that the attacker has all the background knowledge (Tseng and Zhang, 2023).Assume that an attacker attacks a record and knows all records except this record.After analyzing the query result, the attacker can still ensure that the record is not leaked (Jia et al., 2021;Cretu et al., 2022).
With the rapid development of GPS, Wi-Fi and other positioning technologies and storage technologies, a large number of mobile users' trajectory data has been collected and stored (Zhang et al., 2023).Trajectory data contains rich temporal and spatial information, and in-depth research and analysis of trajectory data has become a research hotspot in the field of data mining (Kesu and RamAngu, 2023;Meng et al., 2023).Trajectory data with high availability is the basis of effective trajectory data mining.Researchers can obtain valuable information through the analysis and mining of trajectory data (Hamid et al., 2021;Hu et al., 2021).However, trajectory data also contains a large amount of privacy information of mobile users.If trajectory data is not protected before release, attackers with background knowledge can infer the privacy information of users by analyzing trajectory data, such as physical condition, living habits and home address, etc., and even bring economic losses and personal safety problems to users (Dhinakaran and Joe, 2022).Therefore, how to ensure that the published trajectory data will not disclose the privacy of users and have high availability is an urgent problem to be solved (Bag et al., 2021;Birjali and Kasri, 2021).
This paper is organized as follows.In Section 2, the related works are introduced.We detailed state the proposed method in this paper in Section 3. The results and discussion are shown in Section 4.There is a conclusion in Section 5.

RELATED RESEARCH
When accessing published data, even if the attacker has the background knowledge obtained from other channels, the attacker cannot get any additional information about the target (Irazoqui et al., 2014), which is the basic idea of data privacy protection.Under this goal, many privacy protection models and specific methods have emerged.Traditional privacy protection techniques are based on grouping, including K-anonymity (Ren et al., 2023), Ldiversity (Li et al., 2023), t-close (Soria-Comas et al., 2015) and some derivative methods (Kerestes et al., 2021).The basic idea is to anonymize and hide all records by aligning identifiers (attributes related to the background knowledge of the attacker), so that all records are divided into several equivalence classes, and thus realize a record hiding in a group of records.But they are all based on assumptions about the attacker's capabilities and background knowledge.Therefore, the above model cannot provide a guarantee of sufficient security, but it leads to many other ideas, such as privacy metrics (Yin et al., 2018).Differential privacy protection technology is proposed in (Soria-Comas et al., 2017).Its specific implementation algorithm was () obtained by arbitrary query  of data set  , and random algorithm M added noise  on the basis of () , and  followed a certain distribution (e.g., Laplace distribution).It was proved that this algorithm satisfied the differential privacy definition, and finally returned () +  to the user.
Suppose the attacker already knows everyone's information except Alice's diagnosis, and the attacker wants to obtain Alice's diagnosis, so he/she issues a query request  to the medical data set shown in Table 1: Table 1 indicates a medical data set.The attacker already knows everyone's information except Alice, so the attacker knows that the query above outputs 3 or 4. Differential privacy algorithm M adds a noise x on the basis of the query output of 4, assuming x = -0.7,then returns 3.3 to the attacker.For the attacker, 3.3 makes it difficult to determine whether the output is 3 or 4, thus ensuring Alice's privacy.The core of differential privacy protection lies in the selection of parameter  and the design of stochastic algorithm.In terms of random algorithm design, different types of problems can have different implementation mechanisms, the most basic protection mechanisms are Laplacian mechanism (Li et al., 2019), exponential mechanism (Gopi et al., 2022).The former is an algorithm that realizes differential privacy protection by disturbing the real output value of noise generated by Laplacian distribution, and mainly deals with some numerical data with output results.The latter mainly deals with some algorithms whose output results are non-numerical.In the interactive environment, Zhang et al. (2022) improved the traditional Laplacian mechanism, which could provide more queries under the same budget compared with the Laplacian mechanism.In a non-interactive environment, Brauwers and Frasincar (2023) put forward the concept of matrix mechanism, which represented interrelated queries into a matrix, thus reducing the amount of noise added, but its efficiency and optimization effect were not ideal.Xiao et al. (2011) first implemented Haar wavelet transform on data and then adds noise, which reduced data availability and improves query accuracy.Soria-Comas et al. (2014) proposed a differential privacy protection method based on hierarchical summation and least squares to divide the query sequence into groups that met the consistency constraint, and noise was added to each group.These achievements are included in the category of matrix mechanism.Li et al. (2021) improved the above method, proposed a low-rank matrix mechanism, and adopted the method of decomposing load matrix to optimize its strategy.In terms of the selection of parameter  , Jagielski et al. ( 2018) proposed an attack model, which provided an upper bound for the selection of parameter , but did not provide an attack algorithm.From the perspective of economics, Jacobs et al. (2022) proposed a simple economic model, which enabled users to select parameters in a principled manner.Mahawaga et al. (2020) proposed an attack algorithm of differential privacy protection technology and gave the upper bound of parameter  selection.To sum up, the importance of selecting parameter  is self-evident.

RESEARCH METHOD
Suppose the dataset D has n tuples, and the attacker knows all the background information except the sensitive information of the target.Therefore, there are a total of n possibilities for any n-1 tuples in dataset D to form dataset ′(|′| = || − 1) .Here, the data set ′ is denoted  , called the potential input set, and its size is  = ||.
In the scheme of Li et al. (2021), an attack model is proposed, which is based on the prior of the attacker (assuming that the average distribution is satisfied, i.e.,  > 1  , otherwise it has no meaning) and a posterior probability deduce that the upper bound of parameter  satisfies: (1) Where  represents the probability that the attacker pushes the attack object in or out of the output result set. represents the size of the potential input set.It can be seen from this inequality that when  is large, the value of the parameter  is large.
In this section, we consider how to guess the true value of multiple attacks on the same query in a worst-case scenario (only 2 potential input sets), so as to know whether the attack object is in the data set.The data set owner obtains a result f(D) based on the query request f made by the attacker against the attack object, and then returns its f(D) with noise x to the attacker.In this way, the attacker performs N attacks and gets N results () +  1 , () +  2 ,...,() +   , and infer whether the attack object is in the data set based on N results.
Since noise  is random, it is impossible for an attacker to accurately guess the specific value of  in each query.But as long as the attacker can guess in which interval  can fall, it is enough for the attacker to make some decisions.For example, a count query, as long as the noise  falls between [-0.5, 0.5], an attacker can make an accurate judgment.
Since noise  follows the (, ) distribution, an attacker can calculate the probability of  falling in a certain interval if  and  are known.The location parameter  has no effect on the attacker, while the scale parameter  = ∆  directly affects the difficulty of the attack.Under a given query problem, with the increase of  , the probability of noise  falling into the fault-tolerant interval of the query becomes smaller and smaller, and the attacker's attack becomes more and more difficult.Therefore, the selection of parameter  can reflect the above phenomenon.
In the feature extraction stage, the algorithm adopts ResNet-40 structure (Nam et al., 2023) to down-sample the left and right graphs three times.Each sub-sampling module contains 3, 4 and 6 residual modules, which can obtain the feature maps of 1/3, 1/6 and 1/12 of the original image.The number of channels is 64, 128 and 256 respectively.Then the global average pooling is applied to the final output feature map, and finally the channel attention weight of the multi-scale feature map is calculated by two 1 × 1 convolution, which is used to guide the multi-scale fusion in the feature fusion stage.
The 3 × 3 convolution of residual modules in all scales is replaced by context attention (COA).COA can capture rich static and dynamic context information at the same time, make full use of dynamic and static context information between input keys to guide the learning of dynamic attention matrix, and enhance the representation ability of feature graphs.Because COA is computationally similar to standard 3 × 3 convolution, the COA module has a similar number of parameters and floating-point calculations to ResNet-40.
In COA, assuming the input 2D feature map  ∈  ×× , Key, Query, and Value are defined as  = ,  =  and  =   , respectively.COA first uses  ×  group convolution for all the neighborhood keys in the  ×  grid, and the learned context Key  1 ∈  ×× is the static context information between the neighborhood keys.Then, the context keys  are superimposed by 2, and the   and   are convolved by 2 consecutive 1 × 1 activation functions with ReLU and without activation functions.It gets a dynamic multi-head attention matrix: For each head in the multi-head dynamic attention matrix,  first learns the local attention matrix for each spatial position based on Query and context Key, and then multiplicities the COA matrix  and  to calculate the participation feature graph  2 : Finally, the feature mapping  2 , which captures the dynamic feature interaction between inputs, is called dynamic context. 2 is then fused with the static context  1 by the attention mechanism (Yin et al., 2020) as the output of COA.
The attention-guided multi-scale fusion module aggregates the feature information of the three scales in parallel, then recalibrates the feature map at each scale, and outputs the feature map according to the original three scales (1/3, 1/6, 1/12 of the original image).
The existing research results show that, in the estimation of parallax map, the channel weights of feature map at 1/3 scale of the original image are usually fixed, and the channel weights of feature map at 1/6 and 1/12 scale of the original image are usually more specific, and the parallax distributions of different images have different influences on the channel weights of feature map.This shows the importance of recalibrating each feature channel by attention in parallel aggregation of multi-scale feature maps.
By recalibrating the channel weights of feature maps across scales, the algorithm's ability to selectively identify information features and focus on significant features can be enhanced, more comprehensive and effective features can be extracted, and matching errors can be reduced.
In this paper, the algorithm proposes three parallel aggregation modules for the feature graphs of these three scales.The aggregation modules are defined as follows: Where  is the number of levels of feature mapping ( = 3 in this paper).  is the feature map of grade  output in the feature extraction stage.Similar to (Lee et al., 2022),   computes the feature mapping according to the size relationship between  and : When  =  ,  represents the identity function.When  <  is used, the ( − ) convolution with step 2 is used to down-sample the feature graphs to achieve the same size.When  > , bilinear up-sampling is used to achieve size consistency, and then 1 × 1 convolution is used to align the number of channels.
After the three scales are aggregated, channel importance is recalibrated by the channel attention module, and the recalibrated feature map  �  can be expressed as: Where   consists of two batch normalized 3 × 3 convolution and ReLU.  is the attention weight learned from the proposed attention module.

Experimental Environment and Data Set
In this paper, the synthetic data set City80K (Chen et al., 2013) is used to test the data's utility loss.City80K is a data set that simulates the movement trajectory of 80,000 pedestrians 24 h a day in a metropolis with 26 plates.It contains five sensitive attribute values, one of which is chosen as the sensitive value in the experiment (Guo et al., 2023).All comparison algorithms are implemented in MATLAB language and run on a workstation with Intel i7-5500U CPU (3.0 GHz), 8 GB memory and 7200 RPM 1 TB hard disk.The operating system is Window 11.

Measuring Standards
Loss rate is an important parameter to measure the practicality of trajectory data set.This paper measures the practicality loss from three aspects: instance, Modified Fractal Signature (MFS) and trajectory, as follows: In terms of instance losses, Equation ( 8) is used to measure: In terms of MFS losses, Equation ( 9) is used to measure: Where () is the total number of MFS in the original data set, and (′) is the total number of MFS in the data set after algorithm processing.In this paper, MAFIA algorithm (Burdick et al., 2005) is used to calculate MFS.
In terms of trajectory loss, Equation ( 10) is used to measure: Where () is the number of tracks in the original data set, and (′) is the number of tracks in the data set after algorithm processing.

RESULTS AND DISCUSSION
In order to fully study the effectiveness of the proposed algorithm, it is compared with the KCL algorithm (Liu et al., 2021;Yu et al., 2023), DLE method (Zheng et al., 2022) and BBL method (Liu and Zhang, 2023).Figs.1-3 and Tables 2-4 show the instance loss rate, MFS loss rate and trajectory loss rate of the two algorithms under different C values, where L = 3, K = 30, E = 800.As can be seen from figures, the instance loss rate, MFS loss rate and track loss rate all decrease first and then tend to be stable with the increase of C value.As C value increases, the number of minimum violation sequences decreases and tends to be stable, and the number of suppressed sequences decreases and becomes stable.Therefore, the instance loss rate, MFS loss rate and track loss decrease first and then stabilize.When the C value is small, the minimum violation sequences are mainly global suppression, and the loss rates of the two algorithms are similar.With the increase of the C value, the number of minimum violation sequences decreases and the number of local suppression sequences increases, and the number of suppressed sequences decreases and becomes stable.Compared with the two algorithms, the algorithm in this paper effectively reduces the instance loss, so its loss rate is lower.Table 2 indicates the Instance − Loss comparison from the virtual simulation data.Fig. 1 is the Instance − Loss visualization results of virtual simulation data.
As can be seen from Table 2 and Fig. 1, under different C values, instance-loss can become lower than KCL method, which means that the security of virtual simulation data can be better guaranteed.When C = 0.4, the curve reaches the convergence state, so in the subsequent experiment, we adopt C = 0.4.As can be seen from Tables 3 and 4 and Figs. 2 and 3, similarly, under different C values, values of MFS-loss and Trajectory-loss are also lower than that of KCL method, which signify that the security of virtual simulation data in applied innovation design can be better ensured.When C = 0.4, the curve reaches the convergence state, subsequent values do not change.Where L = 3, C = 0.4, E = 800.As can be seen from Figs. 4-6, the instance loss rate, MFS loss rate and track loss rate all increase with the increase of K value.Because the increase of K value causes the increase of minimum violation sequences and global suppression sequences, the increase of suppressed sequences will correspondingly increase the data loss rate.When K value is small, the minimum violation sequence is mainly local suppression, and the loss rates of the two algorithms are similar.With the increasing of K value, the global suppression sequence and the suppressed sequence increase.Compared with the two algorithms, the proposed algorithm effectively reduces the instance loss, so its loss rate is lower.From Table 5 and Fig. 4, it can be seen that when K = 10/20, the KCL and proposed method have the same values, when K becomes bigger, the values of KCL and proposed method all become bigger too, however, their growth has been dramatic.Because the increase of K value causes the increase of minimum violation sequences and global suppression sequences, the increase of suppressed sequences will correspondingly increase the data loss rate.6 and 7 similarly to Fig. 4 and Table 5, the above tables and figures have the similar curve trend.From the objective analysis point of view, the increase of K value does not reduce the security guarantee of virtual simulation data.In short, the three indicators have similar trends, which shows that the method in this paper has a good effect in ensuring the security of virtual data.From the above data results, it can be seen that the method in this paper has a good security guarantee for virtual simulation data.In this case, teachers can be assured to teach without worrying about data leakage, so as to ensure the effectiveness of student learning.

CONCLUSION
This paper presents a new privacy protection algorithm for virtual simulation data.It adopts local suppression instead of global suppression to realize the privacy protection of trajectory data, and combines deep learning methods to reduce the trajectory loss rate, instance loss rate and MFS loss rate.Experimental results show that the data loss rate performance of the proposed algorithm is better than other privacy protection algorithms.In the future, data availability will be further improved while ensuring the efficiency of the privacy protection algorithm.the article for submission.

Fig. 4 .
Fig. 4. Instance − Loss comparison visualization results of virtual simulation data when K value is different

Table 4 .
Trajectory − Loss comparison (%) Figs.4-6 show the instance loss rate, MFS loss rate, and trajectory loss rate of the two algorithms under different K values.