SECTION I.

Introduction

Skeleton-based action recognition [1], [2], [3] has the advantages of being robust, privacy-preserving, and computationally efficient. It has extensive applications in various fields, such as healthcare [4], sports guidance [5], and autonomous driving [6]. Traditional approaches in this field often necessitate substantial volumes of annotated data. However, the acquisition of high-fidelity labeled datasets incurs significant costs, thereby impeding the recognition of previously unseen actions. Skeleton-based few-shot action recognition (FSAR), which correctly classifies unlabeled actions with only a few labeled training samples for each action class, is a promising approach. However, current research in this area is limited.

In line with the predominant trend in FSAR research [7], [8], [9], [10], our approach builds upon the foundation of metric learning [11], [12]. Current methods mainly focus on learning global representations of actions and using different metric functions to match skeleton sequences. However, relying solely on the overall representation of actions to classify similar actions is challenging and unreasonable. Fine-grained and unique representations also contribute to classification. In addition to these challenges, while extant research [13], [14], [15] has developed diverse metric functions for action sequence matching, it has predominantly focused on Euclidean and cosine distances as foundational metrics. This narrow focus overlooks the potential of other basic metrics such as Manhattan, Jaccard, and Chebyshev distances. There is a dearth of comprehensive experimental analysis exploring the comparative advantages and limitations of these various basic metrics in the context of action sequence matching. Furthermore, although existing metric functions have achieved high accuracy, the design of novel metric functions faces diminishing returns, with incremental improvements becoming increasingly challenging to attain. This scenario necessitates an innovative approach to enhance action-matching accuracy beyond the constraints of individual metric function design.

To address the aforementioned issues, we propose a novel framework called fine-grained information capture and adaptive metric aggregation (FICAMA) for skeleton-based FSAR that incorporates the generalization-refinement self-information loss (GRS), the skeletal motion fusion module (SFM), and the adaptive multimetric distance aggregation module (AMA). We employ the GRS to increase the self-information in support samples, encouraging the model to capture more fine-grained and unique representations within the samples. These representations, combined with the overall representations, are necessary for accurate matching. To integrate and coordinate the overall representations with the fine-grained representations within each sample, we design the SFM that fuses joints within frames to coordinate spatial information and introduce temporal context information. Finally, we propose the AMA that addresses the limitations of individual metric functions. The AMA dynamically adjusts its weights to aggregate multiple metric functions in a task-adaptive manner. This approach leverages the complementary strengths of various metrics to achieve high-precision skeleton sequence matching, circumventing the diminishing returns faced by single metric function design. In addition, we conduct comprehensive experiments to analyze the comparative performance of different basic metrics, filling the gap in experimental evidence identified earlier.

We evaluate our proposed method under different splits on two large-scale public datasets: NTU RGB+D 120 [16] and Kinetics [17]. Our method demonstrates robust performance across diverse datasets and settings, surpassing suboptimal approaches by margins ranging from 0.22% to 1.15% in various evaluation scenarios. The main contributions of this work can be summarized as follows.

  1. We propose the FICAMA framework for skeleton-based FSAR, which encourages the model to capture fine-grained representations and adaptively aggregates multiple metric functions.

  2. We introduce two novel components: the GRS and the SFM. GRS enhances the model’s ability to capture fine-grained and unique representations within samples, while SFM integrates these representations with global features, significantly improving spatiotemporal representation. This innovative approach addresses the limitations of previous methods that primarily focused on global representations.

  3. We design the AMA that adaptively aggregates multiple metric functions to leverage their complementary strengths. In addition, we conduct comprehensive ablation studies on various basic and complex metric functions, providing valuable insights for future research in metric function design for FSAR.

SECTION II.

Related Work

FSAR is an essential field in computer vision. Numerous methods relying on RGB video data have been proposed. Wang et al. [15] integrate all videos in a task to learn discriminative representations. Huang et al. [9] introduce an FSAR method based on generating and matching composite prototypes. Cao et al. [14] utilize temporal order information in video data and propose a novel few-shot learning framework called the temporal alignment module. Wanyan et al. [18] employ active learning to find reliable modalities for each sample based on task-relevant contextual information and fuse modality-specific posterior distributions. Tang et al. [19] introduce a 3-D feature extractor and text semantic encoding, using a transformer module to adaptively fuse text and video features. Liu et al. [20] dynamically generate kernels to summarize each relation and local feature in the associated graph and then use a lightweight fusion network and the nearest neighbor classifier to classify each query video. Wang et al. [21] learn a shared representation space by minimizing the contrastive loss between different modalities and obtain discriminative feature representations for FSAR by integrating adversarial and contrastive branches. Li et al. [22] divide a complex action into several subactions through preset hierarchical clustering, further decompose subactions into finer-grained spatial attention subactions, and, finally, use Earth Mover’s distance to measure the similarity between samples. Although these traditional methods relying on RGB video data are effective, they have high computational complexity and are highly sensitive to noise, including background clutter and illumination changes. These limitations require us to turn to more efficient and robust approaches.

Skeleton-based FSAR offers a promising alternative by focusing on essential motion data, reducing computational requirements, and increasing robustness to noise. Wang et al. [23] enhance the matching of skeleton sequences by simulating different camera views and considering temporal alignment, addressing the temporal-view misalignment caused by variations in action speed, temporal location, and subject posture. Lee et al. [24] propose a skeleton-text multimodal learning method that improves the FSAR performance of the model by combining different modality data and a teacher-student learning mechanism. Yang et al. [25] represent skeleton data at multiple spatiotemporal scales and achieve optimal feature matching through multiscale matching and cross-scale matching. Li et al. [26] enhance the discriminability of subembeddings by introducing spatial and temporal learners and propose an adaptive weight allocation module to adapt to different samples. Chen et al. [27] develop a part-aware prototype graph network to capture skeleton motion patterns at different spatial levels and introduce a category-agnostic attention mechanism to highlight important parts in each action category. Deyzel and Theart [28] propose a new dataset containing common strength training actions in the fitness industry and combine metric learning and graph convolutional networks to achieve FSAR. Liu et al. [10] fuse joint, bone, and velocity modalities through an early fusion mechanism and use a topology encoding module to capture the coordination between joints and the intrinsic semantic relationships between body parts. Peng et al. [29] consider the case of occluded skeleton sequences and utilize three data streams and a hybrid attention fusion mechanism to mitigate the adverse effects of occlusion. Ma et al. [13] encourage skeletons to be represented in a full-rank space with a rank-maximization constraint, achieving spatial disentanglement and providing interpretable and data-efficient representations for skeleton sequences. Zhu et al. [30] replace the comparison metric with the sum of similarity measures of local embeddings aligned to key spatial/temporal segments of actions. Li et al. [31] transform action sequences that appear partially similar to each other into more discriminative feature vectors through adaptive matching and mutual-adaptive matching modules. Xie et al. [32] introduce cross-attention mechanisms and dynamic time warping to improve classification performance for medical behavior classification problems. Sabater et al. [4] generate action descriptors through pose normalization, geometric feature extraction, and temporal convolutional networks to improve action recognition performance in real-world scenarios. Memmesheimer et al. [33] convert skeleton sequences into image representations and then use deep metric learning to address the one-shot action recognition problem.

Currently, research on skeleton-based FSAR is limited, necessitating further exploration to develop high-precision, universal methods. Existing studies primarily focus on enhancing model fitting through knowledge distillation, multimodality, and attention mechanisms, often overlooking data characteristics. Moreover, while designing high-precision metric functions has been one of the hotspots in the FSAR field [9], [13], [14], [15], the potential of basic metrics beyond the commonly used Euclidean and cosine distances remains underexplored. Metrics such as Manhattan, Jaccard, and Chebyshev distances, each offering unique properties for measuring sample similarity, have not been systematically investigated in this context. These basic functions are the foundation for building more complex metric functions. For example, in Bi-MHM [15], the cosine distance is used to calculate the distance between frames of different samples, and then, the video sequence metric problem is converted into a video set metric problem. In the subsequent experimental section, we analyze the performance of basic metric functions on FSAR and their impact on complex metric functions.

We address these gaps through three key innovations. First, we designed the GRS module to enhance data feature representation by increasing the self-information of support samples. This approach addresses the oversight in previous work where the importance of enhancing feature-level data was often overlooked. Second, inspired by methodologies such as MLPMixer [34], we developed the SFM for effective feature fusion. SFM enhances features by fusing temporal and spatial information, thereby improving the model’s understanding of complex motion dynamics through positional encoding. Finally, whereas previous research focused heavily on designing single, highly accurate metric functions, our AMA uniquely combines multiple metric functions to achieve superior performance. By aggregating underutilized basic metrics such as Manhattan, Jaccard, and Chebyshev, AMA outperforms individual metrics and addresses the stagnation in the performance improvements of single metric functions. Our comprehensive ablation studies validate these components, offering significant insights for future metric function design in skeleton-based FSAR.

SECTION III.

Methodology

A. Problem Definition

This article follows the standard few-shot learning protocol [11], [13], dividing the dataset into three disjoint subsets: training set ${D}_{\text {tr}}$ , validation set ${D}_{\text {val}}$ , and test set ${D}_{\text {te}}$ . The goal is to train a model on ${D}_{\text {tr}}$ such that it can generalize well to unseen classes in ${D}_{\text {te}}$ , while ${D}_{\text {val}}$ is used for model selection. We adopt the episodic training strategy [11], extracting support and query sets from the dataset under the N-way K-shot setting. Here, N represents the number of classes, and K represents the number of samples for each class in the support set. In addition, P samples are selected from the remaining instances of the same N classes to form the query set. The objective is to classify each video in the query set into one of the N classes using the labeled support set.

Let ${D}_{s}\in R^{N\times K\times C\times T\times U\times M}$ be the set of support skeleton sequences for training, where C, T, U, and M represent the dimensionality of body joint coordinates, the number of frames for each sample, the number of joints per skeleton, and the maximum number of skeletons per sample, respectively. Using a backbone network $f_{\phi }(\cdot):{R}^{C\times T\times U\times M}\to {R}^{t\times U\times V}$ with parameters $\phi $ , each support skeleton sequence is encoded into a support sample feature $G^{s}\in R^{t\times U\times V}$ , where t is the number of frames after encoding and V is the embedding dimension of a single joint. The prototype representation [11] for each class is calculated as $\mathbf {P}_{n}=(1/K)\sum _{({G}_{i}^{s},y_{i}^{s})}{G}_{i}^{s}\times \mathbb {I}(y_{i}^{s}=n)$ , where $n\in N$ is the label of the nth class, ${G}_{i}^{s}$ and $y_{i}^{s}$ are the ith support sample feature and its corresponding label, respectively, and $\mathbb {I}$ is an indicator function that returns 1 when the expression is true and 0 otherwise.

The query sample feature $G^{q}\in R^{t\times U\times V}$ is obtained using a similar process, but there are $N\times P$ query sample features. Following the prototype network [11], the class distribution of a query sample $G^{q}$ is obtained by applying a softmax function on the distances between the query sample and the class prototypes\begin{equation*} p\left ({{y=n\mid {G}^{q}}}\right)=\frac {\exp \left ({{-d\left ({{{G}^{q},X_{n}}}\right)}}\right)}{\sum _{n'\neq n}\exp \left ({{-d\left ({{{G}^{q},X_{n'}}}\right)}}\right)} \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.where y and n represent the predicted label and true label of ${G}^{q}$ , respectively, and $d({G}^{q}, X_{n})$ denotes the distance between the query sample and the prototype of class n. $n'$ represents all classes except class n.

B. Overall Framework

Fig. 1 shows the overall framework of FICAMA.

  1. The backbone adopts ST-GCN [35], a spatial-domain-based graph convolutional spatiotemporal network model that effectively captures the spatiotemporal features of human actions. For advanced alternatives, HD-GCN [36], CTR-GCN [1], and ST-GCn++ [2] can be used. However, to ensure fair comparisons with published works [13], [37], [38], ST-GCN remains the choice in this study.

  2. GRS increases the self-information of support set samples, encouraging the model to capture fine-grained and unique information within the samples. This ensures that the class prototypes contain both typical patterns and coarse-grained information of actions, as well as fine-grained and unique information from different samples, enhancing robustness to variations.

  3. SFM performs spatiotemporal fusion for each sample, ensuring that the information introduced by GRS is more coordinated and uniform in space and time. It fuses different joints in space and integrates different aspects of joint motion, then mixes frames, and further clarifies the temporal relationships in actions using relative positional encoding.

  4. AMA dynamically aggregates the results of multiple metric functions in a task-adaptive manner. It measures the reliability of metric functions using the difference between the minimum and second-minimum values in the results of different metric functions. Higher weights are assigned to more reliable metric functions, and a softmax function is used to smooth the weights of different functions for better aggregation of multiple results.

Fig. 1. - Overall framework of FICAMA consists of four parts: backbone, GRS, SFM, and AMA. Demonstrating a five-way one-shot task, after extracting features, the GRS of support samples is calculated. Then, SFM is applied to each sample individually to fuse the spatiotemporal information of a single sample. Two metric functions are used to calculate the distances from query samples to class prototypes, obtaining two results: distance1 and distance2. AMA calculates weights based on the difference between the minimum and second-minimum values. Finally, the weighted distance is obtained. Subsequently, classification is performed based on the final weighted distance, and the query sample is classified into the nearest category.
Fig. 1.

Overall framework of FICAMA consists of four parts: backbone, GRS, SFM, and AMA. Demonstrating a five-way one-shot task, after extracting features, the GRS of support samples is calculated. Then, SFM is applied to each sample individually to fuse the spatiotemporal information of a single sample. Two metric functions are used to calculate the distances from query samples to class prototypes, obtaining two results: distance1 and distance2. AMA calculates weights based on the difference between the minimum and second-minimum values. Finally, the weighted distance is obtained. Subsequently, classification is performed based on the final weighted distance, and the query sample is classified into the nearest category.

C. Generalization-Refinement Self-Information Loss

The goal of FSAR is to correctly classify unseen query samples into known support set categories. This process relies on class prototypes that can generalize well to unseen samples, but it faces several challenges. One major issue is that class prototypes, typically formed by averaging a limited number of support set samples, often capture common and coarse-grained information while overlooking unique and fine-grained details. This limitation makes it difficult for prototypes to accurately represent the full spectrum of class characteristics. Query samples can vary widely in their features. While global representations derived from class prototypes work well in many cases, they may struggle with more nuanced distinctions. For example, differentiating between similar actions such as “removing glasses” and “removing headphones” requires fine-grained information that global representations might miss. In few-shot scenarios, where we have limited samples, overreliance on global representations can hinder generalization. Unique information that deviates from typical patterns can actually enhance the model’s ability to generalize, especially when working with a small sample set. To address these challenges, our approach aims to capture both fine-grained and unique information within samples. By doing so, we seek to improve the model’s classification accuracy and its ability to generalize to unseen actions.

GRS is designed to increase the self-information of all samples in the support set, enabling the model to capture more fine-grained and unique information. Self-information quantifies the uniqueness of a sample, with higher values indicating more distinctive features. By enhancing self-information, GRS encourages the model to focus on fine-grained, distinguishing characteristics of actions rather than common patterns. For a single support sample $G^{s}\in R^{t\times U\times V}$ , it is first reshaped into a 1-D tensor $G^{s'}\in R^{\text {tUV}}$ . We use scikit-learn to compute the self-information. The process of computing the self-information of the ith support sample in the nth class is given as follows:\begin{equation*} \mathrm {\text {SI}}\left [{{n,i}}\right ] = \mathrm {\text {MI}}\left ({{G^{s'}_{n,i}, G^{s'}_{n,i}, \text {bins}}}\right) \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features.where MI represents the mutual information computation and bins is a hyperparameter. Higher self-information indicates that a sample is relatively rare and contains more information that deviates from typical patterns. This information can encourage the model to capture more fine-grained details. The final GRS can be calculated using the following equation:\begin{equation*} {\mathcal {L}}_{\text {grs}} = -\frac {1}{N} \sum _{n=1}^{N} \log \left ({{\frac {\sum _{i}^{K} \mathrm {\text {SI}}\left [{{n,i}}\right ]}{\epsilon }}}\right) \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $\epsilon $ is a fixed constant of 1e−6 used to maintain numerical stability. Increasing the self-information of samples enhances their uniqueness and information content, crucial for FSAR. This approach aims to classify unseen query samples using limited visible examples. When samples are overly similar or predictable, models struggle to capture the essential characteristics of different action categories, relying instead on superficial similarities. By emphasizing fine-grained, sample-specific information, the model better discerns fundamental action patterns and features. This enhancement improves the model’s ability to generalize to unseen samples by leveraging richer, more diverse representations of each action category. Consequently, GRS allows the model to form more comprehensive and nuanced class prototypes, even in the context of limited samples typical in few-shot scenarios.

D. Skeletal Motion Fusion Module

Samples with high self-information contain rich spatial and temporal information, but the distribution of information may not be uniform. Spatial information may be concentrated in a few joint points within a single-frame skeleton, requiring the fusion of global and fine-grained representations to enable the model to capture rich spatial information at every instant of the action. In addition, temporal information varies from frame to frame, specifically lacking temporal contextual perception, making it difficult to accurately capture the dynamic nature of actions. To address these two issues, we propose SFM, inspired by MLPMixer [34], which employs all-MLP architecture to effectively capture spatial features through cross-patch and channel mixing in image classification. We construct SFM using a similar all-MLP approach, tailored to effectively capture and fuse the rich spatial and temporal information present in skeleton sequences. This architecture enhances the model’s capacity to extract more discriminative features for action recognition.

SFM consists of three MLP modules and positional encoding. The three MLP modules have a consistent structure, composed of two linear layers, an ReLU layer, and a residual connection, as shown in Fig. 1. The SFM operates on each sample in the current context task. For a sample $G=\{\text {S}_{1},\ldots,\text {S}_{t}\}\in R^{t\times U\times V}$ containing t frames of skeletons $\text {S}\in R^{U\times V}$ , the fusion process can be represented in two stages. The first stage is spatial fusion for each skeleton frame\begin{align*} \mathbf {F}_{i}& =\sigma \left ({{\mathbf {S}^{\top }_{i}\mathbf {W}_{j_{1}}}}\right)\mathbf {W}_{j_{2}}+\mathbf {S}^{\top }_{i},i=1,2,\ldots,t. \tag {4}\\ \mathbf {S}_{i}^{*}& =\sigma \left ({{\mathbf {F}_{i}^{\top }\mathbf {W}_{e_{1}}}}\right)\mathbf {W}_{e_{2}}+\mathbf {F}_{i}^{\top },i=1,2,\ldots,t \tag {5}\end{align*} View SourceRight-click on figure for MathML and additional features.where $\mathbf {S}_{i}^{*} \in R^{U \times V}$ is the spatially fused feature, $\mathbf {W}_{j_{1}},\mathbf {W}_{j_{2}} \in R^{U \times U}$ and $\mathbf {W}_{e_{1}},\mathbf {W}_{e_{2}} \in R^{V \times V}$ are the learnable weights for joint fusion and embedding dimension fusion, respectively, $\sigma $ represents the ReLU nonlinearity layer, and the final spatially fused sample $G^{*}=\{\text {S}_{1}^{*},\ldots,\text {S}_{t}^{*}\}\in R^{t\times U\times V}$ is obtained. Joint fusion is influenced by the motion of other joints in the same frame while preserving the unique representation of each joint. Coordinating the motion between different joints in the same frame makes the motion information more discriminative, thereby capturing global spatial dependencies. Embedding dimension fusion allows for the integration of different aspects of each joint’s motion by fusing the embedding dimensions that represent different aspects of motion information, combining fine-grained and unique representations. The fusion operation in the first stage occurs within each frame of the skeleton sequence, encapsulating the complex motion of joints within a single time step but lacking the capture of the evolution of actions between frames and temporal context information. Therefore, the second stage focuses on the fusion between frames within the same sample. First, the shape of the spatially fused sample is changed to $G^{*}=\{\text {S}_{1}^{*},\ldots,\text {S}_{t}^{*}\}\in R^{t\times UV}$ , where the shape of a single-frame skeleton becomes $\text {S}_{1}^{*} \in R^{UV}$ . Then, we add positional encoding $\mathbf {E_{pos}}\in R^{t\times UV}$ and perform frame fusion\begin{align*} \mathbf {G}& =G^{*}+\mathbf {E_{\text {pos}}} \tag {6}\\ \mathbf {R}& =\sigma \left ({{\mathbf {G}^{\top }\mathbf {W}_{t_{1}}}}\right)\mathbf {W}_{t_{2}}+\mathbf {G}^{\top }\tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features.where $\mathbf {R} \in R^{t \times \text {UV}}$ is the spatiotemporally enhanced feature and $\mathbf {W}_{t_{1}},\mathbf {W}_{t_{2}} \in R^{t \times t}$ are the learnable weights for frame fusion. By adding positional encoding, valuable information about the temporal order and relative position of frames is injected, which is crucial for understanding the temporal evolution of actions and enabling the model to learn meaningful temporal patterns. Frame fusion along the temporal dimension effectively aggregates and combines information from different time steps, allowing the model to learn complex temporal dependencies and capture the long-term context required for accurate action recognition. Since the first stage already used embedding dimension fusion weights to capture the dependencies and interactions between embedding dimensions within each joint, we do not set embedding dimension fusion weights again in the second stage but instead focus on modeling temporal relationships. The reason for not using positional encoding in the first stage is that the model can capture the inherent spatial structure and relationships within frames, and introducing explicit spatial information through spatial positional encoding would lead to overparameterization and performance degradation. In the subsequent ablation study section, we demonstrate the rationality of our design through experimental results.

E. Adaptive Multimetric Distance Aggregation Module

Designing new advanced metric functions is becoming increasingly difficult and may not yield significant improvements. Many sophisticated metric functions [9], [14], [15] have been proposed, each achieving only small gains over previous methods. This trend suggests that we may be approaching the limit of what individual metric functions can achieve in terms of performance. Inspired by the mixed expert model system, which combines multiple specialized models to solve complex tasks, we propose AMA. Instead of designing new metric functions, AMA aggregates the outputs of multiple existing metric functions for each classification task. The adaptivity of AMA lies in its task-dependent weighting mechanism, where weights are calculated based on the reliability of each metric function’s performance for the specific task. This approach allows AMA to leverage the strengths of various metric functions in a manner tailored to the unique characteristics of each classification task, thereby optimizing their combined performance and potentially surpassing the limitations of using individual metric functions alone. The calculation process is represented by the following equation:\begin{align*} \text {Dist1}& = \mathrm {Metric1}\left ({{\mathbf {R}^{\mathbf {q}}_{i},\mathbf {P}}}\right), \text {Dist2}= \mathrm {Metric2}\left ({{\mathbf {R}^{\mathbf {q}}_{i},\mathbf {P}}}\right) \tag {8}\\ \text {weight1} & = \frac {e^{\text {mar1}}}{e^{\text {mar1}} + e^{\text {mar2}}},\text {weight2} = \frac {e^{\text {mar2}}}{e^{\text {mar2}} + e^{\text {mar1}}} \tag {9}\\ \text {Dist}& =\text {Dist1}*\text {weight1}+\text {Dist2}*\text {weight2} \tag {10}\end{align*} View SourceRight-click on figure for MathML and additional features.where Metric1 and Metric2 are different metric functions. Unless otherwise specified, Metric1 employs Bi-TCMHM, our enhanced version of Bi-MHM [15] that incorporates time regularization, while Metric2 utilizes the Euclidean distance. These two metrics are selected for their widespread adoption and demonstrated efficacy. $\mathbf {R}^{\mathbf {q}}_{i}\in R^{t \times U \times V}$ and $\mathbf {P}=\{ \mathbf {P}_{1},\ldots,\mathbf {P}_{N}\}\in R^{N \times t \times U \times V}$ are the ith query sample enhanced by the SFM and the set of class prototypes, respectively. The formation process of the nth class prototype is $\mathbf {P}_{n}\in R^{t \times U \times V}=(1/K)\sum _{(\mathbf {R}^{\mathbf {s}}_{j},y_{j}^{s})}\mathbf {R}^{\mathbf {s}}_{j}\times \mathbb {I}(y_{j}^{s}=n)$ , where $\mathbf {R}^{\mathbf {q}}_{j}$ is the jth support sample enhanced by the SFM. $\text {Dist1}\in R^{N}$ and $\text {Dist2}\in R^{N}$ are the distances between the query sample and all class prototypes calculated by different metric functions. mar1 and mar2 represent the difference between the minimum distance and the second minimum distance in Dist1 and Dist2, respectively. This difference can effectively reflect the reliability of a metric function in the current task. When the difference between the minimum distance and the second minimum distance is large, it indicates that the metric function can clearly distinguish that the query sample is closer to the class prototype with the minimum distance and farther away from other class prototypes. Conversely, if the difference between the minimum distance and the second minimum distance is small, it indicates that the metric function gives an ambiguous result, and the distance difference between the query sample and the class prototypes with the minimum and second minimum distances is not significant, indicating weak reliability. Therefore, metric functions with higher reliability should be assigned greater weights. The method of calculating weights based on softmax is superior to simple average weighting because the weights calculated by the softmax operation are smoother and can avoid the influence of extreme values, resulting in better robustness. In addition to using mar1 and mar2 to calculate the weight1 and weight2, other reliability metrics can be considered, such as negative entropy ne or maximum classification probability mcp. For the ith query sample, the negative entropy and maximum classification probability based on Dist1 are calculated as follows:\begin{align*} \text {ne1}& =\sum _{n=1}^{N} p\left ({{\hat {y}_{i}=n|\mathbf {R}^{\mathbf {q}}_{i},\text {Dist} 1}}\right)\log p\left ({{\hat {y}_{i}=N|\mathbf {R}^{\mathbf {q}}_{i},\text {Dist1}}}\right). \tag {11}\\ \text {mcp1}& =\max _{n} p\left ({{\hat {y}_{i}=n|\mathbf {R}^{\mathbf {q}}_{i},\text {Dist1}}}\right). \tag {12}\end{align*} View SourceRight-click on figure for MathML and additional features.

The predicted label of the query sample is denoted by $\hat {y}_{i}$ . The calculations for ne2 and mcp2 follow the same formulas, with Dist2 replacing Dist1 in the respective equations. Negative entropy measures the concentration of the probability distribution. When the probability is concentrated on a few categories, the negative entropy value is larger; when the probability is dispersed among multiple categories, the negative entropy value is smaller. However, the results of the ablation experiments show that using the difference between the minimum distance and the second minimum distance to calculate weights can maximize classification accuracy. This can be attributed to the following reasons: considering only the maximum classification probability ignores the probability information of other categories and cannot comprehensively reflect the contribution of the metric function to the discriminative ability of the sample. Although negative entropy reflects the uncertainty of the entire probability distribution, its calculation includes some categories that may not be highly relevant, which may introduce noise. In contrast, the difference between the minimum distance and the second minimum distance directly focuses on the discriminative ability of the metric function on the two most likely categories. In classification tasks, the model’s confidence in the correct category and its ability to distinguish from the closest competing category are crucial, and the difference between the minimum distance and the second minimum distance captures this well. Therefore, AMA mainly uses the difference between the minimum distance and the second minimum distance to measure the reliability of the metric function.

F. Learning Objective

During training, the loss function is calculated as follows:\begin{equation*} {\mathcal {L}}_{\text {match}}=-\frac {1}{N^{q}}\sum _{i=1}^{N^{q}}\log p\left ({{\hat {y}_{i}=y_{i}\mid \mathbf {R}^{\mathbf {q}}_{i}}}\right)+\alpha {\mathcal {L}}_{\text {grs}} \tag {13}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $N^{q}$ represents the total number of query samples, $\hat {y}_{i}$ and $y_{i}$ denote the predicted label and true label of the query sample $\mathbf {R}^{\mathbf {q}}_{i}$ , respectively, and $\alpha $ is the weight of the GRI.

SECTION IV.

Experiments

A. Datasets

1) NTU RGB+D 120:

Liu et al. [16] introduced the NTU RGB+D 120 dataset, a comprehensive 3-D human behavior recognition benchmark dataset. The dataset contains 113 945 skeleton sequences, each representing an action performed by one or two subjects. Each skeleton sequence includes 25 body joints, with each joint annotated with 3-D coordinates. For our experiments, we selected a subset of 120 action categories and divided them into training, validation, and test sets, containing 80, 20, and 20 categories, respectively. Following the protocol outlined in [13], we constructed two subsets, “NTU-S” and “NTU-T,” by randomly sampling 60 and 30 instances from each category, respectively.

2) Kinetics:

Carreira and Zisserman [17] proposed the Kinetics skeleton dataset, a large-scale collection of human action videos obtained from YouTube. The dataset consists of 260 232 video clips, covering 400 different action categories. Each video is processed using the OpenPose framework [39] to estimate the 2-D poses of human subjects, generating skeleton graphs with 18 body joints. The initial joint features include 2-D spatial coordinates and associated prediction confidence scores. In our study, we focus on a subset of the first 120 action categories, randomly selecting 100 samples from each category. To maintain consistency with the NTU RGB+D 120 dataset, we adopt the same proportions to divide the dataset into training, validation, and test sets.

B. Implementation Details

The proposed framework is built using the PyTorch library [40] and evaluated on an Ubuntu 20.04 machine equipped with an NVIDIA GeForce RTX 4090 GPU. We conduct experiments using the traditional N-way K-shot few-shot learning paradigm [14], [41], where each episodic task samples N categories from the dataset, with K samples per category. To ensure consistency across different datasets, a fixed number of frames are sampled from each skeleton sequence: 50 frames for Kinetics and 30 frames for NTU RGB+D. The random seed is kept consistent across all experiments to maintain reproducibility. We use ST-GCN [35] as the backbone network for feature extraction in all compared methods. The Adam optimizer is used to train the model, with an initial learning rate set to 0.001. The training process is limited to a maximum of 100 epochs. In each training epoch, 1000 episodics are randomly sampled from the training set, while 500 episodics are randomly sampled from the validation set for performance evaluation. The best model is selected based on the highest validation accuracy. Early stopping is triggered if the validation accuracy does not improve for ten consecutive epochs to prevent overfitting. During the testing phase, we load the best-performing model obtained during the training process and evaluate it for ten epochs. In each testing epoch, 500 episodics are randomly sampled from the test set. The final performance is reported as the average accuracy over these ten epochs. To eliminate the influence of randomness, each experiment is repeated three times, and the average accuracy is reported. The hyperparameter settings are given as follows: the weight of the GRS ($\alpha $ ) is set to 0.5, and bins in (2) is set to 10.

C. Quantitative Results

The top part of Table I summarizes related work in the same domain. The middle section compares various base metrics. Finally, the bottom part introduces diverse versions of our proposed AMA and FICAMA. Notably, ProtoNet uses ST-GCN as its backbone and employs the Euclidean distance for its metric function, with training conducted using cross-entropy loss. DTW, Bi-MHM, Bi-TCMHM, cosine, Jaccard, Chebyshev, and Manhattan are obtained by replacing the Euclidean distance metric in ProtoNet with the corresponding metric functions. Margin, Ne, and Mcp use different methods to measure the reliability of metric functions and calculate weights in the AMA, which then replaces the Euclidean distance metric in ProtoNet. Average combines the results of two metric functions by setting weights1 and weights2 to a fixed value of 0.5. The Euclidean distance and cosine are commonly used methods for calculating the distance between video-level representations in metric space, while Jaccard, Chebyshev, and Manhattan are classic methods for calculating the distance between points in space. We compare the performance of these basic metric methods on the skeleton-based FSAR task. ProtoNet using the Euclidean distance performs the best, while other basic metric methods generally perform poorly. Jaccard, Chebyshev, and Manhattan are applied for the first time in the FSAR task. Results from RGB video-based FSAR studies [19], [20], [21] show that cosine performs no worse than or even better than the Euclidean distance, but, in Table I, cosine has low accuracy. This discrepancy may be due to the fact that the dimensionality of the video-level representation of the skeleton modality is much larger than that of the RGB video modality, leading to differences in results. Considering that many complex metric functions are designed based on the cosine distance, we demonstrate the performance differences of constructing complex metric functions based on different basic metric methods in the ablation study.

TABLE I FSAR Results on Benchmark Datasets Under Five-Way One-Shot and Five-Way Five-Shot Settings
Table I- FSAR Results on Benchmark Datasets Under Five-Way One-Shot and Five-Way Five-Shot Settings

Margin and average aggregate the results of the Euclidean distance and Bi-TCMHM, achieving higher accuracy than ProtoNet and Bi-TCMHM using a single metric function, demonstrating the benefits of aggregating multiple metric functions. Although the accuracy of Ne and Mcp is lower than that of average, the accuracy of margin is higher than that of average, indicating that dynamically adjusting weights according to the current task can outperform using fixed weights, but an appropriate weight calculation method needs to be selected. Margin focuses on the discriminative ability of the metric function for the two most likely confusing categories, which has a better effect than Mcp, which only focuses on a single category, and Ne, which focuses on all categories.

It should be emphasized that a special situation occurred on the Kinetics dataset, where all methods did not achieve significant improvements compared to the simplest ProtoNet. One reason is the data noise in Kinetics, where excessive data noise makes it difficult for the model to fit the data. In the ablation study, we analyze the peculiarities of the Kinetics dataset in depth.

D. Ablation Studies

1) Contribution of Model Components:

ProtoNet is chosen as the baseline, and the Euclidean distance is used as the metric function in all cases except when using AMA. As shown in Table II, the results indicate that each component is necessary and can improve the performance of the baseline. When the proposed components act independently, AMA brings the largest improvement, demonstrating the potential of the design of dynamically aggregating multiple metric functions. GRS provides a larger improvement in the five-shot setting because, with more support samples, each sample has fine-grained and unique information, and the resulting class prototype can better generalize to unknown query samples. In the one-shot setting, there is only one support sample per class, and less discriminative information can be captured. When the components are used in combination, the combination of GRS and SFM achieves high accuracy because SFM can make the information of the samples more evenly and harmoniously distributed among joints and frames, and the introduced temporal information also contributes to high-precision classification. Using all three components together achieves the best results, demonstrating the complementarity between different modules. GRS brings more fine-grained and unique information, SFM makes the information distribution more uniform and orderly, and AMA selects the optimal metric function weights according to the current task.

TABLE II Ablation Experiments on NTU-T Under the Five-Way Setting
Table II- Ablation Experiments on NTU-T Under the Five-Way Setting

In Table II, Params represents the model’s parameter count. The values of Params demonstrate that the proposed modules can improve accuracy without significantly increasing the number of parameters. In addition, the frames per second (FPS) are measured by using an Intel i5-13600KF CPU for inference. The values of FPS indicate that these modules maintain high processing speeds, which can meet real-time requirements for practical applications.

2) Hyperparameter Analysis:

Fig. 2 illustrates the impact of different $\alpha $ values on model accuracy. When $\alpha $ is small, the fine-grained and unique information that can be captured is limited, and the model’s accuracy is low. When $\alpha $ is large, the model focuses too much on fine-grained and unique information while ignoring coarse-grained and more general information, leading to poor accuracy. This failure situation reflects that the focus of the model during classification should still be on coarse-grained and more general information, and the introduction of fine-grained and unique information will be helpful when this information cannot complete the correct classification. This is similar to the way humans distinguish between different objects, first completing classification based on coarse-grained and more general global information, such as contours and shapes, and then focusing on fine-grained and more unique features, such as color and patterns, when it is difficult to complete classification based on this information.

Fig. 2. - Hyperparameter sensitivity of GRS, with all tasks trained on NTU-T under the five-way one-shot setting.
Fig. 2.

Hyperparameter sensitivity of GRS, with all tasks trained on NTU-T under the five-way one-shot setting.

3) Self-Information Analysis During Model Training:

The GRS module enhances the self-information of support set sample features. Fig. 3 illustrates the sum of self-information during training for our model with GRS and the baseline model without GRS. While the baseline model shows a gradual decrease in self-information, suggesting an emphasis on global representations, our model demonstrates a progressive increase, indicating an enhanced focus on distinctive features. The blue line (“Ours before SFM”) and orange line (“Ours after SFM”) represent the sum of self-information before and after the SFM component, respectively. The upward trend and higher values of these lines confirm that GRS and SFM effectively encourage the model to prioritize unique and informative representations, crucial for improving FSAR performance.

Fig. 3. - Change in self-information of support samples during the training process of baseline and FICAMA on NTU-T under the five-way one-shot setting.
Fig. 3.

Change in self-information of support samples during the training process of baseline and FICAMA on NTU-T under the five-way one-shot setting.

4) Variant Structures of SFM:

Fig. 4 shows the different variant structures of SFM, and we integrate these variants into ProtoNet to analyze the optimal architecture. As shown in Table III, both the SFM1 and SFM2 models perform spatial fusion only on the joint tokens in the sample, with the difference being the number of MLP layers. SFM1 has only one layer of MLP, which only fuses information between different joints, while SFM2 has two layers of MLP, performing fusion on both joints and embedding dimensions. Embedding dimension fusion can coordinate and unify the differences between different joints, resulting in better performance.

TABLE III Impact of Different Variant Structures of SFM on the Performance of the Baseline Model on the NTU-T Dataset
Table III- Impact of Different Variant Structures of SFM on the Performance of the Baseline Model on the NTU-T Dataset
Fig. 4. - Schematic of different variant structures of SFM. (a) SFM1. (b) SFM2. (c) SFM3. (d) SFM4.
Fig. 4.

Schematic of different variant structures of SFM. (a) SFM1. (b) SFM2. (c) SFM3. (d) SFM4.

We also tested the impact of different positional encodings on accuracy in SFM2, where None represents not using positional encoding, Vanilla represents using the sinusoidal positional encoding from [44], Rotary represents using the rotary position embedding proposed in [45], and Learnable represents using the learnable position embedding from [46]. The results show that using positional encoding during spatial fusion leads to a decrease in accuracy because the GCN used as the backbone has already well-represented the joint position information.

The SFM3 and SFM4 models both perform temporal fusion on frame tokens but differ in their approach to embedding dimension fusion. Specifically, SFM3 does not include embedding dimension fusion between frames, while SFM4 does. Our experiments revealed that the inclusion of embedding dimension fusion between frames, as in SFM4, led to decreased accuracy. This decline can be attributed to the larger embedding dimensions of frames, where fusion may result in information homogenization, thereby reducing the model’s ability to discriminate between distinct actions. In contrast, SFM3’s approach of not fusing embedding dimensions between frames preserved more distinct action features, leading to better performance.

In SFM3, relative positional encoding (such as rotary position embedding) outperforms absolute positional encoding (such as sinusoidal positional encoding and learnable positional encoding) because it better addresses the issues of action evolution and subaction misalignment. Finally, we compare the order of spatiotemporal fusion. The results show that performing temporal fusion first and then spatial fusion performs poorly, even worse than performing temporal or spatial fusion alone. This indicates that temporal fusion without prior spatial understanding may lead to the loss of important spatial details. The method of performing spatial fusion first and then temporal fusion can establish a solid foundation of spatial understanding before dealing with temporal complexity, resulting in better performance. This spatial-first approach aligns with the hierarchical nature of human visual processing. In human perception, we typically first recognize spatial patterns and object arrangements before interpreting their temporal relationships. By mimicking this natural order in our model, we can potentially achieve more accurate and human-like action recognition.

5) Efficiency and Accuracy Analysis of AMA:

In this section, we use AMA in ProtoNet and aggregate more metric functions to further illustrate the effect and computational cost of AMA, with the number of metric functions ranging from 1 to 5. As shown in Table IV, using multiple metric functions does not lead to a significant increase in training time and occupied memory, with the exception of adding the DTW metric function. DTW is a parameterized method that requires training, so the training time increases from 48 to 133 s, and the required memory also increases from 3546 to 9214 MB. However, most metric functions are nonparameterized and do not require training. Even if multiple metric functions are used, the required computation and memory are small, which gives AMA greater practicality and applicability. As the number of metric functions increases, the accuracy of the model also changes, but the margin, which measures the reliability of metric functions using the difference between the minimum distance and the second minimum distance, always achieves good results, demonstrating its superiority. The results of using a single metric function are shown in Table I. More metric functions do not necessarily lead to better results because using multiple metric functions with similar properties, such as Euclidean distance represented by Eucl and Manhattan distance represented by Manhattan used simultaneously in the third row of margin, may increase the weight of common errors in this type of metric function, resulting in lower accuracy. Furthermore, AMA has a certain level of robustness when facing lower accuracy metric functions. In the fifth row, adding cosine with an accuracy of 70.82% results in a decrease of 0.16% compared to the fourth row, but it should be noted that the accuracy is only 59.65% when using cosine alone, and AMA does not cause a significant decrease in accuracy. This demonstrates that even if there are lower accuracy metric functions in AMA, good results can still be obtained. For example, the accuracy in the fifth row is still 0.61% higher than using Eucl alone at 70.21%.

TABLE IV Quantitative Metrics of Performance, Training Time, and Occupied Memory When Fusing Multiple Metric Functions Based on Different Reliability Indicators in AMA With All Tasks Trained on NTU-T Under the Five-Way One-Shot Setting
Table IV- Quantitative Metrics of Performance, Training Time, and Occupied Memory When Fusing Multiple Metric Functions Based on Different Reliability Indicators in AMA With All Tasks Trained on NTU-T Under the Five-Way One-Shot Setting

We also analyzed the impact of different combinations of metric functions in AMA on the final accuracy. Specifically, we selected two high-accuracy metric functions and two low-accuracy metric functions and tested different pairwise combinations in AMA to examine the robustness and accuracy changes of AMA when the accuracy differences between metric functions in AMA are large. The results are shown in Table V. In most cases, the results using margin are better. The results in the first row show that when using high-accuracy metric functions, the dynamic aggregation mechanism of AMA makes the final results more accurate. AMA demonstrates effectiveness even with low-accuracy metric functions. As shown in the second row, it extracts reliable information from poor metrics, achieving 66.72% accuracy in the one-shot setting. This significantly outperforms individual low-accuracy metrics (59.65% and 46.57%). For combinations of high-accuracy and low-accuracy metric functions, AMA can also allow high-accuracy metric functions to benefit from low-accuracy metric functions, as shown in the fifth row. When using only AMA, the results are 3.42% and 0.71% higher than ProtoNet, indicating that AMA has great flexibility and does not require all metric functions to have high accuracy, which expands the usage scenarios of AMA.

TABLE V Classification Accuracy When Using Metric Functions With Different Accuracies in AMA With All Tasks Trained on NTU-T Under the Five-Way Setting
Table V- Classification Accuracy When Using Metric Functions With Different Accuracies in AMA With All Tasks Trained on NTU-T Under the Five-Way Setting

6) Impact of Basic Metrics on Complex Metric Functions:

Designing more efficient metric functions is one of the important tasks in the FSAR field, and there have been many excellent works such as Bi-MHM [15] and OTAM [14]. While these advanced metric functions have significantly improved FSAR performance, they all fundamentally rely on basic metrics such as the Euclidean distance or the cosine distance. Despite their crucial role, the impact of these foundational metrics on overall performance remains understudied. Furthermore, the potential of alternative basic metrics beyond the commonly used Euclidean and cosine distances has not been systematically explored. This gap in knowledge prompted our investigation into how different basic metrics affect the performance of existing advanced metric functions, aiming to provide insights for future metric function design in FSAR.

In Bi-MHM, the cosine distance is used to calculate the frame-to-frame distance between different samples, and then, the frame-matching metric is completed by considering the matching relationship between frames of different samples. The process of using Bi-MHM to calculate the distance $d^{1}(\mathbf {R}^{\mathbf {q}}_{i},\mathbf {P}_{n})$ between a class prototype $\mathbf {P}_{n}=\{{\mathbf {A}}_{1}^{s},{\mathbf {A}}_{2}^{s},\cdots {\mathbf {A}}_{t}^{s}\}$ containing t frames of data and a query action $\mathbf {R}^{\mathbf {q}}=\{{\mathbf {A}}_{1}^{q},{\mathbf {A}}_{2}^{q},\cdots {\mathbf {A}}_{t}^{q}\}$ is shown in the following:\begin{align*}d^{1}\left ({{\mathbf {R}^{\mathbf {q}},\mathbf {P}_{n}}}\right)& =\frac {1}{t}\sum _{{\mathbf {A}}_{i}^{s}\in \mathbf {P}_{n}} \left ({{\min _{{\mathbf {A}}_{j}^{q}\in \mathbf {R}^{\mathbf {q}}}\|{\mathbf {A}}_{i}^{s}-{\mathbf {A}}_{j}^{q}\|}}\right) \\ & \quad +\frac {1}{t}\sum _{{\mathbf {A}}_{j}^{q}\in \mathbf {R}^{\mathbf {q}}}\left ({{\min _{{\mathbf {A}}_{i}^{s}\in \mathbf {P}_{n}}\|{\mathbf {A}}_{j}^{q}-\mathbf {A}_{i}^{s}\|}}\right) \tag {14}\end{align*} View SourceRight-click on figure for MathML and additional features.where $|\cdot |$ is the cosine distance. We replace $|\cdot |$ with different basic metric functions to analyze the impact of basic metric functions on the performance of advanced metric functions, providing guidance and experimental evidence for subsequent work on designing metric functions. The results are shown in Table VI.

TABLE VI Impact of Different Basic Metric Functions on the Accuracy of Complex Metric Function Bi-MHM
Table VI- Impact of Different Basic Metric Functions on the Accuracy of Complex Metric Function Bi-MHM

The experimental results show that using different basic metrics to calculate the distance between frame-level representations in advanced metric functions only leads to small performance differences. This phenomenon contrasts sharply with the case of computing video-level representations. As illustrated in Table I, the performance gap for video-level representations is substantially larger, with accuracies ranging from 46.57% to 71.89%. The main reason for this difference lies in the significant impact of representation dimensionality on the performance of basic metrics. When basic metrics are used to calculate frame-level representations (embedding dimension of 6400 for a single frame) in Bi-MHM, their performance is generally good and close to each other. However, when using basic metrics to calculate video-level representations (dimension of 51 200 for a single video-level representation), only the Euclidean distance, the Chebyshev distance, and the Manhattan distance achieve good results, while the other three metrics perform poorly. In addition, the experiments demonstrate the potential of the Manhattan distance and the Jaccard distance as basic metrics. The data in Table VI show that although using the cosine distance in Bi-MHM can achieve high performance, the difference compared to using the Manhattan distance or the Jaccard distance is not significant. Based on these findings, when designing advanced metric functions in the future, in addition to the commonly used cosine distance and Euclidean distance, the Manhattan distance and the Jaccard distance are also worth trying and may further improve accuracy.

7) Analysis of the Kinetics Dataset:

As shown in Table I, methods applied to the Kinetics dataset provided by [13] fail to significantly outperform the baseline ProtoNet. This underperformance can be attributed to dataset noise and the inability of GCN models to effectively differentiate actions represented by 2-D joints and joint estimation confidence scores in few-shot settings.

The Kinetics dataset comprises samples of 300 frames, each containing two 18-joint skeletons. Each joint is represented by 2-D coordinates and a confidence score. Fig. 5(a)–(c) illustrates normal, invalid, and abnormal frames, respectively. Analysis revealed numerous randomly distributed invalid frames within samples, which remain unprocessed when following the procedures in [13].

Fig. 5. - Visualization of three types of frames in the Kinetics dataset. (a) Normal frame. (b) Invalid frame. (c) Anomalous frame.
Fig. 5.

Visualization of three types of frames in the Kinetics dataset. (a) Normal frame. (b) Invalid frame. (c) Anomalous frame.

We created two refined datasets: Kinetics_clean, which excludes invalid frames and samples, and Kinetics_2d, which further removes joint confidence scores. Table VII delineates the differences among these datasets.

TABLE VII Comparison of Kinetics Dataset Variants
Table VII- Comparison of Kinetics Dataset Variants

Experimental results (see Table VIII) show performance improvements on Kinetics_clean compared to the original dataset, confirming the presence of noise in the Kinetics dataset in [13]. Notably, performance on Kinetics_2d did not significantly decrease, and in some cases improved, compared to Kinetics_clean. This suggests that models in few-shot scenarios fail to effectively utilize confidence scores.

TABLE VIII Performance Comparison Across Kinetics Dataset Variants
Table VIII- Performance Comparison Across Kinetics Dataset Variants

These findings have two important implications: 1) models can achieve comparable performance in classifying 2-D skeleton data without relying on confidence scores, which leads to reduced storage requirements and improved data management efficiency and 2) current models demonstrate an inability to effectively utilize confidence scores in few-shot scenarios. This suggests a need for developing methods that can better leverage this additional information when available.

SECTION V.

Conclusion

In this study, we propose an effective FICAMA framework for skeleton-based FSAR to increase the fine-grained information of samples and fuse multiple metrics to improve matching accuracy. FICAMA consists of three novel modules: the GRS for increasing the self-information of support set samples to encourage the model to capture fine-grained and unique representations, the skeletal motion fusion module for fusing unique representations and global representations, and the AMA for aggregating the results of multiple metric functions. The proposed framework’s effectiveness is validated by the results on two mainstream benchmark datasets, and extensive ablation experiments are conducted to analyze the potential of basic metrics and the noise in the Kinetics dataset. Building on these findings, future research could explore the integration of data from complementary modalities to further enhance model performance, potentially addressing the limitations of skeleton-based representations in FSAR scenarios. In addition, future work could focus on enhancing the adaptability of skeleton-based action recognition methods across varying conditions.