ICSE2026

Review #773A

Overall merit

2. Weak reject

Rigor

The experiment results in Table 2 seems unfair. To compare the effectiveness of different retrieval methods and retrieval sources, Table 2 introduces two retrieval methods and three retrieval sources. Three CMG approaches are also considered for comparison the effectiveness, however, it is unfair to use the results of the CMG approaches rather than the retrieval module of the approaches. Why not directly compare the retrieval module of these three CMG approaches?No revision. Reason: Not all three CMG approaches (NNGen, HACMG, and REACT) have retrieval modules. Only REACT have retrieval modules, and its retrival module is included with two retrieval methods and three retrieval soures.
Table 2 shows that the approach REACT outperform other retrieval methods and retrieval sources, and REACT is designed with hybrid retrieval method from dataset-level retrieval source. Does this result indicate that the retrieval method is more important? This conclusion is against the findings of the preliminary study. No revision. Reason: Table 2 shows only using retrieved message as generated message can outperform some CMG approaches (NNGen and HACMG). Althrough REACT outperform other retrieval methods and retrieval sources, it can still be improved by retrieved from commit history. Table 5 verified this.
In the augmentation stage of HisRag, it enhance the input of SPLMs using modification embedding which proposed by the previous approach[1]. Why only apply the modification embedding on the approach CCMG? In other words, why not apply the modification embedding to other SPLM approaches, such as ATOM and CCT5, to demonstrate the effectiveness of the input augmentation in HisRag?No revision. Reason: COME is the only approach that uses modificaition embedding. It has an unique design which adds a tag embedding into the input of model. To the importance of historical retrieval and ensure the model architecture of the other CMG approaches, we did not adopt this design for other approaches. In Sectiong 5.2 Enhanced Approaches, we provided a detailed explanation.
The setup of enhanced approaches by HisRag lacks explanation. In Table 3, why LLM-based approaches use 3 as retrieved number, but LLM-based approach REACT use 1 as retrieved number. REAE and COME are hybrid approaches which design retrieval module and generation module, why use different retrieved number? It would be good to explain how the HisRag paradim applies to these approaches and the reason for four types of parameter configurations.Line 555-625 in diff version shows the detailed setting.
Another concern is the lack of comparison between the two LLM-based commit message generation approaches OMG [2] and OMEGA [3]. They show outstanding performance compared with all the baselines. So, it is important to compare with these two approaches to highlight the effectiveness of HigRag. Line 521-529 in diff version explained this. In addition, we didn't compare \toolname{} with recent CMG approaches like OMG \cite{li2024only} and OMEGA \cite{imani2024context}, which utilize multiple additional context information (i.e. PR title, issue title, and commit type). The primary reason is that their datasets are significantly smaller in scale and includes types of contextual metadata that not considered in existing CMG methods like RACE \cite{shi2022race}, COME \cite{he2023come}, and CCT5 \cite{lin2023cct5}. Due to the disparity in dataset scale and contextual information, a direct comparison would require extensive adaption for all existing CMG methods, which is beyond the scope of this paper. Since our work focuses on modeling commit message history without relying on extra contextual information, we leave a thorough comparison with such context-aware methods to future work.
Insufficient evaluation for the effectiveness of shorted model input. HisRag only use retrieved commit messages as addition input and remove retrieved diff, it is better to conduct experiments to demonstrate the superiority of such design. Line 562-569 in diff version. The motivation of HisRag is to learn the style of commit messages rather than code semantics from the retrieved diff. Since some approaches (RACE \cite{shi2022race} and REACT \cite{RAG2024}) are originally designed need the retrieved diff and message as the input of model, we use the retrieved diff and message from history retrieval stage. Due to the limit of computational resources and fair comparison, we only use one retrieved diff and message. For approaches that didn't consider the retrieved diff in their original implementations, we keep the same setting of only retrieved commit messages just like the existing approach (HACMG \cite{eliseeva2023commit}) did.
In section 6.4, only COME and Llama3-8B are considered in the human evaluation. Why not be consistent with the comparison in Table 5? This requires further explanation of the reason for such selection. Line 766-772 in diff version. We choose two representative approaches (Llama-3-8B and COME) and their enhanced versions (HisRag_{Llama-3-8B}$ and $HisRag_{COME}$) by HisRag for the following reasons. First, as shown in Table 5, these two approaches perform relatively well across all CMG approaches. Second, including all 11 baselines in human evaluation would introduce significant overhead and participant fatigue, leading to potential biases or reduced evaluation quality. Therefore, we only select these two approaches for human evaluation.

Relevance

The topic of commit message generation is relevant to the SE community.

Verifiability & Transparency

The authors have conducted well organized evaluation, however, as mentioned in Rigor, some necessary comparison with SOTA approaches, evaluation for the effectiveness of shorted model input, and human evaluation are not conducted, which could compromise the verifiability of this approach. ditto
The replication package is also available in Zenedo, the package provides source code, running scripts and incomplete experimental results. The author seems to only providing the results of CodeT5, CCT5, COME, CodeLlama-7B, LLama3-8B, Deepseek V3, and Deepseek R1, the results of NNGen, RACE, REACT, and HACMG, as well as the results of human evaluation, are also not found. Artifacts hava beed uploaded

Questions for authors’ response

Can HisRag outperform the state-of-the-art LLM-based commit message generation approaches such as OMG and OMEGA? ditto
Can you design experiments to demonstrate the effectiveness of shorter model input on the SPLM and LLM models? ditto
Please clarify the setup of enhanced approaches by HisRag. ditto

Review #773B

Overall merit

2. Weak reject

Weaknesses

While practically useful, the core technical idea—using a narrow retrieval scope and injecting retrieved messages—is conceptually incremental compared to recent CMG innovations (e.g., REACT, HACMG). -The novelty mostly lies in the application of personalized retrieval.
The lexical- and semantic-based retrieval baselines in the preliminary study are not clearly described. It is not evident how commit messages are selected or whether these baselines unfairly outperform or underestimate actual CMG models.Line 243-268.
The difference between HisRag_COME and its variant without modification embeddings is not adequately explained, making it hard to interpret the cause of performance variations in the ablation study. Line 733-736. Red color is $HisRag_{COME}$ without modification embedding. It means removing the tagged token embedding layer and only use rearranged token embedding layer for $HisRag_{COME}$.
The sample size (50 examples) is relatively small and not statistically justified in terms of confidence interval or margin of error. Additionally, details on annotator selection, task guidelines, or inter-rater agreement are missing.Line 763-783 explained the details.
While HisRag aims to retrieve “less,” the paper does not address the computational cost of generating and storing embeddings, maintaining per-user indices, or runtime retrieval in large teams or CI/CD workflows.No revision. Intuitively, if the number of retrieved numbers decreases, the retrieval efficiency will definitely be higher
Injecting similar commit messages from the past could lead to reduced lexical or structural diversity. There is no analysis of whether HisRag causes overfitting to personal style or introduces repetition in generated messages.No revision. This is a popular problem in Rag methods
It’s unclear whether baseline models were re-tuned or re-evaluated under the same training and testing settings as the HisRag-enhanced models, raising concerns about evaluation fairness.Line 635-636. The baseline approaches maintain the same training settings before and after HisRag enhanced.
The paper does not report the distribution of history lengths across developers and repositories. Given that the approach depends on historical context, it would be valuable to know how history size impacts performance.No revison. Table 1 shows the retrieved numbers under different retrieval sources.

Detailed comments for authors

This paper, while modest in its novelty, makes a valuable and timely contribution to the field of commit message generation (CMG). HisRag is based on real-world developer behavior and fits well within existing CMG pipelines. However, the paper could be improved by providing greater transparency in its methodology, particularly in the areas of ablation studies and human evaluation. Additionally, a more detailed characterization of the dataset and a discussion on the scalability of the system would strengthen the work.

Considering both the strengths and the areas for improvement, I am inclined to recommend a weak reject.

Questions for authors’ response

Beyond narrowing the retrieval scope, could you elaborate on what differentiates HisRag from prior CMG frameworks like REACT? No revision. Line 374-389
Also, how generalizable is your approach to non-Java languages or less active repositories where historical context may be sparse? Line 868-885. Add a Section(Limitations and Future Work) to illustrate this
Could you justify the sample size used in the human evaluation in terms of statistical significance? Also, were annotators given explicit guidelines? Was inter-annotator agreement measured? ditto
In RQ3, what exactly is removed or changed in the “HisRag_COME without modification embedding” variant? Could you clarify what insights this ablation is meant to isolate? ditto

Review #773C

Overall merit

2. Weak reject

ii) Rigor:

While the experiments are extensive and cover multiple model types and metrics, several concerns reduce the rigor:

Baselines:

The authors’ experimental comparisons are incomplete. Although HisRag is positioned as a RAG-based approach, the paper does not include direct comparisons with widely used RAG methods based on BM25 [51], which are commonly adopted for LLMs in commit message generation tasks. Furthermore, the authors do not compare HisRag with more recent large language model-based approaches, such as the ERICOMMITTER method [68]. ERICOMMITTER is particularly relevant because it also evaluates both lexical-based retrieval (BM25) and semantic-based retrieval (pre-trained code models), offering a broader and stronger benchmark. The omission of these important baselines significantly limits the fairness and completeness of the evaluation and makes it difficult to accurately assess the true advantages of the proposed approach. No revision. The point of this paper is to verify the importance of retrieving from commit history. 　

Metric Reliability:

The paper primarily relies on automatic metrics such as B-NORM, BLEU, ROUGE-L, METEOR, and Log-MNEXT. However, both BLEU and B-NORM have been shown to be unreliable for CMG evaluation. For example, Wang et al. report that LLMs often achieve relatively low BLEU scores, yet human evaluations consistently find their outputs to be of the highest quality. This suggests that apparent improvements over LLMs on BLEU or B-NORM may simply be due to better formatting or stylistic alignment, rather than true content improvement. Thus, the current evaluation metrics may not accurately reflect the real benefits of HisRag in practical settings.No revision. We used automatic metrics and human evaluation metrics to mitigate this proplem　

Human Evaluation:

Although a human study is included, the evaluation criteria—Informativeness, Conciseness, and Expressiveness—are not clearly defined. The scoring system ranges from 0 to 4, but the meaning and anchors for each score are not explained, leaving it unclear how these scores relate to actual usefulness or human judgment. This lack of clarity could also be the reason for the large standard deviation observed in the results. Furthermore, the evaluation does not include recent strong models such as DeepSeek, which may achieve the best performance, making the assessment less comprehensive and less informative for practitioners. No revision. We explained the evaluation criteria in Line 782-797. Line 767-772 explained why we only use Llama- 3-8B and COME

Dataset:

The paper evaluates HisRag solely on the CommitChronicle dataset. However, recent works—such as ERICOMMITTER [68]—also use the widely recognized MCMD dataset, which is a standard benchmark in this field. Limiting evaluation to just one dataset restricts the comparability and generalizability of the results. Including additional standard datasets would strengthen the validity and broader impact of the experimental findings. No revision. MCMD dataset didn't provide the commit history information. CommitChronicle is the only dataset that hosts the commit history while preserving the chronological order of information when we write this paper.

Questions for authors’ response

Can the authors clarify what is fundamentally novel in HisRag compared to existing RAG-based methods like [55], [70], and [68] and standard RAG (based on BM25) or concatenation approaches? No revision. The novelty is is Line 374-389
Can the authors provide direct comparisons with widely used RAG baselines such as BM25-based RAG [51] and recent methods like ERICOMMITTER [68]?ditto
Do the authors have evidence that improvements in B-NORM and BLEU correspond to real quality gains, given the known unreliability of these metrics for LLM-based CMG?ditto
Can the authors clearly define the human evaluation criteria, explain the 0–4 scoring system, and consider including recent models like DeepSeek in the human study?ditto
Can the authors evaluate HisRag on additional standard datasets such as MCMD to improve comparability and generalizability?ditto