submission id,reviewer id,review text,annotation label,annotation sentences
QxItoEAVMb,gnLZ,"**Summary**: The paper introduces an new _modular_ reinforcement learning in PyTorch named TorchRL whose goal is provide a flexible, efficient and scalable library for rapid prototyping and research.
The key componnet here the TensorDict data structure enablibg easy efficient communication between different components like environments, buffers, models etc.
It includes many reference implementations including RL algorithms as well as distributed traning and other utilities.
Experiments are performed to validate correctness and efficiency while showing competitive performance compared to other libraries.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: - Modular design: seems really well designed from the perspective of allowing new implementations from existing components vs the current standard practice of trying to fork an existing implementation where which component changed and mattered can be difficult to figure out. You either have single file implementations that don't really do distributed scaling well or you have weird nested inference to figure out how/where even is the actual algorithm implemented.
- Modularity is demonstrated with reasonable tradeoffs for a very wide variety of RL methods and applications. 
- TensorDict as a concept is very generally useful and will hopefully see wider adoption in Torch ecosystem.
- Distributed training and scaling building on `torch.distributed` is quite nice.  
- Example implementations are short and clear while scaling well.

**Weaknesses**: - Likely steeper learning curve and verbosity compared to higher level libraries. However, given the focus on research over applications, should not be that bad.
- Limited to PyTorch, although DLPack has made things easier.
- Smaller community and ecosystem. Would have been much better if this had been ready a couple of years ago that would have helped fix some of the fragmentation issues. If I were to guess, RL _algorithm_ research is somewhat waning (and has largely been not _that_ successful as a research endeavor because a lot of things that make RL work are considered engineering details).
- Lots of tiny details matter when it comes to RL comparisons and more tooling and integration here would be useful. Reporting 3 average seed results is likely not the best way to report results when comparing RL methods.
- Range of ""reasonableness"" is quite wide when it comes to RL implementations. Would be great if the paper also put the current best reported results for various environments to contextualize. That is also going to be helpful in convincing new algorithm implementations to start here rather than the best performing implementation in terms of reward scores etc.

**Questions**: - While I understand it's not the focus, would be useful to also show how heteregoenous multi-agent problems can be setup in TorchRL's API while still potentially leveraging batching/vectorization capabilities?
- Hyperparameters in RL are a big bane. Any plans to support automatic tuning, especially population based methods?",1,"['Smaller community and ecosystem. Would have been much better if this had been ready a couple of years ago that would have helped fix some of the fragmentation issues. If I were to guess, RL algorithm research is somewhat waning (and has largely been not that successful as a research endeavor because a lot of things that make RL work are considered engineering details).', 'Lots of tiny details matter when it comes to RL comparisons and more tooling and integration here would be useful. Reporting 3 average seed results is likely not the best way to report results when comparing RL methods.']"
COYDmKkQH4,KQUb,"**Summary**: The authors present AutoCast++, a system for world event prediction relying on three components: a task-aligned retrieval module; a news summarisation module (text summarisation on retrieved news); a fusion-in-decoder model that is aligned to perform the event predictions.
They evaluate the system on the AutoCast dataset by grouping the tasks in numerical, multiple choice and true/false; considering as baselines a collection of methodologies suggested by the benchmark.
The results show that the proposed system is able to outperform the considered baselines considering different model sizes.

**Soundness**: 3 good

**Presentation**: 2 fair

**Contribution**: 3 good

**Strengths**: The proposed system shows remarkable performance presenting a limited impact from the model size.
The only tasks where it does not excel are the numerical ones, but it's anyway a close call with a baseline that is almost two times larger.

**Weaknesses**: While the exclusion of baselines relying on new LLMs including data post mid 2021 is understandable, the ablation studies seem to suggest that relying on LLM for retrieval reranking and summarisation play a huge role in the performance of the system.
What would be convincing is to build/revamp the baselines considered using the GPT3 pre-trained version that the authors leverage in their experiments.
This would surely make the submission much stronger and convincing.

**Questions**: Which GPT3 version was considered in the work?
What is the impact of binning numerical questions? Is the binning applicable also to the baselines? If yes how would results change?",0,
qL9gogRepu,PRrS,"**Summary**: The paper explores the longstanding issue of ambiguity in natural language and its implications for semantic parsing in AI. It delves into the nature of language ambiguity, highlighting how it stems from the balance of communication efficiency and interpretive flexibility, and poses challenges for AI systems that lack human-like commonsense knowledge and context. To address these challenges, the authors propose a novel framework and the Ambiguous Parsing (AMP) dataset, which includes various types of ambiguities paired with dual logical forms (LFs). This resource is aimed at enhancing the performance of large language models (LLMs) in semantic parsing tasks, especially in handling ambiguity.

The study introduces two tasks to assess how well LLMs, utilizing in-context learning (ICL), can capture multiple interpretations of an ambiguous input. These tasks are designed to evaluate model performance in both zero-shot and few-shot settings, with a series of metrics developed to quantify their ability to predict and represent ambiguity. The paper also reports on models' performance, noting that while models can sometimes mirror human preference for certain interpretations, they generally fall short in predicting all possible parses. Additionally, it is observed that some models are quite adept at reflecting the distribution of interpretations in mixed-prompt scenarios, offering insight into in-context learning amidst conflicting evidence.

**Soundness**: 4 excellent

**Presentation**: 4 excellent

**Contribution**: 4 excellent

**Strengths**: 1. The AMP dataset is a significant contribution, providing a resource specifically designed for investigating ambiguity in semantic parsing, which is a relatively unexplored area.

2. The paper takes a comprehensive approach by addressing the challenge from the perspective of both dataset creation and model evaluation.

3. The introduction of zero-shot and few-shot tasks offers a rigorous evaluation framework for future research on ambiguity in semantic parsing.

4. The development of new metrics to assess the models’ ability to handle ambiguity is a noteworthy contribution that can guide subsequent model development.

5. The results contribute interesting insights into the capabilities and limitations of current LLMs in capturing ambiguity through zero-shot and in-context learning.

**Weaknesses**: 1. While the paper provides a strong foundation, it could benefit from a more detailed exploration of how ambiguity affects real-world applications of semantic parsing.

2. The AMP dataset, while novel, might still be limited in scope and diversity, potentially affecting the robustness of the study’s conclusions.

3. It is unclear how the proposed methods deal with the dynamic nature of conversational context, which can significantly affect ambiguity resolution.

**Questions**: 1. How do you foresee the findings of this research being applied in practical AI systems, particularly in areas where ambiguity can have significant consequences, like in legal or healthcare settings?

2. Is the AMP dataset extensible, and are there plans to include more complex or nuanced forms of ambiguities, such as cultural or idiomatic ones?

3/ Could you elaborate on the selection process for the five types of natural language ambiguities included in your study? Were there other types of ambiguities considered but excluded?

4. The use of synthetic data might not fully capture the complexity of natural language ambiguities encountered in real-world scenarios. How well do the findings translate to naturally occurring datasets?",1,"['The AMP dataset, while novel, might still be limited in scope and diversity, potentially affecting the robustness of the study’s conclusions.']"
GEZACBPDn7,2nwv,"**Summary**: The paper presented a semi-supervised method for graph classification. The proposed model is composed of two GCNs, one is for individual graphs and the other is for a super graph of all graphs, where the super graph is constructed by a graph kernel. The proposed method is compared with its competitors such as graph contrastive learning on benchmark datasets, where different labeling rates have been considered.

**Soundness**: 4 excellent

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1. The problem studied in the paper, namely graph-level semi-supervised learning with scarce labels, is an important and challenging problem. 
2. The proposed method is based on a double-level GCN model, which has two GCNs. The first one performs graph convolution for each graph and the second one performs graph convolution for a global graph defined (by graph kernel) over all the graphs. This idea is very novel and appealing.
3. The proposed method is compared with state-of-the-art methods such as SimGRACE and GLA as well as classical methods such as GCN and WL kernel. It has competitive performance.
4. The proposed method is simple and easy to implement.

**Weaknesses**: 1. The authors claimed that their method has fewer hyperparameters but they did not provide specific comparison with other methods such as GLA in terms of the number of hyperparameters. 
2. The similarity graph among graphs is constructed by a graph kernel such as WL-subtree kernel and there are two different post-processing method for $\mathcal{K}$. it is not clear which one is better and which one was used in the experiments. 
3. The writing can be further improved.

**Questions**: 1. At the beginning of Section 3.1, $\mathbf{S}$ is a binary matrix. However, in Section 3.3, the kernel matrix given by a graph kernel may not be binary or sparse. Do the sparsification and binarization have a significant impact on the performance of the proposed method? 
2. In Section 4.2, the authors set $d=d’=64$. Is this the best setting? How do $d$ and $d’$ as well as $d’’$ influence the classification accuracy?
3. What are the numbers of layers in the two GNNs in the experiments? Does the depth matter?
4. In Figure 2, the two post-processing methods for the global kernel matrix are compared. It seems that the one related to $c$ is better than the one related to $\tau$. I wonder if the authors reported the results of the method related to $c$ in Tables 2, 3, and 
5. It is not clear why the authors did not include the results of larger labeling rates such as 30% or 50%.
6. Are their any time cost comparison?
7. In Table 4, it seems that the performance of graphlet sampling kernel is always the worst. I suggest the authors discuss the difference between graphlet sampling kernel and other kernels.
8. It is necessary to compare the number of hyperperameters of the proposed method with those of the baselines. In the proposed method, one has to determine $c$ or $\tau$, which affect the classification performance.",1,['The writing can be further improved.']
eAKmQPe3m1,VQLf,"**Summary**: The paper introduces several recipes to accelerate the training of text-to-image foundation models. These include
- Use pretrained DiT
- Use Flan-T5 XXL
- Use LLaVA captions.
- AdaLN / AdaLN single architectures to reduce model size.
Overall, these methods allow training of a reasonable quality model in 10% of the resources used than Stable Diffusion, which makes it more possible to democratize the training recipes in text-to-image foundation models.

**Soundness**: 4 excellent

**Presentation**: 3 good

**Contribution**: 4 excellent

**Strengths**: Originality: the empirical evaluation of synthetic captions in text to image generation is not systematically studied until DALL-E 3, and the new AdaLN architecture might be useful.
Clarify: the paper is quite clear about most details about the training, which makes reproducibly much more likely.
Significance: the paper mostly explores valid heuristics for training text-to-image foundation models quickly, some conclusions can be helpful in the community: 1) DiT architecture instead of UNet, 2) the use of synthetic captions, 3) the use of SAM dataset.

**Weaknesses**: The paper mostly is a combination of multiple ideas that exist in the literature, so ""novelty"" in the traditional sense is somewhat limited.

**Questions**: 1. The SAM dataset blurs human faces in their training, won't this cause problem in generation cases where generating a face (not closed up) is needed?
2. How does the model generate images with more extreme aspect ratios?
3. The DiT architecture has a fixed patch size. As resolution becomes higher, so will the number of tokens be higher. Will this cause a bottleneck in training and inference (such as 1k resolution)?
4. DALL-E 3 technical report mentions the pitfall of ""overfitting"" to automated captions, is this the case in PixArt model? If not, how is this mitigated?
5. Since the dataset size is smaller, does it have trouble producing named entities, such as celebrities? 
6. How critical is training DiT on ImageNet needed? While being able to start with an existing model is good it also limits the possibilities to explore different architectures. 
7. The CLIP score of the CLIP-FID curve of Pixart seems worse than SD 1.5. Is there any reason for that?",1,"['The paper mostly is a combination of multiple ideas that exist in the literature, so ""novelty"" in the traditional sense is somewhat limited.']"
D9SA02esgh,ZykZ,"**Summary**: The authors of the paper introduce MORPHOCC, a neural network model designed to capture and represent the diversity of neuron morphologies in the mouse primary visual cortex (V1). The model encodes the morphology of each neuron into a low-dimensional embedding from which the 3D shape can be reconstructed. Trained on 797 dendritic shapes of V1 neurons, the model's embedding effectively captures morphological features, aiding in cell type classification. The model also enables the generation of new neuron instances through interpolation in the embedding space.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: - The approach addresses the essential need for quantitative, unbiased methods to capture and represent the structural and morphological features of neurons.
- MORPHOCC's ability to reconstruct 3D shapes from a low-dimensional embedding offers potential benefits for representing and analyzing neuronal morphologies.

**Weaknesses**: - The reliance on existing deep learning architectures like the PointNet encoder and SIREN decoder, without significant modifications or enhancements, raises concerns about technical novelty, especially considering the high standards expected for technical novelty in ICLR.
- The training dataset consists of only 797 neurons, raising concerns about the model's ability to generalize, especially when applied to classifying and generating new neurons outside this limited set. This is somewhat evident from the very high IoU scores and limited diversity of interpolated samples in Figure 5.
- Using linear interpolation in the embedding space to generate new neuron instances may not produce neurons distinct from those seen during training. Essentially, this method interpolates between two known neurons, resulting in a neuron that isn't morphologically much different from the original ones.

**Questions**: - When generating neurons via interpolation, how do structural details, like dendrite branching and length, evolve?
- What specific measures were taken to prevent overfitting, especially given that directly learning the embeddings led to overfitting?
- How might these findings be used in real-world applications, like neuroscience research or medical diagnostics?",0,
UEP8yRxTfV,sefk,"**Summary**: The paper studies the biased estimation problem of diffusion models, by examining the flaw in the $\epsilon$-prediction. The authors also conduct several empirical analyses to support the bias effect. Empirically, the proposed objective achieves better FID scores across facial datasets.

**Soundness**: 1 poor

**Presentation**: 2 fair

**Contribution**: 1 poor

**Strengths**: - The paper examines the potential bias problem when using $\epsilon$-prediction.

- Empirically, the proposed weighting scheme outperforms previous ones.

**Weaknesses**: - **No novelty**: It seems that the paper reinvents a well-established objective in the diffusion models literature -- $x_0$ prediction (see the blog https://medium.com/@zljdanceholic/three-stable-diffusion-training-losses-x0-epsilon-and-v-prediction-126de920eb73). The paper ""rediscovered"" the relation between $x_0$-prediction and $\epsilon$-prediction. The proposed objective in Eq.11 is actually doing $x_0$-prediction type loss: by setting $\epsilon_\theta = \frac{x_t-\hat{x}_0}{\sigma}$ and $\epsilon=\frac{x_t-x_0}{\sigma}$ one could recover the $x_0$-prediction loss.

One step further, there are already works focusing on combining the strengths of $x_0$-prediction and $\epsilon$-prediction, like the $v$-prediction [1] and the pre-conditioning techniques in EDM [2], in the past year.

- The FID score in Table 1 is way too high in the small NFE regime (NFE<100). It makes the comparison much less convincing.


[1] Progressive Distillation for Fast Sampling of Diffusion Models, Salimans et al.

[2] Elucidating the Design Space of Diffusion-Based Generative Models, Karras et al.

**Questions**: N/A",0,
jLLF5EbwI2,wpM5,"**Summary**: The paper aims to address two limitations of previous text-to-image generation, which are the missing object mentioned in the prompt and the wrong spatial relationship between objects. The paper proposes the SPADE dataset and SPADE generator to generate reference images that align with the input prompt regarding object and spatial relationships. The paper proposes LSAI to use Stable Diffuison or ControNet with the reference image in a training-free manner to generate the final image. The results show that the proposed method can improve the spatial fidelity of Stable Diffusion and also have better generalization ability.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: 1. The manuscript successfully identifies and addresses two critical shortcomings in the field of text-to-image generation: the neglect of specified objects and errors in spatial arrangement.
2. A logical and effective strategy is proposed, where a reference image is generated to match the input prompt in terms of spatial relationships, and then utilized with SDEdit or ControlNet to create the final image. This approach demonstrates sound reasoning.
3. The method can improve the stable diffusion model. It also shows great performance in out-of-distribution objects.

**Weaknesses**: 1. The intuition behind SPADE is not explained well. The author should explain why the 3D rendering engine is needed here. Because the paper only focuses on vertical and horizontal direction, 2D is enough.
2. Some comparison experiment is missing. For example, LayoutGPT also focuses on the relationship issue, the comparison result should be given.
3. More visualization results are needed to validate the generalization ability. The OOD experiment is interesting, however, the author should show the result with different substitutes.

**Questions**: Please refer to weaknesses.",0,
mZn2Xyh9Ec,Jn9J,"**Summary**: This paper describes FlashAttention-2 which improves upon FlashAttention by introducing ""tweaks"" to improve performance on GPUs.   The paper claims the tweaks improve performance by increasing occupancy and use of matrix-multiply hardware (tensor cores).  The paper reports a bit under $1.3\times$ wall clock speedup versus FlashAttention.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: Improving training speed of LLMs is of great interest to many.

**Weaknesses**: Could do a better good job explaining how a given ""tweak"" helps achieve a given improvement (occupancy, use of tensor cores).

**Questions**: Regarding the equation at the top of Page 5, I am unclear ""$\mbox{diag}(l^{(1})^{-1}$"" is to the power -1.  Comparing to the prior equation seems like exponent of -1 should be 1.

I think it would help some readers (like me) understand the contribution a bit better if the paper briefly summarized the key changes in the six (unnumbered) equations on Page 5 that are described as the ""online softmax trick"" versus the six on Page 3.  

How do the ""tweaks"" in Section 4.1.1 help reduce non-matrixmul FLOPs?  I know a fair amount about tensor cores, but it wasn't obvious to me.

The paper claims occupancy is increased on Page 6 but it was unclear: (i) what definition of occupancy is being used (GPU resources could mean many things and occupancy often just refers to number of warps that can concurrently run versus max number supported by hardware ); and (ii) whether any measurement has been made to confirm the claimed improvement (e.g., using NVIDIA Parallel Nsight or similar approaches for collecting performance counters).

Much of Algorithm 1 seems similar to the original FlashAttention.  It may help summarizing which lines are different.  It would also help the reader if there was a summary of which lines lead to the reduction in non-matrixmul FLOPs and improved occupancy.

""Only at the every end of the"" - typo.

For the backward pass (Section 3.1.2): It was unclear what the relevance of the paragraph on MQA and GQA is to the changes in FlashAttention-2 versus FlashAttention.  

In Figure 2, does an uncolored square mean no computation?  Does the backward pass for a given worker start right away or do workers need to synchronize between forward and backward pass?  Do you not need to compute the combined result for the forward pass before you can start the backward pass?    If you do need to wait, then how can one achieve greater than 50% use of peak performance if roughly half the compute cycles are spent waiting for the longest running forward/backward pass thread block to complete?   If you don't need to wait, why not?

I'm not sure how to relate Figure 3 to Algorithm 1 (i.e., which lines it is meant to illustrate).  From the two paragraphs above Figure 3 I get it there are two potential sources of reduced execution time: fewer shared memory accesses and fewer synchronizations (__syncthreads, I assume).  Unclear which of those matters most and why given that shared memory accesses proceed about as fast as register file accesses and synchronization with a thread block is low overhead.  

Why is FlashAttention (version 1) missing in Figure 5?

As someone who knows GPUs well, I would have liked to see more performance counter data to backup the claims of the sources of performance improvements.   I understand space is limited in the main text, but in checking the supplemental material, while it is great to see all the code, there appeared to be no PDF providing additional data or details.  Including one might have helped.",0,
vI95kcLAoU,iYVP,"**Summary**: This paper improves the efficiency of Vision Transformer by replacing some attention layers with a compute-efficient parametric function, ie, convolutional feed-forward layer. The idea is motivated by a clear observation and analysis that attention patterns tend to be redundant between different layers, indicating a strong correlation. With the novel design, the authors validated the framework on various architecture and datasets. Comprehensive experiments have shown the advantage of their method.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1. The motivation of this paper is very clear, accompanied by strong analysis in the attention patterns. 
2. The figures and visualizations can clearly demonstrate their method. The overall presentation is good to me.
3. Experiments are comprehensive, including different architectures, datasets, tasks, which strongly demonstrate that the proposed method is general.
4. The performance gain is also consistent across different settings.

**Weaknesses**: 1. Based on the analysis in Section 3.2, it makes sense for the authors to apply their method from layer 2 to 8. However, it is not convincing for different pretrained ViTs to skip layer 2 to 8 as well if considering different training objectives or pretrained datasets. Thus, it would be better for the authors to study if other pretrained ViTs (MAE [A], DINOv2 [B], SAM [C]), have the same phenomenon.

2. Introducing convolution into ViTs has shown to be effective in related works [D], which is intuitive to me to achieve performance gain for SKIPAT. In this paper, SKIPAT adopts convFFN as a parametric function to replace MSAs, which still needs to be trained from scratch in order to achieve efficiency gain. It would be promising if this parametric function can be used as a drop-in replacement for existing large ViTs.


[A] He, Kaiming, et al. ""Masked autoencoders are scalable vision learners."" Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[B] Oquab, Maxime, et al. ""Dinov2: Learning robust visual features without supervision."" arXiv preprint arXiv:2304.07193 (2023).

[C] Kirillov, Alexander, et al. ""Segment anything."" ICCV (2023).

[D] Wang, Wenhai, et al. ""Pvt v2: Improved baselines with pyramid vision transformer."" Computational Visual Media 8.3 (2022): 415-424.

**Questions**: Can the authors specify more on the experimental setting of applying SKIPAT into hierarchical ViTs? I can understand that SKIPAT works for layer 2 to 8 in plain ViTs. But it is not intuitive to me how to select the layers to skip in PVT, LIT, etc.",0,
Y8OaqdX5Xt,ciXi,"**Summary**: The authors introduce a new method for decision-time planning, based on inferring co-player goals, learning co-player goal-based policies and then rolling out MCTS using these policies. They apply this new method to several sequential social dilemma domains.

**Soundness**: 1 poor

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: - The introduction provides clear motivation, and the conclusion provides a concise summary and sound suggestions for future work.
- To my knowledge, the idea of doing MCTS based on explicitly learned co-player models using goals that are inferred online is novel. In principle, this could yield an impactful improvement over the state of the art.
- The method is generally well-described, and therefore I assess that this paper is likely reproducible.

**Weaknesses**: - The main weakness of this paper is that the experimental results are quite weak. Table 1 does not show a very large improvement from PToM over the baselines except in some specific cases (e.g. adaptation to exploiters in SS). PToM does not achieve cooperation in SPD. There are no videos or behavioral analyses provided to corroborate qualitative claims about the better behaviour of PToM over the baselines. Can the authors provide such results? For instance, what is the evidence that PToM is significantly better at hunting stags in SSH? 
- Some of the experimental claims are poorly explained and some of them are unsubstantiated. For example, on page 8, what is meant by ""we find that the leading space of PToM is not significant""? Also on page 8, the ""further intuition"" on why PToM is effective at adaptation is not substantiated by analysis of the belief model for the agent during intra- and inter-ToM. Can the authors provide this data? 
- There is some important missing related work which should be cited on theory of mind in the context of RL: https://arxiv.org/pdf/2102.02274.pdf, https://arxiv.org/pdf/1901.09207.pdf. Ideally these methods would be provided as baselines, or the authors would explain why their method was clearly an improvement from a theoretical standpoint. Can the authors comment on this?
- The authors could also cite more recent work in the LOLA line: e.g. https://arxiv.org/abs/2205.01447. 
- Above equation (3) the authors seem to assume that the focal agent has access to the goals of its opponents during training. Is this correct? Yet earlier in Section 3, the authors claim that agent j's true goal is inaccessible to agent i. Can they clarify this apparent contradiction (perhaps the goals are known during training but not during execution, in the usual centralized training, decentralized execution paradigm)? 
- In Figure 3, does the x-axis for PToM take into account the experience steps used in MCTS? This is unclear, and if not, these experience steps should also be accounted for, to make this a fair comparison between algorithms.

**Questions**: See Weaknesses.",0,
Pb9PIECnNF,PtvE,"**Summary**: This paper proposes to study the OOD sample detection for pre-trained models where some OOD data may be in the pretraining dataset (PT-OOD). It is observed that the detection performance for self-supervised pretrained model is worse than supervised pretrained model. The paper propose to use k-NN to detect PT-OOD sample.

**Soundness**: 1 poor

**Presentation**: 2 fair

**Contribution**: 3 good

**Strengths**: The problem setting of detecting PT-OOD samples is interesting and might be valuable for future application of large pre-trained models.

**Weaknesses**: 1). The analysis lacks support. The paper report that OOD detection methods perform better on supervised pretrained model than on self-supervised pretrained model. The analysis in this paper says that it is because the features of models under supervised pretraining are linear seperable while the features of self-supervised trained models are not. Except the illustration in Fig.2, I have not found any theoretical or empirical evidence to support this analysis.

2). The proposed method lacks novelty and contradicts to the analysis in section 4. This paper proposes to apply kNN on features to detect PT-OOD samples. While applying clustering method on features to classify data sample is a conventional way in feature laerning [1], the proposed method lacks novelty. Furthermore, it contradicts to the analysis in section 4, stating the PT-OOD features scatter among ID features.

3). Some part of the paper is confusing. For example, in the last paragraph in the introduction, it first says "" We can utilize instance-by-instance discriminative features to separate ID and OOD, which require ID boundaries"" and in the next phrase, it says ""without using ID decision boundaries"", which is confusing. 

[1] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. In ICLR 2020.

**Questions**: I have major concerns over the analysis in this paper. 

1). Why whould the features from self-supervised model be less linear seperable than supervised model? By tuning the last linear layer, self-supervised models achieve similar or better performance than supervised models, it is not obvious why less linear seperable is the reason for less robustness in detecting PT-OOD samples.

2). If the analysis in this paper is true, why would kNN be an effective method to detect PT-OOD sample when ""PT-OOD can scatter in the feature space""? 

Therefore, I think the paper requires a more in-depth analysis of the PT-OOD problem.",1,"['The proposed method lacks novelty and contradicts to the analysis in section 4. This paper proposes to apply kNN on features to detect PT-OOD samples. While applying clustering method on features to classify data sample is a conventional way in feature laerning [1], the proposed method lacks novelty.']"
bcNwnuWMe0,FSEP,"**Summary**: The authors study the use of graph neural networks in flood forecasting. They find that with their models the use of river topology does not add significant value to the prediction of flood events.

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: - Flood prediction is an important problem under the climate change, and can lead to enormous social good via mitigation of the loss of human lives and economic damage. 

- The current work applies some of the most advanced time-series prediction method like LSTM, combined with using GCN to take into account the network topology to solve this problem.

**Weaknesses**: - The key issue with the current paper is the message is mostly a negative result, that incorporating the network structure does not help with the forecast of flood. Although there are some experimental support to this, it is difficult to draw this conclusion because the authors have only tried a limited set of models. As the claim is counter-intuitive, more analysis, especially analysis of the raw data, is required for supporting the claim. Are there correlations between the water level in two gauge stations upstream and downstream? And are there time lags between between a spike in an upstream gauge station and a downstream gauge station? If so, the spike in the upstream gauge station should be useful for predicting the spike in the downstream gauge station. Why is this not reflected in the experiment results? 

- I have reservations about the way the problem is modeled. Since we are interested in predicting flood, which is a rare event, fitting the average MSE loss to the time series data might not be the best approach. Assuming the floods are rare large spikes in the data, a conservative model will do best by trying NOT to predict a large spike, as getting the timing of the spike wrong can incur a huge MSE loss. It could, for example, be modeled alternatively as a time series prediction problem, where we try to predict if there is a flood event within the next 6 hours given the water level in the past 24 hours. There could be multiple ways to model this but MSE on the water level does not seem the right fit for modeling rare spike events.

**Questions**: - How long does it take on average for the flow from one upstream gauge station to reach a neighboring downstream gauge station? I think these values should be taken into account for the history window size (24 hours currently) and the prediction horizon (6 hours currently). 

- What is the sample size for each gauge station? Or more importantly, what is the average number of flood events that each gauge station experience in the data? If the sample size is small, it could be beneficial to just pool all the data and train one model for the time series prediction of rare events, than to spread the rare event samples across multiple stations and try to model their correlations. Could that be a reason why the incorporation of network structure is not helping here? 

- The authors claim that the methods perform similarly. From Table 2 it seems to be true for NSE, but MSE has a lot more variations across the different methods. Why is this the case?",0,
1JuMFjSkpD,8JtY,"**Summary**: This paper promotes the independence of the sensitive attribute and the predicted label to enhance fairness in classification by using (sample) distance covariance as a penalty term. The authors not only provide theoretical analysis on the convergence and sample complexity bounds for the estimation of distance covariance and mini-batch computation, but also numerical results on UCI tabular and image datasets.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1.	Distance covariance seems to be a very interesting metric for the dependency between two random variables and an active research area. So this paper which connect the distance covariance with fairness notion is very timely.
2.	This paper provides theoretical background, consistency and sample complexity bounds for the distance covariance
3.	Section 3.3 that connect DP and EO with the nature of dependency and distance covariance is well written.
4.	The numerical results are abundant, including tabular and image data. The experiments on image data illustrate the scalability of the proposed algorithms, unlike common experiments only on small-scale tabular datasets.

**Weaknesses**: 1.	The intuitive explanation for the distance covariance and its existing usages in statistics are not well explained.
2.	There are some missing references for fairness interventions, which I encourage the authors to include and compare, for example, 
a.	Lowy, A., Baharlouei, S., Pavan, R., Razaviyayn, M. and Beirami, A., 2021. A stochastic optimization framework for fair risk minimization. arXiv preprint arXiv:2102.12586.
b.	Alghamdi, W., Hsu, H., Jeong, H., Wang, H., Michalak, P., Asoodeh, S. and Calmon, F., 2022. Beyond Adult and COMPAS: Fair multi-class prediction via information projection. Advances in Neural Information Processing Systems, 35, pp.38747-38760.
3.	Despite that the distance covariance is interesting, the reason why it is potentially a better metrics than other information-theoretic quantities such as mutual information and the Renyi maximal correlation is unclear to me. Distance covariance, MI and the maximal correlation are all zero when two random variables are independent; however, the maximal correlation satisfies Renyi’s 6 postulates for a good measure of dependency. It is encouraged that the authors spend more space to discuss the pros and cons regarding the dependency metrics, give illustrations on why one is better than the other and hopefully provide a simple numerical example.
4.	In the experimental results (Table 1 and 2), it seems that the proposed results consistently have higher accuracy and lower fairness violation. However, the proposed result is not too different from other methods as most of them are in the Lagrangian form, i.e., CE loss plus fairness/ independence constrains. It is encouraged that the authors explain clearly why the proposed method could lead to a consistently better acc-fariness trade-off point than other methods.

**Questions**: Please refer to Weakness. I will consider raising the scores after the rebuttal period if the authors could address the weaknesses.",0,
eoB6JmdmVf,cgRE,"**Summary**: This paper describes a re-analysis of two datasets of fMRI responses of a story from the Moth Radio Hour. In one cases, subjects listened to the story and in the other case the subjects read the story. The authors attempt to predict responses to these stories using text-based, large language models (BERT, GPT2, FLAN-T5) and audio-based, speech models (wav2vec2.0 and Whisper) using standard, regression-based voxelwise encoding models. They compare the prediction accuracy of these models with variants where they have regressed out the contribution from text-based features, audio- and speech-based features, and low-level visual features. They find that text- and speech-based models show similar overall prediction accuracy in early visual and early auditory regions, while text-based models show superior performance in putatively higher-level visual and language regions. They find that the performance of text-based models is relatively robust in higher-level language regions, maintaining relatively good performance when controlling for text, audio/speech, and visual features, consistent with a response to higher-level linguistic properties. In early sensory areas, there is a greater impact of controlling for these features suggesting that these lower-level features predict more of the response variance, as expected. The trends for the speech model are mostly similar, with the biggest difference being that the controlling for audio and speech features hurts performance more in high-level, language regions, suggesting that these models are not predicting high-level, linguistic properties.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: I think directly comparing text and speech models is a valuable contribution to the literature.

The dataset investigated, with matched responses to reading and listening, is interesting and relevant to the questions being addressed. 

I think there is value in highlighting the problem of feature correlations, which this paper does well.

They test a large set of control features. Their controls are more comprehensive than most papers I have seen.

**Weaknesses**: The conclusion that lower-level features explain a large portion of the variance in lower-level sensory areas is not surprising. The fact that high-level language regions are more robust to lower-level features in the context of text-based language models is also not surprising, since the response is higher-level and the model does not have any visual or auditory information baked into it. The fact that speech models are impacted by including audio and speech features is not as surprising as the authors suggest, since the models are taking audio as input and the representations are only being derived from 2-second stimuli and thus lose much of the higher-level language information that is present in a text-based model. This paper is essentially confirmatory and should be framed as such, in my opinion. 

I don’t see the benefit of the “direct” approach compared with measuring the unique variance of different feature sets using some kind of a variance partitioning framework such as that promoted by the Gallant and Huth labs. Conceptually, the main thing one wants to know is what fraction of the neural response variance can be uniquely explained by a particular model and how much is shared with other models. The direct approach seems like an indirect way to address that question. I also find the term “direct” unclear? What is direct about it? What would the indirect approach be?

There is no attempt to understand whether text- and speech-based models account for shared or unique variance from each other, which seems important and natural in the context of this paper. 

There needs to be more detail on the speech models. Minimally, there needs to be a summary of the tasks (e.g., masked token prediction) they were trained on and the maximum possible temporal extent that the models are able to consider. If possible, the authors should extend the window they consider to go beyond 2 seconds to allow the models to potentially incorporate longer timescale linguistic information.

Averaging performance across models is suboptimal because some of the models might be performing quite well, which would be valuable to know. For example, Whisper has been trained on a much broader range of tasks than wav2vec2.0 and it would be useful to know whether it performs better as a consequence. A better choice would be to select the best performing model for the main figure and to put the performance of all models in the appendix. Model selection could be done on training or validation data to prevent overfitting.

It is unclear how activations from different layers were handled. Were they all combined together? Typically, one selects the best-performing layer in a model using the training or validation set. 

The authors need more detail about the stimuli. They should specify the total duration of the story(ies) in the listening and reading conditions, how many words there were, how the words were presented, and the rate they were presented at. For example, for listening, was this a natural story with a variable word rate? Or were the words presented artificially using a fixed ISI? If they used a variable word rate, how does this impact how the features were calculated? Downsampling does not seem straightforward in this case. For the reading condition, how was the text presented? Was there a word presented every few hundred milliseconds or was a whole sentence presented at once? Similarly how does this impact the feature design?

The language ROIs includes many regions of the STG that I would consider high-level auditory regions (e.g., respond similarly to native and foreign speech). I would recommend repeating the analyses with the language parcels released by the Fedorenko lab, or at least limiting yourself to the STS. For the early auditory analysis, I think it would be worth repeating these analyses with just the A1 parcel to be more conservative. 

The visual word form area is quite small and challenging to localize:
https://www.pnas.org/doi/abs/10.1073/pnas.0703300104

I suspect the results here reflect what one would see from a generic high-level fusiform visual region. The authors could test this by selecting another nearby region and seeing if the results differ. If the results are similar, I think it is misleading to describe the results as specific to the visual word form area, despite the label provided by the atlas.

I could not follow how the noise ceiling is calculated. What is done with the results from all of the different subsamples? Is there some attempt to extend the results to infinite samples? I am skeptical about calculating a noise ceiling in V1 or A1 for the non-preferred modality. I would expect the noise ceiling to be very close to 0. How was this handled? When possible, it would be preferable to plot both the raw scores and the noise ceiling on the same figure so that you can see both. When you average across voxels for ROI analyses, do you average the noise-corrected values or do you separately average the raw and noise ceiling values and then divide these two numbers?

For Figure 3, it would be preferable to group by the listening/reading as was done in later figures. The performance between the modalities is not really comparable as these are totally different stimuli (and I am skeptical of the noise ceiling calculation). 

The equations in the section title “Removal of low-level features from language model representations” make it seem like there is a single regularization term for all of the model features. It seems preferable to do what was done for the neural analyses and to fit a banded ridge model separately on every model feature. The different low-level features have very different dimensionality, so there should be some discussion of how this was handled when concatenating the features. If you z-score each feature than features that have higher-dimensionality will have much more influence. It was also not clear to me how cross-validation was handled here. Did you train and validate on a subset of stimuli and then remove the predicted response on test? How many folds were there? This information about cross-validation should also be specified in the voxel-wise encoding model section. For the banded ridge regression, were the lambdas specified separately varied for each feature set? How do we know that this range is sufficient? It is highly sensitive to the scale of the features. How fine was the grid search?

**Questions**: In most cases, I found it easier to include my questions in the weaknesses section. See above. 

What was the reason for constraining the text window to 20 words? How are the results impacted by this choice?

What is the reason for not removing the control features from the neural responses as well? How would doing so impact the results?",0,
DsEhqQtfAG,Q4Mc,"**Summary**: This work proposes the Decomposed Diffusion Sampling (DDS) method as a Diffusion model-based Inverse problem Solvers (DIS) for inverse problems in medical imaging. The work is based on the observation that a diffusion posterior sampling (DPS) with the manifold constrained gradient (MCG)  is equivalent to one-step projected gradient on the tangent space at the “denoised"" data by Tweedie’s formula, this work
provides multi-step update scheme on the tangent space using Krylov subspace methods. The experiments show that performing numerical optimization schemes on the denoised representation is superior to the previous methods of imposing DC. Further, the work devises a fast sampler based on DDIM that works well for both VE/VP settings. With extensive experiments on multi-coil MRI reconstruction and 3D CT reconstruction, it was shown that DDS achieves superior quality while being ≥ ×80 faster than the previous DIS.

**Soundness**: 3 good

**Presentation**: 2 fair

**Contribution**: 3 good

**Strengths**: The main strength of the work is the novel theoretic insight: the geometric interpretation that the diffusion posterior sampling (DPS) with the manifold constrained gradient (MCG)  is one-step projection to the tangent space of the clean data manifold, and using Conjugate Gradient type method, one can achieve multiple steps projection within the tangent space. All the theoretic proofs are given in details, all the lemmas are well formulated and clearly explained. The experimental results are convincing.

**Weaknesses**: The work involves both Krylov space theory and diffusion model, it will be more helpful for general audience to give brief overview of both theories in the appendix. Especially. the key observation : DPS with MCG is a one-step projection to the tangent space of the clean data manifold. Furthermore, it needs more explanation for the problem : why Krylov space method can guarantee to stay in the tangent space.

**Questions**: 1. Do we need the  assumption that the clean data manifold is an affine subspace ? How about general curved manifold ?
2. Why Krylov space method guarantee to stay in the tangent space ?
3. Please elaborate the geometric view of diffusion model more. Is the diffusion process on the manifold or in the ambient Euclidean space ?",0,
MrslLZmkye,cBKA,"**Summary**: This paper presents a generative adversarial training method leveraging a Wasserstein-score to enhance Out-of-Distribution (OoD) detection accuracy. The approach simultaneously undertakes data augmentation and exploration using a limited set of OoD samples. Additionally, the study offers theoretical assurances, confirming that the optimal solutions derived from generative model can be statistically realized through adversarial training in empirical scenarios.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: The method employs a unique exploration strategy to identify regions where the model lacks confidence. By focusing on uncertain regions, SEE-OOD achieves superior performance in detecting OOD samples compared to existing methods. The paper presents extensive experiments and benchmarks to validate the effectiveness of SEE-OOD against other state-of-the-art techniques.

**Weaknesses**: 1. The paper does not provide visualizations of the generated outliers, which could offer more intuitive insights into the model's behavior and decisions.
2. The evaluation metrics employed in the paper miss out on including the Area Under the Receiver Operating Characteristic (AUROC), which is crucial for understanding model performance in classification tasks, especially in OOD detection.
3. While the paper presents results on certain datasets, it would benefit from testing on larger and more diverse datasets to ensure the method's generalizability and robustness.

**Questions**: Please address the weaknesses I've highlighted above.",1,"[""While the paper presents results on certain datasets, it would benefit from testing on larger and more diverse datasets to ensure the method's generalizability and robustness.""]"
wpuQonyeXN,pp1B,"**Summary**: This paper studies quantum RL, which provides sample complexity for both tabular MDP and linear mixture MDP, based on several quantum estimation oracles. Compared with previous literature, this paper provides an online exploration method for quantum RL.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: - This paper is well written and easy to follow
- Study quantum RL is novel in the literature, with limited prior works
- The proposed online exploration paradigm is more practical than previous work.

**Weaknesses**: - The discussion on sample complexity is not enough. For example, it would be better to discuss why the $\sqrt{T}$ factor is removed. Is that because of Lemma 3.1 and Lemma 3.2 such that the previous $\epsilon^{-2}$ sample complexity can be reduced to $\epsilon^{-1}$ sample complexity so that the exploration can be more aggressive? 
- Besides the previous comment, I'm also looking for discussions about the lower bounds (or at least some conjectures). For example, if the dependency on $d, H$ within the lower bounds still match (Zhou et al., 2021) or not?

**Questions**: Besides my concern about the weakness, I'm concerned about the cost of translating a classical RL task into a quantum-accessible RL task. Here are my questions
- Can one directly covert the observation in classical RL to a quantum-accessible RL? (e.g., changing the Atari games to quantum). If the quantum RL can be used in classical RL tasks, then how would the current $\log T$ bound break the classical $\sqrt{T}$ regret bound?
- If the current algorithm can only be used in quantum-accessible RL, and we cannot convert a classical RL task into quantum, then how will this algorithm contribute to real-world RL tasks?",0,
EvBx5whpzJ,2FMY,"**Summary**: In this paper, the authors study the blurred segmented time series (BST) data prediction problem. The authors theoretically clarify the connotation of valuable contextual information. Based on these insights, prior knowledge of BST data is incorporated at the data and class levels into the model design to capture effective contextual information. Moreover, the authors also propose a label consistency training framework to harmonize inconsistent labels. The authors have performed extensive experiments on real datasets to demonstrate the effectiveness of the proposed method in handling the time series classification task on BST data.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1.	The authors propose a new framework to handle the time series classification task on blurred segmented time series data.

2.	The authors provide some theoretical analysis about the connotation of the valuable contextual information.

3.	In the proposed framework, prior knowledge of the BST data at both the data and class levels are incorporated into the proposed model to capture the effective contextual information.

4.	The authors have performed extensive experiments on 3 real datasets to demonstrate the effectiveness of the proposed method.

**Weaknesses**: 1.	Some assumption of the proposed method seems a little strong. In Section 3.2, for the prediction behavior constraint, it is assumed that consecutive time segments span at most 2 classes within a suitably chosen time interval. The time interval may have a big impact on the model performance. However, it is not clear how to choose a suitable time interval for each dataset. The authors also need to perform experiments studying the impacts of the time interval on different datasets.

2.	The experimental analysis seems not consistent enough. In Figure 3(b), the analysis about random disturbance is studied on fNIRS and Sleep datasets. In Table 3, the ablation studies are performed on Sleep and SEEG datasets.

3.	The experimental analysis is not sufficient. Compared with existing methods, one advantage of the proposed method is to exploit the prior information at both the data and class levels. The authors are suggested to perform experiments studying the performance of the proposed method with only considering the prior information at data level and class level respectively.

**Questions**: As discussed in Section 3.2, the time interval may have a big impact on the model performance. How to choose a suitable interval for each dataset?",0,
wpuQonyeXN,AteR,"**Summary**: This work studies quantum reinforcement learning, where quantum means that the classical reward and state transition feedback is replaced by quantum pure states (see Eq. (3.1) (3.2)). This paper studies both the general MDP and linear MDP, and it shows that they can achieve logarithmic regret performance, which breaks the classical square-root regret lower bound.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: - The authors did a good job of presenting this work and comparing it to known literature.

**Weaknesses**: - Lack of novelty. I am familiar with the work of Wan et al 2022 for quantum multi-armed bandits and quantum linear bandits. As the main theoretical tools for multi-armed bandits and linear bandits are considerably similar to RL and linear RL respectively, this paper can be regarded as an extension from Wan et al 2022 to quantum RL. Although the author pointed out one new challenge in Remark 5.1, I did not see enough novel contributions in this work.


---
Zongqi Wan, Zhijie Zhang, Tongyang Li, Jialin Zhang, and Xiaoming Sun. Quantum multi-armed
bandits and stochastic linear bandits enjoy logarithmic regrets. In To Appear in the Proceedings
of the 37th AAAI Conference on Artificial Intelligence, 2022. arXiv:2205.14988

**Questions**: Although leaning toward a negative evaluation of this work for its lack of contribution, I think this quantum RL topic is interesting and would suggest that the authors look into challenging issues around this topic, e.g., regret lower bounds for quantum RL which is not studied in quantum bandits in Wan et al 2022. 

If the authors think there are other nontrivial challenges (except for Remark 5.1) in this work than in Wan et al 2022, please take the chance of rebuttal to explain.",0,
WBkjxnYAgy,HNdL,"**Summary**: This work aims to decode predictate-argument structure from fMRI recordings of participants viewing text and video stimuli. The data was specifically recorded for this study and the authors say that they will release it upon publication. Binary linear decoders are trained to predict whether a specific concept or a concept pair from the set {(subject, verb); (verb, object), (subject,object)} is present or absent in a specific brain recording. The results show that the concepts and concept pairs can be decoded from fMRI corresponding to videos and across participants, and concepts but not concept pairs can be decoded from text.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: - Investigate two sensory modalities in the brain
- Will release data and code upon publication

**Weaknesses**: 1. Severe lack of clarity in many parts of the manuscript, see specific questions below. This really hampers understanding the contributions of the work.

2. Writing and structure can be much improved. Some parts of the manuscript are too brief and lack motivation (e.g. the stimulus design, motivation for looking across multiple modalities, background on what should be expected for the visual modality, what the cited related work is actually doing instead of just listing the references). Other parts are too detailed for the main paper of a submission to a machine learning venue (a whole page is spent on data collection and preprocessing). The writing also comes off at times as a bit patronizing but also too informal. A good example is this excerpt “With so few samples of such large dimension, even such a simple model will overfit. Any more complicated model will overfit even more. This is the nature of neuroimaging. Data is expensive. Scanner time is $600/hr. Adding in subject payments and salary of primary and secondary scanner operators, each data point costs over $6. Moreover, data is tedious to obtain.”

3. It’s not clear to me that an ML venue is the best place for this submission. There is no innovation on the methodology, and the results are entirely neuroscience-focused, so it seems that a neuroscience audience will be better able to give feedback and appreciate this work.

**Questions**: Q1. In the current analysis setting, ""Scott pick up"" would be considered the same as ""pick up Scott"" but those two have different predicate argument structures. I would like the authors to comment on how their work studies predicate-argument structure and not just concept co-occurrence.

Q2. Stimulus design:

a. What is the motivation for showing a pair of concepts in one stimulus? 

b. Can you further explain how showing the text stimuli in “random fonts, point sizes, and positions in the field of view” is increasing the likelihood that you decode the concept semantics and not visual characteristics? If I understand correctly, each text stimulus is then very likely to have a different combination of font, size, and position, which actually makes it easier to decode if the decoder depends on visual properties. 

c. For the stimuli that did not have a subject (e.g. pick up briefcase), how was the video created?

Q3. Data splitting:

a. What “subsetting” was actually done? It seems that Section 3.5 is aimed to explain something but it’s not coming across. Can the authors explain in simple language how what data was trained on and tested on to answer each of the questions?

b. The way the supertrials are implemented seems quite unfair for binary classification. The “present” trials all have a specific concept in common, so averaging over them can reinforce this concept. The “absent” trials don’t necessarily have anything in common, other than not having a particular concept. So averaging over them can destroy important semantic information.

Q4. Analysis choices: please discuss your motivation for the following choices and how you would expect changes in those choices would affect the results:

a. binary classification vs multi class classification

b. the grouping intro super trials

c. the hold out strategy",0,
6jBNQ8nSxA,aN2M,"**Summary**: This paper proposes a new security patch detection framework called LLMDA (Low-Level Malware Detection Algorithm) that leverages large language models (LLMs) and multimodal input alignment. the paper makes notable contributions in advancing security patch detection through an innovative multimodal framework powered by LLMs and contrastive learning. The results highlight the potential of language-centric techniques in this application domain.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 4 excellent

**Strengths**: The idea of using LLMs to generate explanations and instructions for patches is novel and creative. 
Prior works have not exploited LLMs in this manner for security patch analysis. 
The overall system design and methodology are well-conceived and technically sound. The ablation studies in particular are thorough. This work makes important strides in advancing the state-of-the-art in security patch detection. The performance gains are significant. 

In summary, this paper makes noteworthy contributions through its novel application of LLMs and represents an important research direction for security. The original ideas, rigorous experiments, and potential impact make it a valuable work.

**Weaknesses**: Only two datasets are used in the experiments. Testing on a more diverse range of projects and codebases would better showcase the generalizability. 
The datasets used are fairly small, with PatchDB having 36K samples and SPI-DB only 25K. For deep learning, these sizes are quite modest. Training and evaluating on larger corpora could lend more statistical power.

**Questions**: 1.Could you provide the exact prompts used to generate the explanations and instructions? This context would help with reproducibility.
2.The ablation study removes one component at a time. How does performance degrade when ablating multiple components together?
3.Can you apply the case study analysis to a larger and more diverse sample of patches? Any insights on patterns?",0,
z7K2faBrDG,AMLy,"**Summary**: This paper shows that the assumption that an observer has an internal representation of univariate parameters such as spatial frequency or orientation while stimuli are high-dimensional does not lead to contradictory predictions when following a theoretical framework. The perceptual scale is found to correspond to the transduction function in this framework and is related to the Fisher information of the generative model underlying perception. The research suggests that the stimulus power spectrum largely influences the perceptual scale. Furthermore, the study proposes that measuring the perceptual scale can help estimate the perceptual geometry of images, going beyond simple distance measurements to understand the path between images.

**Soundness**: 3 good

**Presentation**: 1 poor

**Contribution**: 2 fair

**Strengths**: - A theoretical analysis on the perceptual scale in the case of GRFs was performed.
- Different scaling experiments involving GRF and naturalistic textures were conducted.

**Weaknesses**: - The theoretical analysis is performed for the case of GRFs, which does not apply to naturalistic textures (Note that Gaussian textures are a very limited class of images). Most of the Propositions are special cases of previous work, so what is the theoretical contribution of this work?
- The different scaling experiments only involve a small set of pairs and 5 naive participants. There may be insufficient data for detailed analysis, and actually results were presented in Sec. 3 without any analysis.
- The paper appears to be hastily written with many typos (e.g., page 5: e have -> we have; Fig. 2: the colors are wrong, and the caption's description conflicts with the main body, Page 8: Proposition 2 and Proposition 2; ...) and symbols are not always well-defined or explained.

**Questions**: This looks like careful and sophisticated work at first glance. I did not notice major defects in the paper, to my knowledge. However, the paper is difficult to follow and would benefit from careful editing.
Since I do not have a solid background in this area, I cannot confidently evaluate the significance here.
It may be better if the authors can write their manuscript from the point of view of a general researcher in ICLR.

Additional question: Can the authors explain more on how measuring the perceptual scale helps estimate the perceptual geometry of images? This is claimed in the Abstract but rarely mentioned in the main body.",0,
DWUiUneKMI,cVFx,"**Summary**: This work considers the problem of solving partial differential equations with a physics-informed neural networks (PINNs) approach. The authors propose a novel neural operator architecture, based on the Discrete Hartley Transform, as an alternative to the popular Fourier neural operator, which is based on the Fast Fourier Transform. The main advantage of the Hartley transform is that it allows for a fast convolution (similar to the FFT) but enforces the resulting function to be real as opposed to complex-valued. In a second part of the paper, the authors evaluate the architecture on two partial differential equations: Burger's equation and a diffusion equation.

**Soundness**: 1 poor

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: - The approach of the authors that incorporate the Hartley transform into neural operators appear to be novel.
- The Hartley transform has same computational complexity than the FFT, which allows for a fast evaluation of the neural operator.

**Weaknesses**: - One of the main weaknesses of the paper is the motivation for the Hartley transform. The authors mention that the main limitation of the FNO is that it ``provides suboptimal solutions'' because the resulting functions can be complex-valued. However, I do not see any evidence supporting that this could be a limitation in practice as one could simply take the real part of the network since the output will likely have a small imaginary part if the loss function is small after training on real data.
- The authors only consider problems where the PDE is known so there is little motivation for using a PINO approach over PINN (or even simply a traditional numerical method like finite-difference or finite elements). The experiments must include a comparison against a PINN technique.
- The numerical experiments in the paper do not demonstrate that the proposed architecture improves the performance of the Fourier Neural Operator, as acknowledged by the authors for Burger's equation and observed in Fig. 1.

**Questions**: - In the 2nd paragraph, what do the authors mean by suboptimal solutions?
- The last paragraph on p.3 has citations not rendered.
- Eqs. (5-7) should include the domain on which the PDE is defined, as well as a proper definition of the variables, and initial/boundary conditions.
- The loss functions in Eq. (8) should be mathematically defined.
- The discussion regarding the activation function in Section 3 requires references for claiming that GELU is preferred in physics-informed deep learning.
- In the last paragraph of section 3, it is inexact to state that the numerical scheme provides exact solution. What is the advantage of using the present approach over the numerical techniques, which are much faster and have convergence guarantees?
- In Figure 1, 4th column, why is the absolute error negative?
- Table 2: these results should be repeated over several training runs and contain deviation errors. Is it statistically significant to mention the ``reduced reconstruction loss"" in the first paragraph of p. 7 if the difference only appears on the 4th digit?
- In the 2nd paragraph of the discussion, the author mention that their architecture achieve a ``more random trajectory'', this is a vague term which should be make precise.
- In the last sentence of the discussion: the authors mention that the activation function made no difference. I am wondering why it is discussed in the paper, which is about using the Hartley transform for neural operators.",0,
dcjtMYkpXx,PF7E,"**Summary**: The authors tackles the over-optimization issue in RLHF with reward model ensembles. This is achieved by training multiple reward models, each with identical data but different random seeds. These reward models are then used for ensemble-based optimization objectives, including worst-case optimization (WCO) and uncertainty-weighted optimization (UWO) for best-of-n sampling (BoN) and PPO (proximal policy optimization). Combined with a 25% label noise to mirror real-world settings, the authors show UWO and WCO can effectively mitigate overoptimization.

**Soundness**: 3 good

**Presentation**: 4 excellent

**Contribution**: 3 good

**Strengths**: * **Simple yet elegant approach**: the authors only used different random seeds to train the reward function, yet it can significantly improve gold score performance, especially with BON.
* **Reproduced overoptimization with OSS (open-source software) models**: the authors are the first to study and reproduce RM overoptimization with open-source models.
* **Experiments across multiple model sizes / data sizes**: the authors conducted a comprehensive analysis, which offers insights.

**Weaknesses**: **Lack of account for random seeds**: the results do not look smooth for Figure 4 / Figure 8. The results could be improved by running with at least 3 random seeds and recording the error bars.

**Questions**: > we use the complete dataset of 46, 000 prompts for reward model training and train all the reward models for 5 epochs. 

>We use all the available training data for the training of every reward model as
training on lesser data results in higher validation loss and poorer performance (Lakshminarayanan
et al., 2017). Unless stated otherwise, we train an ensemble consisting of five reward models.

Why train for 5 epochs? What is the training and the validation accuracy of the reward model? In Figure 15, can you show me the epoch as the x-axis? Steps can be confusing as it is related to the batch size / gradient accumulation / how you log.

> we only evaluate BoN for a maximum of nmax = 12, 500 samples , which roughly equals 8.4 nats of KL

Why nmax? Does this mean for some of them he use less than 12500 samples?

> we train for 3000 PPO steps.

How many episodes?


> Figure 3

Why do you have two y-axis? Do the proxy RM and the gold RM have different RM scales? How are these scales defined?


> Figure 6 and 7 

Figure 6 and 7 seem contradictory? KL penalty = 0 gets 0.15 gold score, but in figure 7 KL penalty = 0 gets 0.03 gold score?

> Figure 8

Why does PPO underperform BoN in 1.3B setting according to Gold Score? Gao et al (2023) show 1.2B PPO outperforms 1.2B BoN

> Appendix C

How does these two KL distance calculation Gold RM performance?


> We train a fixed number of proxy reward models using
identical data and hyperparameters but with different random seeds

But I assume data shuffling is different? There are really multiple random seeds:

* query dataset seed
* reward model seed
* policy model seed


> Figure 4
Why is single RM experimented with KL=150 but not the other types?",0,
BUNkXMwfXL,fHcz,"**Summary**: This study elucidates that the training stability of diffusion models stems from the noise-to-data mapping's stability and the smoothness of the loss landscape. Building on these insights, the research introduces a Curriculum Learning based Timestep Schedule (CLTS) and Momentum Decay with Learning Rate Compensation (MDLRC), effectively doubling the training speed of diffusion models.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: (1) The paper is clearly-written with experimental results supporting the claims.

(2) The acceleration techniques introduced in the study successfully enhance the training speed of diffusion models, surpassing prior methods.

**Weaknesses**: (1) The paper's claims, while supported by experimental results, lack robust theoretical analysis. This reliance on experimental evidence alone may render the claims less persuasive, as experimental outcomes can sometimes be situational or coincidental.

(2) The proposed CLTS acceleration technique appears conceptually similar to progressive noise schedules, which incrementally adopt denser noise schedules over crucial timesteps for generation—a concept employed in other works [1]. Additionally, the momentum decay technique does not emerge as a novel approach in the training of diffusion models.

[1] Song, Y., Dhariwal, P., Chen, M. and Sutskever, I., 2023. Consistency models.

**Questions**: (1) In equation (5), why $\epsilon_{\theta} \rightarrow I$, as $\epsilon$ is a Gaussian random variable with mean $0$?  $\epsilon_{\theta}$ should approach zero.

(2) Can the author elaborate more about how to tell from Figure 8 that the diffusion model generates structural information when $t\rightarrow T$?",1,"[""The paper's claims, while supported by experimental results, lack robust theoretical analysis. This reliance on experimental evidence alone may render the claims less persuasive, as experimental outcomes can sometimes be situational or coincidental."", 'Additionally, the momentum decay technique does not emerge as a novel approach in the training of diffusion models.']"
dONpC9GL1o,8Qv3,"**Summary**: This paper provides a theoretical analysis of why decoding from language models with truncation sampling works well and provides a new decoding strategy called BAT-sampling. The gist is that if you truncate a sufficiently large portion of the distribution, then truncation sampling will avoid any tokens which are not in the support of the true language distribution. However, this may result in throwing out tokens which *are* in the support of the distribution. The paper then proposes BAT-sampling, which is an LP-based method that can determine which tokens are and are not in the support of the true distribution, even if they are ""out of order"" in terms of the probability assigned (i.e., the method can determine that lower-probability tokens are in the support of the distribution even when higher probability tokens are not). The paper concludes with a set of experiments, including an impressive discussion of speedups for BAT-sampling, as well as some (very slight) improvements over existing methods in certain conditions.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: This is a great analysis paper, providing an interesting explanation for why truncation sampling works so well in language model decoding. The paper's motivation is clear and well-written. The fact that BAT can determine that some tokens have nonzero true support, even though they are assigned less probability than others which are not in the support of the true distribution, is a surprising and compelling result. Leveraging the softmax bottleneck is a clever trick here and one that will be unexpected to most readers in NLP. 

I expected BAT to be computationally infeasible to run in practice due to its dependence on an LP-solver at each tilmestep of decoding. However, the speedups in the ""Basis-aware threshold sampling in practice"" (namely, using a decomposition of the softmax matrix and only relying on BAT when a token under the threshold probability is chosen) seem reasonable and compelling, and the amortized cost of 0.1s/token, while slow, is not infeasible for certain classes of applications.

The experiments, although not particularly compelling as a reason to start using BAT sampling in practice, seem reasonable and sufficiently thorough. In particular, the analysis of performance as more constraints are added back (after the SVD) is very clear. In contrast, I did not find the ""BAT outperforms all other methods for GPT-2-Large"" paragraph very compelling given that BAT is not the best-performing model on any other model size.

**Weaknesses**: The primary weakness seems to be the performance of BAT compared to other methods. Despite its theoretical justification, it does not clearly outperform other sampling approaches (Figure 5). Although there is a preference for BAT to eta-sampling shown in Figure 6 and Table 1, this preference is very slight and the comparison is only between two sampling methods. However, I do not see this weakness as a legitimate reason to reject the paper, since its main contribution seems to be analysis and theoretical understanding of existing decoding algorithms.

**Questions**: 1. Based on the figures (1,4), it seems like BAT is rejecting a lot of tokens corresponding to partial words. Out of curiosity: is this true, and do you have any insights into why this happens, or other qualitative insights into what tokens tend to get accepted/rejected?",0,
uizIvVBY8P,1EUX,"**Summary**: Many anomaly detection papers assume that only normal instances are present in the training, and train the model unsupervised.
However, in the real world, there are situations where even a few labeled abnormal instances are available.
In this case, studies have shown that even a very small number of anomalies can significantly improve the performance of the detector.
In addition, anomaly detectors are often trained under the assumption that the data distribution is stationary, but in real-world deployments, the distribution changes over time.
Therefore, the authors propose a supervised anomaly detection method using continual learning.
The method consists of a Variational AutoEncoder (VAE) and a binary classifier.
The VAE uses the reconstruction error to determine whether the input data is an unseen anomaly, and the binary classifier determines whether the input data is a seen anomaly, and calculates the anomaly score by aggregating the results of both models.
In addition, the VAE's decoder is used to generate data, which is then used for generative replay to prevent catastrophic forgetting in continual learning.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: - The paper is well organized and the notation is easy to follow.
- The structure of the model and the organization of the methods (such as loss) are theoretically clean and natural. The authors naturally integrated supervised anomaly detection with continual learning.
- The proposed method works with various types of input data, such as images and tabular data.
- It is impressive that the method utilizes CVAE and a binary classifier to learn the process of generating rare abnormal instances by gradient descent, which is then used for generative replay.

**Weaknesses**: The main weakness is that experimental results do not sufficiently support the superiority of this method.

- On tabular datasets such as UNSW, bank, and credit, the model does not significantly outperform the other baselines. In many cases, the performance is similar to that of the binary classifier, suggesting that the performance is due to the binary classifier included in the method rather than the proposed method.
- The experimental baselines are too simple. BC and VAE are components of the proposed method, and there are many methods that might outperform DevNet and Deep SAD, at least in the image domain (Liu et al., 2023). Many anomaly detection methods in the image domain are not designed for continual learning, but since EWC and A-GEM can be applied, it would be meaningful if the proposed method outperforms in this setting.
- In the image domain, the proposed method shows better performance than other baselines, but it seems that experiments on larger datasets are needed to show the practicality of the proposed method. The method was only tested on FMNIST and MNIST with MLP structure, but it would be useful to test it on larger datasets such as CIFAR10 and CelebA.
---
**Liu et al.** [Deep Industrial Image Anomaly Detection: A Survey](https://arxiv.org/abs/2301.11514). *arXiv*, 2023

**Questions**: - Similar to other continuous-learning papers, it would be nice to be able to see how performance changes with additional training on each task.",0,
Gdm87rRjep,QUfQ,"**Summary**: This paper presents an interesting prompt engineering observation about efficient LLMs. Performance improves by adding a hard-coded prompt telling the LLM to reconsider its solution because it is compressed! The authors build on that observation by performing _transferable_ prompt tuning on a number of compressed (quantized/pruned) LLMs from the OPT/LLAMA family of models.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: The main benefit claimed by the authors is that the tuned prompts are domain-agnostic. That is, they can be used on different models, datasets, or tasks quite easily because they are related to the fact that the model is compressed, not specifically-tuned for any specific domain.

**Weaknesses**: The main weaknesses relate to a lack of wider context and evaluation of the presented method. For example: how expensive is prompt tuning? By creating transferable prompts, how much time/effort are we saving? How does accuracy compare to conventional prompt tuning (this is a key missing comparison). How does the presented method peform on other model families? (OPT and Llama are highly related).

Without comparison to other prompt tuning methods, it is hard to put the current results in the needed context.

**Questions**: please see weaknesses above",1,"['How does accuracy compare to conventional prompt tuning (this is a key missing comparison). Without comparison to other prompt tuning methods, it is hard to put the current results in the needed context.']"
97Dl82avFs,2NmJ,"**Summary**: The paper presents a method to generate accessibility captions for images shared on Twitter. The proposed approach combines CLIP embeddings for the images, as well as additional context that is included in the tweet’s text to create an embedding that is then fed to GPT-2 to generate the accessibility description. The paper’s evaluation demonstrates that the proposed approach can outperform naive and neural-based approaches like ClipCap and BLIP-2.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: First of all, I would like to applaud the authors for working on this important and timely problem. I believe that this research is very important and can have the potential to improve the lives and online experience of many people with visual impairments. Overall, I believe that this research focuses on an important problem, and there is potential for a big impact. Second, the paper collects a large-scale dataset of images and user-generated accessibility captions from Twitter; this dataset is far bigger than previous research efforts focusing on similar research problems. Third, I believe that the paper’s approach is a simple, creative, and effective method to combine CLIP embeddings, tweet text, and LLMs to generate accessibility captions for images. The paper’s approach is easy to understand and combines important features for generating contextual and useful accessibility descriptions for images shared on Twitter. Finally, I like that the paper evaluates the performance of the proposed approach both with quantitative metrics as well as via a user study that aims to assess how users perceive and compare accessibility descriptions generated from the proposed method and baseline approaches.

**Weaknesses**: I have several concerns with the paper, mainly related to the lack of gold standards for accessibility captions, the lack of important and adequate methodological details, the paper’s evaluation, the paper’s approach to releasing data, and the paper’s ethical considerations.

First, there is a disconnection between the paper’s motivation and how the paper evaluates the performance of the proposed method. I agree with the paper’s motivation that the user-generated accessibility captions are of questionable quality, given that most users are unaware of best practices for generating accessibility captions. On the other hand, however, the paper collects user-generated accessibility captions and treats them as gold standards (i.e., ground truth captions). This is problematic as in the evaluation, the paper compares the generated captions from their approach and compares them with captions that are of questionable quality. Therefore, it is not clear what is the actual performance of the proposed methods. A way to potentially mitigate this issue is to apply the proposed approach to other datasets released by previous research that include gold-standard captions (i.e., captions that adhere to the best practices for generating accessibility descriptions for images).

Second, I am puzzled about how the BLEU@4 score is calculated in the evaluation. To the best of my knowledge, the BLEU score ranges from 0 to 1 and aims to assess the precision of the n-grams included in the generated text compared to the ground truth. In the paper’s evaluation, the paper mentions that the proposed approach has a BLEU@4 score of around 1.8. I suggest to the authors to clarify how they calculated the BLEU scores (e.g., if they used a modified version) and better describe how we can interpret these BLEU@4 values.

Third, the paper lacks important and adequate details on the paper’s methodology. Particularly, the paper refers to several appendices so that the reader can get more information, however, there are no appendices in the manuscript. This hurts the readability of the paper and does not allow us to assess the quality and robustness of the presented results in the paper. I suggest including the appendices so that we can understand how the paper conducted various steps of the research. In particular, I would have liked to read more on how the paper conducted the user study, how they recruited users, what is their background and expertise with regards to the best practices for generating accessibility descriptions, etc. All these details are paramount for understanding the quality of the presented research.

Fourth, I have some concerns about the paper’s approach to releasing the dataset. Given the recent changes to Twitter’s API, it became extremely hard to rehydrate tweets based on their IDs. So by simply releasing the Twitter IDs and the media URLs, interested researchers will not be able to reproduce the paper’s results and further use this dataset for further studying this problem. I suggest to the authors to consider releasing more attributes from the dataset (specifically the tweet’s text) so that interested researchers can reproduce the paper’s results without relying on the closed and expensive new Twitter APIs.

Fifth, the paper does not properly explain how the qualitative assessment is done (in Section 6.4), which does not allow the reader to understand if it’s done in a systematic way or how representative/generalizable the insights are. I suggest to the authors to include more details on how the samples for the quantitative analysis are selected and, more importantly, how the qualitative assessment is undertaken (e.g., are the people experts in the domain of accessibility description generation, are they aware of the best practices, etc.)

Finally, the paper does not discuss the ethical considerations when conducting this research. This is important as the paper conducts a user study and shows participants’ images shared on Twitter. For instance, did the paper ensure that there are no harmful images in the dataset and that no participants were exposed to harmful information?

**Questions**: 1. What is the rationale for using user-generated captions as gold standards and do you have an idea how this affects the presented results?
2. How is the BLEU@4 score calculated and did you use a modified version of the metric? 
3. How is the user-study conducted and what are the background/expertise of the recruited participants? Also, have you obtained an IRB approval before conducting the user study? How did you ensure that participants were not exposed to harmful information?",0,
KJYIgEteHX,Eeae,"**Summary**: This manuscript discusses the distributional robustness of deep learning based MRI reconstruction (solving an ill-posed inverse problem and recovering an underlying sub-Nyquist sampled image). The authors experimented with U-Net-based MRI reconstruction under multiple subtypes of distribution shifts and analyzed their effects on the performance. The authors also argue that more diverse data leads to more robust models.

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: The problem of identifying and mitigating distributional shifts for deep learning accelerated MRI is of significant real-world relevance. It is critical for building a trustworthy deep learning driven MRI reconstruction system. 

The experiments are performed on a large range of MRI reconstruction datasets with multiple real-world types of distributional shifts (imbalanced data, anatomical shifts, diverse magnetic fields, images from healthy subjects or from patient with health conditions, etc.).

**Weaknesses**: The scientific contributions of the manuscript is limited: despite the detailed analysis and discussion, the distributional robustness of deep learning MRI reconstruction has been discussed by a series of prior works [1-6]. Despite more detailed experiments (imbalanced data and healthy versus disease images), the major conclusions do not go beyond those of early works [1-3]. The authors failed to make significant theoretical and methodological contributions either (while most of [1-6] proposed either theoretical insights and/or methodological contributions). 

The writing needs improvement: The paper is poorly structured, and it does not follow quite well the conventions of ICLR. It is difficult to identify the key arguments and contributions from the text. It is also difficult to grasp the chain of arguments and evidences.

Sec. 2: There is a lack of a brief introduction of essential key concepts: coils and sensitivity maps, sampling masks and accelerations, the signal-processing interpretation of MRI acquisition, problem settings for MRI reconstruction, etc. Missing these key concepts would bring difficulties for readers who are not familiar with MRI reconstruction.

The choice of using a U-Net is over-simplistic, given that the mainstream reconstructions works are based on unrolled proximal gradients with deep cascade networks, variational networks, as well as probabilistic diffusion models, which may also bring stronger distributional robustness due to better inductive biases compared with a plain U-Net. 

[1] https://onlinelibrary.wiley.com/doi/10.1002/mrm.27355 
[2] https://arxiv.org/abs/1902.10815 
[3] https://onlinelibrary.wiley.com/doi/full/10.1002/mrm.28148 
[4] https://arxiv.org/pdf/2011.00070.pdf 
[5] https://link.springer.com/chapter/10.1007/978-3-030-87231-1_21 
[6] https://www.sciencedirect.com/science/article/pii/S0730725X21002526

**Questions**: The authors are encouraged to improve the clarity of the paper: talking about the problem background, existing works and their drawbacks, then the key contributions, in the introduction section. Then, in the following sections, the authors are encouraged to make their arguments clear, and then demonstrate how the experiments support their arguments. 

The authors are also encouraged to bring more theoretical insight behind the observational results, given that distributional shifts in MRI are not a newly identified problem. 

The authors are also encouraged to take the effect of different reconstruction technique into consideration: despite diverse implementation details, these methods can be generally categorized as 1. plain feed-forward networks; 2. unrolled cascaded networks; 3. variational networks, as well as 4. probabilistic diffusion models. The authors may want to consider the effects of inductive biases on distributional robustness.",0,
sk7RRHFk7M,mmJ4,"**Summary**: This paper aims to generate high-fidelity dance videos given three single inputs, reference person image, target pose sketon (2D), target background. This task is similar to traditional pose transfer, while human dance is claimed to be a more challenging task. To this end, the authors propose a control-net based framework, where the foreground and background are taken as seperate inputs. To further strengthen disentanglement, a pretraining strategy is proposed which trains the model on a much larger image dataset. Extensive experiements demonstrate the effectiveness of the proposed method.

**Soundness**: 3 good

**Presentation**: 2 fair

**Contribution**: 3 good

**Strengths**: There are several merits in this work:
1. Using foreground and background segmentations as seperate inputs for pose transferring is new. It seems to provide a neat yet effective solution. I also appreciate the Human Attribute Pre-training. Firstly, it is simple, but effective as well. Secondly, it properly utilizes the large-scale meta data.
2. The quantitative evaluation results are significant. The proposed model outperforms other methods by a large margin.
3. The authors do have conducted sufficient experiments (e.g., comparisons, ablation), as well as exploring further applications such as fine-tuning on one person.
4. Detailed implementation details, and submitted code.
5. The image transfer results look promising.

**Weaknesses**: I am not an expertise in video synthesis, here are my feelings about this work. 

First of all, the authors didn't provide the video demonstration for their work, which is supposed to largely decrease the validity of this work, since the whole highlight is about dance generation. 

1. Though the title is about ""dance generation"", I feel the emphysis of this work, including technical design, is more on pose-based image transfer. There is little thing about ""sequence"" modeling for dance.
2. I won't say the generated dances are realistic (as claimed in title). There are many temporal inconsistency and jittering, though I acknowledge the single image-based editting is realistic. I guess for better video modeling, we should pay more attention on the temporal consistency. (I found the video somewhere else.)
3. Upper-body video generation is kind of limited. Is there full-body dance dataset available?
4. Since in diffusion model, generating each image requires hundreds of steps, I am curious how long does it take to generate a dance video. Will that be a limitation of this work?
5. For video generation, it's necessary to have comparisons with baselines.
6. Some discussion about limitations are desirable.

**Questions**: Please refer to the weakness, and may respond to these questions if applicable.",0,
vW1SkPl4kp,DGv8,"**Summary**: This paper studies the iterated CVaR objective RL under both linear and general function approximations. Additionally, the authors have incorporated human feedback into the framework. The paper offers theoretical guarantees for all the proposed algorithms in various settings.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: This paper is well-written and well-organized. It effectively conveys the high-level intuition behind the algorithm. The paper provides a comprehensive discussion of the theoretical results for iterated CVaR RL in various settings, including linear function approximation, general function approximation, and the incorporation of human feedback. Notably, the paper introduces a novel approximation operator for the true CVaR operator in the linear approximation setting, which could be of independent interest.

**Weaknesses**: The technical aspects of the algorithms appear somewhat limited and are relatively standard in the literature. Additionally, the algorithms are not efficient, and obtaining a solution for the approximate CVaR operator is not easy.

**Questions**: Some typos:

On page 2, in the first bullet, $\sqrt{\alpha^{-H}}$ → $\sqrt{\alpha^{-2}}$

On page 8, it should be $\sigma(x)$ in regularity.

What is $\tilde{\sigma}$ in equation (14)?

Compared with the tabular setting, why the linear case has a bad dependence on H?",1,"['The technical aspects of the algorithms appear somewhat limited and are relatively standard in the literature.', 'Additionally, the algorithms are not efficient']"
bZMyHBSnEI,qyjt,"**Summary**: This paper proposes a deep equilibrium (DEQ) method for multimodal fusion by seeking a fixed point of the dynamic multimodal fusion process and modeling feature correlations in an adaptive and recursive manner.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: (1)	This method innovatively combines multimodal fusion with DEQ framework to iteratively achieve multi-level multimodal fusion while retaining single-modal information
(2) 	The experiments proves the effectiveness of the method, and the ablation experiment is relatively complete. The weight visualization in Figure 3 dynamically perceives modality importance for efficacious downstream multimodal learning, which is intuitive.

**Weaknesses**: 1. The method in this paper is compared with the weight-tied method, which shows that the method in this paper can converge. This is obvious because the method optimizes fθ by the formula z* = fθ(z*,x), and does not impose such a constraint on the weight-tied method with a finite number of layers, and the weight-tied method certainly cannot converge.
2. In the original DEQ paper, DEQ is proposed for memory efficiency, and the effect is similar to that of weight-tied, and it would be better if the article gave a comparative experiment with the weight-tied method. 
3. Some of the expressions in the paper are unscientific and abstract, such as the sentence on page 3:’Our fusion design is flexible from the standpoint that fθi(·) can be altered arbitrarily to fit multiple modalities. It could be better expressed as ‘Our fusion design is flexible from the standpoint that fθi(·) can be altered arbitrarily to fit multiple level features’.
4. The drawing is not intuitive.

**Questions**: When the equilibrium state is reached, why an informative unified representation in a stable feature space for multimodal learning be obtained? What is the relationship between these two? The paper does not give proof.",1,['The drawing is not intuitive.']
bDooTVT4t2,mJxe,"**Summary**: This paper aims at improving certified robustness by randomized smoothing with anisotropic noise. The universal theory for certification with anisotropic noise has been provided. The authors consider three kinds of customizing anisotropic noises, and provide corresponding noise generation methods. The authors conduct experiment to demonstrate that the proposed UCAN method achieve state-of-the-art performance compared to existing randomized smoothing-based methods for certified robustness.

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: 1. The proposed method on smoothing with anisotropic noise is novel. It is interesting to see the expansion of RS-based methods from isotropic noises to anisotropic ones.
2. This paper provides the theoretical guarantee of certified robustness under anisotropic randomized smoothing, and comprehensive analyses to transform existing randomized smoothing methods to anisotropic cases.
3. Authors consider three different kinds of anisotropic noises, and provide a novel input-dependent one by optimizing $\mu(x)$ and $\sigma(x)$ by a multi-layer neural network.

**Weaknesses**: 1. My major concern of this paper is the potential unfairness in evaluation on UCAN and existing RS-based methods. If I am not misunderstanding, the evaluation criterion is based on scaled radius, which has different weight in each dimension.

I believe this is true because I am surprised to the evaluation results provided in Table 3 that the $l_2$ certification on CIFAR-10 reach over 70% even under radius 1.75. By simple calculation, for commonly-used $L_{\infty}$ norm with budget 8/255, it achieves at most $\sqrt{3072 \times (\frac{8}{255})^2} \approx 1.74$, that means in your UCAN it achieves at least 70% robust accuracy for 8/255 $l_{\infty}$ attacks. This result is unbelievable, because existing SOTA performance of CIFAR-10 robustness may only achieve about 60% if no further data augmentations are conducted (like diffusion model), let alone UCAN is only a certified method based on $l_2$ norm. Therefore, although the paper said they evaluate certified accuracy w.r.t radius, I am doubtful of this claim and I think the authors only consider scaled radius robustness.

However, scaled radius certification seems not a fair criterion for certified robustness. It is reasonable that in some dimensions, the image is vulnerable to adversarial attacks, e.g., contour of an image. Reversely, in some dimension images are intrinsically robust to perturbations like background of the image. Therefore, I believe the corresponding variance $\sigma$ is small when UCAN performs on these vulnerable dimensions, and gain robustness back in some ``unimportant’’ dimensions.

Overall, the evaluation setting of this paper seems differently from existing RS methods. It is a consensus that using $l_p$ norm as constraint for images, the authors should provide corresponding evaluation on standard $l_p$ norm, or at least, provide the explanations or practical scenario on why using scaled radius as the evaluation criterion.

2.	In Theorem 3.2, your certification using the p-norm of $\frac{\delta_i}{\sigma_i},$ but it seems that $\frac{\delta_i}{\sigma_i}$, is a one-dimension scalar as $\delta_i$ is the i-th dimension of perturbation $\delta$. Furthermore, this theorem is seemingly a direct corollary from Theorem 3.1, because your certification divides the variance $\simga_i$ for each $\delta_i$ (not anisotropic anymore?).

3.	There might be missing of some baselines for $l_1$ [a] and $l_{\infty}$ [b, c] certified robustness. It will be better to compare the UCAN with existing certified $l_1$ and $l_{\infty}$ methods.

[a] Levine et al. Improved, deterministic smoothing for L_1 certified robustness. In ICML 2021.
[b] Zhang et al. Towards Certifying L_∞ Robustness using Neural Networks with L_∞-dist Neurons. In ICML 2021.
[c] Zhang et al. Boosting the Certified Robustness of L-infinity Distance Nets. In ICLR 2022.

**Questions**: 1.	Why using 5-layers NN when generating universal/ input-dependent anisotropic noises? Is there some motivations or ablation studies for that?
2.	Could you provide more details on training of universal anisotropic noise? It seems that the variance loss is to optimize $\sigma$ and smoothing loss containing $\sigma$ when optimizing classifier $\theta_f$. I believe the two losses are optimized alternately but not simultaneously.
3.	The authors said that randomized smoothing achieved great success for certified adversarial robustness. Could RS really make classifier robust? Can you provide comparison of RS based model to the SOTA methods for achieving robustness?",0,
QpLuWhiiaH,Y6jJ,"**Summary**: The author proposes a diffusion-based imputation method for tabular data. They tailor four architectures for handling the tabular features, propose a resample framework for enhancing the coherence between observed and imputed data, and propose to adopt DDIM for sampling. The author conducts experiments on seven dataset to demonstrate the effectiveness of DiffImpute.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: The authors conduct a careful exploration of which architectures, MLP, ResNet, Transformer, and U-Net, provide the best performance on diffusion models for modeling tabular data. They also adopt three evaluation criteria for the comparison, which is better than previous work on imputation that only uses MSE as the evaluation criteria.

**Weaknesses**: 1. The overall contribution of this paper is limited. 

All of the content except the transformer conditioning architecture is already known. The architecture design is heuristic, which has no theoretical guarantees of the performance. Moreover, they build upon Variance Preserving (VP) SDE (e.g., DDPM or TabDDPM in tabular data). The author does not mention whether their method work for Variance Exploding (VE) SDE (e.g, Score-based generative model, StaSy [1] in tabular data).
[1]: Kim, J., Lee, C.E., & Park, N. STaSy: Score-based Tabular data Synthesis. ICLR 2023.

2. Overclaiming the contribution of transformer conditioning architecture.

* Diffusion model can work on imputation together with generation (conditional generation) without the proposed transformer conditioning architecture. They are well-studied in the literature [1,2].

[1]: Tashiro, Y., Song, J., Song, Y., & Ermon, S. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. NIPS 2021.

[2]: Ouyang, Y., Xie, L., Li, C., & Cheng, G. (2023). MissDiff: Training Diffusion Models on Tabular Data with Missing Values. ArXiv, abs/2307.00467.

3. The effectiveness of the proposed method is not well supported.
* : The standard evaluation of imputation performance is the mean squared error of imputed value against oracle value instead of the efficiency criterion used in paragraph ""Machine Learning efficiency - Data imputation"". Otherwise, it faces the problem of ""when the generative model needs to fill in the most significant feature or a feature that has a minimal impact on XGBoost output"" as mentioned in the paper. If the authors adopt the traditional evaluation on this task, many designs in this paragraph will not be needed.

* : To evaluate the performance of TabGenDDPM on imputation tasks, it should be compared with other imputation methods, e.g., [3,4], rather than only compared with TabDDPM.

[3]: Yoon, J., Jordon, J., & Schaar, M.V. GAIN: Missing Data Imputation using Generative Adversarial Nets. ICML 2018.

[4]: Mattei, P., & Frellsen, J. MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets. ICML 2019.


* : The author should compare with other diffusion based model on tabular data, e.g., StaSy [1]. Also, some discussion and experimental results of whether transformer conditioning can be developed on Variance Exploding (VE) SDE.

* : The of illumination the experimental setup should be clarified. Currently, it brings some confusion.
- The baseline in Figure 3 stands for which method? In my point of view, it is not the methods mentioned in section 5.2.
- The Table 4 is confusing. In my point of view, three different evaluation criteria have different properties, i.e., the smaller the correlation is, the better the performance is, which is different with privacy risk. Why do the authors use Up arrow/Down arrow beside the name of the dataset. It is also not clear why the authors only report the experimental results on six datasets rather than eight datasets in Table 2. 
- It would be helpful to have the performance on each dataset for Table 3 in the appendix. 

4. Minor

The paper has many typos, e.g., 
- adding period for the caption of Table 1, 3, 4, and Figure 3; 
- what is the meaning of ""4+2"" and ""2(4+40)"" in Table 1; 
- ""in this situation, the generative model can employ the no-missing values to condition the missing data generation."" is hard to understand.

**Questions**: Previous works on imputation tasks, e.g., [1,2,3], train their model on the data containing missing values, which is important for real applications. However, DiffImpute is trained on complete tabular datasets. Why does the author adopt the CSDI [3] architecture for imputation on tabular data? CSDI [3] is also a diffusion based model for imputation tasks. Then, their model can train their model on the data containing missing values.

[1]: Mattei, Pierre-Alexandre and Jes Frellsen. “MIWAE: Deep Generative Modelling and Imputation of Incomplete Data Sets.” International Conference on Machine Learning (2019).
[2]: Yoon, Jinsung et al. “GAIN: Missing Data Imputation using Generative Adversarial Nets.” ArXiv abs/1806.02920 (2018): n. pag.
[3]: Tashiro, Yusuke et al. “CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation.” ArXiv abs/2107.03502 (2021): n. pag.",1,"['The of illumination the experimental setup should be clarified. Currently, it brings some confusion.']"
MVe2dnWPCu,K341,"**Summary**: The work proposes a probabilistic modeling approach to efficiently search for the best-fit module composition out of the large discrete space of possible compositions of the modules in continual learning. Depending upon the similarity of the input distributions with that of a previous problem, two variants of the probabilistic model have been proposed: one for perceptual transfer where the prior uses the original accuracy of pre-trained modules to order these, and the other for latent transfer where the prior specifies that pre-trained modules in a path have been used together to solve a previous problem.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: - Good motivation, presentation, and writing. The equations have been explained well.
- The idea of using the validation accuracy for a path as the proxy for its fitness is simple and elegant.
- The limitations of the proposed method have been elaborated well.
- The reported evaluation metrics are rigorous.

**Weaknesses**: Please see the questions.

**Questions**: - On page 3, the authors mention their strategy is based on a generative model of the input x. How is the generative quality of the proposed method quantitatively? Some further evaluation of the proposed method using metrics like ECE can thus be more insightful.

- While I am not very familiar with the up-to-date modular continual learning literature, the baselines in Tables 1-2 look classic to me. Can the authors comment on comparing with more recent works?

- Can the authors compare the computational overhead of their method against the baselines?",0,
hbsvyhznr4,qavn,"**Summary**: The paper propsoed a joint training to imporve the roboustness of image-based maneuvering models. The proposed method improve the roboustness of the two modules jointly, the decoder and steering angle prediction model.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: The paper focuses on an interesting and important issue of deep nueral networks, especially for autnomous driving systems.

**Weaknesses**: There are several issues with the paper:
1- The issue of adversarial attack is well-known in the field. However, it is not well explained how this issue might be the case for application discussed in the paper. I think it is important to provide real-world scenarios for this purpose. 
2- The proposed method has not been evalauted on general benchmarks and as such it is difficulat to understand the effectivenss of the proposed method. 
3- There are several new state-of-the-art adversarial defense mechanisms in the field currently, and they are missed to be included in the paper. 
4- It is diffult to understand what is the main novelity of the proposed method.

**Questions**: 1- How deoes this problem might take palce in real-world scenarios?
2- How does the proposed method comapred with state-of-the-art adversarial training and defence mechanisms?
3- What is the main novelity of the propsoed method? new perturbation training or proposing a new roboust model for steer angle prediction?",1,"['There are several new state-of-the-art adversarial defense mechanisms in the field currently, and they are missed to be included in the paper.', 'It is diffult to understand what is the main novelity of the proposed method. What is the main novelity of the propsoed method? new perturbation training or proposing a new roboust model for steer angle prediction?']"
XTJ0YVBM10,tQwg,"**Summary**: This work introduces a framework, task-oriented asking (TOA), for language models to ask follow-up questions to clarify downstream tasks. To evaluate this framework, this work introduces a QA dataset derived from HotPotQA, constructed by masking one of the supporting fact for answering the question, and expecting TOA to generate a question soliciting the supporting fact. TOA is then evaluated on its ability to ask useful clarification questions which improve performance on the task upon receiving an oracle answer.

**Soundness**: 4 excellent

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1. This work addresses an (understudied) problem of asking follow-up/clarification questions, and contributes a novel framework for addressing the problem, within the realm of QA.

2. Overall, this paper is also quite well-written, and very clearly lays out its method.

3. The paper very comprehensively articulates its scope and limitations, making note of any caveats where they may arise, and also makes the right comparisons and baselines to try and address any limitations in the evaluation (e.g. comparing against the repeater baseline to address limitations laid out in bullets on page 5). It seems like the authors have worked hard to address any spurious correlation or potential biases that may affect the evaluation. Consequently, the results are quite convincing.

3. The paper also provided very comprehensive ablation studies in section 5, and provided concrete examples of failure cases.

4. The paper explores its TOA method with lots of different models, including open-source models.

**Weaknesses**: 1. The current setup in the paper seems somewhat dataset-specific, and the evaluation is also currently only focused on a single task & dataset. In the introduction, the paper frames TOA as a more generic technique for all tasks (e.g. intro makes reference to legal tasks and states “we propose task-oriented asking (TOA) as a benchmark for language model question generation”, without narrowing the scope to just QA tasks with a single missing supporting fact.) Thus, either the claims in the introduction need to be tempered, or the current evaluation scheme should be broadened to give a sense of how well TOA generalizes to other tasks.

2. More specifically, the answer model is currently restricted to picking between a limited set of facts (one of them being the masked supporting fact necessary to answer the original question, and the others being distractor facts), which likely overestimates performance compared to an answering 
    1. While understandably the authors were trying to simulate an oracle answer model, note that this does not necessarily tell us how well the question-asking model is, and does necessarily simulate a “perfect answerer”. In particular, the task for the question-asking model shifts from “ask a question seeking, specifically, the missing information” to “ask a question that privileges the masked supporting fact as an answer over any of the provided distractor facts”. In the latter case, we don’t even need to guarantee that the question being asked is comprehensible, or would organically be answered with the supporting fact, but simply that the supporting fact seems like a marginally better response than any of the distractors. For example, it could be the case that the input itself carries information about what information is missing and the question was unnecessary, or the question isn’t actually asking for the missing information / only asks for part of the missing information but the masked fact is still the most pertinent answer.
    2. While comparing against the Repeater baseline takes care of some of these concerns, this still does not take away from the fact that there are factors that aren’t explored due to the answer setup. For example, how comprehensible & easy to answer are questions? Would they naturally lead to an answer that contains the correct fact, supposing when didn’t have a constrained set of possible answers? Answering these questions are important if we’d want to generalize beyond the setup here, as generally we do not have access to a set of possible answers.
    3. Indeed, one of the key challenging considerations of asking questions is that the model needs to ask not just a question that recovers the missing information — but recovers the minimal unit of information that is necessary to perform the end-task. Otherwise, we can imagine a question like “tell me everything you know” being maximally useful under the recovery metric for all inputs. The current task setup is unable to measure whether the questions 

3. Related to the above, the paper claims that TOA is able to “generate plausible-sounding questions”. It would be great to get empirical concrete evidence of this — perhaps through some evaluation of the wellformedness of the resulting questions.

4. There were some places in the description of the evaluation framework that were unclear / missing important details (see questions).

5. Can you report error bars for Figure 4?

**Questions**: 1. What is the distinction in annotation: “full” validation set vs. “manually annotated” test set?

2. Some more clarity in the metric / evaluation section (section 4.2) would be useful:
    1. What do mu and sigma represent in the first paragraph of page 6?
    2. Assuming that “F1 recovery” means R=F1 in the equation on page 6, and “EM recovery” means R=EM, but this was not clearly stated anywhere.
        1. Based on this definition, why isn’t recovery of the Repeater baseline at 0? My understanding is that repeater simply repeated the (incomplete) facts in the input? Is it that the Repeater model still uses the oracle to pick an arbitrary fact while the “masked” and “supporting” simply condition on the masked input and full input respectively? This should all be more clearly articulated and the “masked” and “supporting” baselines clearly defined.

3. How well does TOA compare against a baseline that simply prompts the model to “reason step-by-step”, i.e. having it (implicitly) generate missing facts without asking any intermediate questions?",0,
ZltAP7Q4g4,y9Vb,"**Summary**: The paper presents FedGSE, a novel method for gradient-based sub-model extraction in federated learning, aiming to address the challenge of resource-constrained clients. The authors propose a strategy that selects neurons within each layer of the global model based on their large gradients generated by training the global model on a public dataset. The selected neurons are then used to form sub-models for training on the client side using local datasets. This ensures that the gradient updates produced by the sub-model closely resemble the gradient updates that would be produced when training the client data on the global model, resulting in better alignment between the sub-model and global model performance.

The key contributions of the paper are:
- Proposing the *first* method to improve federated learning training by approximating the update gradients between the global model and sub-models.
- Mathematically proving that the designed algorithm can make the update gradients between the global model and sub-models closest.
Validating the efficiency of the proposed method through extensive experiments, demonstrating that FedGSE outperforms other state-of-the-art methods.
- Validating the efficiency of the proposed method through extensive experiments, demonstrating that FedGSE outperforms other state-of-the-art methods.

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: **Originality:**
The paper introduces FedGSE, a novel method for gradient-based sub-model extraction in federated learning. The authors focus on optimizing the update gradients between the global model and sub-models, which is a unique approach compared to existing methods that primarily focus on designing allocation strategies for training workload. The proposed method demonstrates originality in its problem formulation and solution.

**Quality:**
The paper is well-structured and presents a clear and coherent explanation of the proposed method. The authors provide a comprehensive review of related works, highlighting the limitations of existing methods and how FedGSE addresses them. The experimental results demonstrate the effectiveness of the proposed method, and the authors provide a thorough analysis of the results, including the impact of various hyperparameters on the performance of the method.

**Clarity:**
The paper is written in a clear and concise manner, making it easy for readers to understand the proposed method. The authors provide a detailed description of the algorithm and its components, as well as the experimental setup and results. The figures and tables are well-designed and partially support the main points of the paper.

**Significance:**
The proposed FedGSE method has the potential to advance the field of federated learning, particularly in addressing the challenge of resource-constrained clients. By optimizing the update gradients between the global model and sub-models, the method demonstrates improved performance over existing methods. The experimental results show that FedGSE consistently outperforms other state-of-the-art methods, especially in high data heterogeneity scenarios. This indicates that the proposed method has significant potential for real-world applications and can contribute to the advancement of federated learning research.

**Weaknesses**: ### General

**Claim of Novelty:** The authors assert that their method is pioneering in enhancing Federated Learning training by approximating the update gradients between the global model and sub-models. This claim hinges on the novelty of their strategy, which zeroes in on the optimization direction of approaching the global model's effectiveness when extracting sub-models. While they have provided a thorough review of related works, including static and dynamic allocation strategies for sub-model extraction, a significant oversight is the lack of comparison with other resource-constrained federated learning methods, especially those based on Knowledge Distillation (KD).It would be beneficial for the authors to expand their discussion and benchmarking to include KD-based methods and other strategies that address resource constraints in federated learning.

** Scalability Concerns:** The paper addresses a highly practical scenario of resource-constrained FL. However, the datasets used for validation, such as CIFAR-10/100, may not be sufficiently large or complex to truly test the scalability of the proposed method. For a method targeting resource-constrained environments, it's crucial to demonstrate its efficacy on high-dimensional data, such as face recognition datasets like CelebA or ImageNet.Suggestion: The authors should consider conducting experiments on larger and more complex datasets to truly validate the scalability and robustness of their approach.

**Potential Limitations**: The authors have acknowledged some limitations of their method, such as the need for constructing a public dataset on the server side and the challenges associated with executing the backward progress on the server side. While it's commendable that they've recognized these issues, it's essential to delve deeper into potential solutions or workarounds for these challenges.

### About Algorithm

**Gradient Representation and Assumptions:** The paper introduces the gradient of a neuron with the equation 
$g_{l,i} = \sum_{k=0}^{K-1} \sum_{v=0}^{V-1} |\frac{\partial f(w; x, y)}{\partial h_{l,i}(k, v)}|$
where $f(w; x, y)$ represents the cross-entropy loss of the deep neural network for the input $(x, y)$. The dimension length of the feature map is represented by $K$ and $V$, i.e., $h_{l,i} \in R^{K×V}$. However, the paper does not provide a clear justification or empirical evidence for choosing this gradient representation. It would be beneficial to have a more detailed discussion.

**Proof**: The proofs of Lemma 1 and Theorem 1 share Appendix B, but it is oversimplified, and not clear to see its pertinence. Moreover, how can eq (5) be derived from eq (4) with a non-linear relationship?

**Questions**: 1. The authors claim that their method is the first to propose improving Federated Learning training by approximating the update gradients between the global model and sub-models. This claim is based on the novelty of their approach, which focuses on designing a strategy that starts from the optimization direction of approaching the effectiveness of the global model when extracting sub-models. The authors provide a comprehensive review of related works, including static and dynamic allocation strategies for sub-model extraction, and highlight the limitations of these methods. In my opinion, the compared benchmark should also include the other resource-constrained (or heterogeneous) federated learning methods (like KD-based) apart from sub-module extraction methods. Could the authors elaborate more discussion on these methods?
2. (*miscellaneous*) In the text of this paper, the brackets are directly adjacent to the word without any space, which is unconventional. The numbers following the Table/Figure should also be separated by a space. The reference formatting is chaotic (e.g., the algorithm references in Algorithm 1). 
3. The paper works on a very practical scenario of resource-constrained FL. In this case, it is of high importance to verify its scalability to high-dimensional data wit (e.g., face recognition datasets like CelebA or ImageNet). Datasets like CIFAR-10/100 are too small to verify its scalability. 
4. The paper should provide information about the extra local training cost when applying GetSimilarDataCSL/SPL, as **resource-constrained** clients, not only the uploaded parameters but also local computational cost should also be considered, unless the setting of this paper is ""communication-constrained"" FL.",0,
2NwHLAffZZ,ZGhT,"**Summary**: This paper asks why overparameterized neural networks can be linearised with respect to their parameters (e.g. in the Neural Tangent Kernel regime), and propose that the reason is weak correlations between the first and higher derivatives of the model function. With respect to previous work, they consider the case of neural networks with two distinct activation functions and the deviation from linearity during SGD training.

**Soundness**: 2 fair

**Presentation**: 1 poor

**Contribution**: 1 poor

**Strengths**: Understanding the behaviour of overparameterized neural networks is a very important and interesting question.

**Weaknesses**: In my opinion this work fails on providing and/or communicating anything new on the topic.

1) Discussion on some very important related work is missing, which this work should have compare with.
2) Several statements are unsupported, definitions are missing, there are several inaccuracies and the paper is overall very hard to follow.
3) The mathematical notation is cumbersome and, for no apparent reason, completely different from many related papers.
4) Crucial points are relegated to the appendix, without which the main text is severely incomplete.

The main reference missing is “On the linearity of large non-linear models: when and why the tangent kernel is constant”, NeurIPS 2020 by Liu, Zhu, Belkin (https://arxiv.org/abs/2010.01092), but there are many other papers following this one that have studied the question of why neural networks can be linearised, also in relation to the model derivatives.
This line of work is not discussed at all. 

i) As a main contribution, the author list the case of “wide neural networks with two distinct activation functions”, but the only thing they say about this case in the main text is one sentence on page 9, relegating all about this claim in the appendix (we are not even told what “neural networks with two distinct activation functions” mean in the main text).
ii) Section 2 is completely unmotivated and its relevance remains unclear until much later. For example, in the first three paragraphs of section 2.2 it’s unclear what is the goal and the challenges in reaching the goal.
iii) The function \Epsilon is supposed to be a generic convex function, but there seems to be an (unstated) assumption that it depends on the difference between F and y, which is true for the square loss but not for many other commonly used loss functions.
iv) (x,y) are called, respectively, label and images, but it should be the other way around.
v) A “limiting parameter” in introduced on page 5 but never explained in the main text
vi) Equation 9 only applies to gradient flow, not to gradient descent. After reading Theorem 3.1 it becomes clear that Equation 9 is a definition, but until then it just looks like a mistake.
vii) Below Equation 12, Why can the parameter indices viewed as random variables? They are not random variables, and they are not drawn from a uniform distribution. Instead, all indices are summed over all the parameters. If they were a sample from a uniform distribution, there would be some noise.
viii) No intuition is given here about the relevance of the quantity introduced in equation 12.
ix) I don't understand Definition 3.2. ""O"" is supposed to be limiting order. What is ""O"" there?
x) n0 not defined in equations (16) and (17)
xi) There should not be a Delta in the second expression of equation (17)
xii) Inequality 19 seems to be crucial for obtaining the results. However, the statement ""nearly all realistic scalable systems satisfy"" is not justified.
xiii) What does ""typical parameter of the linearization/correlation decay"" mean?
xiv) “This scenario is a little more complex but can be dealt with.” How is this scenario dealt with? It seems the reader here just needs to trust the authors without any explanation or justification.
xv) “These systems can be interpreted as non-linear dynamical systems.” Any reference for this statement?

i) The Jacobian of the model is called “”derivative matrix and is transposed with respect to the Jacobian that everyone uses.
ii) I have never seen a gradient with a subscript “T” to denote the transpose of the operation result.
iii) In equation 12, the gradient with several subscripts and superscripts is just a (high order) partial derivative. Why re-inventing the notation?

**Questions**: NA",0,
4uaogMQgNL,UGEs,"**Summary**: The paper focuses on view synthesis from unposed images. Scene Representation Transformer, diffusion model, and controlnet branch are utilized to effectively perform the task on object-level novel view synthesis.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: SRT + Diffusion are adopted to study the challenging task of novel view synthesis from unposed images.

**Weaknesses**: The experiments are conducted on object-level generation. As the authors also mentioned in the related works section, single view image-to-3d is highly related to the task UpFusion trying to solve. Single view input can also be considered as input image without pose. As a result, I believe the contributions of UpFusion can be better justified when comparing to existing single view novel view synthesis works, for example [1].
Besides, a couple of references are missing [2] (also on Co3D dataset), [3][4] (SDS-based).

[1] Zero-1-to-3: Zero-shot one image to 3d object

[2] NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors

[3] RealFusion: 360° Reconstruction of Any Object from a Single Image

[4] NeuralLift-360: Lifting An In-the-wild 2D Photo to A 3D Object with 360 views

**Questions**: Please refer to the weaknesses.",0,
rljudc4XHW,Pvpq,"**Summary**: This paper proposes a lightweight and effective method for aggregating BEV pillar features using K-means clustering and Top-K Attention. The authors also introduce a Diversity Loss to prevent the attention mechanism from focusing too heavily on the most relevant features. The proposed method is evaluated on the nuScenes dataset and outperforms previous methods.

**Soundness**: 3 good

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: The proposed clustering and top-K attention mechanism are simple and intuitive, yet achieve strong performance compared to previous state-of-the-art methods (Table 1). Extensive ablation studies in Section 4.4 demonstrate the benefits of the proposed modules.

**Weaknesses**: Latency: Section 3.3 states that ""the computational efficiency of DynamicBEV is one of its key advantages”. However, not all floating-point operations (FLOPs) are created equal, especially for the clustering operation on GPUs, TPUs, and other edge devices. It would be helpful if the authors could measure the latency of the full model and provide a breakdown of the latency of each component (e.g., clustering, sorting top-k).

Generalization: Evaluating the proposed method on only one dataset is not sufficient. I suggest evaluating the proposed method on at least one additional dataset.

Visualization: Could the authors provide detailed illustrations on K-mean clustering and Top-K attention in Fig1?
Fig 2 is not clear.  What does each color mean?

**Questions**: Please see weaknesses.",1, "['Evaluating the proposed method on only one dataset is not sufficient. I suggest evaluating the proposed method on at least one additional dataset.']"
gG38EBe2S8,meGk,"**Summary**: This paper presents an image morphing approach using pre-trained text-to-image diffusion models. The key idea is to interpolate text embeddings and latent states of two images following low-rank adaptation of model weights. The intuition is that the fine-tuned model exposes a more direct morphing path, yielding more perceptually convincing interpolations. Qualitative and quantitative experiments are performed to assess several key aspects of the proposed method. Overall, the method achieves superior morphing results compared to baselines and further supports interesting applications.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: - The method makes use of a pre-trained text-to-image diffusion model for image morphing. The data prior captured by the diffusion model naturally constrains the morphing path on the image manifold, allowing perceptually convincing interpolations.

- The paper provides a helpful discussion on the tradeoffs in model adaptation. It discovers that LoRA creates an effective bottleneck to reduce overfitting, and further devises a simple strategy to calculate LoRA ranks based on sample diversity along the morphing path.

- The paper introduces an effective approach for perceptually uniform sampling. It is based on adaptive step size selection and yields smooth and visually pleasing morphing results. 

- Several metrics are defined to evaluate different aspects of image morphing, namely smoothness, realism and directness. The method outperforms baselines in qualitative and quantitative experiments, and shows potential for several interesting applications.

**Weaknesses**: - Despite the impressive results, there are some vague claims between the lines which need further clarification.

a) Equation 5 formulates image morphing as walking on the geodesic path in the 2-Wasserstein sense. However, the actual implementation amounts to linear interpolation of CLIP text embeddings, which in no way reflects the theoretical formulation.

b) For the interpolation of latent states, the paper cites an early work saying x0 and xT forms an optimal transport mapping. How does this justify the slerp interpolation between xT(0) and xT(1)?

- The method only compares to Wang and Golland (2023), another diffusion-based method for image morphing, yet there are numerous methods that are not based on diffusion (or even not learning-based). I am curious about how the method compares to, for example, frequency domain morphing methods, and what are their typical failure modes.

**Questions**: - It would be helpful if the authors can comment on the robustness of their method relative to the hyper-parameters in model adaptation and sampling.

- Most of the image pairs used for experiments seem to have certain degree of spatial alignment. It would be helpful to know the rule of thumb for the selection of these image pairs.",0,
gvKZyTlUgQ,uatX,"**Summary**: This paper introduces a novel approach, named Warped Convolutional Neural Networks (WCN), for effectively learning and representing homography in neural networks through algebraic expressions. The proposed method enables the learning of features that remain invariant to significant homography transformations and can be easily incorporated into popular CNN-based methods. The paper thoroughly analyzes the proposed approach, including the warp function and its properties, implementation details, as well as extensive experimental results on benchmark datasets and tasks. The contributions of this paper encompass a fresh perspective on homography learning utilizing algebraic expressions, the introduction of a novel warped convolutional layer, and a comprehensive evaluation of the proposed method across various benchmark datasets and tasks.

**Soundness**: 3 good

**Presentation**: 2 fair

**Contribution**: 3 good

**Strengths**: 1. This paper establishes a sophisticated and elegant relationship between homography and the SL(3) group along with its Lie algebra.
2. The formulation of the homography and the underlying warping functions proposed in this paper demonstrate technical soundness.
3. The proposed WCN method for estimating homography parameters is logical and well-founded.
4. Extensive experiments are performed on various tasks and datasets, successfully validating the effectiveness of the proposed method.

**Weaknesses**: 1. In terms of novelty, this paper bears resemblance to the work of Zhan et al. (2022). Zhan et al. employed a similar approach, utilizing two groups for estimation, whereas this paper proposes the use of six groups for the same purpose. As a result, the contribution of this work can be seen as somewhat incremental. More clear discussion should be given. 

2. It is desirable for this paper to provide additional elaboration on the mathematical aspects associated with the proposed method, with the aim of enhancing comprehension for individuals who are not familiar with this particular field. The inclusion of more accessible explanations and intuitive examples would be beneficial in ensuring that the content is more easily understood by a broader audience.

**Questions**: Please check the weakness listed above.",0,
sKLQMNTAMx,7xGT,"**Summary**: This paper proposed a method to tackle the task of  one-shot federated learning under the setting of non IID. The main idea seems to divides the model into backbone and head, deploying them separately on the client and server sides.

In the setting of with no sufficiently large training data available, it's know that only fine tuning the head and freezing the backbone can lead to better results, which was reported in object detection. Thus I am not supervised to see that freezing the backbone and only training the head works better in federated learning.

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 2 fair

**Strengths**: The ides is simple and I believe it should work, as I said above, though I am not sure if this can be a good contribution to federated learning as I am not closely following recent federated learning literature. 
Experiments show this simple idea works well.

**Weaknesses**: Experiments are not well designed.  
1) I think the authors should first compare FedTC against recent one-shot federated learning methods on standard IID settings. FedTC doesn't need to beat those methods, but I'd expect FedTC show on par results compared with those methods. I did not see that in the paper.
2) Table 1 which is the main results of the paper, shows FedTC largely outperforms other methods. But to my understanding all of the compared methods (maybe most, I did not check carefully) are not designed to handle non IID cases. Is it expected to see those methods fail? 
Then what is the main point of Table 1? 
3) Since the main selling point is FedTC works better on non IID datasets, I'd like to see FedTC compared against traditional non IID methods (by traditional, I mean training in the standard way rather than federated learning, which presumably should work better than federated learning)

4) Since one shot federated learning is largely unsolved, maybe it makes more sense to improve that rather than to solve much more challenging one shot, non IID, federated learning. This is just a comment.

**Questions**: see above",0,
lKK50q2MtV,FeQk,"**Summary**: The paper presents a method for text-based video editing, called TokenFlow. TokenFlow utilizes a pre-trained text-to-image diffusion model without the need for finetuning or video training data. Independently using text-based image editing techniques on frames will produce temporal artifacts. The paper proposed a method to improve the temporal consistency. More specifically, the method uses extended attention to edit several keyframes and then propagates the keyframe features to all the frames based on a Nearest Neighbour field. The Nearest Neighbour field is computed based on features of DDIM inversion.

**Soundness**: 3 good

**Presentation**: 4 excellent

**Contribution**: 3 good

**Strengths**: - The video editing results are impressive, the temporal consistency is pretty good.
- The analysis and visualization of UNet features on video tasks are helpful for future research on video generation.
- The idea of TokenFlow is novel. Based on the ablation study and qualitative results in the supplemental material, TokenFlow is also very critical to good temporal consistency. 
- The paper reads well and is easy to follow.

**Weaknesses**: Although it's not necessary, it will be helpful to compare TokenFlow with Pix2Video.

**Questions**: Are self-attention features the only features that are replaced by features of neighboring frames? Have you tried to replace some other features such as ResBlock features or attention masks?",1,"[""Although it's not necessary, it will be helpful to compare TokenFlow with Pix2Video.""]"
AJBkfwXh3u,U7py,"**Summary**: This paper proposes a novel approach for interpretability in dynamic graph neural networks. The proposed framework is demonstrated on both synthetic and real-world datasets. The experimental results show that the proposed method outperforms the baselines (all baselines are for explaining static graph neural networks). Another contribution is that the paper constructs a new synthetic benchmark dataset for dynamic graph interpretability tasks.

**Soundness**: 3 good

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: The proposed framework is the first work for interpretability in dynamic graph neural networks. This is a significant contribution. The paper is well organized and clearly described. The method is technically sound. The experiments are comprehensive and the results show the effectiveness of the proposed method. The new constructed benchmark dataset is a good addition to the research domain.

**Weaknesses**: Minors: 
In Figure 1, the text is too small.

**Questions**: In table 2, the best performance for OrphicX is obtained by DTree-Grid?",0,
XXpH3D0TVP,KmFt,"**Summary**: This work focuses on attribution for diffusion models, i.e. understanding how the underlying training data influences the generation of a sample. This work proposes an approach based on TRAK [1] to attribute a single denoising step back to the training data. The work also proposes metrics based on counterfactual estimation to evaluate attribution approaches and shows comparisons against simple baselines on multiple datasets (CIFAR-10 and MS-COCO).  

[1] Park, Sung Min, et al. ""Trak: Attributing model behavior at scale."" _arXiv preprint arXiv:2303.14186_ (2023).

**Soundness**: 2 fair

**Presentation**: 3 good

**Contribution**: 3 good

**Strengths**: 1. The work focuses on an important and timely problem. Attribution for models is an important technical problem to address as generative models become more ubiquitous. This has implications for regulation, copyright, and fair compensation to artists [1].
2. The proposed solution is reasonably motivated and builds on prior work that achieves SOTA attribution results for discriminative models. The work also compares against reasonable baselines for data attribution and shows results across two different datasets namely CIFAR-10 and MS COCO.
3. The proposed attribution results look reasonable and are quantitatively supported well via counterfactual evaluation.  Evaluation for attribution approaches is difficult since there exists no ground truth label. The counterfactual evaluation metrics proposed in this paper (inspired via TRAK [2] and DataModels [3]) will be useful for future research.
4. The writing quality of the paper is good. It was easy to follow the main contributions of the paper and understand the background of data attribution.

[1] https://www.klgates.com/Recent-Trends-in-Generative-Artificial-Intelligence-Litigation-in-the-United-States-9-5-2023 
[2] Park, Sung Min, et al. ""Trak: Attributing model behavior at scale."" _arXiv preprint arXiv:2303.14186_ (2023).
[3] Ilyas, Andrew, et al. ""Datamodels: Predicting predictions from training data."" _arXiv preprint arXiv:2202.00622_ (2022).

**Weaknesses**: 1. A limitation of the proposed approach is the fact that attribution scores are only provided for a single denoising step. This is unintuitive, as it requires multiple steps to be analyzed to understand how an image was generated. It would be good to obtain a single-shot attribution score for the entire diffusion trajectory. While simple heuristics can be employed to obtain this from the current approach, it's unclear if these are useful and interpretable.

2. There is little analysis regarding how attributions change throughout the diffusion trajectory. It would be interesting to analyze more how ranking b/w attributions stay consistent b/w different timesteps. For example, what's the correlation coefficient b/w attributions of two timesteps close to each other v/s further away? How many of the +ve influence samples in the initial/middle timesteps stay positive throughout the diffusion trajectory?

3. The claim regarding conditioning likelihood increasing in small time intervals is a bit weak (i.e. features appear in specific timesteps). This should be more rigorously studied for multiple generated images on CIFAR-10 and MS-COCO. 

Minor - 
1. The font, and margins of this submission violate ICLR guidelines. This should be corrected in the next version of the paper.

**Questions**: 1. This is merely a suggestion and could help strengthen the paper. It would be interesting to compare attributions using a similar framework as [1] by fine-tuning large text-to-image models such as Stable Diffusion on a few images using a dreambooth-like approach. In this case, the attributions should have a higher +ve influence on the fine-tuning dataset. This can be done even for a small random subset of LAION images, instead of the entire dataset.  This can also be done for text-conditioned models on MS-COCO.

2. Several important details are missing regarding the attribution approach. TRAK uses a random projection matrix to compress gradients to low dimensional space, the dimension hasn't been mentioned at all. Is this random projection step not done for attributing diffusion models? Are multiple checkpoints used to estimate attribution scores? Are these trained on different subsets of training data? How much storage and compute is required for estimating the attribution scores? 

[1] Wang, Sheng-Yu, et al. ""Evaluating Data Attribution for Text-to-Image Models."" arXiv preprint arXiv:2306.09345 (2023).",0,
psDvcWtFdE,NJds,"**Summary**: This paper proposes a deep instance generator for MILPs with feasibility gurantee. It uses a VAE model trained on a dataset to generate similar MILP instances, and leverages a dual method proposed by Bowly et al. for feasibility.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 1 poor

**Strengths**: 1. This paper proposes a MILP generator with feasibility gurantee.
2. It conduct some experiments to demonstrate the effectiveness and for analysis.

**Weaknesses**: 1. The technical novelty is minor. The proposed model is a direct combination of two existing methods [1] and [2]. [1] is a recent work accepted by NeurIPS and this paper is almost the same with [1]. Even if taking [1] without consideration, this work is an application of existing techniques, i.e., the VAE for graph generation and the feasible instance construction method proposed in [2].
2. Does this paper deal with MILPs or IPs? In Eq. (1) all variables are constrained as integers. In table 1 there are no features indicating whether the variables are integers.
3. Do the datasets contain unfeasible MILPs? Can the model learn to generate feasible MILPs without the feasibility gurantee? The necessity of this component is not demonstrated with ablation study.
4. The proposed method does not performs better than random significantly.
5. Why not report the hyper-configuration results to show whether this method can benefit this task?
6. What is the useness of the optimal value prediction task? Can the proposed method help the solving instead of just predicting the optimal value? 

[1] https://arxiv.org/abs/2310.02807

[2] Simon Andrew Bowly. Stress testing mixed integer programming solvers through new test instance generation methods. PhD thesis, School of Mathematical Sciences, Monash University, 2019.

**Questions**: See weaknesses.",0,
HD5Y7M8Xdk,84Pi,"**Summary**: The paper proposes a variational algorithm for learning model and latent parameters in a latent variable model which at each step first updates the variational distribution by minimizing forward chi-square divergence and then uses this distribution to estimate the log marginal likelihood using Importance Sampling. The optimization is done by gradient ascent where the gradients are estimated by MC sampling.
The proposed algorithm is compared against many contemporary algorithms on simulated and real world datasets, including a large scale case study on multi-neuron interaction modelled by a custom made partially observable GLM.

**Soundness**: 2 fair

**Presentation**: 2 fair

**Contribution**: 2 fair

**Strengths**: 1. The paper tries well to motivate itself well and the use of chi-square divergence objective as means of finding the optimal distribution for IS is theoretically sound.
2. The paper has done experiments and analysis on may simulated and real world datasets. The experiment on GLP model for multi-neurns activation is well documented and insightful. Some of the plots look good and match the narrative. 
3. The method proposed seems sound to me and the results show that it can perform better than contemporary VI algorithms on the tasks given in the paper.

**Weaknesses**: 1. The paper is not polished yet, although the major parts are all there, it may require another thorough pass.
it has too many mistakes and typos:, the notation changes from bold to normal in many places, the title has a typo: 'importane', 'log function is a convex function'. 
2. Some of the references and recent literature is missing which have looked on the quality of different divergence objectives such as CUBO and ELBO for finding the optimal sampling distribution.
3. The theory part and the algorithm part can be emphasized more, right now it feels to compressed and dense. The figure 2 is good but it has too many colors and things to unpack, maybe use solid line for true posterior as I was thoroughly confused by the legend choices, and the use of two colors for showing modes reduced readibility for me atleast. 
4. It is the bane of chi-square divergence methods that it  does not scale well with dimensions covered in the papers here: https://arxiv.org/pdf/2010.09541.pdf and https://arxiv.org/abs/1802.02538 and it seems that this method may not scale well as it uses Chi-squared divergence minimization.

**Questions**: 1. What is the dimensionality of the POGLM model, do the authors intend to use this method as a tool for low dimensional complex posteriors because both IS and CUBO do not scale well with dimensions and even large sample size as done in this paper will not help.  
2. Maybe include this in your conclusion section and discuss this as a limitation ? 
3. Did the authors use any other optimizers other than ADAM, did it have any effect, how did you choose the optimization algorithm hyperprameters like learning rate etc. ?
4. Did reparameterization gradients perform better than score gradients in the case where they both were available.",1,['Some of the references and recent literature is missing which have looked on the quality of different divergence objectives such as CUBO and ELBO for finding the optimal sampling distribution.']