The latest paper from DeepSeek introduces a new attention mechanism — NSA, a locally trainable sparse attention mechanism for ultra-fast long-context training and inference.
Kimi proposed a new attention mechanism, MoBA, which combines the principles of MoE and improves the efficiency of LLMs in long-text scenarios without sacrificing performance.
Generative large language models (LLMs) have opened up numerous novel
possibilities, but due to their significant computational requirements their
ubiquitous use remains challenging. Some of the most useful applications
require processing large numbers of samples at a time and using long contexts,
both significantly increasing the memory communication load of the models. We
introduce SparQ Attention, a technique for increasing the inference throughput
of LLMs by reducing the memory bandwidth requirements within the attention
blocks through selective fetching of the cached history. Our proposed technique
can be applied directly to off-the-shelf LLMs during inference, without
requiring any modification to the pre-training setup or additional fine-tuning.
We show how SparQ Attention can decrease the attention memory bandwidth
requirements up to eight times without any loss in accuracy by evaluating Llama
2 and Pythia models on a wide range of downstream tasks.
Scaling up the size of vision models has been the de facto standard to obtain
more powerful visual representations. In this work, we discuss the point beyond
which larger vision models are not necessary. First, we demonstrate the power
of Scaling on Scales (S^2), whereby a pre-trained and frozen smaller vision
model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform
larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth
estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation.
Notably, S^2 achieves state-of-the-art performance in detailed understanding
of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the
conditions under which S^2 is a preferred scaling approach compared to
scaling on model size. While larger models have the advantage of better
generalization on hard examples, we show that features of larger vision models
can be well approximated by those of multi-scale smaller models. This suggests
most, if not all, of the representations learned by current large pre-trained
models can also be obtained from multi-scale smaller models. Our results show
that a multi-scale smaller model has comparable learning capacity to a larger
model, and pre-training smaller models with S^2 can match or even exceed the
advantage of larger models. We release a Python package that can apply S^2 on
any vision model with one line of code:
https://github.com/bfshi/scaling_on_scales.
In this work we systematically review the recent advancements in code
processing with language models, covering 50+ models, 30+ evaluation tasks, and
500 related works. We break down code processing models into general language
models represented by the GPT family and specialized models that are
specifically pretrained on code, often with tailored objectives. We discuss the
relations and differences between these models, and highlight the historical
transition of code modeling from statistical models and RNNs to pretrained
Transformers and LLMs, which is exactly the same course that had been taken by
NLP. We also discuss code-specific features such as AST, CFG, and unit tests,
along with their application in training code language models, and identify key
challenges and potential future directions in this domain. We keep the survey
open and updated on github repository at
https://github.com/codefuse-ai/Awesome-Code-LLM.
CoRR (2023)
Cited0Views0
Download
Bibtex
ChatPaper
4.5 Star
0
0
加载更多
Popular Recommendation
Popular Viewed Papers&Topics
This paper introduces a new technique called SparQ Attention, which can significantly reduce the memory bandwidth requirements of generative large language models during inference, thereby improving the throughput of LLM inference.
Scaling up the size of vision models has become a practical trend to obtain more powerful visual representations. But is "bigger" always "better" in the future? This paper discusses the aspects of larger vision models that may not be necessary.