16 papers found
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
arXiv (Cornell University)2025
SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations
arXiv (Cornell University)2025
Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining
arXiv (Cornell University)2025
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model
arXiv (Cornell University)20255 citations
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
arXiv (Cornell University)202411 citations
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
PubMed202448 citations
An Empirical Study of Mamba-based Language Models
arXiv (Cornell University)20249 citations
Effectively Modeling Time Series with Simple Discrete State Spaces
arXiv (Cornell University)202313 citations
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
arXiv (Cornell University)2023952 citations
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
arXiv (Cornell University)2023139 citations
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
arXiv (Cornell University)202319 citations
