TL;DR — I tried to run Qwen2.5-7B-Instruct-1M on a consumer laptop (RTX 3050 Laptop 6GB VRAM) and mapped the literal feasibility frontier. All evidence in JSON, drift-CI enforced. Three honest findings:
4k context = the hard ceiling on Windows transformers + bitsandbytes int4 NF4. 5k, 6k, 8k all OOM at the first attention forward pass. The 4k cell passes only because Windows kernel shared-memory PCIe spillover (WDDM overcommit) lets allocations spill to system RAM at ~10x latency tax — peak measured 10.8GB on a 6GB GPU.
WSL2 + vllm cannot even fit the model. vllm 0.7.3 memory profile literal log: "model weights take 5.43GiB; PyTorch activation peak memory takes 1.42GiB; the rest of the memory reserved for KV Cache is -0.94GiB". 0 GPU cache blocks allocated, 0.00x concurrency at 4200 tokens. Linux nvidia driver does not provide an equivalent shared-mem fallback — vllm sees only physical 6GB and refuses. The conventional wisdom "vllm > transformers for memory efficiency" is literal disproven at this hardware tier: it fails harder because Windows OS was the enabler, not the inference engine.
Cloud free-tier is also capped, and unevenly. GitHub Models free tier (zero credit card, gh OAuth only): gpt-4.1-mini PASS @ 4k in 8.54s (~30x faster than local). llama-3.3-70b-instruct PASS @ 4k in 5.17s. But: gpt-5 returns
unavailable_modelat any context size on free tier. DeepSeek-V3 + gpt-5 are capped at literal 4000 input tokens. And Anthropic Claude is not in the GitHub Models catalog at all — zero CC + Claude = no path.
Full numbers + 11 JSON evidence cells + 3 ADRs at: https://github.com/leagames0221-sys/longctx-bench-honest
Hardware: RTX 3050 Laptop 6GB / driver 560.94 / CUDA 12.6 / Windows 11 + WSL2 Ubuntu 24.04. Software: torch 2.5.1+cu124, transformers (5.8.0 Win / 4.48.3 WSL), bitsandbytes 0.49.2, vllm 0.7.3. Everything fully reproducible — uv.lock committed, runners under examples/.
Related sibling repo for browser RPA on the same constraints (5-layer defense-in-depth journey, 5 honest failures with JSON): https://github.com/leagames0221-sys/browser-agent-demo
Cross-repo thesis is "constraint-optimized AI engineering": map the literal feasibility frontier under (zero credit card, consumer laptop, public OSS only, drift-CI enforced) and publish both the working zone AND the boundary. Happy to answer questions about the methodology or specific runner code.
United States
NORTH AMERICA
Related News
Amazon Employees Are 'Tokenmaxxing' Due To Pressure To Use AI Tools
20h ago
UCP Variant Data: The #1 Reason Agent Checkouts Fail
6h ago

Décryptage technique : Comment builder un téléchargeur de vidéos Reddit performant (DASH, HLS & WebAssembly)
16h ago
How Braze’s CTO is rethinking engineering for the agentic area
10h ago
Encryption Protocols for Secure AI Systems: A Practical Guide
20h ago