GENAI - Large Language Models
Fri 7th March 2025

Technical Advancements in LLMs

New Open-Source Models

The community discussed DeepSeek’s Janus-Pro 7B, a multimodal model combining text and image tasks. It features a dual-pathway transformer architecture, excelling in both understanding and generation. Janus-Pro 7B outperforms OpenAI’s DALL-E 3 on key benchmarks (DPG-Bench), showing how unified approaches can rival specialized models.

Reasoning Models & Smaller Footprints

Alibaba’s QwQ-32B is gaining attention for its self-refinement and reinforcement learning techniques, enabling strong performance in math and coding tasks. It features a 131K token context length, putting it on par with OpenAI’s latest offerings.

Additionally, MIT researchers demonstrated an 8B model using Test-Time Training (TTT) to significantly boost performance on reasoning tasks. This suggests that smaller models can achieve GPT-4 level reasoning by leveraging dynamic learning during inference.

Community Projects and Showcases

Running a 671B Model Locally

Community members successfully ran DeepSeek R1 (671B) on local hardware using heavy quantization and RAM offloading. Some achieved reasonable speeds (~6 tokens/sec) on multi-CPU setups, while others pushed it to a single RTX 4090 with extreme quantization.

Open-Source Collaboration (UI & Tools)

Users improved the OpenWebUI by adding a Claude-style artifact system. This feature allows generated content to be managed separately, improving chat workflow for long-form outputs.

Community Model Replications

Discussions around Hugging Face’s Open-R1 initiative grew as users collaborated to fine-tune smaller replicas of DeepSeek R1. This project aims to create a reasoning-focused model with a lighter footprint.

Noteworthy Tools & Updates

Libraries and Backend Updates

The latest update of llama.cpp introduced optimizations for multi-socket CPU systems, improving inference speeds. Additionally, users tested Hugging Face’s TGI for serving LLMs efficiently in multi-user environments, comparing it with Ollama for different workloads.

Quantization Techniques

Discussions on novel quantization methods led to interest in SVDQuant, which applies 4-bit weight-and-activation quantization using low-rank approximations. While initially tested on diffusion models, users speculated its application for LLMs to further reduce memory footprint and improve performance.