
Text and Typography
Examples of text, typography, math equations, diagrams, flowcharts, pictures, videos, and more.
Examples of text, typography, math equations, diagrams, flowcharts, pictures, videos, and more.
Reference Principle and Deployment A video concerning the principles of LLaVa Good to know Project structure and config.json settings Download the llava framework, download the weight and v...
Install pip install -U huggingface_hub # Python>=3.8 Login huggingface-cli login ## get the token from website Download Models huggingface-cli download --resume-download {model name from ...
Blip架构 (1)Image Encoder 是干嘛的? 就是把图片变成一串“数字表达”,类似我们读一本书时把每个字变成我们能理解的意思。 它用的是像 ViT 这种模型,把图片像切豆腐块一样切成很多小块(patch),然后变成一个个“图像token”。 最终它会变成一个形状像 (图片数, token数, 每个token的维度) 的东西,就像 NLP 中的 (batch...
参考 https://zhuanlan.zhihu.com/p/619501914 经验主义 融合编码器不能太简单 图像编码器要比文本编码器大一些 🔧 ALBEF 的核心思想:Align Before Fuse 传统图文模型(如 UNITER)是“先融合后对齐”的:先将图文输入一个 Transformer,然后再训练模型学习它们之间的关系。 ALBEF 的...
Vit Vit Principle Vit Code Vit Position Encoding - Video Clip 🧠 场景:CLIP处理一句话 比如我们有这句话: “a cute cat”(一只可爱的猫) CLIP 会这样处理这句话: 1. 分词 + 编码: 这句话会变成一个词序列(token): css CopyEdit ["<...
All generated by chatgpt-4o 🎯 问题重述: 假设一个 batch 有 64 张图,MoCo 的流程是为每张图都生成 query 和 key,那怎么一起训练、一起计算 loss呢? ✅ 回答核心: MoCo 是并行地对每张图执行“对比任务”,然后对所有样本的 loss 做平均,一起反向传播。这是现代深度学习中很常见的“mini-batch train...
1. Overall Goal & Context Thesis Title: Feasibility Study on Using ‘Behind the Ear’-EEG to Detect Arousal in Virtual Reality Exposure Therapy Primary Objective: To investigate whether ‘Be...
Paper Link Paper Link 1. The Core Problem: VLMs Struggle with Learning from Examples Current Strength: Today’s Vision-Language Models (VLMs), like LLaVA, excel at zero-shot tasks – understandi...
Paper Link Paper Link I. 🌟 Core Focus and Contributions Topic: A systematic review of EEG-based Multimodal Emotion Recognition (EMER) Focus: Centers on EEG as the primary modality, combined ...