feat(deep_learning): 트랜스포머 라이브 코딩 자료 추가

8b7aef2a · insun park · 815db462 · 8b7aef2a · 8b7aef2a
Commit 8b7aef2a authored Jun 24, 2025 by insun park
--- a/ai lecture/courses/07_deep_learning/live_coding_transformer_block.md
+++ b/ai lecture/courses/07_deep_learning/live_coding_transformer_block.md
+# Live Coding 시나리오: Transformer Block 직접 만들어보기
+
+**목표:** Transformer의 핵심 구성 요소인 Self-Attention, Multi-Head Attention, Feed-Forward Network를 PyTorch를 사용하여 밑바닥부터(from scratch) 구현함으로써 Transformer의 내부 동작 원리를 깊이 있게 이해합니다.
+
+**대상:** Python 및 PyTorch 기본 문법에 익숙하지만, Transformer의 내부 구조를 더 명확히 알고 싶은 개발자
+
+**예상 소요 시간:** 45-50분
+
+**결과물:** `source_code/07_deep_learning/live_coding_transformer_from_scratch.py`
+
+---
+
+## 세션 개요 및 진행 계획
+
+| 시간 (분) | 내용                                                                      | 핵심 포인트                                                                  |
+| :-------- | :------------------------------------------------------------------------ | :--------------------------------------------------------------------------- |
+| 5         | **도입:** Transformer 아키텍처 개요 및 Live Coding 목표 소개              | "Attention is All You Need", Encoder-Decoder 구조, 오늘 만들 부분(Block) 소개 |
+| 15        | **Part 1: Scaled Dot-Product Attention 구현**                             | Q, K, V의 의미, 행렬 연산, 마스킹(masking)의 역할, 스케일링(scaling)의 중요성 |
+| 15        | **Part 2: Multi-Head Attention 구현**                                     | Attention을 여러 "헤드"로 나누는 이유, `split`과 `concat`을 통한 구현      |
+| 10        | **Part 3: Transformer Block 조립**                                        | 잔차 연결(Residual Connection)과 Layer Normalization의 역할, FFN 추가       |
+| 5         | **마무리:** 완성된 코드 리뷰 및 전체 아키텍처 내에서의 역할 설명, Q&A | Encoder Block과 Decoder Block의 차이점, 다음 학습 단계 제안                 |
+
+---
+
+## Live Coding 상세 시나리오
+
+### 1. 도입 (5분)
+
+- Transformer가 NLP뿐만 아니라 다양한 분야에서 왜 혁신적인지를 간략히 설명합니다.
+- "Attention is All You Need" 논문의 핵심 아이디어(RNN/CNN 없이 Attention만으로 시퀀스 처리)를 강조합니다.
+- 거대한 Transformer 아키텍처 그림을 보여주며, 오늘 우리가 집중할 부분은 이 아키텍처를 구성하는 핵심 부품인 'Encoder Block' 또는 'Decoder Block' 하나라는 것을 명확히 합니다.
+
+### 2. Part 1: Scaled Dot-Product Attention (15분)
+
+- **개념 설명**:
+    - Attention의 직관적 의미를 "Query가 Key와의 유사도를 계산하여 Value의 가중합을 얻는 과정"으로 설명합니다.
+    - Q, K, V 행렬의 형태(shape)와 의미를 설명합니다. `(batch_size, seq_len, d_model)`
+- **코드 구현**:
+    - `torch.matmul(q, k.transpose(-2, -1))` 로 Attention Score를 계산하는 것부터 시작합니다.
+    - 스케일링 팩터 `sqrt(d_k)`로 나누는 이유(gradient 안정화)를 설명하며 코드를 추가합니다.
+    - (선택적) 마스킹이 필요한 이유(padding token 무시, decoder의 look-ahead 방지)를 설명하고, `masked_fill_`을 사용하여 구현합니다.
+    - `torch.softmax`를 적용하여 최종 Attention Weight를 얻고, 이를 `v`와 곱하여 결과(context vector)를 계산하는 코드를 완성합니다.
+    - 입력과 출력의 텐서 형태가 동일하게 유지됨을 확인합니다.
+
+### 3. Part 2: Multi-Head Attention (15분)
+
+- **개념 설명**:
+    - "한 번에 보는 것보다, 여러 관점에서 나누어 보는 것"이 왜 더 효과적인지를 비유적으로 설명합니다. (e.g., "문장의 다른 의미적/문법적 관계를 동시에 파악")
+    - `d_model` 차원의 벡터를 `num_heads`개의 `d_k` (d_model / num_heads) 차원 벡터로 나누는 과정을 설명합니다.
+- **코드 구현**:
+    - 입력으로 들어온 Q, K, V를 각각 `nn.Linear`를 통과시켜 새로운 Q, K, V를 만듭니다.
+    - 이 Q, K, V를 `num_heads`에 맞게 `split`하고 `transpose`하여 `(batch_size, num_heads, seq_len, d_k)` 형태로 변환하는 함수를 작성합니다.
+    - Part 1에서 만든 `scaled_dot_product_attention` 함수를 재사용하여 Attention을 계산합니다.
+    - Attention 계산 후, 흩어졌던 헤드들을 `concat`하고 `nn.Linear`를 통과시켜 최종 출력을 만드는 코드를 완성합니다.
+    - `nn.Module`을 상속받아 `MultiHeadAttention` 클래스로 전체 로직을 캡슐화합니다.
+
+### 4. Part 3: Transformer Block 조립 (10분)
+
+- **개념 설명**:
+    - Add & Norm (잔차 연결 + Layer Normalization)의 역할을 설명합니다. (그래디언트 소실 문제 완화, 학습 안정화)
+    - Position-wise Feed-Forward Network (FFN)의 구조와 역할을 설명합니다. (비선형성 추가)
+- **코드 구현**:
+    - **Sub-layer 1**: `MultiHeadAttention`을 적용한 결과에 `input`을 더하고(`x + self.attention(x)`) `LayerNorm`을 적용합니다.
+    - **Sub-layer 2**: 위 결과물을 `FFN`에 통과시킨 후, 다시 한번 Add & Norm을 적용합니다.
+    - 위 과정을 `EncoderBlock` (또는 `DecoderBlock`) 이라는 `nn.Module` 클래스로 캡슐화합니다.
+
+### 5. 마무리 (5분)
+
+- 오늘 만든 Transformer Block이 어떻게 여러 개 쌓여 전체 Encoder/Decoder를 구성하는지 보여줍니다.
+- Encoder Block과 Decoder Block의 미세한 차이(Masked Multi-Head Attention 사용 여부 등)를 짚어줍니다.
+- 완성된 `live_coding_transformer_from_scratch.py` 코드를 전체적으로 리뷰합니다.
+- `transformers` 라이브러리의 `BertModel`과 같은 실제 구현체와 오늘 만든 코드의 관계를 설명하며, 라이브러리 사용의 중요성을 강조합니다.
+- Q&A 시간을 갖습니다. 
\ No newline at end of file
--- a/ai lecture/source_code/07_deep_learning/live_coding_transformer_from_scratch.py
+++ b/ai lecture/source_code/07_deep_learning/live_coding_transformer_from_scratch.py
+import torch
+import torch.nn as nn
+import math
+
+# Live Coding 시나리오에 따라 Transformer의 핵심 구성 요소를 밑바닥부터 구현합니다.
+
+def scaled_dot_product_attention(q, k, v, mask=None):
+    """
+    Scaled Dot-Product Attention을 계산합니다.
+    Q, K, V는 (batch_size, num_heads, seq_len, d_k) 형태의 텐서입니다.
+    """
+    # 1. Q와 K의 전치 행렬을 곱합니다. (Attention Score 계산)
+    # (batch_size, num_heads, seq_len, d_k) @ (batch_size, num_heads, d_k, seq_len)
+    # -> (batch_size, num_heads, seq_len, seq_len)
+    d_k = q.size(-1)
+    scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(d_k)
+
+    # 2. (선택적) 마스킹을 적용합니다.
+    if mask is not None:
+        # 마스크에서 0인 위치를 매우 작은 값으로 채워 Softmax 이후 해당 위치의 확률이 0에 가깝도록 만듭니다.
+        scores = scores.masked_fill(mask == 0, -1e9)
+
+    # 3. Softmax를 적용하여 Attention Weight를 계산합니다.
+    attn_weights = torch.softmax(scores, dim=-1)
+
+    # 4. 계산된 Attention Weight와 V를 곱합니다.
+    # (batch_size, num_heads, seq_len, seq_len) @ (batch_size, num_heads, seq_len, d_v)
+    # -> (batch_size, num_heads, seq_len, d_v)
+    output = torch.matmul(attn_weights, v)
+    return output, attn_weights
+
+
+class MultiHeadAttention(nn.Module):
+    """
+    Multi-Head Attention 모듈 구현
+    """
+    def __init__(self, d_model, num_heads):
+        super(MultiHeadAttention, self).__init__()
+        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"
+
+        self.d_model = d_model      # 모델의 전체 차원
+        self.num_heads = num_heads  # 헤드의 수
+        self.d_k = d_model // num_heads # 각 헤드의 차원
+
+        # Q, K, V 및 최종 출력을 위한 Linear 레이어
+        self.w_q = nn.Linear(d_model, d_model)
+        self.w_k = nn.Linear(d_model, d_model)
+        self.w_v = nn.Linear(d_model, d_model)
+        self.w_o = nn.Linear(d_model, d_model)
+
+    def split_heads(self, x):
+        # (batch_size, seq_len, d_model) -> (batch_size, num_heads, seq_len, d_k)
+        batch_size, seq_len, _ = x.size()
+        return x.view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
+
+    def combine_heads(self, x):
+        # (batch_size, num_heads, seq_len, d_k) -> (batch_size, seq_len, d_model)
+        batch_size, _, seq_len, d_k = x.size()
+        return x.transpose(1, 2).contiguous().view(batch_size, seq_len, self.d_model)
+
+    def forward(self, q, k, v, mask=None):
+        # 1. 입력 Q, K, V에 각각 Linear 레이어를 통과시킵니다.
+        q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)
+
+        # 2. 헤드를 여러 개로 나눕니다.
+        q, k, v = self.split_heads(q), self.split_heads(k), self.split_heads(v)
+
+        # 3. Scaled Dot-Product Attention을 수행합니다.
+        attn_output, attn_weights = scaled_dot_product_attention(q, k, v, mask)
+
+        # 4. 나누었던 헤드를 다시 합칩니다.
+        output = self.combine_heads(attn_output)
+
+        # 5. 최종 Linear 레이어를 통과시켜 결과를 반환합니다.
+        output = self.w_o(output)
+        return output
+
+
+class PositionwiseFeedForward(nn.Module):
+    """
+    Position-wise Feed-Forward Network 구현
+    """
+    def __init__(self, d_model, d_ff, dropout=0.1):
+        super(PositionwiseFeedForward, self).__init__()
+        self.linear1 = nn.Linear(d_model, d_ff)
+        self.linear2 = nn.Linear(d_ff, d_model)
+        self.relu = nn.ReLU()
+        self.dropout = nn.Dropout(dropout)
+
+    def forward(self, x):
+        return self.linear2(self.dropout(self.relu(self.linear1(x))))
+
+
+class EncoderBlock(nn.Module):
+    """
+    Transformer Encoder Block 하나를 구현합니다.
+    """
+    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
+        super(EncoderBlock, self).__init__()
+        self.self_attn = MultiHeadAttention(d_model, num_heads)
+        self.feed_forward = PositionwiseFeedForward(d_model, d_ff, dropout)
+        self.norm1 = nn.LayerNorm(d_model)
+        self.norm2 = nn.LayerNorm(d_model)
+        self.dropout1 = nn.Dropout(dropout)
+        self.dropout2 = nn.Dropout(dropout)
+
+    def forward(self, x, mask=None):
+        # 1. Multi-Head Self-Attention (첫 번째 서브레이어)
+        # 잔차 연결(Residual Connection) 및 Layer Normalization
+        attn_output = self.self_attn(x, x, x, mask)
+        x = self.norm1(x + self.dropout1(attn_output))
+
+        # 2. Position-wise Feed-Forward Network (두 번째 서브레이어)
+        # 잔차 연결(Residual Connection) 및 Layer Normalization
+        ff_output = self.feed_forward(x)
+        x = self.norm2(x + self.dropout2(ff_output))
+
+        return x
+
+
+if __name__ == '__main__':
+    # --- 파라미터 설정 ---
+    batch_size = 4
+    seq_len = 60      # 문장의 최대 길이
+    d_model = 512     # 모델의 임베딩 차원
+    num_heads = 8     # Multi-Head Attention의 헤드 수
+    d_ff = 2048       # Feed-Forward 네트워크의 내부 차원
+
+    # --- 더미 데이터 생성 ---
+    # 실제로는 Tokenizer를 통해 얻은 토큰 ID를 Embedding 레이어에 통과시킨 결과입니다.
+    dummy_input = torch.rand(batch_size, seq_len, d_model)
+    print(f"입력 텐서 형태: {dummy_input.shape}")
+
+    # --- Encoder Block 인스턴스화 및 실행 ---
+    encoder_block = EncoderBlock(d_model, num_heads, d_ff)
+    output = encoder_block(dummy_input)
+
+    print(f"Encoder Block 출력 텐서 형태: {output.shape}")
+    print("\n성공: Encoder Block을 통과한 후 입력과 출력의 텐서 형태가 동일합니다.")
+    print("이는 여러 개의 Encoder Block을 쌓아 전체 Encoder를 구성할 수 있음을 의미합니다.") 
\ No newline at end of file