本節作業,沒有用到 2-6 的 Hybrid Search。
直接用老師的程式,稍微修改就可以得到答案(寫法一)。
▌Homework 實作重點
Q1: 嵌入查詢向量
- 使用
jinaai/jina-embeddings-v2-small-en
模型 - 查詢文字:
'I just discovered the course. Can I join now?'
- 找出向量中的最小值
Q2: 計算餘弦相似度
- 嵌入另一個文件:
'Can I still join the course after the start date?'
- 計算兩個向量間的餘弦相似度
- 由於 FastEmbed 的向量已標準化,可直接使用點積
Q3: 根據文本欄位排序
- 嵌入 5 個文件的
text
欄位(只索引答案) - 計算與查詢向量的相似度
- 找出最高相似度的文件索引
Q4: 根據 question + text 排序
- 將
question
和text
欄位組合(索引問題+答案) - 重新計算相似度並排序
- 比較與 Q3 的結果是否不同
Q5: 找出最小維度模型
- 使用
TextEmbedding.list_supported_models()
列出所有模型 - 找出最小的向量維度
- 例如
BAAI/bge-small-en
是 384 維
Q6: 使用 Qdrant 建立索引
- 載入 machine-learning-zoomcamp 的文件
- 使用小維度模型建立向量索引
- 使用
question + text
的組合進行嵌入 - 搜尋並回傳最高分數
▌寫法一:老師的寫法
為了盡量和老師原來的程式接近,程式碼沒有加上繁中註解。
程式修改自
2-5 rag.ipynb
。這種寫法對一步一步的教學比較方便,但彈性相對較低(例如:每種不同的 search 寫成不同的 method)。
import requests
import numpy as np
from fastembed import TextEmbedding
from qdrant_client import QdrantClient, models
import uuid
# Load documents
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course_dict in docs_raw:
for doc in course_dict['documents']:
doc['course'] = course_dict['course']
documents.append(doc)
# Q1: Embedding the query
query = 'I just discovered the course. Can I join now?'
model = TextEmbedding(model_name='jinaai/jina-embeddings-v2-small-en')
embeddings = list(model.embed([query]))
query_vector = embeddings[0]
print(f"Vector size: {len(query_vector)}")
print(f"Min value: {np.min(query_vector)}")
# Q2: Cosine similarity with another vector
doc = 'Can I still join the course after the start date?'
doc_embeddings = list(model.embed([doc]))
doc_vector = doc_embeddings[0]
cosine_similarity = np.dot(query_vector, doc_vector)
print(f"Cosine similarity: {cosine_similarity}")
# Q3: Ranking by cosine
documents_q3 = [
{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
'section': 'General course-related questions',
'question': 'Course - Can I still join the course after the start date?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
'section': 'General course-related questions',
'question': 'Course - Can I follow the course after it finishes?',
'course': 'data-engineering-zoomcamp'},
{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first Office Hours live.\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon't forget to register in DataTalks.Club's Slack and join the channel.",
'section': 'General course-related questions',
'question': 'Course - When will the course start?',
'course': 'data-engineering-zoomcamp'},
{'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
'section': 'General course-related questions',
'question': 'Course - What can I do before the course starts?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
'section': 'General course-related questions',
'question': 'How can we contribute to the course?',
'course': 'data-engineering-zoomcamp'}
]
texts = [doc['text'] for doc in documents_q3]
embeddings = list(model.embed(texts))
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
similarities.sort(key=lambda x: x[1], reverse=True)
print(f"Highest similarity document index: {similarities[0][0]}")
# Q4: Ranking by cosine, version two
full_texts = [doc['question'] + ' ' + doc['text'] for doc in documents_q3]
embeddings = list(model.embed(full_texts))
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
similarities.sort(key=lambda x: x[1], reverse=True)
print(f"Highest similarity document index (question + text): {similarities[0][0]}")
# Q5: Selecting the embedding model
models_list = TextEmbedding.list_supported_models()
min_dim = float('inf')
for model_info in models_list:
dim = model_info.get('dim', 0)
if dim > 0 and dim < min_dim:
min_dim = dim
print(f"Smallest dimensionality: {min_dim}")
# Q6: Indexing with qdrant
client = QdrantClient("http://localhost:6333")
collection_name = "ml-zoomcamp"
try:
client.delete_collection(collection_name=collection_name)
except:
pass
client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=384,
distance=models.Distance.COSINE
)
)
small_model = TextEmbedding(model_name='BAAI/bge-small-en')
points = []
for i, doc in enumerate(documents):
text = doc['question'] + ' ' + doc['text']
embeddings = list(small_model.embed([text]))
vector = embeddings[0]
point = models.PointStruct(
id=str(uuid.uuid4()),
vector=vector.tolist(),
payload={
"text": doc['text'],
"section": doc['section'],
"question": doc['question'],
"course": doc['course']
}
)
points.append(point)
client.upsert(
collection_name=collection_name,
points=points
)
query_embeddings = list(small_model.embed([query]))
query_vector_small = query_embeddings[0]
search_result = client.search(
collection_name=collection_name,
query_vector=query_vector_small.tolist(),
limit=1
)
highest_score = search_result[0].score
print(f"Highest score: {highest_score}")
加上中文註解的程式碼(點擊展開)
import requests
import numpy as np
from fastembed import TextEmbedding
from qdrant_client import QdrantClient, models
import uuid
# 載入資料集
# 從 GitHub 下載包含所有課程問答的 JSON 檔案
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
# 準備文件列表
# 將巢狀的 JSON 結構展開成簡單的文件列表,保留所有課程資料
documents = []
for course in documents_raw:
course_name = course['course']
# 將每個課程的文件加入列表,並添加課程名稱
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
print(f"已載入 {len(documents)} 筆文件,涵蓋所有課程")
# 計算各課程的文件數量
course_counts = {}
for doc in documents:
course = doc['course']
course_counts[course] = course_counts.get(course, 0) + 1
print("各課程文件數量:")
for course, count in course_counts.items():
print(f" {course}: {count} 筆")
# Q1: 嵌入查詢向量
# 將指定的查詢文字轉換成向量表示
query = 'I just discovered the course. Can I join now?'
print(f"\nQ1: 嵌入查詢 - '{query}'")
# 初始化嵌入模型(使用 512 維的 Jina 模型)
model = TextEmbedding(model_name='jinaai/jina-embeddings-v2-small-en')
# 將查詢文字轉換成向量
embeddings = list(model.embed([query]))
query_vector = embeddings[0]
print(f"向量維度: {len(query_vector)}")
print(f"向量中的最小值: {np.min(query_vector):.3f}")
# Q2: 計算餘弦相似度
# 測試兩個文字向量之間的語義相似程度
doc = 'Can I still join the course after the start date?'
print(f"\nQ2: 與另一個向量的餘弦相似度")
print(f"比較文件: '{doc}'")
# 將第二個文件也轉換成向量
doc_embeddings = list(model.embed([doc]))
doc_vector = doc_embeddings[0]
# 計算餘弦相似度
# 由於 FastEmbed 的向量已經正規化,可以直接使用點積計算餘弦相似度
cosine_similarity = np.dot(query_vector, doc_vector)
print(f"餘弦相似度: {cosine_similarity:.3f}")
# Q3: 使用餘弦相似度進行排序
# 測試在給定的 5 個文件中,哪個與查詢最相似
print(f"\nQ3: 根據文本欄位進行餘弦相似度排序")
# 作業指定的 5 個測試文件
documents_q3 = [
{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
'section': 'General course-related questions',
'question': 'Course - Can I still join the course after the start date?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
'section': 'General course-related questions',
'question': 'Course - Can I follow the course after it finishes?',
'course': 'data-engineering-zoomcamp'},
{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first Office Hours live.\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon't forget to register in DataTalks.Club's Slack and join the channel.",
'section': 'General course-related questions',
'question': 'Course - When will the course start?',
'course': 'data-engineering-zoomcamp'},
{'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
'section': 'General course-related questions',
'question': 'Course - What can I do before the course starts?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
'section': 'General course-related questions',
'question': 'How can we contribute to the course?',
'course': 'data-engineering-zoomcamp'}
]
# 只嵌入每個文件的 'text' 欄位(答案部分)
texts = [doc['text'] for doc in documents_q3]
embeddings = list(model.embed(texts))
# 計算每個文件與查詢的相似度
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
# 按相似度由高到低排序
similarities.sort(key=lambda x: x[1], reverse=True)
print("文件排序結果(按相似度遞減):")
for rank, (doc_idx, similarity) in enumerate(similarities):
print(f" 排名 {rank+1}: 文件 {doc_idx}, 相似度: {similarity:.3f}")
print(f"最高相似度的文件索引: {similarities[0][0]}")
# Q4: 使用問題+文本的組合進行排序
# 測試將問題和答案組合後,是否會改變相似度排序
print(f"\nQ4: 根據問題+文本組合進行排序")
# 將每個文件的問題和答案組合成完整文本
# 這樣做可以增加語義資訊,因為問題部分可能與查詢更匹配
full_texts = [doc['question'] + ' ' + doc['text'] for doc in documents_q3]
embeddings = list(model.embed(full_texts))
# 重新計算相似度
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
# 排序
similarities.sort(key=lambda x: x[1], reverse=True)
print("文件排序結果(使用問題+文本):")
for rank, (doc_idx, similarity) in enumerate(similarities):
print(f" 排名 {rank+1}: 文件 {doc_idx}, 相似度: {similarity:.3f}")
print(f"最高相似度的文件索引: {similarities[0][0]}")
print(f"與 Q3 的結果是否不同: {similarities[0][0] != 0}") # 假設 Q3 結果是索引 0
# Q5: 選擇嵌入模型
# 查找 FastEmbed 支援的模型中,維度最小的是多少
print(f"\nQ5: 查找最小維度的嵌入模型")
# 獲取所有支援的模型列表
models_list = TextEmbedding.list_supported_models()
# 找出最小維度
min_dim = float('inf')
smallest_models = []
for model_info in models_list:
dim = model_info.get('dim', 0)
if dim > 0:
if dim < min_dim:
min_dim = dim
smallest_models = [model_info['model']]
elif dim == min_dim:
smallest_models.append(model_info['model'])
print(f"最小維度: {min_dim}")
print(f"具有最小維度的模型範例: {smallest_models[0] if smallest_models else '無'}")
# Q6: 使用 Qdrant 建立索引
# 將所有課程的文件索引到向量資料庫中,然後使用過濾器搜尋特定課程
print(f"\nQ6: 使用 Qdrant 建立索引並搜尋")
# 連接到本地的 Qdrant 服務
client = QdrantClient("http://localhost:6333")
# 設定集合名稱
collection_name = "ml-zoomcamp"
# 如果集合已存在則刪除(重新開始)
try:
client.delete_collection(collection_name=collection_name)
print("已刪除現有集合")
except:
print("集合不存在,建立新集合")
# 建立新的向量集合
# 使用 384 維度(對應小型模型)和餘弦距離
client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=384, # 使用較小的維度以配合小型模型
distance=models.Distance.COSINE
)
)
print("已建立向量集合")
# 建立 payload 索引以支援課程過濾
# 這個索引讓我們能夠高效地按課程名稱過濾搜尋結果
client.create_payload_index(
collection_name=collection_name,
field_name="course",
field_schema="keyword" # 使用關鍵字類型進行精確匹配
)
print("已建立課程欄位索引")
# 初始化小型嵌入模型(384 維)
small_model = TextEmbedding(model_name='BAAI/bge-small-en')
# 準備要上傳的資料點
points = []
print("正在處理所有文件並生成向量...")
for i, doc in enumerate(documents):
# 組合問題和答案作為完整文本
text = doc['question'] + ' ' + doc['text']
# 使用小型模型生成向量
embeddings = list(small_model.embed([text]))
vector = embeddings[0]
# 建立資料點
point = models.PointStruct(
id=str(uuid.uuid4()), # 使用 UUID 作為唯一識別碼
vector=vector.tolist(), # 將 numpy 陣列轉換為列表
payload={
"text": doc['text'],
"section": doc['section'],
"question": doc['question'],
"course": doc['course'] # 保留課程資訊以便過濾
}
)
points.append(point)
# 將所有資料點上傳到 Qdrant(所有課程的資料)
client.upsert(
collection_name=collection_name,
points=points
)
print(f"已上傳 {len(points)} 個資料點到向量資料庫(包含所有課程)")
# 使用相同的查詢進行搜尋測試,但只搜尋機器學習課程的內容
print(f"搜尋查詢: '{query}'")
# 將查詢也用小型模型轉換成向量
query_embeddings = list(small_model.embed([query]))
query_vector_small = query_embeddings[0]
# 在 Qdrant 中進行向量搜尋,使用過濾器限制只搜尋機器學習課程
# 這展示了講師在 2-5 中的做法:載入所有資料,搜尋時使用過濾器
search_result = client.search(
collection_name=collection_name,
query_vector=query_vector_small.tolist(),
query_filter=models.Filter(
must=[
models.FieldCondition(
key="course",
match=models.MatchValue(value="machine-learning-zoomcamp")
)
]
),
limit=1 # 只返回最相似的 1 個結果
)
# 顯示搜尋結果
if search_result:
highest_score = search_result[0].score
print(f"最高相似度分數: {highest_score:.3f}")
print(f"最相關的文件: {search_result[0].payload['question']}")
print(f"所屬課程: {search_result[0].payload['course']}")
else:
print("未找到搜尋結果")
# 驗證過濾器的效果:同樣的查詢在所有課程中搜尋
print(f"\n比較:在所有課程中搜尋的結果")
search_all = client.search(
collection_name=collection_name,
query_vector=query_vector_small.tolist(),
limit=1
)
if search_all:
print(f"最高相似度分數: {search_all[0].score:.3f}")
print(f"最相關的文件: {search_all[0].payload['question']}")
print(f"所屬課程: {search_all[0].payload['course']}")
print("\n作業完成!所有問題已執行完畢。")
原本還想另外整理老師程式碼修改前後(修改前是 2-5,修改後是加了 homework),但整理有點麻煩,試了幾次 prompt,沒達成我要的結果,有空再說。
A normal line
- removed line
+ added line
▌寫法二:OOP
和上週的作業一相同,請 Claude 協助撰寫作業。但這次很小心,避免兩台電腦輪流輸入資料(會造成部分資料流失)。
但沒能像上次一樣,每執行一個步驟,然後把程式結果貼回 prompt 的方式,一步一步進行。
原因是我 prompt 請 Claude 分兩階段進行。第一階段先寫完整程式,第二階段才逐步進行。因為已經寫好程式,第二階段的逐步進行,和我想的不同。原本想的是 Q1, Q2…,結果他是安裝、設定環境、執行程式(整個 Python)。
# LLM Zoomcamp 2-7 Homework: Vector Search
# 作業:向量搜尋與嵌入
import requests
import numpy as np
from fastembed import TextEmbedding
from qdrant_client import QdrantClient, models
import uuid
class VectorSearchHomework:
"""向量搜尋作業類別"""
def __init__(self):
"""初始化"""
self.embedding_model_name = 'jinaai/jina-embeddings-v2-small-en'
self.small_model_name = 'BAAI/bge-small-en'
self.client = None
self.documents = []
def setup_qdrant_client(self):
"""設定 Qdrant 客戶端連接"""
print("正在連接到 Qdrant...")
self.client = QdrantClient("http://localhost:6333")
print("Qdrant 連接成功!")
def load_documents(self):
"""載入文件資料"""
print("正在載入文件資料...")
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
# 選取 machine-learning-zoomcamp 的文件
for course in documents_raw:
course_name = course['course']
if course_name != 'machine-learning-zoomcamp':
continue
for doc in course['documents']:
doc['course'] = course_name
self.documents.append(doc)
print(f"已載入 {len(self.documents)} 筆 ML Zoomcamp 文件")
def q1_embed_query(self):
"""Q1: 嵌入查詢向量"""
print("\n=== Q1: 嵌入查詢 ===")
query = 'I just discovered the course. Can I join now?'
print(f"查詢文字:{query}")
# 使用 FastEmbed 嵌入查詢
model = TextEmbedding(model_name=self.embedding_model_name)
embeddings = list(model.embed([query]))
query_vector = embeddings[0]
print(f"向量維度:{len(query_vector)}")
print(f"向量中的最小值:{np.min(query_vector):.3f}")
return query_vector
def q2_cosine_similarity(self, query_vector):
"""Q2: 計算餘弦相似度"""
print("\n=== Q2: 餘弦相似度 ===")
doc_text = 'Can I still join the course after the start date?'
print(f"文件文字:{doc_text}")
# 嵌入文件
model = TextEmbedding(model_name=self.embedding_model_name)
embeddings = list(model.embed([doc_text]))
doc_vector = embeddings[0]
# 計算餘弦相似度
# 由於向量已經標準化,可以直接使用點積
cosine_similarity = np.dot(query_vector, doc_vector)
print(f"餘弦相似度:{cosine_similarity:.3f}")
return cosine_similarity, doc_vector
def q3_ranking_by_cosine(self, query_vector):
"""Q3: 使用餘弦相似度排序"""
print("\n=== Q3: 餘弦相似度排序 ===")
# 準備文件
documents = [
{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
'section': 'General course-related questions',
'question': 'Course - Can I still join the course after the start date?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
'section': 'General course-related questions',
'question': 'Course - Can I follow the course after it finishes?',
'course': 'data-engineering-zoomcamp'},
{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first Office Hours live.\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon't forget to register in DataTalks.Club's Slack and join the channel.",
'section': 'General course-related questions',
'question': 'Course - When will the course start?',
'course': 'data-engineering-zoomcamp'},
{'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
'section': 'General course-related questions',
'question': 'Course - What can I do before the course starts?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
'section': 'General course-related questions',
'question': 'How can we contribute to the course?',
'course': 'data-engineering-zoomcamp'}
]
# 嵌入所有文件的 text 欄位
model = TextEmbedding(model_name=self.embedding_model_name)
texts = [doc['text'] for doc in documents]
embeddings = list(model.embed(texts))
# 計算與查詢向量的相似度
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
# 按相似度排序
similarities.sort(key=lambda x: x[1], reverse=True)
print("文件排序(按相似度遞減):")
for rank, (doc_idx, similarity) in enumerate(similarities):
print(f"排名 {rank+1}: 文件 {doc_idx}, 相似度: {similarity:.3f}")
highest_similarity_index = similarities[0][0]
print(f"\n最高相似度的文件索引:{highest_similarity_index}")
return highest_similarity_index
def q4_ranking_question_and_text(self, query_vector):
"""Q4: 使用 question + text 的相似度排序"""
print("\n=== Q4: 使用 question + text 排序 ===")
# 使用與 Q3 相同的文件
documents = [
{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
'section': 'General course-related questions',
'question': 'Course - Can I still join the course after the start date?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
'section': 'General course-related questions',
'question': 'Course - Can I follow the course after it finishes?',
'course': 'data-engineering-zoomcamp'},
{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first Office Hours live.\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon't forget to register in DataTalks.Club's Slack and join the channel.",
'section': 'General course-related questions',
'question': 'Course - When will the course start?',
'course': 'data-engineering-zoomcamp'},
{'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
'section': 'General course-related questions',
'question': 'Course - What can I do before the course starts?',
'course': 'data-engineering-zoomcamp'},
{'text': 'Star the repo! Share it with friends if you find it useful ❣️\nCreate a PR if you see you can improve the text or the structure of the repository.',
'section': 'General course-related questions',
'question': 'How can we contribute to the course?',
'course': 'data-engineering-zoomcamp'}
]
# 組合 question 和 text
model = TextEmbedding(model_name=self.embedding_model_name)
full_texts = [doc['question'] + ' ' + doc['text'] for doc in documents]
embeddings = list(model.embed(full_texts))
# 計算相似度
similarities = []
for i, doc_vector in enumerate(embeddings):
similarity = np.dot(query_vector, doc_vector)
similarities.append((i, similarity))
# 排序
similarities.sort(key=lambda x: x[1], reverse=True)
print("文件排序(使用 question + text):")
for rank, (doc_idx, similarity) in enumerate(similarities):
print(f"排名 {rank+1}: 文件 {doc_idx}, 相似度: {similarity:.3f}")
highest_similarity_index = similarities[0][0]
print(f"\n最高相似度的文件索引:{highest_similarity_index}")
is_different = highest_similarity_index != 0 # 假設 Q3 的答案是 0
print(f"與 Q3 的結果不同嗎?{is_different}")
return highest_similarity_index
def q5_smallest_dimensionality(self):
"""Q5: 找出最小維度的模型"""
print("\n=== Q5: 最小維度模型 ===")
# 列出所有支援的模型
from fastembed import TextEmbedding
models = TextEmbedding.list_supported_models()
# 找出最小維度
min_dim = float('inf')
smallest_models = []
for model in models:
dim = model.get('dim', 0)
if dim > 0:
if dim < min_dim:
min_dim = dim
smallest_models = [model]
elif dim == min_dim:
smallest_models.append(model)
print(f"最小維度:{min_dim}")
print("具有最小維度的模型:")
for model in smallest_models:
print(f"- {model['model']}")
return min_dim
def q6_index_with_qdrant(self):
"""Q6: 使用 Qdrant 建立索引"""
print("\n=== Q6: 使用 Qdrant 建立索引 ===")
collection_name = "ml-zoomcamp-homework"
# 刪除現有集合(如果存在)
try:
self.client.delete_collection(collection_name=collection_name)
print("已刪除現有集合")
except:
pass
# 建立新集合(使用小模型的維度)
self.client.create_collection(
collection_name=collection_name,
vectors_config=models.VectorParams(
size=384, # BAAI/bge-small-en 的維度
distance=models.Distance.COSINE
)
)
print("已建立新集合")
# 準備資料點
points = []
model = TextEmbedding(model_name=self.small_model_name)
for i, doc in enumerate(self.documents):
# 組合 question 和 text
text = doc['question'] + ' ' + doc['text']
# 嵌入文字
embeddings = list(model.embed([text]))
vector = embeddings[0]
point = models.PointStruct(
id=str(uuid.uuid4()),
vector=vector.tolist(),
payload={
"text": doc['text'],
"section": doc['section'],
"question": doc['question'],
"course": doc['course']
}
)
points.append(point)
# 上傳到 Qdrant
self.client.upsert(
collection_name=collection_name,
points=points
)
print(f"已上傳 {len(points)} 個資料點")
# 搜尋測試
query = 'I just discovered the course. Can I join now?'
query_embeddings = list(model.embed([query]))
query_vector = query_embeddings[0]
search_result = self.client.search(
collection_name=collection_name,
query_vector=query_vector.tolist(),
limit=1
)
if search_result:
highest_score = search_result[0].score
print(f"最高分數:{highest_score:.3f}")
return highest_score
else:
print("未找到結果")
return 0
def run_all_questions(self):
"""執行所有問題"""
print("開始執行 LLM Zoomcamp 2-7 Homework")
print("=" * 50)
# 設定連接
self.setup_qdrant_client()
# 載入資料
self.load_documents()
# Q1: 嵌入查詢
query_vector = self.q1_embed_query()
# Q2: 餘弦相似度
cosine_sim, doc_vector = self.q2_cosine_similarity(query_vector)
# Q3: 排序
highest_idx_q3 = self.q3_ranking_by_cosine(query_vector)
# Q4: 使用 question + text 排序
highest_idx_q4 = self.q4_ranking_question_and_text(query_vector)
# Q5: 最小維度
min_dim = self.q5_smallest_dimensionality()
# Q6: Qdrant 索引
highest_score = self.q6_index_with_qdrant()
print("\n" + "=" * 50)
print("作業完成!結果摘要:")
print(f"Q1 - 向量最小值:{np.min(query_vector):.3f}")
print(f"Q2 - 餘弦相似度:{cosine_sim:.3f}")
print(f"Q3 - 最高相似度文件索引:{highest_idx_q3}")
print(f"Q4 - 最高相似度文件索引:{highest_idx_q4}")
print(f"Q5 - 最小維度:{min_dim}")
print(f"Q6 - 最高分數:{highest_score:.3f}")
if __name__ == "__main__":
# 執行作業
homework = VectorSearchHomework()
homework.run_all_questions()
▌OOP vs. 傳統方式
》OOP 優點
-
組織性:把相關的功能和資料(documents、client、model_name)放在一起
-
可重用性:如果以後要處理不同的資料集或調整參數,只需要修改初始化部分
-
狀態管理:像 Qdrant client 連接、載入的文件等,放在 instance 中比較好管理
-
擴展性:未來要加入更多功能(比如不同的評分方法、多種模型比較),class 結構比較容易擴展
-
專業習慣:實際工作中,通常會把這種有多個相關步驟的作業包裝成 class
》函數式寫法優點
-
教學清晰:每個函數對應一個概念,容易理解
-
實驗友好:在 notebook 中可以單獨執行和調試
-
簡單直接:沒有額外的抽象層,初學者更容易掌握
兩種方式都有各自的適用場景。
▌參考資料
Homework