1️⃣ Section 10. 文字嵌入（Text Embedding）

sky · 2023年04月19日11:40

這應該是這堂課最重要的課程之一了。

文字嵌入（Text Embedding）主要在解決什麼問題？

將文字轉為向量格式，利用比對向量模式是否類似，來找到最可能的對應文字。

當我們需要的資料不在目前 ChatGPT 的資料庫中時（例如：資料過新），我們可以使用文字嵌入（Text Embedding）的 Embedding method，來自己建立這些資料的向量集。

從大數據走向人工智慧 from Sheng-Wei (Kuan-Ta) Chen

本章用到的 OpenAI API 有二（Embedding 是本講座新介紹的）：

openai.Completion.create(
	...
    model="text-davinci-003"
)

openai.Embedding.create(
	model='text-embedding-ada-002',
	...
)

應用案例 Use Cases

說明：案例皆為 ChatGPT 產出，還沒有修改內容，例如：文本。

搜尋 Search

Where results are ranked by relevance to a query string

點擊展開

以下是一些使用文字嵌入進行搜尋的常見應用案例。這些技術可以幫助使用者更快地找到相關的文本內容，從而提高效率。

相似文本搜索：使用文字嵌入將文本轉換為向量，然後計算與輸入文本向量最相似的文本向量，從而找到相似的文本。
關鍵詞搜索：使用文字嵌入來幫助搜尋引擎理解關鍵詞的意義，從而更精準地搜尋相關的資料。
多語言搜索：使用文字嵌入將不同語言的文本轉換為向量，從而在多語言環境下進行搜尋。
非精確匹配搜索：使用文字嵌入將文本轉換為向量，然後計算文本向量之間的相似度，從而進行非精確匹配搜索。
文本推薦：使用文字嵌入將用戶的歷史文本轉換為向量，然後計算與這些向量最相似的新文本向量，從而推薦新的文本內容。
自動完成：使用文字嵌入將用戶輸入的部分文本轉換為向量，然後計算與這些向量最相似的文本向量，從而自動完成用戶的輸入。
資訊檢索：使用文字嵌入將文本轉換為向量，然後建立索引，從而實現快速搜索相關的文本內容。

文本聚類 Embedding Clustering

Where text strings are grouped by similarity

點擊展開

以下是一些使用文字嵌入進行 文本聚類 的常見應用案例。這些技術可以幫助使用者理解文本之間的相似性和差異性，從而提高效率和精準度。

主題分類：使用文字嵌入將文本轉換為向量，然後使用聚類算法將相似的文本聚合在一起，從而實現主題分類。
用戶分群：使用文字嵌入將用戶的歷史文本轉換為向量，然後使用聚類算法將相似的用戶聚合在一起，從而實現用戶分群。
情感分析：使用文字嵌入將文本轉換為向量，然後使用聚類算法將情感相似的文本聚合在一起，從而實現情感分析。
文本相似度比較：使用文字嵌入將文本轉換為向量，然後使用聚類算法將相似的文本聚合在一起，從而比較文本之間的相似度。
推薦系統：使用文字嵌入將商品描述轉換為向量，然後使用聚類算法將相似的商品聚合在一起，從而實現商品推薦。
語言學研究：使用文字嵌入將不同語言的文本轉換為向量，然後使用聚類算法將相似的文本聚合在一起，從而研究語言之間的相似性和差異性。
資訊檢索：使用文字嵌入將文本轉換為向量，然後使用聚類算法將相似的文本聚合在一起，從而實現快速搜索相關的文本內容。

異常檢測 Anomaly detection

Where outliers with little relatedness are identified

點擊展開

以下是一些使用文字嵌入進行 異常檢測 的常見應用案例。這些技術可以幫助使用者快速發現異常，提高安全性和效率。

垃圾郵件檢測：使用文字嵌入將郵件內容轉換為向量，然後使用異常檢測算法（如LOF、Isolation Forest等）檢測與正常郵件相差較大的垃圾郵件。
網站內容檢測：使用文字嵌入將網站內容轉換為向量，然後使用異常檢測算法檢測與正常內容相差較大的異常內容，例如色情、詐騙等。
文字異常檢測：使用文字嵌入將文本轉換為向量，然後使用異常檢測算法檢測與正常文本相差較大的異常文本，例如詐騙、欺詐等。
系統日誌檢測：使用文字嵌入將系統日誌轉換為向量，然後使用異常檢測算法檢測與正常日誌相差較大的異常日誌，例如未授權的訪問、駭客攻擊等。
安全事件檢測：使用文字嵌入將安全事件描述轉換為向量，然後使用異常檢測算法檢測與正常事件相差較大的異常事件，例如潛在的網絡攻擊、未授權的訪問等。

多樣性度量 Diversity measurement

Where similarity distributions are analyzed

點擊展開

以下是一些使用文字嵌入進行 多樣性度量 的常見應用案例。這些技術可以幫助使用者評估文本之間的相似性分佈，從而確定文本的多樣性，進而優化相關的應用。

新聞多樣性分析：使用文字嵌入將新聞標題轉換為向量，然後使用多樣性度量算法（如KL散度、余弦相似度等）分析新聞標題之間的相似性分佈，以確定新聞的多樣性。
推薦系統多樣性分析：使用文字嵌入將產品描述轉換為向量，然後使用多樣性度量算法分析推薦產品之間的相似性分佈，以確定推薦系統的多樣性。
搜索引擎多樣性分析：使用文字嵌入將搜索關鍵字轉換為向量，然後使用多樣性度量算法分析搜索結果之間的相似性分佈，以確定搜索引擎的多樣性。
文本分類多樣性分析：使用文字嵌入將文本分類標籤轉換為向量，然後使用多樣性度量算法分析不同分類之間的相似性分佈，以確定文本分類的多樣性。
品牌聲譽分析：使用文字嵌入將與品牌相關的評論、新聞、社交媒體帖子等轉換為向量，然後使用多樣性度量算法分析這些文本之間的相似性分佈，以確定品牌聲譽的多樣性。

分類 Classification

Where text strings are classified by their most similar label

點擊展開

以下是一些使用文字嵌入進行分類的應用案例。這些應用案例涉及使用文字嵌入來將文本轉換為向量，然後使用分類器來分類文本。

這種方法可以提高分類器的準確性，因為文字嵌入可以捕捉文本的複雜特徵和上下文信息。這些技術可以應用於許多不同的文本分類任務，從情感分析到主題分類，都可以獲得良好的效果。

文本情感分類：使用文字嵌入將文本轉換為向量，然後使用分類器（如SVM、深度學習模型等）將文本分類為正面、負面或中性情感。
文本主題分類：使用文字嵌入將文本轉換為向量，然後使用分類器將文本歸入特定的主題（如政治、體育、科技等）。
文本屬性分類：使用文字嵌入將文本轉換為向量，然後使用分類器將文本歸入特定的屬性（如新聞、公告、廣告等）。
文本分類問題解決：使用文字嵌入將文本轉換為向量，然後使用分類器將文本歸入特定的問題類型（如問答、回應、討論等）。
文本相似度分類：使用文字嵌入將文本轉換為向量，然後使用分類器將文本分類為與特定參考文本相似或不相似。

什麼是文字嵌入？

「Text Embedding」是將文本轉換為數字向量的一種技術。它通過將每個單詞或短語映射到向量空間中的一個點來實現這一目標。這種向量表示形式可以捕捉單詞或短語之間的關係和相似性，使得文本處理過程更容易、更有效率。

圖片來源（OpenAI）：New and improved embedding model

文字嵌入的運作原理可以簡單概括如下：

首先，通過一些預處理步驟，如去除停用詞、標記化和清理文本等，從原始文本中獲取單詞或短語。
接下來，使用一些技術來將這些單詞或短語映射到向量空間中的一個點。常用的技術包括：
- One-Hot Encoding：將每個單詞或短語轉換為一個稀疏向量，其中只有一個元素為1，其餘元素均為0。
- Bag of Words（BOW）：將文本表示為單詞或短語的頻率向量，其中每個維度對應一個單詞或短語。
- Word2Vec：使用神經網絡模型學習每個單詞或短語的向量表示，以反映它們之間的關係和相似性。
- GloVe：使用全局語境信息來學習單詞或短語的向量表示，以反映它們之間的關係和相似性。
最後，可以使用這些向量表示形式來執行各種文本處理任務，如相似度比較、分類、聚類、推薦等。

總體而言，「Text Embedding」通過將單詞或短語映射到向量空間中，為文本提供了一個數字表示形式，這種表示形式可以捕捉到單詞或短語之間的關係和相似性，從而提高了文本處理的效率和準確性。

另一堂課 說 OpenAI 的嵌入模型，接受文本並將其轉換為 1336 維向量。

李弘毅老師

本共學課程講師這方面沒有太多著墨（而是叫我們自己讀論文），我推薦大家觀看李弘毅老師的資料。

補充：分享時有夥伴提及 蔡炎龍老師 的講課也很棒。我只看過蔡老師的文章（的確很棒），沒看過影片，一併加上老師的連結。

（官方）課程講義：

https://ai.ntu.edu.tw/resource/handouts/ML14.html

課程筆記：

佛心人仕觀看課程影片後，所做的筆記。

課程影片：

model hallucination

Model hallucination 指的是模型在訓練或測試時對於不存在的事物或現象產生錯誤的預測或推斷的現象。簡單來說，就是模型創造或認為存在某些假想的情況，而這些情況實際上並不存在或不合理。

例如，一個基於圖片辨識的模型可能在訓練時學習到將藍天的圖片認為是「天空」，但當出現綠色天空的圖片時，這個模型可能會將其誤判為藍色的天空，這就是 Model hallucination 的一個例子。

Model hallucination 通常是由訓練數據中的缺陷或偏差引起的。如果訓練數據中缺乏多樣性或包含不完整或不準確的標籤，模型就容易出現 Model hallucination 的現象。此外，如果模型本身過於複雜或過度擬合訓練數據，也可能導致 Model hallucination 的問題。

Model hallucination 問題，可透過改進數據和模型設計等方式來解決：

使用更好的訓練數據，包括更多樣性的數據和更精確的標籤。
使用更簡單的模型，以減少模型的複雜性和擬合程度。
在模型訓練和測試時進行嚴格的驗證和測試，以確保模型的可靠性和響應性。

本堂課老師的解法：

Prompt Engineering to avoid hallucination:

Only answer if you are 100% certain, otherwise reply “Sorry, I’m not sure what the answer is”

範例（產生 hallucination）

import openai 
import pandas as pd
import tiktoken # https://github.com/openai/tiktoken

openai.api_key = os.getenv("OPENAI_API_KEY")

prompt = "What does the start-up company Pentera do and who invested in it?"

response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Out（輸出）

Pentera is a start-up company that provides software solutions to help organizations manage their employee benefits programs. The company has raised $3.5 million in seed funding from investors including Y Combinator, SV Angel, and Social Leverage.

範例（修正 hallucination）

修改 prompt，加上前面說的 Only answer the question below if you have 100% certainty of the facts.。

import openai 
import pandas as pd
import tiktoken # https://github.com/openai/tiktoken

openai.api_key = os.getenv("OPENAI_API_KEY")

- prompt = "What does the start-up company Pentera do and who invested in it?"
+ prompt = """Only answer the question below if you have 100% certainty of the facts.
+ 
+ Q: What does the start-up company Pentera do and who invested in it?
+ A:"""

response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Out（輸出）

I cannot answer this question with 100% certainty.

Document Data

ast: 抽像語法樹 Abstract Syntax Tree

ast 模組就是幫助 Python 處理抽象的語法解析。

literal_eval() 函數：判斷需要計算的內容計算後是不是合法的 Python 類型，是的話就進行運算，否則就不進行。

本範例中的用途在將單一字串轉換為字串的 List（[str1, str2…]）

import pandas as pd  # 如果之前還沒有 import 的話
import ast 

df = pd.read_csv("unicorns.csv") 
df.head()  # 列出 df 前面幾行資料
df['Investors'][0]  # 確認 Investors 欄位內容，其實是字串

def summary(company,crunchbase_url,city,country,industry,investor_list):
    investors = 'The investors in the company are'
     
    for investor in ast.literal_eval(investor_list):  # 也可以用 split(', ')，但還要處理引號...
        investors += f" {investor}, "

    text = f"{company} has headquarters in {city} in {country} \
        and is in the field of {industry}. {investors}. \
        You can find more information at {crunchbase_url}"

    return text 

df['summary'] = df.apply(lambda df: summary(df['Company'],df['Crunchbase Url'],df['City'],df['Country'],df['Industry'],df['Investors']),axis=1)
df['summary'][0]

Out（輸出）

‘Esusu has headquarters in New York in United States and is in the field of Fintech. The investors in the company are Next Play Ventures, Zeal Capital Partners, SoftBank Group, . You can find more information at https://www.cbinsights.com/company/esusu’

df.head()  # 列出 df 前面幾行資料
df['summary '][0]  # 觀看 summary 欄位內容的第一筆資料

Out（輸出略，就是新增 summary 欄位）

Out（輸出略，就是 summary 欄位內容的第一筆資料）

Token Count

關於計算 Token ，可以參考這裡：

github.com

openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to count tokens with tiktoken\n",
    "\n",
    "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
    "\n",
    "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
    "\n",
    "Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).\n",
    "\n",
    "\n",
    "## Encodings\n",
    "\n",
    "Encodings specify how text is converted into tokens. Different models use different encodings.\n",
    "\n",

This file has been truncated. show original

import tiktoken

def num_tokens_from_string(string, encoding_name):
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

# 計算需要多少 tokens
num_tokens_from_string(df['summary'][0],encoding_name='cl100k_base')

df['token_count'] = df['summary'].apply(lambda text: num_tokens_from_string(text,'cl100k_base'))

df.head()

df['token_count'].sum() * 0.0004 / 1000

df[df['token_count'] > 8191]

# 續上方程式碼
def get_embedding(text):
  # Note how this function assumes you already set your Open AI key!
    result = openai.Embedding.create(
      model='text-embedding-ada-002',
      input=text
    )
    return result["data"][0]["embedding"]

get_embedding(df['summary'][0])

Out（輸出）

[0.012057947926223278,
-0.017802061513066292,
-0.022373223677277565,
…

# this will take awhile due to the amount of calls to the API.
# it will take about 0.5 seconds per row
df['embedding'] = df['summary'].apply(get_embedding)

# df.head()

df.to_csv('unicorns_with_embeddings.csv',index=False)

Document Similarity and Context Injection

我們準備好自己的資料（本例為2022年新創公司資料，所以目前版本的 ChatGPT 資料庫不會有）後，接著就來完成整個問答的動作。

將使用者輸入的文字，使用 OpenAI API 嵌入成為向量。
Embed a query string to vector
將步驟 1 產生的向量，與我們資料庫中的向量做比對。
Perform a cosine similarity between query vector and all our document vectors.
找出最接近的向量後，產出相對應的文字。
Choose most similar and inject context.

prompt = "What does the company Pentera do and who invested in it?"

prompt_embedding = get_embedding(prompt)

# prompt_embedding

import numpy as np
# There are other services/programs for larger amount of vectors
# Take a look at vector search engines like Pinecone or Weaviate
def vector_similarity(vec1,vec2):
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(vec1), np.array(vec2))

df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

df.sort_values("prompt_similarity", ascending=False).head()

# Could also use sort_values() with ascending=False, but nlargest should be more performant
df.nlargest(1,'prompt_similarity').iloc[0]['summary']

Out（輸出）

‘Pentera has headquarters in Petah Tikva in Israel and is in the field of Cybersecurity . The investors in the company are AWZ Ventures, Blackstone, Insight Partners, . You can find more information at https://www.cbinsights.com/company/pcysys’

summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
Here is some context:
{summary}
Q: What does the start-up company Pentera do and who invested in it?
A:"""

response = openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=500,
    model="text-davinci-003"
)
print(response["choices"][0]["text"].strip(" \n"))

Out（輸出）

Pentera is a start-up company in the field of Cybersecurity with headquarters in Petah Tikva, Israel. The investors in the company are AWZ Ventures, Blackstone, and Insight Partners.

# 續前方程式
def embed_prompt_lookup():
    # initial question
    question = input("What question do you have about a Unicorn company? ")
    # Get embedding
    prompt_embedding = get_embedding(question)
    # Get prompt similarity with embeddings
    # Note how this will overwrite the prompt similarity column each time!
    df["prompt_similarity"] = df['embedding'].apply(lambda vector: vector_similarity(vector, prompt_embedding))

    # get most similar summary
    summary = df.nlargest(1,'prompt_similarity').iloc[0]['summary'] 

    prompt = f"""Only answer the question below if you have 100% certainty of the facts, use the context below to answer.
            Here is some context:
            {summary}
            Q: {question}
            A:"""

    response = openai.Completion.create(
        prompt=prompt,
        temperature=0,
        max_tokens=500,
        model="text-davinci-003"
    )
    print(response["choices"][0]["text"].strip(" \n"))

embed_prompt_lookup()

Out（輸出）

Momenta is a company in the field of Artificial Intelligence with headquarters in Beijing, China.

影片中老師提及，因為資料量不大，所以使用 Numpy 的 dot，如果資料量大，可以考慮用 Pincecone 或 ???。

前者的網站如下，後者我聽不出來老師在說什麼。

https://www.pinecone.io/learn/roughly-explained/cosine-similarity/

1️⃣ Section 10. 文字嵌入（Text Embedding）

文字嵌入（Text Embedding）主要在解決什麼問題？

應用案例 Use Cases

搜尋 Search

文本聚類 Embedding Clustering

推薦 Recommendations

異常檢測 Anomaly detection

多樣性度量 Diversity measurement

分類 Classification

什麼是文字嵌入？

李弘毅老師

（官方）課程講義：

課程筆記：

課程影片：

更多參考資料：

model hallucination

範例（產生 hallucination）

範例（修正 hallucination）

Document Data

Token Count

Document Similarity and Context Injection