Part 1: Define Function
In this tutorial, we will create a simple question & answering chatbot that uses information from Wikipedia. This chatbot can be considered as a workflow that consists of the following components:
- A web search component that searches for relevant Wikipedia articles, using Wikipedia API.
- A web retriever that retrieve the text from Wikipedia, using Wikipedia API.
- A question & answering LLM that decides keywords to search Wikipedia and afterward answers questions.
Prerequisite
In this tutorial, the question & answering LLM will be based on llama.cpp
(source). To use it with Python, we will use llama-cpp-python
(source). Please install llama-cpp-python
: pip install llama-cpp-python
and download the llama-2 7b instruct model weight (llama-2-7b-chat.Q4_K_M.gguf) here.
Define a Function to get text from Wikipedia
A Function is simply an operation that takes in input and returns an output. Let’s write the first Function to retrieve the text from Wikipedia. It has an option to retrieve full HTML text from /page/html/{title}
or short summary from /page/summary/{title}
.
import json
import urllib.parse as parse
import urllib.request as request
from theflow import Function
class RetrieveArticle(Function):
"""Retrieve the text content from an wikipedia article"""
bool
full_text:
def run(self, title: str) -> str:
= parse.quote(title)
title = (
url f"https://en.wikipedia.org/api/rest_v1/page/html/{title}"
if self.full_text
else f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
)
= request.Request(url)
req with request.urlopen(req) as resp:
= resp.read().decode("utf-8")
text
if not self.full_text:
= json.loads(text)["extract"]
text
return text
- 1
-
Subclass
theflow.Function
to create a flow. - 2
-
Declare the init parameters with an explicit type annotation.
theflow
will automatically treat any parameters with type other thantheflow.Function
as init parameters. - 3
-
Declare the flow logic in
run
, with input and output type annotations. The input and output annotations are optional, but higly recommended to check if a Function is compatible with another Function
In this declaration, full_text
is an init parameter that can be set when initializing the flow. The flow logic is written in the run
method that takes in a string input parameter (title
) and outputs a string.
Input parameters are transient between each flow invocation while init parameters are persistent.
You can run the flow as follow to retrieve the summary of Wikipedia article on Albert Einstein:
= RetrieveArticle(full_text=False)
retrieve "Albert Einstein") retrieve(
Output
Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world's most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.
Take notice that we execute the flow with retrieve("Albert Einstein")
rather than retrieve.run("Albert Einsten")
. When calling an object directly, theflow
will perform extra preparation steps in additional to executing the .run
method
To get a full article, you can set full_text=True
(print the first 1000 characters):
= RetrieveArticle(full_text=True)
retrieve = retrieve("Albert Einstein")
raw_full_text print(raw_full_text[:1000])
Output
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1175176346"><head prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="2e565590-5208-11ee-ab11-ed58119cf5bb"/><meta charset="utf-8"/><meta property="mw:pageId" content="736"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/1175115027"/><meta property="mw:revisionSHA1" content="c806ebf80b70cf77a1a534fcfbcb935ad8a4f796"/><meta property="dc:modified" content="2023-09-13T07:35:42.000Z"/><meta property="mw:htmlVersion" content="2.8.0"/><meta property="mw:html:version" content="2.8.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Albert_Einstein"/><base href="//en.wikipedia.org/wiki/"/><title>Albert Einstein</title><meta property="mw:generalModules" content="ext.phonos.init|ext.cite.ux-enhancements"/><meta property="mw:moduleStyles" cont
We retrieve the information when full_text=True
but it contains HTML markups that can make question answering difficult.
Declare default parameters
theflow
has 2 ways to declare default parameters:
- Set default value directly. This approach is Pythonic and convenient.
- Set with a callback function. This approach is suitable when a param is calculated from another param, or if a param can be expensive to compute.
We will define some parameters and logic to clean the full HTML. For demonstration, let’s use simple regex to clean the text and remove the tags. The relevant parameters will be added to the RetrieveArticle
Function above.
import re
from theflow import Param
class RetrieveArticle(Function):
"""Retrieve the text content from an wikipedia article"""
bool
full_text: str = r"^<p id=.*?>.*?</p>$"
retrieve_p_pattern: str = r"<.*?>"
remove_tag_pattern:
@Param.auto(depends_on=["remove_tag_pattern"])
def remove_tag_obj(self):
return re.compile(self.remove_tag_pattern)
def run(self, title: str) -> str:
...
if not self.full_text:
= json.loads(text)["extract"]
text else:
= re.findall(
matches self.retrieve_p_pattern,
text,=re.MULTILINE
flags
)= [self.remove_tag_obj("", m) for m in matches]
cleaned_matches = "\n\n".join(cleaned_matches)
text
return text
- 1
- It’s possible to add a default value to an init parameter.
- 2
- It’s possible to have an auto parameter, whose values will be calculated based on other parameters.
- 3
- Code block that post-processes the HTML text.
In the above constructor, we have supplied default values to the regex patterns, which will take effect in case users don’t supply. We also declare an auto-parameter remove_tag_obj
whose value will be calculated in the remove_tag_obj
method code block. By specifying depends_on=["remove_tag_pattern"]
, this auto param will be re-calculated whenever remove_tag_pattern
changes. Otherwise, we will access its value from cache.
Then, we can rerun this Function to get the cleaner full text:
= RetrieveArticle(full_text=True)
retrieve = retrieve("Albert Einstein")
full_text print(full_text[:1000])
Output
Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[38] German: [ˈalbɛɐt ˈʔaɪnʃtaɪn] i; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[39] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.[1][40] His mass–energy equivalence formula 2</sup>]]“}},”i”:0}}]}’ id=“mweg”>E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”.[41] He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”,[42] a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science.
The text is not entirely perfect, but it’s much more operable now.
Create a composite Function
From the previous section, we can retrieve Wikipedia article. However, to retrieve the article, we must know the correct Wikipedia title. To get the correct title, let’s create a search Function that can return relevant articles given a search query.
class WikipediaSearch(Function):
"""Search for articles from wikipedia"""
def run(self, title: str, limit: int = 10) -> dict:
= parse.urlencode(
params
{"action": "query",
"format": "json",
"srlimit": limit,
"list": "search",
"srsearch": title,
},=parse.quote
quote_via
)str = f"https://en.wikipedia.org/w/api.php?{params}"
url: = request.Request(url)
req with request.urlopen(req) as resp:
= resp.read().decode("utf-8")
text return json.loads(text)
Let’s start searching for Einstein:
= WikipediaSearch()
search "Einstein", limit=2) search(
Output
{
"batchcomplete": "",
"continue": {
"sroffset": 2,
"continue": "-||"
},
"query": {
"searchinfo": {
"totalhits": 14589
},
"search": [
{
"ns": 0,
"title": "Albert Einstein",
"pageid": 736,
"size": 234632,
"wordcount": 23014,
"snippet": "Albert <span class=\"searchmatch\">Einstein</span> (/ˈaɪnstaɪn/ EYEN-styne; German: [ˈalbɛɐt ˈʔaɪnʃtaɪn] ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely",
"timestamp": "2023-09-22T17:37:31Z"
},
{
"ns": 0,
"title": "Einstein family",
"pageid": 18742711,
"size": 30108,
"wordcount": 2959,
"snippet": "The <span class=\"searchmatch\">Einstein</span> family is the family of physicist Albert <span class=\"searchmatch\">Einstein</span> (1879–1955). <span class=\"searchmatch\">Einstein's</span> great-great-great-great-grandfather, Jakob Weil, was his oldest",
"timestamp": "2023-09-08T02:38:37Z"
}
]
}
}
From then on, getting information about a concept is simply about using both the RetrieveArticle
and WikipediaSearch
Function
= RetrieveArticle(full_text=False)
retrieve = WikipediaSearch()
search
= []
texts = search("Einstein", limit=2)
search_result for result in search_result["query"]["search"]:
= result["title"]
title = retrieve(title)
text
texts.append(text)
print("\n=======\n".join(texts))
Output
Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.
=======
The Einstein family is the family of physicist Albert Einstein (1879–1955). Einstein’s great-great-great-great-grandfather, Jakob Weil, was his oldest recorded relative, born in the late 17th century, and the family continues to this day. Albert Einstein’s great-great-grandfather, Löb Moses Sontheimer (1745–1831), was also the grandfather of the tenor Heinrich Sontheim (1820–1912) of Stuttgart.
Since these 2 operations usually go with each other, we can wrap them into a pipeline and they can easily be used together.
from theflow import Node
class LearnFromWikipedia(Function):
search: Function= Node(
retrieve: Function =RetrieveArticle.withx(full_text=False),
default
)
def run(self, concept: str, limit: int = 2) -> str:
= []
texts = self.search(concept, limit=limit)
search_result for result in search_result["query"]["search"]:
= result["title"]
title = self.retrieve(title)
text
texts.append(text)return ". ".join(texts)
- 1
-
Declare a search node by annotating it with
Function
type. - 2
-
Declare a retrieve node and set to use
RetrieveArticle
by default with thefull_text
param set to False. - 3
-
Access and execute search node with
self.search
. - 4
-
Access and execute retrieve node with
self.retrieve
.
A node is defined by annotating it with Function
type or any of its sub-types. It can be declared by explicitly using the Node(...)
construct. Explicitly using the Node construct allows for finer-grained declaration. In example above, it allows us to supply the default Function and its parameters. Note, for node, we don’t want to set default by declaring:
class LearnFromWikipedia(Function):
...= RetrieveArticle(full_text=False)
retrieve: Function ...
because this will initiate into an retrieve article object during LearnFromWikipedia
class declaration; while we would like to initiate this retrieve article object only after a LearnFromWikipedia
instance is created and only in case this node is executed without an explicit value set by users.
From this construct, we can learn about “Einstein” by:
= LearnFromWikipedia(search=WikipediaSearch())
learn "Einstein") learn(
Output
Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.. The Einstein family is the family of physicist Albert Einstein (1879–1955). Einstein’s great-great-great-great-grandfather, Jakob Weil, was his oldest recorded relative, born in the late 17th century, and the family continues to this day. Albert Einstein’s great-great-grandfather, Löb Moses Sontheimer (1745–1831), was also the grandfather of the tenor Heinrich Sontheim (1820–1912) of Stuttgart.
theflow
not only executes the defined flow, but also keeps track of the run information. We can the last run with .last_run.logs()
. For example, we can see the last learn
’s run with:
learn.last_run.logs()
There can be a lot of information in the log, let’s truncated some information and view the main gist.
{"name": "__main__.LearnFromWikipedia",
"id": "1695503613669097",
".": {
"status": "run",
"input": {"args": ["Einstein"], "kwargs": {}},
"output": "Albert Einstein was a German-born theoretical physicist..."
},".search": {
"status": "run",
"input": {"args": ["Einstein"], "kwargs": {"limit": 2}},
"output": {"batchcomplete": "", "continue": {"sroffset": 2, "continue": "-||" }, "query": ...}
},".retrieve": {
"status": "run",
"input": {"args": ["Albert Einstein"], "kwargs": {}},
"output": "Albert Einstein was a German-born theoretical physicist..."
},".retrieve[1]": {
"status": "run",
"input": {"args": ["Einstein family"], "kwargs": {}},
"output": "The Einstein family is the family of physicist Albert Einstein..."
} }
Here, the run’s name
refers to the name of the flow. It’s set by default to be class name. The run’s id
refers to the id of that specific run. The .
, .search
, .retrieve
, .retrieve[1]
refer to the logged input/output of its respective step:
.
: This corresponds to the logged info of the main flow..search
: This corresponds to the logged info of the search node, which defined by thesearch: Function
node in the flow..retrieve
and.retrieve[1]
: These corresponds to the logged info of the retrieve node. Since the retrieve node is run in afor loop
for 2 times, we have.retrieve
to log info in the 1st iteration and.retrieve[1]
to log info in the 2nd iteration.
This built-in tracking ability allows users to investigate flow’s inner-working, which is quite crucial while developing flow.
What’s next?
In this tutorial, we have been able to define both simple and composite flow. We also introspect flow information. In the next tutorial, we will define a question-answering flow and combine it with this flow.