Part 1: Define Function

In this tutorial, we will create a simple question & answering chatbot that uses information from Wikipedia. This chatbot can be considered as a workflow that consists of the following components:

Prerequisite

In this tutorial, the question & answering LLM will be based on llama.cpp (source). To use it with Python, we will use llama-cpp-python (source). Please install llama-cpp-python: pip install llama-cpp-python and download the llama-2 7b instruct model weight (llama-2-7b-chat.Q4_K_M.gguf) here.

Define a Function to get text from Wikipedia

A Function is simply an operation that takes in input and returns an output. Let’s write the first Function to retrieve the text from Wikipedia. It has an option to retrieve full HTML text from /page/html/{title} or short summary from /page/summary/{title}.

import json
import urllib.parse as parse
import urllib.request as request
from theflow import Function


class RetrieveArticle(Function):
    """Retrieve the text content from an wikipedia article"""

    full_text: bool

    def run(self, title: str) -> str:
        title = parse.quote(title)
        url = (
          f"https://en.wikipedia.org/api/rest_v1/page/html/{title}"
          if self.full_text
          else f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
        )

        req = request.Request(url)
        with request.urlopen(req) as resp:
            text = resp.read().decode("utf-8")

        if not self.full_text:
          text = json.loads(text)["extract"]

        return text
1
Subclass theflow.Function to create a flow.
2
Declare the init parameters with an explicit type annotation. theflow will automatically treat any parameters with type other than theflow.Function as init parameters.
3
Declare the flow logic in run, with input and output type annotations. The input and output annotations are optional, but higly recommended to check if a Function is compatible with another Function

In this declaration, full_text is an init parameter that can be set when initializing the flow. The flow logic is written in the run method that takes in a string input parameter (title) and outputs a string.

Input parameters are transient between each flow invocation while init parameters are persistent.

You can run the flow as follow to retrieve the summary of Wikipedia article on Albert Einstein:

retrieve = RetrieveArticle(full_text=False)
retrieve("Albert Einstein")
Output

Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world's most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.

Take notice that we execute the flow with retrieve("Albert Einstein") rather than retrieve.run("Albert Einsten"). When calling an object directly, theflow will perform extra preparation steps in additional to executing the .run method

To get a full article, you can set full_text=True (print the first 1000 characters):

retrieve = RetrieveArticle(full_text=True)
raw_full_text = retrieve("Albert Einstein")
print(raw_full_text[:1000])
Output
<!DOCTYPE html>
<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1175176346"><head prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta property="mw:TimeUuid" content="2e565590-5208-11ee-ab11-ed58119cf5bb"/><meta charset="utf-8"/><meta property="mw:pageId" content="736"/><meta property="mw:pageNamespace" content="0"/><link rel="dc:replaces" resource="mwr:revision/1175115027"/><meta property="mw:revisionSHA1" content="c806ebf80b70cf77a1a534fcfbcb935ad8a4f796"/><meta property="dc:modified" content="2023-09-13T07:35:42.000Z"/><meta property="mw:htmlVersion" content="2.8.0"/><meta property="mw:html:version" content="2.8.0"/><link rel="dc:isVersionOf" href="//en.wikipedia.org/wiki/Albert_Einstein"/><base href="//en.wikipedia.org/wiki/"/><title>Albert Einstein</title><meta property="mw:generalModules" content="ext.phonos.init|ext.cite.ux-enhancements"/><meta property="mw:moduleStyles" cont

We retrieve the information when full_text=True but it contains HTML markups that can make question answering difficult.

Declare default parameters

theflow has 2 ways to declare default parameters:

  • Set default value directly. This approach is Pythonic and convenient.
  • Set with a callback function. This approach is suitable when a param is calculated from another param, or if a param can be expensive to compute.

We will define some parameters and logic to clean the full HTML. For demonstration, let’s use simple regex to clean the text and remove the tags. The relevant parameters will be added to the RetrieveArticle Function above.

import re
from theflow import Param


class RetrieveArticle(Function):
    """Retrieve the text content from an wikipedia article"""

    full_text: bool
    retrieve_p_pattern: str = r"^<p id=.*?>.*?</p>$"
    remove_tag_pattern: str = r"<.*?>"

    @Param.auto(depends_on=["remove_tag_pattern"])
    def remove_tag_obj(self):
        return re.compile(self.remove_tag_pattern)

    def run(self, title: str) -> str:

        ...

        if not self.full_text:
            text = json.loads(text)["extract"]
        else:
            matches = re.findall(
                self.retrieve_p_pattern,
                text,
                flags=re.MULTILINE
            )
            cleaned_matches = [self.remove_tag_obj("", m) for m in matches]
            text = "\n\n".join(cleaned_matches)

        return text
1
It’s possible to add a default value to an init parameter.
2
It’s possible to have an auto parameter, whose values will be calculated based on other parameters.
3
Code block that post-processes the HTML text.

In the above constructor, we have supplied default values to the regex patterns, which will take effect in case users don’t supply. We also declare an auto-parameter remove_tag_obj whose value will be calculated in the remove_tag_obj method code block. By specifying depends_on=["remove_tag_pattern"], this auto param will be re-calculated whenever remove_tag_pattern changes. Otherwise, we will access its value from cache.

Then, we can rerun this Function to get the cleaner full text:

retrieve = RetrieveArticle(full_text=True)
full_text = retrieve("Albert Einstein")
print(full_text[:1000])
Output

Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[38] German: [ˈalbɛɐt ˈʔaɪnʃtaɪn] i; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[39] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.[1][40] His mass–energy equivalence formula 2</sup>]]“}},”i”:0}}]}’ id=“mweg”>E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”.[41] He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”,[42] a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science.

The text is not entirely perfect, but it’s much more operable now.

Create a composite Function

From the previous section, we can retrieve Wikipedia article. However, to retrieve the article, we must know the correct Wikipedia title. To get the correct title, let’s create a search Function that can return relevant articles given a search query.

class WikipediaSearch(Function):
    """Search for articles from wikipedia"""

    def run(self, title: str, limit: int = 10) -> dict:
        params = parse.urlencode(
            {
                "action": "query",
                "format": "json",
                "srlimit": limit,
                "list": "search",
                "srsearch": title,
            },
            quote_via=parse.quote
        )
        url: str = f"https://en.wikipedia.org/w/api.php?{params}"
        req = request.Request(url)
        with request.urlopen(req) as resp:
            text = resp.read().decode("utf-8")
        return json.loads(text)

Let’s start searching for Einstein:

search = WikipediaSearch()
search("Einstein", limit=2)
Output
  {
    "batchcomplete": "",
    "continue": {
      "sroffset": 2,
      "continue": "-||"
    },
    "query": {
      "searchinfo": {
        "totalhits": 14589
      },
      "search": [
        {
          "ns": 0,
          "title": "Albert Einstein",
          "pageid": 736,
          "size": 234632,
          "wordcount": 23014,
          "snippet": "Albert <span class=\"searchmatch\">Einstein</span> (/ˈaɪnstaɪn/ EYEN-styne; German: [ˈalbɛɐt ˈʔaɪnʃtaɪn] ; 14 March 1879 – 18 April 1955) was a German-born theoretical physicist, widely",
          "timestamp": "2023-09-22T17:37:31Z"
        },
        {
          "ns": 0,
          "title": "Einstein family",
          "pageid": 18742711,
          "size": 30108,
          "wordcount": 2959,
          "snippet": "The <span class=\"searchmatch\">Einstein</span> family is the family of physicist Albert <span class=\"searchmatch\">Einstein</span> (1879–1955). <span class=\"searchmatch\">Einstein's</span> great-great-great-great-grandfather, Jakob Weil, was his oldest",
          "timestamp": "2023-09-08T02:38:37Z"
        }
      ]
    }
  }

From then on, getting information about a concept is simply about using both the RetrieveArticle and WikipediaSearch Function

retrieve = RetrieveArticle(full_text=False)
search = WikipediaSearch()

texts = []
search_result = search("Einstein", limit=2)
for result in search_result["query"]["search"]:
    title = result["title"]
    text = retrieve(title)
    texts.append(text)

print("\n=======\n".join(texts))
Output

Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.

=======

The Einstein family is the family of physicist Albert Einstein (1879–1955). Einstein’s great-great-great-great-grandfather, Jakob Weil, was his oldest recorded relative, born in the late 17th century, and the family continues to this day. Albert Einstein’s great-great-grandfather, Löb Moses Sontheimer (1745–1831), was also the grandfather of the tenor Heinrich Sontheim (1820–1912) of Stuttgart.

Since these 2 operations usually go with each other, we can wrap them into a pipeline and they can easily be used together.

from theflow import Node


class LearnFromWikipedia(Function):
    search: Function
    retrieve: Function = Node(
        default=RetrieveArticle.withx(full_text=False),
    )

    def run(self, concept: str, limit: int = 2) -> str:
        texts = []
        search_result = self.search(concept, limit=limit)
        for result in search_result["query"]["search"]:
            title = result["title"]
            text = self.retrieve(title)
            texts.append(text)
        return ". ".join(texts)
1
Declare a search node by annotating it with Function type.
2
Declare a retrieve node and set to use RetrieveArticle by default with the full_text param set to False.
3
Access and execute search node with self.search.
4
Access and execute retrieve node with self.retrieve.

A node is defined by annotating it with Function type or any of its sub-types. It can be declared by explicitly using the Node(...) construct. Explicitly using the Node construct allows for finer-grained declaration. In example above, it allows us to supply the default Function and its parameters. Note, for node, we don’t want to set default by declaring:

class LearnFromWikipedia(Function):
    ...
    retrieve: Function = RetrieveArticle(full_text=False)
    ...

because this will initiate into an retrieve article object during LearnFromWikipedia class declaration; while we would like to initiate this retrieve article object only after a LearnFromWikipedia instance is created and only in case this node is executed without an explicit value set by users.

From this construct, we can learn about “Einstein” by:

learn = LearnFromWikipedia(search=WikipediaSearch())
learn("Einstein")
Output

Albert Einstein was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century. His mass–energy equivalence formula E = mc2, which arises from relativity theory, has been called “the world’s most famous equation”. He received the 1921 Nobel Prize in Physics “for his services to theoretical physics, and especially for his discovery of the law of the photoelectric effect”, a pivotal step in the development of quantum theory. His work is also known for its influence on the philosophy of science. In a 1999 poll of 130 leading physicists worldwide by the British journal Physics World, Einstein was ranked the greatest physicist of all time. His intellectual achievements and originality have made Einstein synonymous with genius.. The Einstein family is the family of physicist Albert Einstein (1879–1955). Einstein’s great-great-great-great-grandfather, Jakob Weil, was his oldest recorded relative, born in the late 17th century, and the family continues to this day. Albert Einstein’s great-great-grandfather, Löb Moses Sontheimer (1745–1831), was also the grandfather of the tenor Heinrich Sontheim (1820–1912) of Stuttgart.

theflow not only executes the defined flow, but also keeps track of the run information. We can the last run with .last_run.logs(). For example, we can see the last learn’s run with:

learn.last_run.logs()

There can be a lot of information in the log, let’s truncated some information and view the main gist.

{
  "name": "__main__.LearnFromWikipedia",
  "id": "1695503613669097",
  ".": {
    "status": "run",
    "input": {"args": ["Einstein"], "kwargs": {}},
    "output": "Albert Einstein was a German-born theoretical physicist..."
  },
  ".search": {
    "status": "run",
    "input": {"args": ["Einstein"], "kwargs": {"limit": 2}},
    "output": {"batchcomplete": "", "continue": {"sroffset": 2, "continue": "-||" }, "query": ...}
  },
  ".retrieve": {
    "status": "run",
    "input": {"args": ["Albert Einstein"], "kwargs": {}},
    "output": "Albert Einstein was a German-born theoretical physicist..."
  },
  ".retrieve[1]": {
    "status": "run",
    "input": {"args": ["Einstein family"], "kwargs": {}},
    "output": "The Einstein family is the family of physicist Albert Einstein..."
  }
}

Here, the run’s name refers to the name of the flow. It’s set by default to be class name. The run’s id refers to the id of that specific run. The ., .search, .retrieve, .retrieve[1] refer to the logged input/output of its respective step:

  • .: This corresponds to the logged info of the main flow.
  • .search: This corresponds to the logged info of the search node, which defined by the search: Function node in the flow.
  • .retrieve and .retrieve[1]: These corresponds to the logged info of the retrieve node. Since the retrieve node is run in a for loop for 2 times, we have .retrieve to log info in the 1st iteration and .retrieve[1] to log info in the 2nd iteration.

This built-in tracking ability allows users to investigate flow’s inner-working, which is quite crucial while developing flow.

What’s next?

In this tutorial, we have been able to define both simple and composite flow. We also introspect flow information. In the next tutorial, we will define a question-answering flow and combine it with this flow.