TEXT_DATASET
 
 Load one of the 20 newsgroup sample datasets from scikit-learn. The data is returned as a dataframe with one column containing the text and the other containing the category.  Params:    subset : "train" | "test" | "all", default="train"  Select the dataset to load: "train" for the training set, "test" for the test set, "all" for both.   categories : list of str  Select the categories to load. By default, all categories are loaded.
The list of all categories is:
'alt.atheism',
'comp.graphics',
'comp.os.ms-windows.misc',
'comp.sys.ibm.pc.hardware',
'comp.sys.mac.hardware',
'comp.windows.x',
'misc.forsale',
'rec.autos',
'rec.motorcycles',
'rec.sport.baseball',
'rec.sport.hockey',
'sci.crypt',
'sci.electronics',
'sci.med',
'sci.space',
'soc.religion.christian',
'talk.politics.guns',
'talk.politics.mideast',
'talk.politics.misc',
'talk.religion.misc'   remove_headers : boolean, default=false  Remove the headers from the data.   remove_footers : boolean, default=false  Remove the footers from the data.   remove_quotes : boolean, default=false  Remove the quotes from the data.     Returns:    out : DataFrame      
   Python Code
from flojoy import flojoy, DataFrame, Array
from sklearn.datasets import fetch_20newsgroups
from sklearn.utils import Bunch
import pandas as pd
from typing import cast, Literal, Optional
# TODO: Add more datasets to this node.
@flojoy
def TEXT_DATASET(
    subset: Literal["train", "test", "all"] = "train",
    categories: Optional[Array] = None,
    remove_headers: bool = False,
    remove_footers: bool = False,
    remove_quotes: bool = False,
) -> DataFrame:
    """Load one of the 20 newsgroup sample datasets from scikit-learn.
    The data is returned as a dataframe with one column containing the text and the other containing the category.
    Parameters
    ----------
    subset : "train" | "test" | "all", default="train"
        Select the dataset to load: "train" for the training set, "test" for the test set, "all" for both.
    categories : list of str, optional
        Select the categories to load. By default, all categories are loaded.
        The list of all categories is:
        'alt.atheism',
        'comp.graphics',
        'comp.os.ms-windows.misc',
        'comp.sys.ibm.pc.hardware',
        'comp.sys.mac.hardware',
        'comp.windows.x',
        'misc.forsale',
        'rec.autos',
        'rec.motorcycles',
        'rec.sport.baseball',
        'rec.sport.hockey',
        'sci.crypt',
        'sci.electronics',
        'sci.med',
        'sci.space',
        'soc.religion.christian',
        'talk.politics.guns',
        'talk.politics.mideast',
        'talk.politics.misc',
        'talk.religion.misc'
    remove_headers : boolean, default=false
        Remove the headers from the data.
    remove_footers : boolean, default=false
        Remove the footers from the data.
    remove_quotes : boolean, default=false
        Remove the quotes from the data.
    Returns
    -------
    DataFrame
    """
    to_remove = tuple(
        ["headers" for remove_headers in [remove_headers] if remove_headers]
        + ["footers" for remove_footers in [remove_footers] if remove_footers]
        + ["quotes" for remove_quotes in [remove_quotes] if remove_quotes]
    )
    newsgroups = fetch_20newsgroups(
        subset=subset,
        categories=categories.unwrap() if categories else None,
        remove=to_remove,
    )
    newsgroups = cast(Bunch, newsgroups)
    data = newsgroups.data
    labels = [newsgroups.target_names[i] for i in newsgroups.target]
    df = pd.DataFrame({"Text": data, "Label": labels})
    return DataFrame(df=df)
Example
Having problems with this example app? Join our Discord community and we will help you out!
In this example, the TEXT_DATASET node is used to load the 20 newsgroups dataset. Only the training subset is selected, and the two categories that are loaded are comp.graphics and alt.atheism.
REMOVE_HEADERS, REMOVE_FOOTERS, and REMOVE_QUOTES are also set to true in order to remove the headers, footers, and quotes from the data.