8.12. Tabular data#

The cogent3 Table data type provides methods for manipulating tabular data. Here we summarise the basic capabilities. One important point to make is that tables are immutable, in that once a column has been created you cannot modify individual elements. But you can modify what columns a table has, their name and order. Table columns themselves are just numpy arrays. (See the cogent3 cookbook for a more thorough description.)

8.12.1. Making a Table from standard Python objects#

8.12.1.1. From a column-oriented dict#

We convert dinucleotide data into a column based dict – the keys will become the column names and the values will be the column.

data = {"base1": base1, "base2": base2}

We make a cogent3 table using a utility function.

from cogent3 import make_table

table = make_table(data=data)
table
base1base2
NN
TG
CC
AG
TT
......
CT
GA
TG
AC
CT

Top 5 and bottom 5 rows from 1,297 rows x 2 columns

8.12.1.2. From a header and a list of lists#

I’m going to make a list of lists from dinucs by just convert each dinucleotide into a list.

rows = [list(dinuc) for dinuc in dinucs]
rows[:4]
[['N', 'N'], ['T', 'G'], ['C', 'C'], ['A', 'G']]

In this instance, I need to specify the column names using the argument header.

table = make_table(header=["base1", "base2"], data=rows)
table
base1base2
NN
TG
CC
AG
TT
......
CT
GA
TG
AC
CT

Top 5 and bottom 5 rows from 1,297 rows x 2 columns

8.12.2. Loading a table from a file#

We load a tab separated data file using the load_table() function. The format is inferred from the filename suffix.

from cogent3 import load_table

stats = load_table("data/edge_stats.tsv")
stats
edgekappa
FlyingFox8.5680
DogFaced7.6584
edge.04.6557
FreeTaile4.1230
LittleBro6.0653
edge.14.5488
TombBat6.7665
RoundEare5.8395
edge.25.2402
edge.38.2848

10 rows x 2 columns

8.12.3. Getting summary using Table.head() or Table.tail()#

These display the top or bottom of a table.

stats.head()
edgekappa
FlyingFox8.5680
DogFaced7.6584
edge.04.6557
FreeTaile4.1230
LittleBro6.0653

Top 5 rows from 10 rows x 2 columns

stats.tail()
edgekappa
edge.14.5488
TombBat6.7665
RoundEare5.8395
edge.25.2402
edge.38.2848

Bottom 5 rows from 10 rows x 2 columns

8.12.4. Slicing a Table#

Tables are “row oriented”, so the first index concerns rows, the column.

stats[:4]
edgekappa
FlyingFox8.5680
DogFaced7.6584
edge.04.6557
FreeTaile4.1230

4 rows x 2 columns

8.12.5. Getting a column#

Tables have a column attribute.

stats.columns
Columns('edge': <U9, 'kappa': float64)

This has dict like properties and supports you getting a column using the column header,

stats.columns["kappa"]
array([8.5679832 , 7.65839569, 4.6557154 , 4.12300489, 6.06528905,
       4.54883285, 6.76650876, 5.83950315, 5.24024651, 8.28483208])

or, using an int like it’s a series – in this case indexes are defined by the column.order attribute.

stats.columns.order
('edge', 'kappa')
stats.columns[0]
array(['FlyingFox', 'DogFaced', 'edge.0', 'FreeTaile', 'LittleBro',
       'edge.1', 'TombBat', 'RoundEare', 'edge.2', 'edge.3'], dtype='<U9')

8.12.6. Creating a new column#

You write a function that takes the rows from the columns ytou want and returns the result of some operation. I’ll just take the square root of kappa.

from math import sqrt

k_rt = stats.with_new_column("sqrt(kappa)", lambda x: sqrt(x), columns=["kappa"])
k_rt.head()
edgekappasqrt(kappa)
FlyingFox8.56802.9271
DogFaced7.65842.7674
edge.04.65572.1577
FreeTaile4.12302.0305
LittleBro6.06532.4628

Top 5 rows from 10 rows x 3 columns

8.12.7. Filtering a table to include rows by value#

In our dinucleotide table, we only want rows where both bases are in the canonical set {A, C, G, T}. We do this via a filter step using a lambda function and a set object consisting of these bases. The filtered() method calls the lambda with each row. Only if the lambda returns True will the row be included in the new Table. In our case, our lambda will return true if the set of elements in the row is a subset of all the basses.

table = table.filtered(lambda x: set(x) <= {"A", "C", "G", "T"})
table
base1base2
TG
CC
AG
TT
CA
......
CT
GA
TG
AC
CT

Top 5 and bottom 5 rows from 1,296 rows x 2 columns

Note

I did not specify which columns because the default is to use all columns.

8.12.8. Counting unique values#

This method returns counts of the unique combinations of values from the specified columns. The result is a cogent3 type, a CategoryCounter instance, which has some useful properties. Principal being that it behaves like a dict.

unique = table.count_unique()
unique
CategoryCounter({('T', 'G'): 87, ('C', 'C'): 81, ('A', 'G'): 142, ('T', 'T'): 58, ('C', 'A'): 110, ('A', 'C'): 66, ('T', 'A'): 59, ('A', 'A'): 158, ('G', 'C'): 65, ('G', 'A'): 113, ('A', 'T'): 81, ('G', 'T'): 63, ('G', 'G'): 69, ('C', 'T'): 72, ('T', 'C'): 57, ('C', 'G'): 15})

8.12.8.1. To categorical count#

Another being that it can produce CategoryCount object

cat_counts = unique.to_categorical()
cat_counts
Observed
ACGT
A1586614281
C110811572
G113656963
T59578758

Expected
ACGT
A151.759392.7801107.956094.5046
C94.382757.702267.140458.7747
G105.246964.344174.868865.5401
T88.611154.173663.034755.1806

Residuals
ACGT
A0.5066-2.78033.2766-1.3892
C1.60753.0670-6.36331.7251
G0.75570.0818-0.6783-0.3138
T-3.14570.38403.01850.3796

which supports statistical testing of categorical data. For instance

cat_counts.chisq_test()
Chisq-test for independence
chisqdfpvalue
96.54597.85e-17
Observed
ACGT
A1586614281
C110811572
G113656963
T59578758

Expected
ACGT
A151.759392.7801107.956094.5046
C94.382757.702267.140458.7747
G105.246964.344174.868865.5401
T88.611154.173663.034755.1806

Residuals
ACGT
A0.5066-2.78033.2766-1.3892
C1.60753.0670-6.36331.7251
G0.75570.0818-0.6783-0.3138
T-3.14570.38403.01850.3796

8.12.9. Generating categorical counts from a Table#

We can also get a CategoryCount object via Table.to_categorical(). In this case, the counts must be fully specified prior to constructing the table. (Meaning you’ve already done the counting part.) In addition, we also need to specify a column whose values are the row categories. The latter is achieved setting make_table(index_name=<column name>). In this case, I specify the column name of the index is an empty string.

data = {
    "A": (158, 110, 113, 59),
    "C": (66, 81, 65, 57),
    "G": (142, 15, 69, 87),
    "T": (81, 72, 63, 58),
    "": ["A", "C", "G", "T"],
}

table = make_table(data=data, index_name="")
table
ACGT
A1586614281
C110811572
G113656963
T59578758

4 rows x 5 columns

cat_counts = table.to_categorical()
cat_counts
Observed
ACGT
A1586614281
C110811572
G113656963
T59578758

Expected
ACGT
A151.759392.7801107.956094.5046
C94.382757.702267.140458.7747
G105.246964.344174.868865.5401
T88.611154.173663.034755.1806

Residuals
ACGT
A0.5066-2.78033.2766-1.3892
C1.60753.0670-6.36331.7251
G0.75570.0818-0.6783-0.3138
T-3.14570.38403.01850.3796