Tabular data

8.13. Tabular data#

The cogent3 Table data type provides methods for manipulating tabular data. Here we summarise the basic capabilities. One important point to make is that tables are immutable, in that once a column has been created you cannot modify individual elements. But you can modify what columns a table has, their name and order. Table columns themselves are just numpy arrays. (See the cogent3 cookbook for a more thorough description.)

8.13.1. Making a Table from standard Python objects#

8.13.1.1. From a column-oriented dict#

We convert dinucleotide data into a column based dict – the keys will become the column names and the values will be the column.

data = {"base1": base1, "base2": base2}

We make a cogent3 table using a utility function.

from cogent3 import make_table

table = make_table(data=data)
table

base1	base2
N	N
T	G
C	C
A	G
T	T
...	...
C	T
G	A
T	G
A	C
C	T

Top 5 and bottom 5 rows from 1,297 rows x 2 columns

8.13.1.2. From a header and a list of lists#

I’m going to make a list of lists from dinucs by just convert each dinucleotide into a list.

rows = [list(dinuc) for dinuc in dinucs]
rows[:4]

[['N', 'N'], ['T', 'G'], ['C', 'C'], ['A', 'G']]

In this instance, I need to specify the column names using the argument header.

table = make_table(header=["base1", "base2"], data=rows)
table

base1	base2
N	N
T	G
C	C
A	G
T	T
...	...
C	T
G	A
T	G
A	C
C	T

Top 5 and bottom 5 rows from 1,297 rows x 2 columns

8.13.2. Loading a table from a file#

We load a tab separated data file using the load_table() function. The format is inferred from the filename suffix.

from cogent3 import load_table

stats = load_table("data/edge_stats.tsv")
stats

edge	kappa
FlyingFox	8.5680
DogFaced	7.6584
edge.0	4.6557
FreeTaile	4.1230
LittleBro	6.0653
edge.1	4.5488
TombBat	6.7665
RoundEare	5.8395
edge.2	5.2402
edge.3	8.2848

10 rows x 2 columns

8.13.3. Getting summary using `Table.head()` or `Table.tail()`#

These display the top or bottom of a table.

stats.head()

edge	kappa
FlyingFox	8.5680
DogFaced	7.6584
edge.0	4.6557
FreeTaile	4.1230
LittleBro	6.0653

Top 5 rows from 10 rows x 2 columns

stats.tail()

edge	kappa
edge.1	4.5488
TombBat	6.7665
RoundEare	5.8395
edge.2	5.2402
edge.3	8.2848

Bottom 5 rows from 10 rows x 2 columns

8.13.4. Slicing a Table#

Tables are “row oriented”, so the first index concerns rows, the column.

stats[:4]

edge	kappa
FlyingFox	8.5680
DogFaced	7.6584
edge.0	4.6557
FreeTaile	4.1230

4 rows x 2 columns

8.13.5. Getting a column#

Tables have a column attribute.

stats.columns

Columns('edge': <U9, 'kappa': float64)

This has dict like properties and supports you getting a column using the column header,

stats.columns["kappa"]

array([8.5679832 , 7.65839569, 4.6557154 , 4.12300489, 6.06528905,
       4.54883285, 6.76650876, 5.83950315, 5.24024651, 8.28483208])

or, using an int like it’s a series – in this case indexes are defined by the column.order attribute.

stats.columns.order

('edge', 'kappa')

stats.columns[0]

array(['FlyingFox', 'DogFaced', 'edge.0', 'FreeTaile', 'LittleBro',
       'edge.1', 'TombBat', 'RoundEare', 'edge.2', 'edge.3'], dtype='<U9')

8.13.6. Creating a new column#

You write a function that takes the rows from the columns ytou want and returns the result of some operation. I’ll just take the square root of kappa.

from math import sqrt

k_rt = stats.with_new_column("sqrt(kappa)", lambda x: sqrt(x), columns=["kappa"])
k_rt.head()

edge	kappa	sqrt(kappa)
FlyingFox	8.5680	2.9271
DogFaced	7.6584	2.7674
edge.0	4.6557	2.1577
FreeTaile	4.1230	2.0305
LittleBro	6.0653	2.4628

Top 5 rows from 10 rows x 3 columns

8.13.7. Filtering a table to include rows by value#

In our dinucleotide table, we only want rows where both bases are in the canonical set {A, C, G, T}. We do this via a filter step using a lambda function and a set object consisting of these bases. The filtered() method calls the lambda with each row. Only if the lambda returns True will the row be included in the new Table. In our case, our lambda will return true if the set of elements in the row is a subset of all the basses.

table = table.filtered(lambda x: set(x) <= {"A", "C", "G", "T"})
table

base1	base2
T	G
C	C
A	G
T	T
C	A
...	...
C	T
G	A
T	G
A	C
C	T

Top 5 and bottom 5 rows from 1,296 rows x 2 columns

Note

I did not specify which columns because the default is to use all columns.

8.13.8. Counting unique values#

This method returns counts of the unique combinations of values from the specified columns. The result is a cogent3 type, a CategoryCounter instance, which has some useful properties. Principal being that it behaves like a dict.

unique = table.count_unique()
unique

CategoryCounter({('T', 'G'): 87, ('C', 'C'): 81, ('A', 'G'): 142, ('T', 'T'): 58, ('C', 'A'): 110, ('A', 'C'): 66, ('T', 'A'): 59, ('A', 'A'): 158, ('G', 'C'): 65, ('G', 'A'): 113, ('A', 'T'): 81, ('G', 'T'): 63, ('G', 'G'): 69, ('C', 'T'): 72, ('T', 'C'): 57, ('C', 'G'): 15})

8.13.8.1. To categorical count#

Another being that it can produce CategoryCount object

cat_counts = unique.to_categorical()
cat_counts

Observed
	A	C	G	T
A	158	66	142	81
C	110	81	15	72
G	113	65	69	63
T	59	57	87	58

Expected
	A	C	G	T
A	151.7593	92.7801	107.9560	94.5046
C	94.3827	57.7022	67.1404	58.7747
G	105.2469	64.3441	74.8688	65.5401
T	88.6111	54.1736	63.0347	55.1806

Residuals
	A	C	G	T
A	0.5066	-2.7803	3.2766	-1.3892
C	1.6075	3.0670	-6.3633	1.7251
G	0.7557	0.0818	-0.6783	-0.3138
T	-3.1457	0.3840	3.0185	0.3796

which supports statistical testing of categorical data. For instance

cat_counts.chisq_test()

Chisq-test for independence
chisq	df	pvalue
96.545	9	7.85e-17

Observed
	A	C	G	T
A	158	66	142	81
C	110	81	15	72
G	113	65	69	63
T	59	57	87	58

Expected
	A	C	G	T
A	151.7593	92.7801	107.9560	94.5046
C	94.3827	57.7022	67.1404	58.7747
G	105.2469	64.3441	74.8688	65.5401
T	88.6111	54.1736	63.0347	55.1806

Residuals
	A	C	G	T
A	0.5066	-2.7803	3.2766	-1.3892
C	1.6075	3.0670	-6.3633	1.7251
G	0.7557	0.0818	-0.6783	-0.3138
T	-3.1457	0.3840	3.0185	0.3796

8.13.9. Generating categorical counts from a Table#

We can also get a CategoryCount object via Table.to_categorical(). In this case, the counts must be fully specified prior to constructing the table. (Meaning you’ve already done the counting part.) In addition, we also need to specify a column whose values are the row categories. The latter is achieved setting make_table(index_name=<column name>). In this case, I specify the column name of the index is an empty string.

data = {
    "A": (158, 110, 113, 59),
    "C": (66, 81, 65, 57),
    "G": (142, 15, 69, 87),
    "T": (81, 72, 63, 58),
    "": ["A", "C", "G", "T"],
}

table = make_table(data=data, index_name="")
table

	A	C	G	T
A	158	66	142	81
C	110	81	15	72
G	113	65	69	63
T	59	57	87	58

4 rows x 5 columns

cat_counts = table.to_categorical()
cat_counts

Observed
	A	C	G	T
A	158	66	142	81
C	110	81	15	72
G	113	65	69	63
T	59	57	87	58

Expected
	A	C	G	T
A	151.7593	92.7801	107.9560	94.5046
C	94.3827	57.7022	67.1404	58.7747
G	105.2469	64.3441	74.8688	65.5401
T	88.6111	54.1736	63.0347	55.1806

Residuals
	A	C	G	T
A	0.5066	-2.7803	3.2766	-1.3892
C	1.6075	3.0670	-6.3633	1.7251
G	0.7557	0.0818	-0.6783	-0.3138
T	-3.1457	0.3840	3.0185	0.3796