README.md 3.54 KB
Newer Older
Lucile Broseus's avatar
Lucile Broseus committed
1
**TALC: Transcription-Aware Long Read Correction**
Lucile Broseus's avatar
Lucile Broseus committed
2
===================================================
Lucile Broseus's avatar
Lucile Broseus committed
3 4

TALC is an hybrid Long Read correction method tailored for RNA-seq data.
Lucile Broseus's avatar
Lucile Broseus committed
5 6
___________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
7 8 9
Pre-print: https://www.biorxiv.org/content/10.1101/2020.01.10.901728v1
___________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
10 11


Lucile Broseus's avatar
Lucile Broseus committed
12
**Requirements:**
Lucile Broseus's avatar
Lucile Broseus committed
13
----------------
Lucile Broseus's avatar
Lucile Broseus committed
14

Lucile Broseus's avatar
Lucile Broseus committed
15 16
* Compilation

Lucile Broseus's avatar
Lucile Broseus committed
17
To compile from the source, you will need a **gcc version > 5**.    
Lucile Broseus's avatar
Lucile Broseus committed
18 19 20 21 22

TALC is built upon the SeqAn2 C++ library (https://github.com/seqan/seqan).    

Compile with:  

Lucile Broseus's avatar
Lucile Broseus committed
23
```
Lucile Broseus's avatar
Lucile Broseus committed
24 25 26
git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
Lucile Broseus's avatar
Lucile Broseus committed
27 28
make
```
Lucile Broseus's avatar
Lucile Broseus committed
29

Lucile Broseus's avatar
Lucile Broseus committed
30 31
* Jellyfish2   

Lucile Broseus's avatar
Lucile Broseus committed
32
Currently, TALC makes use of k-mer counts table as dumped by Jellyfish2.  
Lucile Broseus's avatar
Lucile Broseus committed
33

Lucile Broseus's avatar
Lucile Broseus committed
34
Jellyfish2 can be dowloaded from: https://github.com/zippav/Jellyfish-2.  
Lucile Broseus's avatar
Lucile Broseus committed
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

Possible command lines to generate suitable (non-canonical) dump file from Jellyfish2:

*For paired-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq1 $SRfq2  
jellyfish dump -c $out.jf > $out.dump
```

*For single-end short read data:*  

```
jellyfish count --mer $kmerSize -s 100M -o $out.jf -t $nthreads $SRfq  
jellyfish dump -c $out.jf > $out.dump
```

Lucile Broseus's avatar
Lucile Broseus committed
52 53 54 55 56
*  Adapter trimming

Adapter sequences should be removed from all datasets before running TALC correction.  
No additional filtering is needed.
________________________________________________________________________________
Lucile Broseus's avatar
Lucile Broseus committed
57 58

**Running TALC**
Lucile Broseus's avatar
Lucile Broseus committed
59 60
----------------

Lucile Broseus's avatar
Lucile Broseus committed
61 62 63 64 65 66 67 68 69 70 71 72 73 74

```
talc $LReads \           # File containg the long reads, in fasta of fastq format
     --SRCounts  $dump \ # k-mer counts from your short reads dataset, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```

Important:  
in TALC, short and long read sequences must be in the same direction (the weighted de Bruijn graph is directional).
If your long reads are reverse complement of your short reads, please add the option: 
> --reverse 

Lucile Broseus's avatar
Lucile Broseus committed
75
```
Lucile Broseus's avatar
Lucile Broseus committed
76
talc $LReads \           # File containg the long reads, in fasta or fastq format
Lucile Broseus's avatar
Lucile Broseus committed
77
     --SRCounts $dump \  # k-mer counts from your short reads dataset, as generated by Jellyfish dump
Lucile Broseus's avatar
Lucile Broseus committed
78 79 80 81 82 83
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
     --reverse           # Reverse complement Long Read sequences before correction
```

Lucile Broseus's avatar
Lucile Broseus committed
84
**Using known splice junctions**
Lucile Broseus's avatar
Lucile Broseus committed
85

Lucile Broseus's avatar
Lucile Broseus committed
86 87 88 89
So as to integrate known splice junctions, you need create a dump file containing k-mers which flank splice junctions and activate the option:
> --junctions 

Such that:
Lucile Broseus's avatar
Lucile Broseus committed
90 91

```
Lucile Broseus's avatar
Lucile Broseus committed
92
talc $LReads \           # File containg the long reads, in fasta or fastq format
Lucile Broseus's avatar
Lucile Broseus committed
93
     --SRCounts $dump \  # k-mer counts from your short reads dataset, as generated by Jellyfish dump
Lucile Broseus's avatar
Lucile Broseus committed
94 95 96 97 98
     --junctions $junc \ # k-mer counts of a subset of k-mers flanking known splice junctions, as generated by Jellyfish dump
     -k $kmerSize  \     # Size k of the k-mers, must match the dump file
     -o $out \           # Prefix for the output
     -t $num_threads     # Number of threads
```
Lucile Broseus's avatar
Lucile Broseus committed
99 100 101

________________________________________________________________________________

Lucile Broseus's avatar
Lucile Broseus committed
102 103 104 105 106 107 108 109
**OUTPUT**
----------------

Currently TALC outputs three files:

* A fasta file containing the corrected Long Read
* A .config.txt file summing up the input parameters
* A .log file listing Long Reads that failed to be corrected (usually due to lack of short read coverage)