Rhiza Labs FluTracker Forum

The place to discuss the flu
It is currently Wed May 22, 2013 10:17 pm

All times are UTC - 5 hours [ DST ]




Post new topic Reply to topic  [ 11 posts ]  Go to page 1, 2  Next
Author Message
 Post subject: sequences at genbank (3)
PostPosted: Fri Apr 15, 2011 2:16 am 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
please no long travellogs in this thread, thanks.

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
PostPosted: Fri Apr 15, 2011 2:16 am 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
--------------------------------------------------------------------------------

genbank 183 is out
actually 157840 flu-sequence-records (+2.0% in 2 months)

ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

GBREL.TXT Genetic Sequence Data Bank
April 15 2011

NCBI-GenBank Flat File Release 183.0

Distribution Release Notes

135440924 loci, 126551501141 bases, from 135440924 reported sequences

- the VRL division is now composed of 17 files (+1)

2.2.1 File Descriptions

Files included in this release are:
...
1557. gbvrl1.seq - Viral sequence entries, part 1.
1558. gbvrl10.seq - Viral sequence entries, part 10.
1559. gbvrl11.seq - Viral sequence entries, part 11.
1560. gbvrl12.seq - Viral sequence entries, part 12.
1561. gbvrl13.seq - Viral sequence entries, part 13.
1562. gbvrl14.seq - Viral sequence entries, part 14.
1563. gbvrl15.seq - Viral sequence entries, part 15.
1564. gbvrl16.seq - Viral sequence entries, part 16.
1565. gbvrl17.seq - Viral sequence entries, part 17.
1566. gbvrl2.seq - Viral sequence entries, part 2.
1567. gbvrl3.seq - Viral sequence entries, part 3.
1568. gbvrl4.seq - Viral sequence entries, part 4.
1569. gbvrl5.seq - Viral sequence entries, part 5.
1570. gbvrl6.seq - Viral sequence entries, part 6.
1571. gbvrl7.seq - Viral sequence entries, part 7.
1572. gbvrl8.seq - Viral sequence entries, part 8.
1573. gbvrl9.seq - Viral sequence entries, part 9.
...

Uncompressed, the Release 183.0 flatfiles require roughly 489 GB (sequence

File Size File Name
...
249998875 gbvrl1.seq
249997181 gbvrl10.seq
16808055 gbvrl11.seq
249996404 gbvrl12.seq
249998509 gbvrl13.seq
249998936 gbvrl14.seq
249997870 gbvrl15.seq
249999053 gbvrl16.seq
238429865 gbvrl17.seq
249999624 gbvrl2.seq
249994915 gbvrl3.seq
249999658 gbvrl4.seq
164453727 gbvrl5.seq
249998209 gbvrl6.seq
249998635 gbvrl7.seq
249999773 gbvrl8.seq
249997917 gbvrl9.seq

...
VRL1 69597 67988073
VRL10 55226 73759817
VRL11 4179 5063852
VRL12 62617 71400140
VRL13 58385 72462844
VRL14 62822 65485288
VRL15 59572 73053270
VRL16 57413 71980516
VRL17 61164 69846358
VRL2 73438 64152240
VRL3 69896 60951713
VRL4 67799 70887231
VRL5 42441 44393369
VRL6 48350 77608209
VRL7 58303 71387958
VRL8 61757 72807020
VRL9 67995 67886914
...


most sequenced organisms in Release 183.0, :
Entries Bases Species
16863696 15675014152 Homo sapiens
8679108 9023900439 Mus musculus
2181397 6498706602 Rattus norvegicus
2198402 5380180551 Bos taurus
3925343 5053645051 Zea mays
3221262 4803375305 Sus scrofa
1704939 3127417957 Danio rerio
228260 1352941097 Strongylocentrotus purpuratus
1341465 1249691996 Oryza sativa Japonica Group
1770024 1194632155 Nicotiana tabacum
1424247 1147209201 Xenopus (Silurana) tropicalis
1218742 1054660637 Drosophila melanogaster
2316656 1013211794 Arabidopsis thaliana
214025 1002427288 Pan troglodytes
1453428 943663921 Canis lupus familiaris
661291 914659269 Vitis vinifera
810916 894552342 Gallus gallus
1889663 893141046 Glycine max
82470 826741112 Macaca mulatta
1217094 748805204 Ciona intestinalis

2.2.8 Growth of GenBank
From 1982 to the present, the number of bases in GenBank has doubled
approximately every 18 months.

Release Date Base Pairs Entries
3 Dec 1982 680338 606
36 Sep 1985 5204420 5700
66 Dec 1990 51306092 41057
92 Dec 1995 425860958 620765
121 Dec 2000 11101066288 10106023
151 Dec 2005 56037734462 52016762
181 Dec 2010 122082812719 129902276
182 Feb 2011 124277818310 132015054
183 Apr 2011 126551501141 135440924

1. PRI - primate sequences
2. ROD - rodent sequences
3. MAM - other mammalian sequences
4. VRT - other vertebrate sequences
5. INV - invertebrate sequences
6. PLN - plant, fungal, and algal sequences
7. BCT - bacterial sequences
8. VRL - viral sequences
9. PHG - bacteriophage sequences
10. SYN - synthetic sequences
11. UNA - unannotated sequences
12. EST - EST sequences (expressed sequence tags)
13. PAT - patent sequences
14. STS - STS sequences (sequence tagged sites)
15. GSS - GSS sequences (genome survey sequences)
16. HTG - HTGS sequences (high throughput genomic sequences)
17. HTC - HTC sequences (high throughput cDNA sequences)
18. ENV - Environmental sampling sequences
19. CON - Constructed sequences
__________________

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
PostPosted: Fri Apr 15, 2011 2:23 am 
Online

Joined: Wed Aug 19, 2009 10:42 am
Posts: 27403
Location: Pittsburgh, PA USA
Reality check. Most recent human influenza sequences are NOT at genbank. What is the purpose of this thread and inventory lists that you post?

_________________
www.twitter.com/hniman


Top
 Profile  
 
PostPosted: Fri Apr 15, 2011 2:24 am 
Online

Joined: Wed Aug 19, 2009 10:42 am
Posts: 27403
Location: Pittsburgh, PA USA
gsgs wrote:
please no long travellogs in this thread, thanks.

Please.

_________________
www.twitter.com/hniman


Top
 Profile  
 
PostPosted: Sat Apr 16, 2011 11:58 am 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
trying to download, convert the files
trying to filter the flu sequences from it
trying to correct the errors, converting them into computer readable form
with suitable headers
trying to align them
trying to determine the "types" of the sequences
(flugenome.org plus other own types)

trying to find other people who are also doing this so to save time
(it only has to be done once)

anyone who seriously works with flu-sequences must be doing this ?!?!?
Still I can't find those computer-readable converted flu-files anywhere



my best version currently is from Oct.2010
f:\gbvrl6\gbflu6 , which I can upload if there is interest

243MB , 144117 sequences (not aligned)
LOCUS:70349+73130+70819+69207+36491+48364+63407+61865+65661+44781+62262+58034+61433+59745+59155
DEFINITION:70347+73130+70820+69207+36490+48364+63406+61864+65661+44781+62262+58034+61433+59745+59153
//:

70349,70349,6006,6006
73130,73130,3168,3168
70820,70820,3070,3070
69207,69207,3445,3445
36491,36491,10429,10429
48364,48364,48364,48364
63408,63408,18325,18324
61865,61865,5457,5457
65661,65661,6890,6890
44782,44782,5402,5402
62262,62262,5275,5275
58034,58034,6407,6407
61433,61433,6740,6740
59745,59745,10522,10522
59155,59155,4596,4612
-----------------------
904706,904706,144096,144111


gbflu8:
LOCUS:144116
DEFINITION:144115
//:145530
%010//%013:144117


my headers:

AB000604,AB000604.1,GI:4520226,AB000604,
1136,RNA,16-DEC-2008,C/Johannesburg/66),
203223,,,Homo sapiens,,,C/Johannesburg/66),6,,,6,46B,5,1C,0,1

----------------------------------------------------
01:locus-ID
02:accession-ID
03:version-ID
04:GI-number
05:sequence length
06:nucleic acid type
07:genbank/submission date
08:strain
09:taxon
10:collection date
11:country (:province or city)
12:species
13: host age,gender,how swabbed
14: lab passage
15:strain
16:segment
17:serotype
18:strain
19:#epitope matches, best choice
20:flugenome.org-type, best choice
21:#epitope matches, 2nd best choice
22:flugenome.org-type, 2nd best choice
23:continent: 0=unknown,1=Europe,2=West-Asia,3=East-Asia,4=North-America,5=Oceania,6=South-America,7=Africa
24:species: 0=unknown,1=human,2=swine,3=equine or canine,4=feline,5=avian
25:number in this list of 144117, sorted by 01:

--:number in this list of 144117 sorted by 09: 16:

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
PostPosted: Sat Apr 16, 2011 1:36 pm 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
just so you can see how it goes and why this process should be globally
done and public and the result stored at genbank ...

download of the 17 virus files, unzipping and converting from UNIX (lf)
to DOS (cr+lf) took 2 hours

putting them all in one file of 3990MB took 10min

161844 flu-records filtered out (804MB) of the 980954 virus-records (3990MB)
creates the file gbflu using program gbflu1.bas
(took 1.5h)

...

extracting the header-relevant data and the nucleotides from the
genbank records (file gbana2.bas) took 52 min
creates the file gbflu7 of 284 MB with 161865 lines (header-data+nucleotides)
still unprocessed data
after throwing away the nonrelevant data the files are getting smaller now
and can be faster further processed
...............................
copy gbflu7 gbflu7.c1 2min
............................
repl gbflu7 , '222 4min
................................
repl gbflu7 '033 , 4min
...................................
gbflu7t.bas separate gbflu7 into gbflu7.nam(47MB) for headers
and gbflu7.seq(238MB) for nucleotide-data 35min
....................................



...


-------------------------------
developing the software to process the data and make them computer-readable
also took many hours and needed to be improved, updated.
This was mainly done years ago. I hope there are no format changes
in genbank and these programs do still work....


------------------------------
I give up for now, too much work, maybe continued later.
actually I have gbflu7.seq , unaligned sequence data ,
and gbflu7.nam , 17 fields from the flu-records ,
no grouping of segments to viruses, no continent-assignment,
no flugenome.org-types, no names-correction,
no species-grouping(i.e. poultry-wild waterfowl,preybirds,singbirds),
no sorting yet

_________________
no patents on genes, publish the GISAID sequences !


Last edited by gsgs on Sun Apr 17, 2011 1:31 am, edited 6 times in total.

Top
 Profile  
 
PostPosted: Sat Apr 16, 2011 1:39 pm 
Online

Joined: Wed Aug 19, 2009 10:42 am
Posts: 27403
Location: Pittsburgh, PA USA
gsgs wrote:
just so you can see how it goes and why this process should be globally
done and public and the result stored at genbank ...

161844 flu-records filtered out (804MB) of the 980954 virus-records (3990MB) (took 1.5h)
...

Genbank sequences are public

_________________
www.twitter.com/hniman


Top
 Profile  
 
PostPosted: Fri May 20, 2011 7:23 am 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
no computer-readable processed genbank files available AFAIR.

I offered genbank mine, they refused

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
PostPosted: Fri May 20, 2011 7:24 am 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
today ~50 partial HAs from Romania 2009-2011

14 out of 20 from this season Romania were India/UK (340,605)
no chi

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
PostPosted: Sat May 21, 2011 1:56 pm 
Offline
User avatar

Joined: Tue Sep 01, 2009 11:54 pm
Posts: 1775
Location: germany
22 genomes from the India/UK strain are available now



I count
25=7+3+5+4+1+2+1+2 markers for the India/UK strain

with 2 substrains
1.) Beijing,Czech,Denmark,Ontario with
earliest THA/09/10
2.) Thailand,Ontario,Sibiria with additional 20=2+4+6+3+3+1+0+1 markers
(earliest THA/2010/09/07

1.)
C666T,T957C,G1030A,A1060T,G1149A,A1446G,G2091A
C1186T,C1896A,C2190T
C99T,G663A,T873C,C963A,G1986A
G605C,T1056C,G1171A,C1403T
A1329G
C1107A,G1131A
G238A
C264T,T414C

plus the 11 Cancun markers

2.)
C1044T,G1665A
A366G,A630G,C999T,A1191G,T1304C,A1905G
C336T,A507G,G1027A,C1428T,T1776C,T2115C
C144T,G640A,C1437T,C1656T
C930T,C963T,G987A,T1059C
A131G
.
C268A

plus the 36 from 1.)

60 mutations --> created in mid 2010

_________________
no patents on genes, publish the GISAID sequences !


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ]  Go to page 1, 2  Next

All times are UTC - 5 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group