TITech 13 Nov 2000
Bioinformatics:
Converting Data to Knowledge
Gio Wiederhold
Stanford University
Computer Science, E.E. & Medicine
http://www-db.stanford.edu/people/gio.html
Data ✖ Knowledge
Aggregation Analyses
of instances
Integration
of sources
Observations
Filters
• The product: Information
10/15/08 Gio Wiederhold - TITech 2000 2
Bio-Information
• to learn about ourselves,
– our origins, our place in the world
Primates, Mice, Zebrafish, Fruit Flies, Roundworms, Yeast
– modesty, seeing how much we share with all organisms
– not just of philosophical interest, but also
• to help humanity to lead healthy lives
– to create new scientific methods
– to create new diagnostics
– to create new therapeutics
10/15/08 Gio Wiederhold - TITech 2000 3
Loops of Data and Knowledge
Knowledge Loop Data Loop Information is
Storage created at the
Education confluence of
data -- the state
Selection Recording
&
Integration knowledge --
the ability to
Experience
Abstraction select and
State changes
project the
Decision-making state into
Action the future
10/15/08 Gio Wiederhold - TITech 2000 4
Volume and Variety
Two interacting issues in the generating
information
1. The volume is large --
we need automation
2. The data is varied & heterogeneous
• many autonomous sources
• many distinct objectives
➔ many incompatibilities, errors
10/15/08 Gio Wiederhold - TITech 2000 5
Nature
Quantities Progress
1
human
The human genome: ~ 4 000 000 000 base pairs
> 30 000
genes
~ 10 000
proteins Genes, and gene abnormalities
diseases
6 000 000
000
humans
Everybody’s genes
<1000
systems
Metabolic pathways
~2 000 000
molecules
Small organic molecules - affect proteins - suitable for drugs
10/15/08 Gio Wiederhold - TITech 2000 6
Diversity ➪ Heterogeneity
A wide variety of knowledge is needed to interpret the data
A large variety of experts is developing this knowledge
The scope of interests differs among those experts
The knowledge is expressed in diverse ways
The terms differs in precise meaning: semantics
A large variety of data types is needed
A wide variety of representations is used
The database and file schemas differ
A wide variety of representations is used
The openness and accessibility of the information differs
10/15/08 Gio Wiederhold - TITech 2000 7
Scope differences
A scope difference exists when terms differ in
their mapping to real-world objects
employee (payroll)
disabled contractors
employee(personnel)
all possible employees
The local objective determines scope
Example: “binding site” in PDB database [Waugh&Altman]
binding sites reported for publication doubtful
all actual binding sites
reporting doubtful results risks rejection of publication
10/15/08 Gio Wiederhold - TITech 2000 8
Heterogeneity inhibits Integration
• An essential feature of science
– autonomy of fields
– differing granularity and scope of focus
– growth of fields requires new terms
• A feature of technological process
– standards require stability
– yesterday’s innovations are today’s infrastructure
Must be dealt with explicitly
– sharing, integration, and aggregation are essential
– large quantities of data require precision
10/15/08 Gio Wiederhold - TITech 2000 9
Heterogeneity among domains is natural
Interoperation creates mismatch
• Autonomy conflicts with consistency,
– Local Needs have Priority,
– Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems ✔ ✔
• Data Representation and Access Conventions ✔
• Metadata: Annotations, Naming, and Ontology ✚
– needed to share data from distinct sources
10/15/08 Gio Wiederhold - TITech 2000 10
Required precision = F(volume)
More precision is needed as data volume increases
--- a small error rate still leads to too many errors
False Positives have to be investigated
( attractive-looking supplier - makes toys
apparent drug-target with poor annotation ) Information Wall
it
s?
li m
ol
n
to
a
False Negatives cause um
th
data errors h
wi
lost opportunities,
n
acce
ma
suboptimal to some degree ptab
hu
l e l
imit
information quantity
adapted from Warren Powell, Princeton Un.
10/15/08 Gio Wiederhold - TITech 2000 11
Inconsistency causes errors,
while results need precision
False positives = poor precision
typically cost more than
false negatives = poor recall
Example: [ Todd Lowe tRNA search <rna.wustl.edu/tRDB
> ]
Search in Yeast for 55 methylation sites
-- required manual elimination of pseudogenes
Search space in human genome is 215 times larger, not yet done
In drug-discovery we have now more targets than
10/15/08 . pharmaceutical companies can afford
Gio Wiederhold - TITech 2000 12
Broad array of relatable sources
• Genomic
[ Many used in data-mining:
• Bibliographic as PRM
• Demographic (Probabilistic Relational Model)
• Epidemiological research by
– Familial Lise Getoor @ stanford ]
– Contacts
Requires acyclicity.
• Clinical Use temporal dependencies?
– Drug effectiveness
– Drug-resistance
– Co-occurrence
10/15/08 Gio Wiederhold - TITech 2000 13
Intersection of a large (irrelevant data)
and a small (good data) distribution.
The optimal separation
Result creates more
false positives
(irrelevant results )
than
false negatives
(good results missed)
10/15/08 Gio Wiederhold - TITech 2000 14
Quality of data verified through publication
Data characteristics project
[Stephen Koslow, Office on Neuroinformatics, NIMH
www.nimh.nih.gov/neuroinformatics/index.cfg]
The human brain uses 15 Watts; has dozens of cell types,
100 billion (10^14) neural cells, 10^15 connections.
Neuroscience is a growing field, includes neuroinformatics.
Intial, broad journals, reductionist journals, Numerical,
symbolic, literature and image data. Volume of publication
only for serotonin, discovered in 1948, now 70 000 papers, is
becoming impossible to follow.
Voluminous 3-D MRI data. UCLA brain mapping. Basis for
localization of diagnostic EEG, MEG observations.
10/15/08 Gio Wiederhold - TITech 2000 15
Projects requiring manual curation
are domain specific
Virtual Cell Project
Dong-Guk Shin, Univ. Connecticut
[email protected] also available without DB support, www.nrcam,uchc.edu
NIH supported: Physiology modeling,
NSF support: computational modeling approach.
Bottom-up approach to cell modeling: Cross checking of models and
HXs: Geometry from segmented images, 2-D visualization of specified
reactions: channels, pumps, for extra, intra (cytosol), of core cellular
compartments. Generates equations for simulation.
Result is a DB publication cycle, supporting model copying and
adaptation.
For access to remote DBs will need more than a browser, but also a
query system, with join over association. DBs need APIs and
mediation for scalability and mismatch.
10/15/08 Gio Wiederhold - TITech 2000 16
Data integration in Literature
[ Jim Garrels, Proteome, Inc. www.proteome.cm -
free ]
BioKnowlede Library, a portal site: with 50 billion bytes of text
covering the 5 billion bytes in Genbank.
Classification, curated by experts.
Pages {title with brief functional description, family, properties (Mutant
phenotype, ) , } sequence annotations, related proteins: Orthologs
and Interlogs (in different species) [Marc Vidal, MGH],
Integrated data from cDNA microarrays and chips, systematic 2-hybrids,
Model-organisms: First Yeast, now Worms [Stuart Kim, Stanford],
Several 1000 physical associations and interactions.
Authors should not publish experimental data directly into a DB and
curate their own papers, but submit their results and publish detailed
expression studies and
10/15/08
update their own results.
Gio Wiederhold - TITech 2000 17
Relationships among search parameters
perfect recall
100%
per
fect p
rec c all r = v.relevant
isio
n re v.available
precision
50%
p= v.relevant
e v ed v.retrieved
r e tri ble
m e a i la
u v
vol me a
u
vol
% tage actually relevant
0%
space of methods, ranked from best
10/15/08 Gio Wiederhold - TITech 2000 18
Means to achieve precision in text
Textual information - knowledge - complements
pure data-oriented searches as BLAST [Liu & Altman]
• Reduce redundancy
– omit similar results from alternate sources
reports, workshop papers, journals, books
• Reduce false positives
– recognize contextual domains *
• the same word refers to different object types
nail (carpentry, anatomy), miter (carpentry, religion)
• Abstract findings to higher levels
– Linguistic processing based on customer model
medical case studies have similar formats
10/15/08 Gio Wiederhold - TITech 2000 19
Integration makes Semantic Mismatches visible
Information comes from many autonomous sources
• Differing viewpoints (by source)
– differing terms for similar items { lorry, truck }
– same terms for dissimilar items trunk ( luggage, car )
– differing coverage vehicles ( DMV, police, AIA )
– differing granularity trucks (shipper, manuf.)
– different scope student (museum fee, Stanford )
• Hinders use of information from disjoint sources
– missed linkages loss of information, opportunities
– irrelevant linkages overload on user or application program
• Poor precision for interoperation
ok for web browsing poor for business and science
10/15/08 Gio Wiederhold - TITech 2000 20
Shared Knowledge Base
PharmGKB – PharmacoGenetics Knowledge Base starting 2000
“An Ontology for Genetic Information” [Russ Altman]
<pharmgkb.org> based at Stanford, funded by NIGMS
to link existing projects – but open to others.
Phenotype variation --> Genotype variation
• Phase 2 metabolizing enzymes – R.Weinshllboum at Mayo Clinic
• Asthma -- Weiss (was Jeff Raizin) at Havard Un.
• Anti-cancer agents -- Mark Ratain at Un. of Chicago
• Membrane Transporters -- Kathleen Giacomini, UCSF
• Tomoxifen metabolic activation -- Dave Flockhart at Georgetown Un.
• Minority Populations and Privacy – M.Rothstein at Univ of Houston ➹
• Depression in Mexican-Americans -- J.Licinio at UCLA
• Database Tools -- Prakash Nadkarni at Yale Un.
10/15/08 Gio Wiederhold - TITech 2000 21
Complex Relationships
Isolated Integrated
functional Molecular functional Clinical
Genomic measures
& cellular measures
information V ble s
phenotype
in aria phenotype a
v e
ge tio s er typ Physiology
Coding no n
me Obheno
Obser vable p
Protein pheno types Genetic
Products Makeup
Molecules Individuals
Alleles nt
e s
l e in m e atmocol
Pharma. Ro ani
s Mole cular Tr rot Non-genetic
activity g p factors
or Varia tion
Drug
response Drugs Environment
systems
courtesy of R.Altman &Teri Klein, PhamGKB
10/15/08 Gio Wiederhold - TITech 2000 22
PharmGKB
• Ontology for pharmacogenetics
– Represented in Protégé
[Musen: smi.stanford.edu/project/protege]
• Service for Universities and Industry
• open access to information and tools, but not a warehouse
– Industrial affiliates contributors and consumers at larger scales:
• geneticXchange GeneLogic
• Merck Co Guidant
• Pharmacia Doubletwist
• SmithKline-Beecham ( & Glaxo-Wellcome ) Incyte
Informax
• Collaboration in larger topics: SGI
– Biotechnology -- Clark Center Sun
– Education -- NIH sponsored training program, new UG degrees
10/15/08 Gio Wiederhold - TITech 2000 23
Consistency: global or partial ?
• Global consistency
+ wonderful for users and their programs
– too many interacting sources
– long time to achieve, 2 sources (UAL, LH), 3 (+ trucks), 4, … all ?
– costly maintenance, since all sources evolve
– no world-wide authority to dictate conformance
• Domain-specific ontologies XML DTD assumption
+ Small, focused, cooperating groups
+ high quality, some examples - arthritis, Shakespeare plays
+ allows sharable, formal tools
+ ongoing, local maintenance affecting users - annual updates
– poor interoperation, users still face inter-domain mismatches
10/15/08 Gio Wiederhold - TITech 2000 24
– periodic source updates need automation in interoperation
Stanford Infolab SKC project
( Scalable Knowledge Composition )
Objective: High precision in semantic
interoperation of autonomous sources
• Basic -- pessimistic -- assumption:
– The ontological mapping of terms ↔ objects
differs between autonomous domains.
• But
– The collections of real-world objects provides a
grounding for the definitions, and an
opportunity to validate the meaning of the
terms being employed.
– Relationships have semantic and a related
structural significance.
10/15/08 Gio Wiederhold - TITech 2000 25
Exploit Domain-specific Expertise
.
Knowledge needed is huge Society of
specialists
in science and in business
Society of
• Partition into natural specialists
domains
• Determine domain Society of
specialists
responsibility and
authority
• Empower domain owners
• Provide tools
Consider interaction
10/15/08 Gio Wiederhold - TITech 2000 26
SKC grounded definition .
• Ontology:
a set of terms and their relationships
• Term:
a reference to real-world and abstract objects
• Relationship:
a named and typed set of links between objects
• Reference:
a label that names objects
• Abstract object:
a concept which refers to other objects
• Real-world object:
an entity instance with a physical manifestation
10/15/08 Gio Wiederhold - TITech 2000 27
Sample Operation: INTERSECTION
Result contains
Articulation shared terms,
useful for purchasing
Source Domain 1: Source Domain 2:
Owned and maintained Owned and maintained
by Store by Factory
10/15/08 Gio Wiederhold - TITech 2000 28
An Ontology Algebra
A knowledge-based algebra for ontologies
Intersection create a subset ontology
keep sharable entries
Union create a joint ontology
merge entries
Difference create a distinct ontology
remove shared entries
The Articulation Ontology (AO) consists of matching
rules that link domain ontologies
10/15/08 Gio Wiederhold - TITech 2000 29
INTERSECTION support
Articulation ontology Terms useful
for purchasing
Matching
rules that use
terms from the
2 source domains
Store Factory
Ontology Ontology
10/15/08 Gio Wiederhold - TITech 2000 30
Other Basic Operations
UNION: merging DIFFERENCE: material
entire ontologies fully under local control
Arti-
culation
ontology
typically prior
intersections
10/15/08 Gio Wiederhold - TITech 2000 31
Sample Operation: INTERSECTION
Result contains
Articulation shared terms,
useful for purchasing
Source Domain 1: Source Domain 2:
Owned and maintained Owned and maintained
by Store by Factory
10/15/08 Gio Wiederhold - TITech 2000 32
Tools to create articulations
Graph matcher
for
Articulation-
creating
Expert
Vehicle Transport
ontology ontology
Suggestions
for articulations
10/15/08 Gio Wiederhold - TITech 2000 33
continue from initial point
Also suggest similar terms
for further articulation:
• by spelling similarity,
• by graph position
• by term match nexus
Expert response:
1. Okay
2. False
3. Irrelevant
to this articulation
All results are recorded
Okay ’s are converted into articulation rules
10/15/08 Gio Wiederhold - TITech 2000 34
Candidate Match Nexus
Term linkages automatically extracted from 1912 Webster’s dictionary *
Notice presence
of 2 domains:
chemistry, transport
Based on processing
headwords ➽ definitions * free; have processed the
using algebra primitives OED (Oxford English Dictionary)
at Stanford for internal use
10/15/08 Gio Wiederhold - TITech 2000 35
Using the Match Nexus
Experiment:
On government structures of
NATO countries:
SKEIN system resolved
over 70% of unmatched terms
10/15/08 Gio Wiederhold - TITech 2000 36
Using the Match Nexus
10/15/08 Gio Wiederhold - TITech 2000 37
Features of an algebra
Operations can be composed
Operations can be rearranged
Alternate arrangements can be evaluated
Optimization is enabled
The record of past operations can be
kept and reused when sources change
10/15/08 Gio Wiederhold - TITech 2000 38
Knowledge Composition
Composed knowledge for
Articulation
applications using A,B,C,E
knowledge
for
U
(A B) U
Articulation
U
Legend: (B C) U
U
(C E) knowledge
U : union U
U (C E)
: intersection
Knowledge
Articulation resource
knowledge U E
for (A B)
U Knowledge U
(B C) resource (C D)
C
Knowledge Knowledge Knowledge
resource resource resource
A B D
10/15/08 Gio Wiederhold - TITech 2000 39
Support Domain Specialization
• Knowledge Acquisition (20% effort) &
• Knowledge Maintenance (80% effort *)
to be performed
• Domain specialists
• Professional organizations
• Field teams of modest size
autonomously
maintainable
Empowerment
* based on experience with software
10/15/08 Gio Wiederhold - TITech 2000 40
Summary Scalable Knowledge Composition
Provide for Maintainable Ontologies
• devolve maintenance onto many
domain-specific experts / authorities
SKC
• provide an algebra to compute
composed ontologies that are
limited to their articulation terms
• enable interpretation within the
source contexts
10/15/08 Gio Wiederhold - TITech 2000 41
Many Other Tasks at/near Stanford
Matching cell / protein 3D with chemical’s 3D
• Regulatory Gene motifs :
– Bioprospector [ Brutlag & Liu <www-cmgm.stanford.edu> ]
• Protein structure generation
– moving from small to larger proteins
1: Powerful parallel processing [IBM BlueGene]
2: Two-level : use features as an intermediate
(alpha-helix, beta-sheets, …)
3: Protein Folding speedup by delegation
[Shirts & Pande: foldingathome.stanford.edu ]
• RNA folding (simpler, larger) [Nakatani & Pande]
10/15/08 Gio Wiederhold - TITech 2000 42
Provenance of derived data
Assure having a proper history of derived results
[ Peter Buneman, UPenn, www.humgen.upenn.edu ] K2 integration tool
Integrated databases often don’t indicate the original sources
I.e., SwissProt does not distinguish inferred versus being observed.
[ William Gelbart, Harvard University] Flybase
Flybase also collects data as exons and their mutations, tranposon insertion sites.
Moving from being Hunter Gatherers in science to Harvesters, moving to an
agronomical society
Clasical genomics is being superseded by expression and interaction of gene products
and gene perturbation.
[ Peter Karp, SRI Int., Bioinformatics Res.Group, www.ai.sri.com/pkarp/ ] EcoCyc
EcoCyc links proteins to 150 metabolic pathways in Ecoli
Databases are supplanting journals. They are re-analyzable. Results in journals are not.
Estimate now about 500 public databases for Bioinformatics; although not all
of them have APIs, use real DBMSs, have differing models, units of
measurements, leading to semantic problems.
10/15/08 Gio Wiederhold - TITech 2000 43
The People Problem
The demand for people in bioinformatics is high,
at all levels
• Critical is a lack of
– training opportunities - programs and teachers
– available trainees
• Being in multi-disciplinary field is scary
– tenure for faculty
– load for students
– salary and growth differentials in biology and CS
• Some institutions are moving aggressively
– must compete with World-Wide Web visions
10/15/08 Gio Wiederhold - TITech 2000 44
Bioinformatics:
Converting Data to Knowledge
• The means: People
• The product: Information
10/15/08 Gio Wiederhold - TITech 2000 45
Up-to-dateness
never 100%
1/year %tage
1/month up-to-date
F(user need)
1/week
1/day ∫ = effort, methods 50%
1/hour
1/minute
1/second
0%
Frequency 0 1 ? as often as possible
of source frequency of visits Feb.2000
change
F(capability given 2.2M public sites with 288M pages )
10/15/08 Gio Wiederhold - TITech 2000 46
Privacy requires Ethics
Knowledge carries responsibilities.
How will people feel about your knowledge about them?
their genetic make-up,
physical & psychological propensities.
Privacy is hard to formalize,
but that does not mean it is not real to people.
Perceptions count.
(There is also real stuff -
insurance scams - personal relations )
Diagnostics without therapies.
10/15/08 Gio Wiederhold - TITech 2000 47
Securing Collaboration
Collaborator
source query certified result
Security Filter Logs
certified query unfiltered result
Private Patient Data
Gio Wiederhold TIHI Oct96 48
10/15/08 Gio Wiederhold - TITech 2000 48