Daniel Lemire

 Daniel Lemire

Daniel Lemire

  • Courses1
  • Reviews3

Biography

Universite du Quebec a Montreal - Computer Science


Resume

  • 2015

    Xiaodan Zhu

    Journal of the Association for Information Science and Technology 66 (2)

    The importance of a research article is routinely measured by counting how many times it has been cited. However

    treating all citations with equal weight ignores the wide variety of functions that citations perform. We want to automatically identify the subset of references in a bibliography that have a central academic influence on the citing paper. For this purpose

    we examine the effectiveness of a variety of features for determining the academic influence of a citation. By asking authors to identify the key references in their own work

    we created a data set in which citations were labeled according to their academic influence. Using automatic feature selection with supervised machine learning

    we found a model for predicting academic influence that achieves good performance on this data set using only four features. The best features

    among those we evaluated

    were those based on the number of times a reference is mentioned in the body of a citing paper. The performance of these features inspired us to design an influence-primed h-index (the hip-index). Unlike the conventional h-index

    it weights citations by how many times a reference is mentioned. According to our experiments

    the hip-index is a better indicator of researcher performance than the conventional h-index.

    Measuring academic influence: Not all citations are equal

    It is becoming common to archive research datasets that are not only large but also numerous. In addition

    their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive

    particularly in very large digital repositories

    hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous

    (2) future research needs are difficult to anticipate

    (3) data is hard to index. To address these problems

    we propose the Extract

    Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically

    we propose a scalable strategy for addressing the research-data problem

    ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.

    Extracting

    Transforming and Archiving Scientific Data

    Compressed bitmap indexes are used in databases and search engines. Many bitmap compression techniques have been proposed

    almost all relying primarily on run-length encoding (RLE). However

    on unsorted data

    we can get superior performance with a hybrid compression technique that uses both uncompressed bitmaps and packed arrays inside a two-level tree. An instance of this technique

    Roaring

    has recently been proposed. Due to its good performance

    it has been adopted by several production platforms (e.g.

    Apache Lucene

    Apache Spark

    Apache Kylin and Druid). \nYet there are cases where run-length encoded bitmaps are smaller than the original Roaring bitmaps---typically when the data is sorted so that the bitmaps contain long compressible runs. To better handle these cases

    we build a new Roaring hybrid that combines uncompressed bitmaps

    packed arrays and RLE compressed segments. The result is a new Roaring format that compresses better. \nOverall

    our new implementation of Roaring can be several times faster (up to two orders of magnitude) than the implementations of traditional RLE-based alternatives (WAH

    Concise

    EWAH) while compressing better. We review the design choices and optimizations that make these good results possible.

    Consistently faster and smaller compressed bitmaps with Roaring

    In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and

    most importantly

    decoding of these arrays consumes considerable CPU time. Therefore

    substantial effort has been made to reduce costs associated with compression and decompression. In particular

    researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless

    we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time

    SIMD-BP128 saves up to 2 bits per integer. For even better compression

    we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.

    Decoding billions of integers per second through vectorization

    Ji-Rong Wen

    Hongfei Yan

    Jian-Yun Nie

    Dongdong Shan

    Xudong Zhang

    Wayne Xin Zhao

    ACM Transactions on Information Systems 33 (3)

    Compression algorithms are important for data oriented tasks

    especially in the era of Big Data. Modern processors equipped with powerful SIMD instruction sets

    provide us an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies

    we propose a general approach to accelerate compression algorithms. By instantiating the approach

    we have developed several novel integer compression algorithms

    called Group-Simple

    Group-Scheme

    Group-AFOR

    and Group-PFD

    and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets

    a Wikipedia dataset and a Twitter dataset. With competitive compression ratios and encoding speeds

    our SIMD-based algorithms outperform state-of-the-art non-vectorized algorithms with respect to decoding speeds.

    A General SIMD-based Approach to Accelerating Compression Algorithms

    Compression can sometimes improve performance by making more of the data available to the processors faster. We consider the compression of integer keys in a B+-tree index. For this purpose

    systems such as IBM DB2 use variable-byte compression over differentially coded keys. We revisit this problem with various compression alternatives such as Google's VarIntGB

    Binary Packing and Frame-of-Reference. In all cases

    we describe algorithms that can operate directly on compressed data. Many of our alternatives exploit the single-instruction-multiple-data (SIMD) instructions supported by modern CPUs. We evaluate our techniques in a database environment provided by Upscaledb

    a production-quality key-value database. Our best techniques are SIMD accelerated: they simultaneously reduce memory usage while improving single-threaded speeds. In particular

    a differentially coded SIMD binary-packing techniques (BP128) can offer a superior query speed (e.g.

    40% better than an uncompressed database) while providing the best compression (e.g.

    by a factor of ten). For analytic workloads

    our fast compression techniques offer compelling benefits. Our software is available as open source.

    Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions

    Compressed bitmap indexes are used to speed up simple aggregate queries in databases. Indeed

    set operations like intersections

    unions and complements can be represented as logical operations (AND

    OR and NOT) that are ideally suited for bitmaps. However

    it is less obvious how to apply bitmaps to more advanced queries. For example

    we might seek products in a store that meet some

    but maybe not all

    criteria. Such threshold queries generalize intersections and unions; they are often used in information-retrieval and data-mining applications. We introduce new algorithms that are sometimes three orders of magnitude faster than a naïve approach. Our work shows that bitmap indexes are more broadly applicable than is commonly believed.

    Compressed bitmap indexes: beyond unions and intersections

    We present fast strongly universal string hashing families: they can process data at a rate of 0.2 CPU cycle per byte. Maybe surprisingly

    we find that these families---though they require a large buffer of random numbers---are often faster than popular hash functions with weaker theoretical guarantees. Moreover

    conventional wisdom is that hash functions with fewer multiplications are faster. Yet we find that they may fail to be faster due to operation pipelining. We present experimental results on several processors including low-powered processors. Our tests include hash functions designed for processors with the Carry-Less Multiplication (CLMUL) instruction set. We also prove

    using accessible proofs

    the strong universality of our families.

    Strongly universal string hashing is fast

    Good database design is crucial to obtain a sound

    consistent database

    and - in turn - good database design methodologies are the best way to achieve the right design. These methodologies are taught to most Computer Science undergraduates

    as part of any Introduction to Database class. They can be considered part of the \"canon\"

    and indeed

    the overall approach to database design has been unchanged for years. Moreover

    none of the major database research assessments identify database design as a strategic research direction.\nShould we conclude that database design is a solved problem?\nOur thesis is that database design remains a critical unsolved problem. Hence

    it should be the subject of more research. Our starting point is the observation that traditional database design is not used in practice - and if it were used it would result in designs that are not well adapted to current environments. In short

    database design has failed to keep up with the times. In this paper

    we put forth arguments to support our viewpoint

    analyze the root causes of this situation and suggest some avenues of research.

    A Call to Arms: Revisiting Database Design

    Nathan Kurz

    Arrays of integers are often compressed in search engines. Though there are many ways to compress integers

    we are interested in the popular byte-oriented integer compression techniques (e.g.

    VByte or Google's Varint-GB). They are appealing due to their simplicity and engineering convenience. Amazon's varint-G8IU is one of the fastest byte-oriented compression technique published so far. It makes judicious use of the powerful single-instruction-multiple-data (SIMD) instructions available in commodity processors. To surpass varint-G8IU

    we present Stream VByte

    a novel byte-oriented compression technique that separates the control stream from the encoded data. Like varint-G8IU

    Stream VByte is well suited for SIMD instructions. We show that Stream VByte decoding can be up to twice as fast as varint-G8IU decoding over real data sets. In this sense

    Stream VByte establishes new speed records for byte-oriented integer compression

    at times exceeding the speed of the memcpy function. On a 3.4GHz Haswell processor

    it decodes more than 4 billion differentially-coded integers per second from RAM to L1 cache.

    Stream VByte: Faster byte-oriented integer compression

    Bitmap indexes are commonly used in databases and search engines. By exploiting bit-level parallelism

    they can significantly accelerate queries. However

    they can use much memory

    and thus we might prefer compressed bitmap indexes. Following Oracle's lead

    bitmaps are often compressed using run-length encoding (RLE). Building on prior work

    we introduce the Roaring compressed bitmap format: it uses packed arrays for compression instead of RLE. We compare it to two high-performance RLE-based bitmap encoding techniques: WAH (Word Aligned Hybrid compression scheme) and Concise (Compressed `n' Composable Integer Set). On synthetic and real data

    we find that Roaring bitmaps (1) often compress significantly better (e.g.

    2 times) and (2) are faster than the compressed alternatives (up to 900 times faster for intersections). Our results challenge the view that RLE-based bitmap compression is best.

    Better bitmap performance with Roaring bitmaps

    Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction

    multiple data (SIMD) instructions available in common processors to boost the speed of integer compression schemes. Our S4-BP128-D4 scheme uses as little as 0.7 CPU cycles per decoded 32-bit integer while still providing state-of-the-art compression. However

    if the subsequent processing of the integers is slow

    the effort spent on optimizing decompression speed can be wasted. To show that it does not have to be so

    we (1) vectorize and optimize the intersection of posting lists; (2) introduce the simd galloping algorithm. We exploit the fact that one SIMD instruction can compare four pairs of 32-bit integers at once. We experiment with two Text REtrieval Conference (TREC) text collections

    GOV2 and ClueWeb09 (category B)

    using logs from the TREC million-query track. We show that using only the SIMD instructions ubiquitous in all modern CPUs

    our techniques for conjunctive queries can double the speed of a state-of-the-art approach.

    SIMD compression and the intersection of sorted integers

    Iterated hash functions process strings recursively

    one character at a time. At each iteration

    they compute a new hash value from the preceding hash value and the next character. We prove that iterated hashing can be pairwise independent

    but never 3-wise independent. We show that it can be almost universal over strings much longer than the number of hash values; we bound the maximal string length given the collision probability.

    The universality of iterated hashing over variable-length strings

    Daniel

    Lemire

    Université du Québec à Montréal

    Acadia University

    Tech-Elements Inc.

    National Research Council of Canada

    TELUQ - Université du Québec

    Institute for biomedical engineering (Montreal)

    University of New Brunswick

    Montreal

    Canada Area

    I am an adjunct professor at UQAM where I am a member of the LATECE laboratory. I supervise graduate students.

    Adjunct Professor

    Université du Québec à Montréal

    Montreal

    Canada Area

    I am a full professor (promoted in 2009) at the Université du Québec. I have been a department chair since June 2017. I teach computer science

    primarily at the graduate level. I have held a federal research grant (NSERC) for nearly twenty years; I am a member of the NSERC computer science committee (2017-2020). I have supervised dozens of graduate students and published 70 peer-reviewed papers. During the 2016-2017 NSERC competition

    I received a rating of \"outstanding\" for the excellence of the researcher. I maintain software used by Fortune 100 companies.

    Professor and department chair

    TELUQ - Université du Québec

    Acadia University

    Research Officer

    This was a full time research position (government laboratory). I founded and ran the e-Health research group. I worked on recommender technology.

    National Research Council of Canada

    Entrepreneur

    I founded and ran a small consultancy which had 4 full-time associates at one point. We had several clients worldwide. We designed a wavelet-based compression format for radiology images (Waaves). We designed signal processing and modelling software in geophysics.

    Tech-Elements Inc.

    Adjunct Professor

    I am an adjunct professor at the Computer Science Department. I have supervised several graduate students.

    University of New Brunswick

    Post-doctoral fellow

    I did wavelet-based signal processing (ECG and EEG).

    Institute for biomedical engineering (Montreal)

  • 1995

    French

    English

    Ph.D.

    Supervisors: Gilles Deslauriers and Serge Dubuc.

    Engineering Mathematics

  • 1994

    M.Sc.

    Supervisor: Catherine Sulem.

    Mathematics

  • 1990

    B.Sc.

    I graduated with High Distinction and a nearly perfect GPA. I got the St.Michael's Dean medal. I held several prestigious scholarships

    including the C. D. Howe Memorial scholarship (50k$). Instead of the regular course load of 5 classes a term

    I took 6 classes a term.\n\nI also completed a minor in French studies and a major in Physics. (Not formally recognized as I did not register in the programs.)

    Mathematics

    I was a member of St. Michael's college.

    University of Toronto - University of St. Michael's College

  • 1988

    DEC

    Sciences de la nature

    Science columnist for the \"Mouton Noir\" (student newspaper)

    CÉGEP de Drummondville

  • 1983

    DES

    Member of the computer club.

    Collège Saint-Bernard

  • Compressing column-oriented indexes

    Decoding billions of integers per second through vectorization

    In many important applications -- such as search engines and relational database systems -- data is stored in the form of arrays of integers. Encoding and

    most importantly

    decoding of these arrays consumes considerable CPU time. Therefore

    substantial effort has been made to reduce costs associated with compression and decompression. In particular

    researchers have exploited the superscalar nature of modern processors and SIMD instructions. Nevertheless

    we introduce a novel vectorized scheme called SIMD-BP128 that improves over previously proposed vectorized approaches. It is nearly twice as fast as the previously fastest schemes on desktop processors (varint-G8IU and PFOR). At the same time

    SIMD-BP128 saves up to 2 bits per integer. For even better compression

    we propose another new vectorized scheme (SIMD-FastPFOR) that has a compression rate within 10% of a state-of-the-art scheme (Simple-8b) while being two times faster during decoding.

    Decoding billions of integers per second through vectorization

    Faster Column-Oriented Indexes

    Recent research results in optimizing column-oriented indexes for faster data warehousing. This talks aims to answer the following question: when is sorting the table a sufficiently good optimization?

    Faster Column-Oriented Indexes

    Write good papers

    How to write good research papers. This talk covers everything from selecting a good title

    writing a good abstract

    crafting good figures

    and so on.

    Write good papers

    Innovation without permission: from Codd to NoSQL

    Practitioners often fail to apply textbook database design principles. We observe both a perversion of the relational model and a growth of less formal alternatives. Overall

    there is an opposition between the analytic thought that prevailed when many data modeling techniques were initiated

    and the pragmatism which now dominates among practitioners. There are at least two recent trends supporting this rejection of traditional models:\r\n(1) the rise of the sophisticated user

    \r\nmost notably in social media is challenge to the rationalist view

    as it blurs the distinction between design and operation

    \r\n(2) in the new technological landscape where there are billions of interconnected computers worldwide

    simple concepts like \r\nconsistency sometimes become prohibitively expensive. Overall

    for a wide range of information systems

    design and operation are becoming integrated in the spirit of pragmatism. Thus

    we are left with design methodologies which embrace fast and continual iterations and and exploratory testing. These methodologies allow innovation without permission in that the right to design new features is no longer so closely guarded.\r\n\r\nFo

    Innovation without permission: from Codd to NoSQL

    A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

    A data warehouse cannot materialize all possible views

    hence we must estimate quickly

    accurately

    and reliably the size of views to determine the best candidates for materialization. Many available techniques for view-size estimation make particular statistical assumptions and their error can be large. Comparatively

    unassuming probabilistic techniques are slower

    but they estimate accurately and reliability very large view sizes using little memory. We compare five unassuming hashing-based view-size estimation techniques including Stochastic Probabilistic Counting and LogLog Probabilistic Counting. Our experiments show that only Generalized Counting

    Gibbons-Tirthapura

    and Adaptive Counting provide universally tight estimates irrespective of the size of the view; of those

    only Adaptive Counting remains constantly fast as we increase the memory budget.

    A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP

    All About Bitmap Indexes... And Sorting Them

    A review of bitmap index from an academic perspective. Several theoretical results are presented. The talk also discuss technical issues regarding sorting the tables prior to indexing

    as a way to improve the indexes.\r\n\r\nMuch of the talk is based on the following preprint:\r\n\r\nDaniel Lemire

    Owen Kaser

    Kamel Aouiche

    Sorting improves word-aligned bitmap indexes. \r\n\r\nhttp://arxiv.org/abs/0901.3751

    All About Bitmap Indexes... And Sorting Them

    XML

    Machine Learning

    Mathematics

    Compression

    Java

    Algorithm Design

    Data Warehousing

    Python

    Analytics

    Recommender Systems

    Time Series Analysis

    Data Visualization

    C++

    Relational Databases

    Algorithms

    OLAP

    Software Development

    MongoDB

    Information Retrieval

    NoSQL

    Faster 64-bit universal hashing using carry-less multiplications

    Intel and AMD support the Carry-less Multiplication (CLMUL) instruction set in their x64 processors. We use CLMUL to implement an almost universal 64-bit hash family (CLHASH). We compare this new family with what might be the fastest almost universal family on x64 processors (VHASH). We find that CLHASH is at least 60% faster. We also compare CLHASH with a popular hash function designed for speed (Google's CityHash). We find that CLHASH is 40% faster than CityHash on inputs larger than 64 bytes and just as fast otherwise.

    Faster 64-bit universal hashing using carry-less multiplications

    Kamel Boukhalfa

    In the current Big Data era

    systems for collecting

    storing and efficiently exploiting huge amounts of data are continually introduced

    such as Hadoop

    Apache Spark

    Dremel

    etc. Druid is one of theses systems especially designed to manage such data quantities

    and allows to perform detailed real-time analysis on terabytes of data within sub-second latencies. One of the important Druid's requirements is fast data filtering. To insure that

    Druid makes an extensive use of bitmap indexes. Previously

    we introduced a new compressed bitmap index scheme called Roaring bitmap that has shown interesting results when compared to the bitmap compression scheme adopted by Druid: Concise. Since

    Roaring bitmap has been integrated to Druid as an indexing solution. In this work

    we produce an extensive series of experiments in order to compare Roaring bitmap and Concise time-space performances when used to accelerate Druid's OLAP queries and other kinds of operations Druid realizes on bitmaps

    like: retrieving set bits from bitmaps

    computing bitmap complements

    aggregating several bitmaps with logical ORs and ANDs operations. Roaring bitmap has shown to improve up to ≈ 5× analytical queries response times under Druid compared to Concise.

    Optimizing Druid with Roaring bitmaps

    Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme

    the best approaches to row-ordering are derived from traveling salesman heuristics

    although there is a significant trade-off between running time and compression. A new heuristic

    Multiple Lists

    which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup

    is a good option for very large tables. However

    for some compression schemes

    it is more important to generate long runs rather than few runs. For this case

    another novel heuristic

    Vortex

    is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns

    in a few cases.

    Reordering rows for better compression: Beyond the lexicographic order

    Owen Kaser

    Hazel Webb

    In OLAP

    analysts often select an interesting sample of the data. For example

    an analyst might focus on products bringing revenues of at least $100 000

    or on shops having sales greater than $400 000. However

    current systems do not allow the application of both of these thresholds simultaneously

    selecting products and shops satisfying both thresholds. For such purposes

    we introduce the diamond cube operator

    filling a gap among existing data warehouse operations. \nBecause of the interaction between dimensions the computation of diamond cubes is challenging. We compare and test various algorithms on large data sets of more than 100 million facts. We find that while it is possible to implement diamonds in SQL

    it is inefficient. Indeed

    our custom implementation can be a hundred times faster than popular database engines (including a row-store and a column-store).

    Diamond Dicing

    Nathan Kurz

    We consider the ubiquitous technique of VByte compression

    which represents each integer as a variable length sequence of bytes. The low 7 bits of each byte encode a portion of the integer

    and the high bit of each byte is reserved as a continuation flag. This flag is set to 1 for all bytes except the last

    and the decoding of each integer is complete when a byte with a high bit of 0 is encountered. VByte decoding can be a performance bottleneck especially when the unpredictable lengths of the encoded integers cause frequent branch mispredictions. Previous attempts to accelerate VByte decoding using SIMD vector instructions have been disappointing

    prodding search engines such as Google to use more complicated but faster-to-decode formats for performance-critical code. Our decoder (Masked VByte) is 2 to 4 times faster than a conventional scalar VByte decoder

    making the format once again competitive with regard to speed.

    Vectorized VByte Decoding

    Bloom filters are probabilistic data structures commonly used for approximate membership problems in many areas of Computer Science (networking

    distributed systems

    databases

    etc.). With the increase in data size and distribution of data

    problems arise where a large number of Bloom filters are available

    and all of them need to be searched for potential matches. As an example

    in a federated cloud environment

    each cloud provider could encode the information using Bloom filters and share the Bloom filters with a central coordinator. The problem of interest is not only whether a given element is in any of the sets represented by the Bloom filters

    but also which of the existing sets contain the given element. This problem cannot be solved by just constructing a Bloom filter on the union of all the sets. Instead

    we effectively have a multidimensional Bloom filter problem: given an element

    we wish to receive a list of candidate sets where the element might be.\n\nTo solve this problem

    we consider three alternatives. Firstly

    we can naively check many Bloom filters. Secondly

    we propose to organize the Bloom filters in a hierarchical index structure akin to a B+ tree that we call Bloofi. Finally

    we propose another data structure that packs the Bloom filters in such a way as to exploit bit-level parallelism

    which we call Flat-Bloofi.\n\nOur theoretical and experimental results show that Bloofi and Flat-Bloofi provide scalable and efficient solutions alternatives to search through a large number of Bloom filters.

    Bloofi: Multidimensional Bloom filters

    To classify time series by nearest neighbors

    we need to specify or learn one or several distance measures. We consider variations of the Mahalanobis distance measures which rely on the inverse covariance matrix of the data. Unfortunately --- for time series data --- the covariance matrix has often low rank. To alleviate this problem we can either use a pseudoinverse

    covariance shrinking or limit the matrix to its diagonal. We review these alternatives and benchmark them against competitive methods such as the related Large Margin Nearest Neighbor Classification (LMNN) and the Dynamic Time Warping (DTW) distance. As we expected

    we find that the DTW is superior

    but the Mahalanobis distance measures are one to two orders of magnitude faster. To get best results with Mahalanobis distance measures

    we recommend learning one distance measure per class using either covariance shrinking or the diagonal approach.

    Time series classification by class-specific Mahalanobis distance measures

INF 6450

4.5(3)