Extracting text from Word/PDF documents 2010-11-12

What do make a word cloud out of a whole bunch of documents or do some simple statistics? Lucene is great for indexing documents, but I wanted something quick and dirty I could mess around with.

For example, here is a cloud tag from my research papers:

Some quick code I cooked up to extract text from a collection of word/pdf:

You’ll need to include these references:

using Word=Microsoft.Office.Interop.Word;
using System.Text.RegularExpressions;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using System.IO;

Read up the pdf project parser here. This will get the raw text from a pdf.

private static string ReadPDFText(string name)
{
    try
    {
        var doc = PDDocument.load(name);
        var stripper = new PDFTextStripper();
        return stripper.getText(doc);
    }
    catch
    {
        return "";
    }
}

Visual Studio Search Results, Programmatically 2010-11-08

If you want to log searches for code or perhaps wanted to provide a better UI for search results, then you need to listen to search events with a Visual Studio Addin/VSPackage.

Unfortunately, there still isn’t a direct interface for getting search results… you actually have to copy the text from the UI window…yuck.

Here is some code that does this:

First, setup the find “done” event handler. EnvDTE.FindEvents m_findEvents;

protected override void Initialize()
{
    Trace.WriteLine (string.Format(CultureInfo.CurrentCulture, "Entering Initialize() of: {0}", this.ToString()));
    base.Initialize();

    var dte = (EnvDTE.DTE)GetService(typeof(EnvDTE.DTE));
    if (dte != null)
    {
       m_findEvents = dte.Events.FindEvents;
       m_findEvents.FindDone += new EnvDTE._dispFindEvents_FindDoneEventHandler(m_findEvents_FindDone);
    }
}

Recording EMG, Sending Event Pulses with LabJack 2010-06-24

[](http://blog.ninlabs.com/wp-content/uploads/2010/06/subvocal.png)[](http://blog.ninlabs.com/wp-content/uploads/2010/06/labjack.jpg)

When recording EMG signals, you want to be able to segment and associate those signals with certain events, e.g., the presentation of a stimuli.  Because one of my goals is to recognize subvocalized words, it is even more important to get tight segmentation. 

Here, we have an test audio signal (blue) and one channel of the corresponding EMG signal (biege).  I had to manually line this up, and have little confidence if it is correct.

Luckily, one the EMG devices I’m using supports sending events into the EMG stream to leave “marks” in the signal.  Using a Labjack device, I can send a digital pulse to the EMG recording device.

Now, I have another channel with event marks!

C# Code for Labjack.

static System.Timers.Timer Timer;
internal static void SendSignal()
{
   //Set digital output FIO4 to output-high.
   LJUD.AddRequest(Connection.ljhandle, LJUD.IO.PUT_DIGITAL_BIT, 4, 1, 0, 0);

   //Set digital output FIO5 to output-high.
   LJUD.AddRequest(Connection.ljhandle, LJUD.IO.PUT_DIGITAL_BIT, 5, 1, 0, 0);

   //Execute the requests.
   LJUD.GoOne(Connection.ljhandle);

   Timer = new System.Timers.Timer(20);
   Timer.Elapsed += delegate(object sender, System.Timers.ElapsedEventArgs e)
   {
      Timer.Enabled = false;
      Timer = null;

      LJUD.AddRequest(Connection.ljhandle, LJUD.IO.PUT_DIGITAL_BIT, 5, 0, 0, 0);
      LJUD.AddRequest(Connection.ljhandle, LJUD.IO.PUT_DIGITAL_BIT, 4, 0, 0, 0);

      LJUD.GoOne(Connection.ljhandle);

   };
   Timer.Start();
}